Equally Split Entries into Groups

Dear Experts,

I have data of around 900 million entries that I want to split into subgroups. I am plotting a quantity (pT) over a range from 0 to 1000 GeV, where each subgroup will contain approximately the same number of entries (enough statistics per subgroup). To achieve this, I have applied a cumulative distribution function (CDF) to my histogram, but I’m not sure how to split this cumulative histogram into multiple subgroups with the same statistics.

Could you please provide me with a solution on how to acheive that ?

Thanks.

Hi @Mustaphaa,

would you mind sharing what you already have so we can fully understand the problem?

Cheers,
Marta

Hi @Mustaphaa,
I attach a simple code to find the ranges to divide a distribution in equally populated sample.
Cumulative.C (1.8 KB)

Below you find two pictures obtained with my code.
In the top part of each picture you have your distribution, and in the bottom part the cumulative obtained after scaling it back to one.

What I do is divide in equal part the y axis of cumulative distribution, then search for the bin where y reach these specific value. The division is way more precise when there are more bin in the first distribution and so also in the cumulative

I hope it enough to solve your problem

1 Like

Dear Dilicus,

Thank you very much, this has solved my problem.

Best,
Mustapha

2 Likes

Hi @Dilicus,

Thank you for the code you provided; it works properly. However, I noticed a small issue with the “last range”. It is quite large in terms of limits, as shown in the attached image. Currently, it spans from 78 to 1000. I’m wondering if it is possible to make it smaller, perhaps between 300 and 1000 or so.

Normally, I don’t have enough data for a range from 130 to 1000 GeV compared to the range from 0 to 130 GeV.

*nbr of bins i have considered is 1000.

ranges

Best,
Mustapha

Maybe I need to adjust a bit the end loop condition.
So you like to have the last range with a smaller fraction of events and not only the 5%?

Could you share a picture of the cumulative of your data obtained with my code?.

1 Like

Hi @Dilicus,

What I mean to say is that I don’t know if I could have the last two ranges like this:

  • Range Before Last: 120 - 300
  • Range Last: 300 - 1000

By preserving the same number of entries for each range, I don’t know if this is doable or not, maybe by taking more than 5% per range ?!

Here is how the cumulative distribution looks for my data (without many cuts that I will be applying, which will reduce the data size):

Another thing is that the datasize per each range is too small i have noticed.

If you notice that the data size for each range is too small, maybe is better to increase to 10% or 15% the size of your samples. This should change also the size of the last one and second last one range. If you increase the bins number to 5000 the division will be more precise

1 Like

Hi @Dilicus,

To approximately achieve the last two ranges like this:

  • Range Before Last: 120 - 300
  • Range Last: 300 - 1000

I set the step size to 0.0001, and it produced ranges similar to these. However, upon inspecting the number of entries in each range, I noticed that the first few ranges had too many entries, and some ranges were repeated.

The reason I set the step size to 0.0001 is that I know there are only around 90K entries between 200 and 1000 GeV, and I thought this would be the only way to achieve such ranges(last two or three).

Could you please share the file you are using and send back the modified version of the macro you are using.

Best,
Stefano

Here is the modified version of the macro
Cumulative.C (2.8 KB)

Here is the data file i’m using, the file is heavy so i give the CernBox path:
https://cernbox.cern.ch/s/6InhPvWvHU1RzZY

And the output i got for different ranges

Thanks

Hi,
I was able to find a solution.
The only thing to do was to increase the numbers of bin of the first histogram. I used 50000000 (5e7) as bin number. With just few bins the cumulative was not precise enough

I used only the first 1% of the file, otherwise the macro took to long to run on my laptop. So my ranges have only ~900 events.

Below you find a screenshot of my ranges, with the events.

EDIT
I just run the macro with the first 10% of the file. these are the results for the last ranges.

2nd EDIT
I run the macro with the full data set. These are the resutls, for the last ranges.
I checked quickly all the range and they all have ~90K events. Using 1e8 bins should improve the results. And maybe you can increase also a bit the step size.

@Dilicus

Thanks so much, that solved my problem.

Dear @Dilicus ,

After splitting the data into groups, I calculated the average pT of each group. Then, I attempted to merge the groups whose average pT differed by less than 10% to achieve a smooth distribution. However, I am uncertain about the average pT formula I used for each merged group and how it should be computed.

The aim is to have enough statistics for each merged group, but the last two ranges could not satisfy the merging condition.

Here is the updated macro with these changes:

Cumulative_Mod.C (5.6 KB)

here is the outpout i got :

Best,

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.