Equally Split Entries into Groups

Mustaphaa · December 4, 2024, 7:11am

Dear Experts,

I have data of around 900 million entries that I want to split into subgroups. I am plotting a quantity (pT) over a range from 0 to 1000 GeV, where each subgroup will contain approximately the same number of entries (enough statistics per subgroup). To achieve this, I have applied a cumulative distribution function (CDF) to my histogram, but I’m not sure how to split this cumulative histogram into multiple subgroups with the same statistics.

Could you please provide me with a solution on how to acheive that ?

Thanks.

mczurylo · December 4, 2024, 10:35am

Hi @Mustaphaa,

would you mind sharing what you already have so we can fully understand the problem?

Cheers,
Marta

Dilicus · December 4, 2024, 2:07pm

Hi @Mustaphaa,
I attach a simple code to find the ranges to divide a distribution in equally populated sample.
Cumulative.C (1.8 KB)

Below you find two pictures obtained with my code.
In the top part of each picture you have your distribution, and in the bottom part the cumulative obtained after scaling it back to one.

What I do is divide in equal part the y axis of cumulative distribution, then search for the bin where y reach these specific value. The division is way more precise when there are more bin in the first distribution and so also in the cumulative

I hope it enough to solve your problem

Mustaphaa · December 5, 2024, 3:50pm

Dear Dilicus,

Thank you very much, this has solved my problem.

Best,
Mustapha

Mustaphaa · December 6, 2024, 8:54am

Hi @Dilicus,

Thank you for the code you provided; it works properly. However, I noticed a small issue with the “last range”. It is quite large in terms of limits, as shown in the attached image. Currently, it spans from 78 to 1000. I’m wondering if it is possible to make it smaller, perhaps between 300 and 1000 or so.

Normally, I don’t have enough data for a range from 130 to 1000 GeV compared to the range from 0 to 130 GeV.

*nbr of bins i have considered is 1000.

ranges

Best,
Mustapha

Dilicus · December 6, 2024, 11:45am

Maybe I need to adjust a bit the end loop condition.
So you like to have the last range with a smaller fraction of events and not only the 5%?

Could you share a picture of the cumulative of your data obtained with my code?.

Mustaphaa · December 6, 2024, 3:59pm

Hi @Dilicus,

What I mean to say is that I don’t know if I could have the last two ranges like this:

Range Before Last: 120 - 300
Range Last: 300 - 1000

By preserving the same number of entries for each range, I don’t know if this is doable or not, maybe by taking more than 5% per range ?!

Here is how the cumulative distribution looks for my data (without many cuts that I will be applying, which will reduce the data size):

Another thing is that the datasize per each range is too small i have noticed.

Dilicus · December 7, 2024, 9:54am

If you notice that the data size for each range is too small, maybe is better to increase to 10% or 15% the size of your samples. This should change also the size of the last one and second last one range. If you increase the bins number to 5000 the division will be more precise

Mustaphaa · December 9, 2024, 1:31pm

Hi @Dilicus,

To approximately achieve the last two ranges like this:

Range Before Last: 120 - 300
Range Last: 300 - 1000

I set the step size to 0.0001, and it produced ranges similar to these. However, upon inspecting the number of entries in each range, I noticed that the first few ranges had too many entries, and some ranges were repeated.

The reason I set the step size to 0.0001 is that I know there are only around 90K entries between 200 and 1000 GeV, and I thought this would be the only way to achieve such ranges(last two or three).

Dilicus · December 9, 2024, 1:52pm

Could you please share the file you are using and send back the modified version of the macro you are using.

Best,
Stefano

Mustaphaa · December 9, 2024, 2:30pm

Here is the modified version of the macro
Cumulative.C (2.8 KB)

Here is the data file i’m using, the file is heavy so i give the CernBox path:
https://cernbox.cern.ch/s/6InhPvWvHU1RzZY

And the output i got for different ranges

Thanks

Dilicus · December 10, 2024, 9:32am

Hi,
I was able to find a solution.
The only thing to do was to increase the numbers of bin of the first histogram. I used 50000000 (5e7) as bin number. With just few bins the cumulative was not precise enough

I used only the first 1% of the file, otherwise the macro took to long to run on my laptop. So my ranges have only ~900 events.

Below you find a screenshot of my ranges, with the events.

EDIT
I just run the macro with the first 10% of the file. these are the results for the last ranges.

2nd EDIT
I run the macro with the full data set. These are the resutls, for the last ranges.
I checked quickly all the range and they all have ~90K events. Using 1e8 bins should improve the results. And maybe you can increase also a bit the step size.

Mustaphaa · December 11, 2024, 3:06am

@Dilicus

Thanks so much, that solved my problem.

system · December 25, 2024, 3:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.