Reading time of a large ROOT file

Ameen · March 23, 2018, 2:44am

Dear ROOT experts,

I am reading large ROOT files with ROOT 6.08.06 which were generated by Delphes 3. Each file is around 70 GB and has around 16 millions events.
Reading each file takes a long time, typically around an hour. Is that a normal thing and can one make the process faster? I came across something like GetEntriesFast() but I’m not sure if it helps.

Thank you,
Amin

Danilo · March 23, 2018, 10:49am

Hi Amin,

ROOT provides the solution to run very efficiently on this kind of datasets: TDataFrame. It allows you to express your analysis easily and parallelise it wihout effort: this is ideal with datasets as big as yours.
Here you can find the TDataFrame code examples.
My recommendation would also be to move to ROOT 6.12 from 6.08.

Cheers,
D

sbinet · March 23, 2018, 10:56am

16 millions events an hour is about 4kHz of throughput.
that’s a relatively good throughput for reconstruction, but for fast simulation, on today’s hardware, that’s a bit disappointing.

70GB for 16 millions events is about 4kB per event, so, definitely a rather light weight event size (compatible with a fast sim toolkit like Delphes)

70GB in an hour is about 20MB per second. definitely an order of magnitude below today’s standard pure I/O rotating disk speeds.

from these back of the enveloppe calculations, it seems:

either your analysis is CPU bound (you compute a lot of things per event)
or you read too much data for what you really need to do (e.g. you read 20k branches from your TTree while you only need a smaller subset to carry your analysis.)

migrating to TDataFrame may help with the last bullet (because it should only enable the needed branches and leave the others on-disk).
it might help with the first one (if some calculations can be carried in parallel.)

but you really need to understand what’s the “profile” of your analysis (CPU-bound, or I/O-bound) to understand where’s the bottleneck.

Ameen · March 23, 2018, 3:26pm

Thank you @Danilo and @sbinet
I think the problem is more of the second case you mentioned. I will read about TDataFrame and try to implement.

Thank you again for your help and suggestions.

Best,
Amin

system · April 6, 2018, 3:26pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.