I am reading large ROOT files with ROOT 6.08.06 which were generated by Delphes 3. Each file is around 70 GB and has around 16 millions events.
Reading each file takes a long time, typically around an hour. Is that a normal thing and can one make the process faster? I came across something like GetEntriesFast() but I’m not sure if it helps.
ROOT provides the solution to run very efficiently on this kind of datasets: TDataFrame. It allows you to express your analysis easily and parallelise it wihout effort: this is ideal with datasets as big as yours.
Here you can find the TDataFrame code examples.
My recommendation would also be to move to ROOT 6.12 from 6.08.
16 millions events an hour is about 4kHz of throughput.
that’s a relatively good throughput for reconstruction, but for fast simulation, on today’s hardware, that’s a bit disappointing.
70GB for 16 millions events is about 4kB per event, so, definitely a rather light weight event size (compatible with a fast sim toolkit like Delphes)
70GB in an hour is about 20MB per second. definitely an order of magnitude below today’s standard pure I/O rotating disk speeds.
from these back of the enveloppe calculations, it seems:
either your analysis is CPU bound (you compute a lot of things per event)
or you read too much data for what you really need to do (e.g. you read 20k branches from your TTree while you only need a smaller subset to carry your analysis.)
migrating to TDataFrame may help with the last bullet (because it should only enable the needed branches and leave the others on-disk).
it might help with the first one (if some calculations can be carried in parallel.)
but you really need to understand what’s the “profile” of your analysis (CPU-bound, or I/O-bound) to understand where’s the bottleneck.