Hi,
I commonly use Python to read ROOT file, using TROOT.Chain. However this is sequential and single thread, which is not perfect. I wonder if there is a way fulfills the following two benefits ( or two ways and each way can do one thing ):
1) Parallel ( I asked however I failed to make it in PyROOT ), so I can use multi threads to read the ROOT files
2) Finding a ‘‘entry’’ by index, which means I can read directly the n-th entry without read the previous n-1 entries. This is useful when using ROOT file with Neural Network because I need to shuffle the data.
For example the Pytorch dataset can read data in a folder directly with the two benefits I mentioned. Since ROOT Files is somehow a “folder” too there is no way we can’t write something like the Pytorch dataset, I guess…
Thank you!
Best,
Li
ROOT Version: Not Provided Platform: Not Provided Compiler: Not Provided
Hi, RDataFrame is ROOT’s interface for parallel data analysis and manipulation (requires at least v6.16). You can find several python tutorials here.
However, RDataFrame does not offer a simple way to access entries in random ordering.
The reason is that it’s very, very easy to write very, very slow applications if you read entries from disk in a random order.
A much better approach, whenever viable, is to load your NN training dataset in memory (as a numpy array, for example) and read from RAM, which as its name implies offers efficient random access.
With RDataFrame, you can apply cuts, defined derived quantities and then load everything into numpy arrays in a few lines: