Machine Learning on a Large Dataset using PyTorch: RDataFrame vs. uproot?

Sam_Kelson · January 18, 2023, 7:13pm

Is there a general opinion on whether it is more efficient to load data via RDataFrames or uproot directly from ROOT files for machine learning analysis in PyTorch? I have seen discussion on this in multiple areas but most are fairly old and I am wondering if anything has changed in the past 2/3 years.

For my purposes specifically, I believe I will not be doing much fiddling around with the data itself and would like to avoid intermediate files assuming no loss in performance (are intermediate file methods much faster?). I just simply want to get the data from the ROOT files and into a PyTorch readable format in the most efficient manner possible.

Any links to external resources and comments are appreciated.

Thanks.

Older relevant post: python - What the fastest, most memory-efficient way of opening a ROOT NTuple for machine learning? - Stack Overflow

jalopezg · January 19, 2023, 12:01pm

Hi @Sam_Kelson,

Welcome to the ROOT forum! I’m inviting @moneta and @vpadulan to the topic, as I think they can shed some light on this.

Cheers,
J.

moneta · January 19, 2023, 9:56pm

Hi Sam,

A simple way, if your dataset is not too large is to use, as pointed in the link above, RDataFrame.AsNumpy.
If this is not convenient for your data, we are developing also a batch generator using RDataFrame which will provide efficiently batches of numpy arrays that can be used for ML tools like PyTorch or Tensorflow with minimal overhead. If you are interested we can provide you a prototype implementation for this

Best regards

Lorenzo

Sam_Kelson · February 1, 2023, 5:42am

Hi Lorenzo,

That batch generator sounds really interesting. A prototype implementation would be lovely if you could provide that for me.

Thanks!

Sam

system · February 15, 2023, 5:43am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.