Reading object from file currently processed in RDataFrame

cgrefe · June 20, 2025, 9:23am

Hi,

I would like to know if it is possible to dynamically access objects from the files currently being processed by the RDataFrame event loop. In particular, I would like to read the bin content of a histogram from the same file which holds the TTree, define the value as a column using DefinePerSample and use that column as a weight to scale a histogram I am filling.

I found the topic “root-forum.cern.ch/t/define-new-rdataframe-column-using-values-from-th1f-in-pyroot/34052” (can’t add links as new user…), but that explicitly opens a separate file to load the histogram beforehand. That is not really an option (or just adds a lot of overhead I/O) when processing a lot of files.

Cheers,
Christian

ROOT Version: 6.34.04
Platform: x86_64
Compiler: gcc13

vpadulan · June 20, 2025, 12:05pm

Dear @cgrefe,

Thank you for reaching out to the ROOT forum! The use case you are presenting is intriguing. If I understand correctly, you would like to have an interface that lets you access at the same time any arbitrary content in a ROOT file, during the event loop while processing a ROOT dataset (either TTree or RNTuple). I must admit this is the first time I hear about such requirement, and I would really like to understand better what is the motivation driving it. On the surface, the forum post you link to seems to be the most effective answer that provides little overhead, while at the same time keeping a good balance between the usability and the functionality of the interface. I don’t really understand your comment about I/O overhead, since anyway reading two different objects from the ROOT file will happen independently. Also, keep in mind that RDataFrame is an interface for analysis/processing of datasets, so it is designed to mainly wrap an existing dataset, and we support plenty of formats (TTree, RNTuple, Arrow, SQLite, Awkward arrays, Numpy arrays, Pandas dataframes, CSV files etc.).

Cheers,
Vincenzo

cgrefe · June 20, 2025, 12:28pm

Hi Vincenzo,

the overhead is not significant if it is a small number of files available locally. I am more worried about the case of hundreds or even thousands of files to process, which might be stored on some remote storage (i.e. EOS). So just opening each file to pre-fetch the metadata before even starting the event loop can be significant. Anyway, the workaround is probably to store such information as an additional column or a friend tree instead of a histogram. I just wanted to be sure that this feature does not exist before changing the workflow in that direction.

Cheers,
Christian

system · July 4, 2025, 12:28pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.