Understanding rdataframe from TFile with many trees

rquaglia · November 11, 2024, 10:50pm

Dear Experts,

I have a question about the way RDataFrame access a tree in a list of files.

In practice i have a series of root files for a total of 120Tb. Each file contains 200 ttrees and all of them are accessed via xrootd.

I was wonderijg how is RdataFrame when asked to process just 1 Ttree from the chain is affected in performance due to the big size of the files containing it.

I.e is RDataFrame(“treename”, listoffiles) event loop performance penalised (network, overhead of opening big files even if you access effectively only 1 tree out of 200) by the fact the file sizes are big but the actual disk space of the trees in the file is not? Should one consider to have many trees in as many tfiles instead or is rdataframe smart enough to do the same under the hood?

Thanks in advance

Renato

I hope i made the question clear.

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Danilo · November 12, 2024, 7:27am

Hi Renato,

Thanks for the question, which is legitimate.
The functionality involved here is not related to RDF per se, but rather the underlying IO layer, which is optimised for handling also this kind of cases. Files are read partially, i.e. you read over the network just the portion you require.

Best,
Danilo

system · November 26, 2024, 7:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.