Best way to random-access a TTree

schneiml · February 20, 2019, 1:48pm

Hi all,

I am trying to read very small amounts of data out of rather small ROOT files. The files in question contain a TTree with a branch of strings and a branch of TH1-derivates [1]. The files come from EOS, and we are talking about 10000’s of them.

I need to scan over the strings (10000’s) to find the objects that I am interested in, and then read the objects (10’s). Currently I use this code:

        metree = getattr(f, treenames[metype])

        for x in range(firstidx, lastidx+1):
            metree.GetEntry(x)
            mename = metree.FullName
            if mename in interesting_mes:
                value = metree.Value
                # save the thing and continue

And it is reasonably fast, but I think it still reads the full file over the network.

Is there a good way to only read one branch out of a TTree in Python, potentially in batch? (I can get the TBranch, but I can’t figure out how the get the value out of it…)

Also, any pointers to how to profile such a problem? (I can’t really monitor IO on a shared machine, and I am not even sure if this code is CPU limited on the Python side).

If there is anything that can be done in terms of ROOT settings while writing these files, I’d be interested as well. But for now I am stuck with a few TB of these.

Thanks,

Marcel

[1] Sample file:
https://mschneid.web.cern.ch/mschneid/perlumi_136.892/step3_inDQM.root

ROOT Version: 6.12/07 (CMSSW)
Platform: slc6_amd64_gcc700
Compiler: linuxx8664gcc

etejedor · February 20, 2019, 6:28pm

Hi Marcel,

When reading a TTree, by default, all branches are read. So if you just do this:

for entry in metree:
    mename = entry.FullName
    if mename in interesting_mes:
        value = entry.Value

even if you just access FullName and Value, all branches will be read. You can however enable and disable the reading of branches before the loop with SetBranchStatus:

metree.SetBranchStatus("*",0) # disable all branches
metree.SetBranchStatus("FullName",1)
metree.SetBranchStatus("Value",1)
for entry in metree:
    ...

If you want to go even faster, we are now promoting a new way of reading and processing the content of TTrees, called RDataFrame:
https://root.cern/doc/master/classROOT_1_1RDataFrame.html

If you use RDataFrame, you will not do a loop in Python, but instead the loop will be buried in the C++ implementation of RDataFrame. With this new approach, you write your program as a sequence of operations on a dataset, following a declarative pattern.

Cheers,
Enric

schneiml · February 20, 2019, 7:14pm

Thanks for the hint Enric, by not reading the Values branch when I don’t need it things got a bit faster. Now I can process >1000files/min with enough parallel workers to hide the EOS latency, that is good enough for today…

I’ll play around with RDataframe at some point, I really like the idea; though I am not sure it is well suited for this use case, where everything is deeply IO limited.

Cheers,

Marcel

etejedor · February 21, 2019, 4:25pm

Hi Marcel,
Good to hear it is running faster.

I just wanted to point out that RDataFrame is able to transparently parallelise the reading of your TTree, so the uncompression and deserialization of the entries will be tackled in parallel by multiple threads. That should help if your use case is I/O bound.

system · March 7, 2019, 4:25pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.