Hi! I have a general question about how to speed up my pyroot scripts (if this is possible). I was developing something a while ago, where I had nested loops over ttrees, and that was incredibly slow, so I switched to a .C macro which did the job quite quick.
Since I wondered if I was doing something wrong, I wrote this script in python and the same in c++, doing not very much, just looping over a ttree in a rootfile (the one I’m using is quite big, 41GB)
python script
import ROOT
path = 'path/to/rootfile.root'
root_file = ROOT.TFile(path, 'read')
tree = root_file.Get("Track")
for event in tree:
pt = event.track_Pt
root_file.Close()
It is expected that your loop in C++ runs much faster than the equivalent loop in Python, it is due to a language performance difference.
One way to speed up your Python code when you are reading a tree is to actually hide the loop into C++. This is precisely what the TDataFrame class does, please have a look at:
It proposes a declarative approach to process data in ROOT trees. I would be happy to guide you through transforming your tree processing code into a TDataFrame chain of operations, if you are interested.
And the results from the last time I touched it (see slide 17; running at C++ speeds accomplished): ROOT User’s Workshop 2013
If use of Python in HEP ever reaches critical mass, then maybe management can be convinced to invest in such work, so that you can have your cake and eat it, too.
Until then, which version of Python do you use? With recent benchmarking, I found a sweet improvement in p3.6 over p2.7 (even more so for cppyy master).
The loop is run on a single thread for fairer comparison with your other macros, but with TDataFrame it’s awkward (by design) to just deserialize a variable and do nothing with it, so I’m filling a histogram instead.