Speeding up the PyROOT loop over TTree with RDataFrame

ROOT Version: 6.16
Platform: lxplus

Dear experts,

I’ve looked for ways to speed up the looping over TTrees in PyROOT and RDataFrame seemed like the preferred way to do it. But when I’ve tried to implement it in my code the performance was either the same or only slightly better than for the pure python script.
I’ve attached the basic versions of the looping script written in C++, python with for event loop and python with RDataFrame and the example .root file. The approximate time it takes to loop over the tree is:

C++: 0.2 sec
Python (for event): 6.7 sec
Python (RDF, no MT): 5 sec
Python (RDF, implicit MT): 6.5 sec

Is there something I’m doing wrong in the RDataFrame script?

Thanks in advance.
ConvertTreeRDF.py (1.2 KB)
ConvertTree.py (1.3 KB)
ConvertTree.cpp (3.1 KB)
Input file (cernbox)


FYI: we’re investigating! (Yes Python is slow, but for the RDF case, most time should be spent on the C++ side!) Which exact ROOT version do you use?

It’s 6.16.00-x86_64-slc6-gcc62-opt.

Is there any updates on this topic?

Hi @apetukho

As far as we see, your implementation with RDataFrame looks good. Running some performance analyses, we see that most of the overhead comes from fact that ROOT needs to compile some parts of the analysis at runtime, i.e. template code in the internals of RDataFrame and the C++ strings used inside the Define and Filters in the Python side.

These are the numbers I get on my workstation:

C++: 0.2 sec
Python (for event): 3.4 sec
Python (RDF, no MT): 2.9 sec

The 2.9 seconds running the Python (RDF, no MT) version are distributed as follows (in wall-time):

Compiling JIT code: 0.836 s 
Initialization of RDF Nodes: 0.038 s 
**Event Loop:  0.087 s** 
Snapshot (also compiling): 0.879 s
Other:  1 s

As you can see the computation of event loop runs quite fast while the compilation of the code takes a big portion of the total time. I believe the size of the input file also affects on this case, most likely for a bigger input the differences between the Python version based on events and the one based on RDataFrame would be more significant.

Alternatively, should you need to run this analysis faster you can also use this version of the analysis using RDataFrame in cpp:

ConvertTreeRDF.cpp (1.7 KB)

In my machine it took 0.182 seconds to run the analysis. The main difference here is that we are not using any JIT code so everything is compiled in advance before running the analysis.

I hope that helps. Let us know if you have any further question.

And Thank you very much for reporting this, it has been very useful for us to detect this performance issue on small analyses.

1 Like

Thank you @jcervant!

To add to this: we are looking into refactoring RDF in order to speedup its compilation (and therefore also jitting) – we might or might not succeed.

The current state of things is that, especially from PyROOT, the startup time of RDF is…important, unfortunately – but the event loop still runs at C++ speed (potentially subdividing batches of entries among multiple threads).

For larger datasets, you should see large improvements w.r.t. a python loop.

Thank you for your response. I’ve run some more tests by adding more cuts and branches to the ConvertTree function and running it on the real MC data (and swapping TLorentzVector to ROOT::Math::LorentzVector to speed everything up and get rid of the gInterpreter). I don’t want to switch to C++ yet, because for me it’s way easier to handle the datasets and normalization coefficients in Python. The script looks like this:

processPathList = [...]
for processPath in processPathList:
	inputFile = ROOT.TFile(processPath, 'read')
	inputTree = inputFile.Get('output_tree')

	start = time.time()
	timeList.append((entryNum, time.time() - start))

And the measured times are (for lxplus machines with ROOT 6.16.00)

# of events in a tree	Python, sec	RDF, sec
167333			21,80		9,26
358497			47,85		7,30
468998			61,18		7,72
6205			0,83		4,93
7986			1,08		5,77
23283			2,71		5,30
13263			2,00		5,27
3729			0,61		5,16
5169			0,62		5,11
6228			0,82		5,83
5903			1,53		5,16
8381			1,19		5,27
10756			1,31		5,45
3342			0,56		5,43
6458			0,62		5,27
671951			58,14		9,19
509784			46,41		7,59
334797			33,33		6,42
580629			64,53		7,68
258548			30,31		7,58
66361			8,54		5,43
196346			22,86		6,42
164301			21,31		6,02
39374			5,76		5,58
206070			26,03		6,35
108375			14,31		7,08
33813			4,84		5,40
1437583			114,05		11,43
900101			84,94		10,57
497204			48,27		7,98
426843			44,61		7,86
2353543			62,37		14,41
3729460			121,76		10,37
1190529			39,71		6,73

Total			996,79		238,32

It’s interesting how the RDF is way faster on the big datasets but really slow in comparison even to Python loops for trees with less than ~50k events.

Nice, thanks for the measurements!

The event loop is always fast with RDF, but between jitting and PyROOT, on lxplus, there’s a constant 5 second overhead, it’s very clear from your benchmarks.

We’ll try to reduce that constant offset…and there are also plans to introduce some sort of “verbose mode” that would directly tell you how many events per second RDF processed during the event loop, and how much time was spent on setting things up, similarly to what Javi showed.

Some of your measurements are surprising, e.g. python @ 2353543 is way lower than what a fit would suggest.

For python I get 2s + 1E-4s*N, for RDF it’s 6s + 2E-6s*N. The constant time isn’t what it should be - and we’ll improve likely both parameters by factors during the next 18 months: we know exactly what we want to do to fix them. (FYI: bulk processing of RNTuple)!RDF

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.