RDataFrame tree conversion takes too much time and memory

eguiraud · August 7, 2020, 1:56pm

Hi,
here’s my report.

TL; DR

Once the obvious issues are fixed (running multiple event loops, enabling multi-threading with 10 worker threads even though each event loop only has workload for a single thread), runtime and memory are mostly eaten up by just-in-time compilation of RDF’s computation graph. In particular, the interpreter never gives back allocated memory, so a long running process that performs just-in-time compilation in a loop will have memory slowly creeping up.

A fully compiled C++ version of the program runs a factor 6 faster or more, and converting each input file in a different process (running all processes in parallel) further slashes runtimes.

Possible solutions include:

running the macro/script once per period in a loop: memory is reclaimed after each execution, the macro can be a python script or an optimized C++ script
jitting less by specifying template parameters to actions and using lambdas rather than strings in Filters/Defines
from python, you can load and run compiled C++ code via ROOT.gSystem.Load, or by compiling it on the fly with ROOT.gInterpreter.ProcessLine or similar
running the RDF conversion in a subprocess (which will free memory at exit), which also makes it possible to use multi-processing parallelization to improve runtimes
using something like the last C++ script I provide: it’s harder to write (it requires listing template parameters by hand, which is tedious), but its memory usage stays constant, it is the lowest of all options tried, and runtimes are an order of magnitude better

Full measurements

In the following, I run over the 3 “heavy” files.
Times are measured with python’s time module or TStopwatch (for the C++ programs), and averaged over 3 executions.
Memory usage is the max resident size as reported by /usr/bin/time.

Each row in the table lists the modification added w.r.t. to the line before (i.e. the last line includes all the modifications listed above).

Additional modifications	Time (s)	Max res. memory (kB)
none (original python script)	10.14 +/- 1.46	664876
no EnableImplicitMT	9.22 +/- 1.42	629028
single event loop	7.18 +/- 1.09	581932
do not merge subprocess results	6.30 +/- 1.13	575640
C++ macro via the ROOT interpreter	5.89 +/- 0.41	510592
compiled C++ program (-O2)	5.88 +/- 0.62	489944
use Alias instead of Defines where possible	5.76 +/- 0.60	487848
use lambdas instead of strings in Defines/Filters	4.63 +/- 0.44	449628
use Snapshot template parameters	1.52 +/- 0.28	285008
processing different files in different processes concurrently	0.83 +/- 0.11	269136

The scripts

Each subsequent script is a modified version of the one before, with the additions mentioned in the table.

original python script
python + no IMT + 1 evt loop + no result merging (4.4 KB)
compiled C++ program (3.0 KB)
C++ program with no jitting: using Alias, lambdas instead of strings, template parameters for Snapshot (3.3 KB)
C++ program with multi-process parallelism (the last line in the table) (3.4 KB)

Homework for the devs

it would be lovely if RDataFrame had an option to be verbose about its event loops and it could tell users where time is being spent
just-in-time compilation of Snapshot calls could be faster

Conclusions

At the end of the day, the problem is that, because of just-in-time compilation of the computation graph, running many very short (order ~1 second) event loops is a worst case scenario for python+RDF (in C++, all is fine): each event loop will hog a bit more memory, start-up (i.e. just-in-time compilation) takes about as long as the actual processing, and there is no opportunity to parallelize the event loop on multiple threads because the workload is too small.

The good news is RDF’s core is fast, the bad news is it’s hidden under too many layers.

Above, in the TL;DR, I list several possible mitigations. The nuclear option is of course to write your program in fully typed C++, which is fast and does not hog memory.

Cheers,
Enrico