RDataFrame tree conversion takes too much time and memory


ROOT Version: 6.16/00
Platform: Ubuntu 16.04


Dear ROOT experts,

I’m using pyRoot with RDataFrame for the experiment data processing. I need to do quite basic actions like in the last CERN opendata RDataFrame examples (e.g. [1], [2]):

  1. Make some event selections.
  2. Apply weight corrections based on the sum of MC weighs, cross-section and luminosity.
  3. Instead of making histograms, transform variables for further work

Each process is split into three periods with different luminosity and each period could be further split into subprocesses with different cross-section and MC weight sum.

However I found out it is a fairly slow way to work with root files. It works especially slow when the process is made up of few subprocesses, each containing only 1-100 events.
It also seem to gradually take all available RAM forcing the OS to kill the conversion process.
This gets very unwieldy when I need to work with systematic errors, because it means I need to do every conversion up to 20 times.

I was advised to make the tree conversion function into the C++ file, add it into the python file with ROOT.gInterpreter.Declare("#include foo.cpp), but that didn’t affect neither the speed nor the memory usage.

I’ve created a mock-up example of my conversion code and attached it to this post.
ConvertDataset.py (5.8 KB)
ConvertTree_cpp.cpp (1.6 KB)
It could also be found here with the C++ code and some example input data. There are two processes, “heavy” and “light”:

  • “Heavy” has only 1 file per period with 64613, 78998 and 104645 entries.
  • “Light” has 3 files per period with (1, 79, 196), (0, 82, 196) and (1, 94, 496) entries

I’ve run some tests and the runtime for the each of the combination of code and process looks like this (the numbers are runtime in seconds):

python, "heavy"	python, "light"	cpp, 'heavy'	cpp, "light"
15,2174670696	30,719119072	14,1762371063	29,1049787998
12,5415570736	29,5174951553	12,7764348984	29,7464039326
12,3580510616	30,4264831543	12,8252680302	36,1380038261
12,3621308804	31,402148962	12,9889249802	28,4454379082
12,4532442093	30,9754590988	13,0270299911	33,2478058338
12,5667109489	32,0482950211	12,978730917	32,8336589336
12,5887079239	35,2845959663	13,110419035	33,0987081528
12,7872169018	34,1884410381	13,2643208504	31,2558979988
12,6519219875	32,9833519459	13,6302509308	33,3314220905
12,7544009686	32,929363966	13,9931509495	32,5388319492
12,6516251564	32,9172339439	14,0106520653	28,6607189178
12,8050370216	33,8565819263	14,4325330257	29,5839531422
12,7571280003	34,3305761814	13,8887300491	31,912913084
12,9673330784	33,6310958862	13,8388259411	34,5327329636
12,8155889511	33,6902740002	14,2913339138	34,5269930363
13,0519108772	33,7768409252	14,0997800827	34,1296551228
13,0272419453	34,1902039051	13,950273037	35,7440810204
12,9802839756	33,9651200771	13,9921731949	35,3556680679
13,1217548847	33,9447009563	13,8357200623	35,6759448051
12,8862700462	36,3453910351	14,1057138443	42,7853910923

Unfortunately, I don’t know how to properly present the memory usage issue.

I’ve got the following questions:

  1. Is the way I’m using the RDataFrame correct? Maybe there’s a better way to merge the input data before sending it to the RDataFrame? Right now the problem with having different normalization coefficients for every process file is stopping me from doing so.
  2. Is there a way to speed up the conversion times and decrease the memory usage? The way it is now really hinders my ability to do the analysis.

Thanks in advance,
Aleksandr

Hi Aleksandr,
thank you for the thorough report.
This is definitely a problematic behavior.

Before we dig into it, however, could you please check whether ROOT’s current master branch still presents this issues?

You can either download pre-compiled binaries of the nightlies or install them via conda (not an advertised distribution channel yet):

$ conda create -n root-nightly -c conda-forge -c https://root.cern/download/conda-nightly/latest root-nightly
$ conda activate root-nightly

Installing conda is fairly straighforward.

Cheers,
Enrico

Hi Enrico,

I’ve tested my code on the same Ubuntu machine with ROOT’s master version. The run times in seconds are

python, "heavy"	python, "light"	cpp, "heavy"	cpp, "light"
12,6357138157	21,2765231133	10,8265318871	20,9584360123
8,6492950916	16,8159852028	8,8113510609	17,554530859
8,3732161522	17,013764143	8,5724999905	18,1778972149
8,4610950947	18,0920848846	8,7197887898	18,6586410999
8,4218490124	17,5790328979	8,8868849278	18,871502161
8,6015489101	16,7283360958	8,7004339695	19,2456002235
8,7279579639	17,1717660427	9,1521298885	18,8681650162
8,5640778542	17,6466178894	8,8281259537	18,9460320473
8,6654288769	17,0789301395	9,1617879868	18,6554582119
8,7749049664	17,5882809162	8,6360168457	18,7109069824
8,5787391663	18,0985980034	8,6626999378	18,7014679909
8,8500330448	20,2722699642	8,7691810131	19,0996360779
8,6696448326	22,8912670612	9,2500059605	19,285490036
8,7564928532	20,6616809368	9,3329889774	20,0462639332
8,7350840569	18,5314588547	9,0782361031	19,4825839996
8,8899960518	18,2859280109	8,9463450909	19,6934659481
8,7146229744	18,0480101109	8,9019899368	18,2985579967
8,7302470207	18,2739129066	9,2911269665	17,2530460358
8,9851078987	20,4818389416	9,6494119167	19,7338588238
9,1543111801	20,0699179173	9,5622749329	20,7159221172

It’s slightly faster compared to v. 6.16/00 but still not as fast as pure C++ tree conversion code (like I’ve checked here.
Memory issues are still present too.

Hi,
this:

finalDF.Snapshot(outputTreeName, outputFileName, branchList);

double passedEntryNum = finalDF.Count().GetValue();
double passedWeightSum = passedWeightSum = finalDF.Sum("weight").GetValue();
double passedWeightSumNorm = finalDF.Sum("weightModified").GetValue();

triggers four event loops. You can rewrite it into:

auto passedEntryNum = finalDF.Count();
auto passedWeightSum = passedWeightSum = finalDF.Sum("weight");
auto passedWeightSumNorm = finalDF.Sum("weightModified");
finalDF.Snapshot(outputTreeName, outputFileName, branchList);

to produce all results in a single event loop.
Also, if your datasets contain many files or TTree clusters (you can check with tree->Print("clusters"), adding ROOT::EnableImplicitMT() at the top of the program will parallelize execution over files and clusters, which might speed things up further.
Compiling the code rather than running it via the ROOT interpreter might also give you a factor 2 or more, depending on your use-case.

I will take a better look at runtime and memory usage tomorrow.
What would be a reasonable runtime for this application?

Cheers,
Enrico

Hi,

to produce all results in a single event loop.

I’ve switched the Snapshot() and the Sum()/Count() functions but that yielded no result for my test files.

Also, if your datasets contain many files or TTree clusters (you can check with tree->Print("clusters") , adding ROOT::EnableImplicitMT() at the top of the program will parallelize execution over files and clusters, which might speed things up further.

I’ve already had the multithreading switched on, it’s on line 180 of the ConvertDatasets.py.

Compiling the code rather than running it via the ROOT interpreter might also give you a factor 2 or more, depending on your use-case.

What would be the proper way to implement it?

What would be a reasonable runtime for this application?

1-2 seconds per process. But it’s the memory consumption that worries me the most. If it wasn’t for the memory I could’ve just left the processing run and come back in a few hours. But the way it is now I have to monitor the whole process to spot when it crushes and launch the processing again.

That’s not enough, you also have to remove the GetValue() calls, that’s what triggers the event loop.

RDataFrame parallelizes execution over files and TTree clusters, but your files are small so they contain only one TTree cluster each, and execution on different files is serialized in a for loop, so EnableImplicitMT will yield no benefit (RDF only generates one task, so there is nothing to distribute to other worker threads).

I will produce some more measurements, provide an optimized version of the code, and take a look at what occupies memory.

Cheers,
Enrico

That’s not enough, you also have to remove the GetValue() calls, that ’s what triggers the event loop.

What is the proper way to convert the values I get this way from types such as <cppyy.gbl.ROOT.RDF.RResultPtr<ULong64_t> object at 0x5dec470> (python) and ROOT::RDF::RResultPtr<unsigned long long> (C++) to regular int and double?

GetValue! :smiley: But after the event loop has run (in your case, Snapshot triggers it).
See here under “Executing multiple actions in the same event loop” for clarifications (plus, in the cheat sheet, you can see Snapshot is an “instant action”, i.e. it triggers the event loop).

To clarify, this:

double passedEntryNum = finalDF.Count().GetValue();
double passedWeightSum = passedWeightSum = finalDF.Sum("weight").GetValue();
double passedWeightSumNorm = finalDF.Sum("weightModified").GetValue();

adds actions to the RDF and then requests the output values on the spot with GetValue, so RDF has to perform one event loop at each line.

This:

auto passedEntryNum = finalDF.Count();
auto passedWeightSum = passedWeightSum = finalDF.Sum("weight");
auto passedWeightSumNorm = finalDF.Sum("weightModified");

only books the actions, and as soon as you call GetValue (or any other method) on one of the results, all booked results are produced in a single event loop.

Hi,
here’s my report.

TL; DR

Once the obvious issues are fixed (running multiple event loops, enabling multi-threading with 10 worker threads even though each event loop only has workload for a single thread), runtime and memory are mostly eaten up by just-in-time compilation of RDF’s computation graph. In particular, the interpreter never gives back allocated memory, so a long running process that performs just-in-time compilation in a loop will have memory slowly creeping up.

A fully compiled C++ version of the program runs a factor 6 faster or more, and converting each input file in a different process (running all processes in parallel) further slashes runtimes.

Possible solutions include:

  • running the macro/script once per period in a loop: memory is reclaimed after each execution, the macro can be a python script or an optimized C++ script
  • jitting less by specifying template parameters to actions and using lambdas rather than strings in Filters/Defines
  • from python, you can load and run compiled C++ code via ROOT.gSystem.Load, or by compiling it on the fly with ROOT.gInterpreter.ProcessLine or similar
  • running the RDF conversion in a subprocess (which will free memory at exit), which also makes it possible to use multi-processing parallelization to improve runtimes
  • using something like the last C++ script I provide: it’s harder to write (it requires listing template parameters by hand, which is tedious), but its memory usage stays constant, it is the lowest of all options tried, and runtimes are an order of magnitude better

Full measurements

In the following, I run over the 3 “heavy” files.
Times are measured with python’s time module or TStopwatch (for the C++ programs), and averaged over 3 executions.
Memory usage is the max resident size as reported by /usr/bin/time.

Each row in the table lists the modification added w.r.t. to the line before (i.e. the last line includes all the modifications listed above).

Additional modifications Time (s) Max res. memory (kB)
none (original python script) 10.14 +/- 1.46 664876
no EnableImplicitMT 9.22 +/- 1.42 629028
single event loop 7.18 +/- 1.09 581932
do not merge subprocess results 6.30 +/- 1.13 575640
C++ macro via the ROOT interpreter 5.89 +/- 0.41 510592
compiled C++ program (-O2) 5.88 +/- 0.62 489944
use Alias instead of Defines where possible 5.76 +/- 0.60 487848
use lambdas instead of strings in Defines/Filters 4.63 +/- 0.44 449628
use Snapshot template parameters 1.52 +/- 0.28 285008
processing different files in different processes concurrently 0.83 +/- 0.11 269136

The scripts

Each subsequent script is a modified version of the one before, with the additions mentioned in the table.

original python script
python + no IMT + 1 evt loop + no result merging (4.4 KB)
compiled C++ program (3.0 KB)
C++ program with no jitting: using Alias, lambdas instead of strings, template parameters for Snapshot (3.3 KB)
C++ program with multi-process parallelism (the last line in the table) (3.4 KB)

Homework for the devs

  1. it would be lovely if RDataFrame had an option to be verbose about its event loops and it could tell users where time is being spent
  2. just-in-time compilation of Snapshot calls could be faster

Conclusions

At the end of the day, the problem is that, because of just-in-time compilation of the computation graph, running many very short (order ~1 second) event loops is a worst case scenario for python+RDF (in C++, all is fine): each event loop will hog a bit more memory, start-up (i.e. just-in-time compilation) takes about as long as the actual processing, and there is no opportunity to parallelize the event loop on multiple threads because the workload is too small.

The good news is RDF’s core is fast, the bad news is it’s hidden under too many layers.

Above, in the TL;DR, I list several possible mitigations. The nuclear option is of course to write your program in fully typed C++, which is fast and does not hog memory.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi Enrico,
Thank you (once again, this time publicly :slight_smile: ) for your extensive testing and advice.

I’ve decided not to drop the merging step because I’m willing to lose some time on processing so I won’t have to merge them afterwards. So I’ve followed the following setups

  1. Original python setup (5.9 KB)
  2. python setup + no excessive loops + no MT (7.7 KB) Fully
  3. Python base + C++ convert function with the same fixes (1.8 KB)
  4. Python base + C++ convert function from your last file (Aliases + lambdas + Snapshot template) (1.9 KB)
  5. Python base + C++ convert function based on simple for tree loop|attachment (4.6 KB)

I’ve tested them on both datasets, running 3 times and averaging. I’ve run it on my local Ubuntu machine with the stable root 6.20 version.

The results are:

Code version ‘heavy’ time, s ‘light’ time, s
Original python 13.7 +/- 2.1 31.2 +/- 2.1
Fixed python 10.4 +/- 1.3 23.8 +/- 1.9
python base + fixed c++ RDF 10.5 +/- 0.4 23.0 +/- 0.5
python base + modified c++ RDF 2.9 +/- 0.8 0.7 +/- 0.7
python base + c++ tree loop 3.0 +/- 0.3 0.3 +/- 0.1

The fourth example is exactly what I needed. It is not as handy to maintain as the JITed version (especially the Snapshot template, I think I can get away without using lambdas and aliases) but way faster, especially with the “light” dataset. And it’s way easier to write than the tree loop, I’ve forgotten how clunky they tend be.

Once again, I’m very grateful for your help.
Cheers,
Aleksandr.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.