TDataFrame feedback: performance comparisons, df.Count() instability


ROOT Version: v6.12.06 / v6.14.04
Platform: Ubuntu 18.04
Compiler: gcc
Cores: 8


Hi all,

I am exploring various aspect of the very nice new TDataFrame/RDataFrame objects and I have made some performance studies that could be possibly an interesting feedback (I apologize in advance if it’s not!). I provide a summary of my observation here and all the code/results are available in this github repository (tag v0.1) (except the input root file which is ATLAS internal - feedback welcome to work this around if someone wants to reproduce these results).

Cheers,
Romain



1. Performances

The following numbers corresponds a running time over 0.8 millions events, with ~10 new variable computations (very simple ones - booleans), and saving the selected events (~50%) in a new ROOT file. One detail to mention is that the explicit event loop doesn’t use multi-threading, while DataFrame-based code do.

TDataFrame (v6.12.06) RDataFrame (v6.14.04) Explicit event loop*
C++ 261.37 s 57.75 s 19.5 s
Python 114.82 s 28.50 s -

(*) the 10 new boolean variables are not computed in the explicit event loop but I don’t think it’s time consuming

Several questions can be raised

  • why the difference between python and C++ (in v6.14.04 for instance)?
  • why the difference between explicit event loop without MT and DataFrame with MT?
  • is it expected to have such a big difference between TDataFrame and RDataFrame (for instance in python)?

2. Instability of df.Count() results in v6.14.04

While doing these studies, I realized that the number of events in the final dataframe changed from one execution to another of the exact same code, at least in v6.14.04 (it doesn’t seem to be the case for v6.12.06): The next table shows the number of selected events extracted in two ways df.Count() and tree.GetEntries() in both python and C++ for four runs of the same code (*).

df.Count() / t.GetEntries() Python C++
Run 1 421186 / 421186 422593 / 421638
Run2 421150 / 421150 422595 / 421638
Run3 421638 / 421638 423074 / 421638
Run4 422593 /422593 420172 / 421638

Observations

  • in python the two numbers are always equal but vary from one execution to another
  • in C++ the number of saved events in the actual output tree is constant from one execution to another, while it’s not the case for df.Count() output.

(*) what different “a run” means in

  • Python
python CompareTDF_v14.py
  • C++
root
.L CompareTDF_v14.C+
RunTDF()
.q
1 Like

Hi @rmadar ,
this is super useful feedback.

TDataFrame in v6.12 was the first prototype, it’s now obsolete and as you see it was also much slower, so I would focus on the numbers you produced for v6.14, C++ and python.

A few questions before we dig deeper in the performance measurements:

  • RDataFrame parallelizes over TTree clusters. How many clusters does your dataset have? You can check with tree->Print("clusters") or with the method described here
  • from your code it seems that the explicit event loop is doing less work, is it an apple to apple comparison?
  • could you produce timings for code compiled with optimizations? (g++ -O3)
  • what are the timings for RDataFrame without MT?
  • how hard would it be to try the same with ROOT master branch?

Regarding the instability of Count(): it looks like a bug, would it be possible to have a standalone reproducer that we can debug?

Many thanks for the super interesting feedback again!
Cheers,
Enrico

Hi @eguiraud,

Thanks for the quick feedback! Please see some replies below.

Cheers,
Romain




Regarding the instability of Count(): it looks like a bug, would it be possible to have a standalone reproducer that we can debug?

Edit: removing the multi-threading solved the df.Count() issue !

That’s probably possible yes, but it will take me a bit of time. Let me keep that in mind for a little later if that’s OK. In the meanwhile, I have noticed something which could be relevant for this (and the other tests): when I compile my macro in ROOT, I get the following warning:

root [0] .L CompareTDF_v14.C+
Info in <TUnixSystem::ACLiC>: creating shared library /home/rmadar/cernbox/PythonDev/HEP/root-dataframe-studies/./CompareTDF_v14_C.so
Warning in cling::IncrementalParser::CheckABICompatibility():
  Possible C++ standard library mismatch, compiled with __GLIBCXX__ '20180415'
  Extraction of runtime standard library version was: '20180720'

  • RDataFrame parallelizes over TTree clusters. How many clusters does your dataset have? You can check with tree->Print(“clusters”) or with the method described here

Here is the output of tree->Print("clusters"):

[rmadar@clratlport01]:root-dataframe-studies>root input.root 
root [1] nominal_Loose->Print("clusters")
******************************************************************************
*Tree    :nominal_Loose: tree                                                   *
*Entries :   883595 : Total =      6048238919 bytes  File  Size = 2022801237 *
*        :          : Tree compression factor =   2.99                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                883594            1000

  • from your code it seems that the explicit event loop is doing less work, is it an apple to apple comparison?

Yes, you are right and it was actually mentioned in the original post. The only things which is not performed in the explicit loop are the calculation of the 10 new booleans. I’ll quickly implement the same computation (it’s not difficult, just a bit of time).


  • could you produce timings for code compiled with optimizations? (g++ -O3)

OK to be honest, I have always used compilation in ROOT and I am not really able to write a makefile. But I can try to ask help around.


  • what are the timings for RDataFrame without MT?

That’s easy to get (I put previous timings for the record). Interestingly, w/o MT is getting closer to the explicit event loop (and is then shorter!):

Python C++
EventLoop - 19.5 s
DF w/ MT 28.5 s 57.7 s
DF w/o MT 23.3 s 22.1 s

  • how hard would it be to try the same with ROOT master branch?

I am not sure since I have never tried. It there a compiled version somewhere on lxplus? If yes, that can be tried fairly easily I think.

Some additional follow-up

1. Fair comparison with explicit event loop

from your code it seems that the explicit event loop is doing less work, is it an apple to apple comparison?

This is now done - the code is here - and the time go from 19.5 s to 35.6 s, meaning that DF w/o MT is faster (22.1 s).

2. Comparison of “small” v.s. “large” datasets

This table summarize the running time for the operations described at the beginning for 5 kEvents or 880 kEvents, using C++ interface with ROOT v6.14.

5 kEvents 880 kEvents
RDF (w/o MT) 5.6 s 22.1 s
EventLoop 0.35 s 35.6 s

Main question is it expected to have such long running time for few thousand events compared with an explicit loop?

You’re using expressions defined as strings, so the long run time for a small number of events might be the JIT overhead.

The warning you get is because someone updated the gcc version in LCG 94 without also rebuilding ROOT.

ROOT master should be available in LCG dev4.

I also noticed in the C++ test you run Count on df_output, which contains filters.

@rmadar I think we are starting to have a bit too many things in flight at the same time :smile:

These are the two things that I would like to look into a bit better:

1. Multi-thread event loop takes more than twice as long as single-thread event loop or python event loop (single- or multi-thread

The main problem with your multi-thread timings is that you only have one cluster in input.root, so multi-threading is expected to buy you nothing. It is not expected to cause a large overhead though, certainly not 57s vs 22s. It would be good to check where the executable spends those extra 30s, I have never seen this effect.

2. Results of Count fluctuate

We have several tests for Count (including multi-thread ones) but we never saw this, and it looks scary.

Given a reproducer, we could look into these issues. Otherwise first good steps would be to:

  • check what of these issues are still present in the master branch
  • check if C++ compiled with -O3 sees the performance problems
  • check whether anything out of the ordinary is visible in the performance profile produced by perf or vtune
  • check whether valgrind (memcheck, helgrind, drd) complains about anything: valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ./repro

Cheers,
Enrico

Hello @eguiraud,

Thanks for the feedback, I have put everything you need on my CERNbox so that you can reproduce and understand the two issues you are mentioning.

Another point which I would like to make sure is about the super long overhead observed for running on few thousand events. Indeed, this point currently forces me to switch back to t->SetBranchAddress() formalism right now (not that I am happy about it, but it makes me gain a lot of time overall).

Cheers,
Romain

Does switching to lambdas improve the overhead for small numbers of events?

It takes 4.6 s for running over 5 kEvts using lambda functions, so it improved but just a little bit. There is still a factor 10 with respect to an explicit event loop. The code I used can be found here.

Thanks,
Romain

Hi @beojan,
jitting should have the same overhead with and without multi-threading, so it does not explain why multi-thread runs take more than double the time.

Thanks a lot,
it’s on my to-do list, will write back here as soon as I have something :slight_smile:

1 Like

Hi,
some initial feedback:

  • initial exploration suggests that the slowdown of the multi-thread runs w.r.t. single-thread is due to Snapshot, and the reason lies in the clustering of the input file: input.root has 883 very small clusters of 1000 entries each, and RDF divides the processing in 883 different tasks which spend most of their time getting to the data they need to read, then they process very little. How was input.root produced? Somehow ROOT master seems to not be affected by this problem, so ROOT v6.16 (released soon) won’t either, but I’d like to see more clearly
  • the slowdown of RDF w.r.t. a simple event loop is also due to Snapshot, because that call is just-in-time compiled and the jitted call contains many template parameters (one per output columns) – just the jitting takes ~4 seconds, you can see it even from htop: even with implicit MT enabled, several seconds are spent in single-thread operations (jitting), then there is a short burst of multi-thread processing
  • from my experiments it looks like the instability of df.Count() is not really a problem of Count(): it’s again a problem of Snapshot, which has a race condition because of which sometimes some entries are not written out. This is the most severe issue and the one I’m debugging at the moment, thank you for reporting it!

The only thing I do not understand is how, in your first post, sometimes df.Count() (on the output of a Snapshot, right?) returns more entries than t.GetEntries() (where t is the output of a normal event loop that writes a tree t, right?). Are you applying the same cuts and all of the cuts in the normal event loop case there?

Cheers,
Enrico

Hello Enrico, thanks!


How was input.root produced? Somehow ROOT master seems to not be affected by this problem, so ROOT v6.16 (released soon) won’t either, but I’d like to see more clearly

This was done with a hadd commands of about ~10 files. Each of them were produced from derived xAOD (ATLAS format) using AnalysisTop (internal ATLAS software based on AnalysisBase).


he slowdown of RDF w.r.t. a simple event loop is also due to Snapshot, because that call is just-in-time compiled and the jitted call contains many template parameters (one per output columns) – just the jitting takes ~4 seconds, you can see it even from htop: even with implicit MT enabled, several seconds are spent in single-thread operations (jitting), then there is a short burst of multi-thread processing

OK I understand, but I am still a bit puzzled then by the last test I have done without jitted strings, which takes ~4s for like 5 kEvts. This is maybe another “issue” than w/ MT versus w/o MT, but still …


t.GetEntries() where t is the output of a normal event loop that writes a tree t, right?

Not really, actually. The output of t.GetEntries() is obtained from the TTree saved by the df.Snapshot() command. In other words, I open the ouput file produced by df.Snapshot(), grabe the TTree and check the number of entries there.

I am still a bit puzzled then by the last test I have done without jitted strings

What takes a long time to jit are not the strings, but the call to Snapshot!
Snapshot(...., {"a","b","c"}) is just-in-time compiled to Snapshot<A, B, C>(..., {"a", "b", "c"}), and in your case the list of template parameters is pretty long as you are snapshotting tens of branches. The 4 seconds of jitting should not be a problem for large analyses (even if the analysis is 40 minutes long and runs on N cores, it will still only take 4 seconds to do the jitting of that Snapshot) but of course it skews the timings of small benchmarks by a lot.

The output of t.GetEntries() is obtained from the TTree saved by the df.Snapshot() command

Uh. very interesting, thanks

The output of t.GetEntries() is obtained from the TTree saved by the df.Snapshot() command

Regarding this little puzzle, my guess is that you did not notice that you have a typo because of which the output file produced by Snapshot was called ouput.root instead of output.root, so you were reading the entries written by Snapshot with out_df.Count() but you were always reading another file (the one previously produced by the python script, correctly named output.root when calling t.GetEntries()). In fact, the output of t.GetEntries() is always 421638 which is the last number of entries you got in your python tests according to the README.md attached to the reproducer you shared.

Performance
About the performance issues, you were measuring CPU time, not wallclock time (see std::clock docs). Total CPU time scales with the number of threads! :smile:

Using TStopwatch::RealTime I see wallclock time go from ~18s for a single-thread run on input.root to ~9s for a run on 4 cores. The scaling is still bad, but waaaay better than what we thought before :smile: I think the problem is a combination of bad clustering of the TTree and a constant offset of ~4s required to just-in-time compile your Snapshot.

Wrong number of entries in output
The instability in Count() was really a bug in Snapshot (jira ticket ROOT-9770, thanks for reporting and providing a repro, it was a huge help) that is now fixed in master. The fix will be present in ROOT v6.16 and v6.14/08, both coming soon.

This should solve everything as far as I can tell, but of course I’m available for further clarifications and/or discussions.

Cheers,
Enrico

Hi @eguiraud,

Thanks a lot for the feedback, that’s really interesting (and, well, a bit embarrassing too :wink: )!

Cheers,
Romain



but you were always reading another file

I apologize for this, you’re right! I have just corrected this in my git repository now.



About the performance issues, you were measuring CPU time, not wallclock time

Right! Now I understand the difference (I wasn’t really familiar with that before, to be honest). I implemented checks based on clock_t, std::chrono and TStopwatch for input.root and I get the following times with v6.14 (C++ code with lambda functions):

time(s) clock chrono twatch_CPU watch_Real
w/ MT 56.3 29.7 56.3 29.7
w/o MT 21.7 22.0 21.7 22.0

So now, the numbers make more sense, but I still don’t get the same scaling factor as you.

I see wallclock time go from ~18s for a single-thread run on input.root to ~9s for a run on 4 cores

What was the ROOT version for these numbers?



The instability in Count() was really a bug in Snapshot (jira ticket ROOT-9770, thanks for reporting and providing a repro, it was a huge help) that is now fixed in master. The fix will be present in ROOT v6.16 and v6.14/08, both coming soon.

Very nice, I am happy I was able to help!

What was the ROOT version for these numbers?

ROOT master branch, compiled with CMAKE_BUILD_TYPE=Release (i.e. an optimized build).
The timings were for your Compare_v14_NoJittedString.C, to which I added a main function that only calls RunTDF(), and that I compiled with -O3 optimizations.

It would be nice if you could try to repeat those measurements for ROOT’s current master branch and for a version of your script compiled with optimizations (g++ -O3).

You can compile your script with optimizations with g++ -O3 -o tdf_bench CompareTDF_v14_NoJittedString.C $(root-config --libs --cflags) (after adding a main).

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.