ROOT Version: v6.12.06 / v6.14.04
Platform: Ubuntu 18.04
Compiler: gcc
Cores: 8
Hi all,
I am exploring various aspect of the very nice new TDataFrame/RDataFrame objects and I have made some performance studies that could be possibly an interesting feedback (I apologize in advance if it’s not!). I provide a summary of my observation here and all the code/results are available in this github repository (tag v0.1) (except the input root file which is ATLAS internal - feedback welcome to work this around if someone wants to reproduce these results).
Cheers,
Romain
1. Performances
The following numbers corresponds a running time over 0.8 millions events, with ~10 new variable computations (very simple ones - booleans), and saving the selected events (~50%) in a new ROOT file. One detail to mention is that the explicit event loop doesn’t use multi-threading, while DataFrame
-based code do.
TDataFrame (v6.12.06) |
RDataFrame (v6.14.04) |
Explicit event loop* | |
---|---|---|---|
C++ | 261.37 s | 57.75 s | 19.5 s |
Python | 114.82 s | 28.50 s | - |
(*) the 10 new boolean variables are not computed in the explicit event loop but I don’t think it’s time consuming
Several questions can be raised
- why the difference between python and C++ (in v6.14.04 for instance)?
- why the difference between explicit event loop without MT and DataFrame with MT?
- is it expected to have such a big difference between
TDataFrame
andRDataFrame
(for instance in python)?
2. Instability of df.Count()
results in v6.14.04
While doing these studies, I realized that the number of events in the final dataframe changed from one execution to another of the exact same code, at least in v6.14.04 (it doesn’t seem to be the case for v6.12.06): The next table shows the number of selected events extracted in two ways df.Count()
and tree.GetEntries()
in both python and C++ for four runs of the same code (*).
df.Count() / t.GetEntries()
|
Python | C++ |
---|---|---|
Run 1 | 421186 / 421186 | 422593 / 421638 |
Run2 | 421150 / 421150 | 422595 / 421638 |
Run3 | 421638 / 421638 | 423074 / 421638 |
Run4 | 422593 /422593 | 420172 / 421638 |
Observations
- in python the two numbers are always equal but vary from one execution to another
- in C++ the number of saved events in the actual output tree is constant from one execution to another, while it’s not the case for
df.Count()
output.
(*) what different “a run” means in
- Python
python CompareTDF_v14.py
- C++
root
.L CompareTDF_v14.C+
RunTDF()
.q