RDataFrame vs TTree performances

Dear experts,

I recently got in touch on RDataFrame and it looks a very powerful tool.

My question is on performances. Let’s assume I would like to write a plotter which handles multiple root files, create histograms and then manipulate them. My actual plotter version is a C++ framework which basically uses tree->Draw() method and then manipulate histograms for rearranging plots in a good shape. Running time depends on the size of root files.

A similar approach with RDataFrame will be faster or slower?

If RDataFrame is faster, do you think a python version of this plotter could still have good time performances?

Thanks for the attention.

Best regards,
Francesco


Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi @Francesco_Cirotto ,
TTree::Draw runs a full loop over the data for every invocation. Correctly written RDataFrame code runs a single event loop that fills many histograms. Coupled with multi-threading parallelization on large datasets, typically RDataFrame will be faster (if you fill more than a couple histograms and your datasets are not too small).

Cheers,
Enrico

Hello @eguiraud thanks for the reply.

So, as far as I understand, in RDataFrame there are lazy actions, which basically avoid to call event loop for every histo filling. An optimized code should work as in the following

  • define selections
  • create histogram, for example with RDataFrame.Filter(SELECTION).Histo1D()

Am I right?

Cheers,
Francesco

Yes, making sure that you do all the Histo1D calls before you actually access any of the results.
The event loop is started the first time any of the results is used, and it generates all the results booked so far. See the docs for more info.

Cheers,
Enrico

P.S.
in short, do (1 event loop):

h1 = df.Histo1D(...)
h2 = df.Histo1D(...)
h1.Draw()
h2.Draw("SAME")

and not (2 event loops):

h1 = df.Histo1D(...)
h1.Draw()
h2 = df.Histo1D(...)
h2.Draw("SAME")

You can use RDF’s verbose logging to debug and/or ask RDF the number of loops it ran at some point of the program execution with df.GetNRuns().

Thanks @eguiraud for the advice!
A final question: is ok to use TStopwatch for monitoring running time between RDataFrame and TTree approaches? Or in ROOT there are more sophisticated tools?

Cheers,
Francesco

Yes that’s ok.
RDataFrame also has its own logging in case it’s useful, see " Performance profiling of RDataFrame applications" at ROOT: ROOT::RDataFrame Class Reference.

Cheers,
Enrico

In my experience with tuples of O(1M ) entries making O(40) histograms vs making them with plain root was O(20-30) times faster, easier to debug and quite elegant from a code style point of view.
The real bottleneck i reached was due to having too much jitted code ,but i think this Is something which Is fixed ( RunGraph if i am not wrong) or planned to be at some point. In any case this Will note answer your question @Francesco_Cirotto ,but the RDF tool Speed up and helped my analysis and the analysis team i work with to do what used to be done in days, in order of minutes.

Ciao @RENATO_QUAGLIANI , thanks for the reply! Indeed I’m agree with you, I really appreciate RDataFrame, that’s why I’m really interested to “convert” old codes with this new tool.
Just as a matter of curiosity do you use RDF in C++ or python environment?

Mainly C++ , but i compile all my code to generate libraries as well, which i load then in Python directly within my analysis framework if i need some functionality from C++ to be used in python.

Example, with some LinkDef and libraries for the code i have in C++ i do

node = HelperProcessing::AttachWeights( node)  ; 

,

node = r.HelperProcessing.AttachWeights(node);

Where HelperProcessing is just a class with public methods using internally RDF C++ functors etc…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.