RDataFrame vs TTree performances

Francesco_Cirotto · October 19, 2021, 2:56pm

Dear experts,

I recently got in touch on RDataFrame and it looks a very powerful tool.

My question is on performances. Let’s assume I would like to write a plotter which handles multiple root files, create histograms and then manipulate them. My actual plotter version is a C++ framework which basically uses tree->Draw() method and then manipulate histograms for rearranging plots in a good shape. Running time depends on the size of root files.

A similar approach with RDataFrame will be faster or slower?

If RDataFrame is faster, do you think a python version of this plotter could still have good time performances?

Thanks for the attention.

Best regards,
Francesco

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

eguiraud · October 19, 2021, 3:06pm

Hi @Francesco_Cirotto ,
TTree::Draw runs a full loop over the data for every invocation. Correctly written RDataFrame code runs a single event loop that fills many histograms. Coupled with multi-threading parallelization on large datasets, typically RDataFrame will be faster (if you fill more than a couple histograms and your datasets are not too small).

Cheers,
Enrico

Francesco_Cirotto · October 20, 2021, 8:54am

Hello @eguiraud thanks for the reply.

So, as far as I understand, in RDataFrame there are lazy actions, which basically avoid to call event loop for every histo filling. An optimized code should work as in the following

define selections
create histogram, for example with RDataFrame.Filter(SELECTION).Histo1D()

Am I right?

Cheers,
Francesco

eguiraud · October 20, 2021, 9:15am

Yes, making sure that you do all the Histo1D calls before you actually access any of the results.
The event loop is started the first time any of the results is used, and it generates all the results booked so far. See the docs for more info.

Cheers,
Enrico

P.S.
in short, do (1 event loop):

h1 = df.Histo1D(...)
h2 = df.Histo1D(...)
h1.Draw()
h2.Draw("SAME")

and not (2 event loops):

h1 = df.Histo1D(...)
h1.Draw()
h2 = df.Histo1D(...)
h2.Draw("SAME")

You can use RDF’s verbose logging to debug and/or ask RDF the number of loops it ran at some point of the program execution with df.GetNRuns().

Francesco_Cirotto · October 21, 2021, 8:00am

Thanks @eguiraud for the advice!
A final question: is ok to use TStopwatch for monitoring running time between RDataFrame and TTree approaches? Or in ROOT there are more sophisticated tools?

Cheers,
Francesco

eguiraud · October 21, 2021, 8:02am

Yes that’s ok.
RDataFrame also has its own logging in case it’s useful, see " Performance profiling of RDataFrame applications" at ROOT: ROOT::RDataFrame Class Reference.

Cheers,
Enrico

RENATO_QUAGLIANI · October 22, 2021, 11:01pm

In my experience with tuples of O(1M ) entries making O(40) histograms vs making them with plain root was O(20-30) times faster, easier to debug and quite elegant from a code style point of view.
The real bottleneck i reached was due to having too much jitted code ,but i think this Is something which Is fixed ( RunGraph if i am not wrong) or planned to be at some point. In any case this Will note answer your question @Francesco_Cirotto ,but the RDF tool Speed up and helped my analysis and the analysis team i work with to do what used to be done in days, in order of minutes.

Francesco_Cirotto · October 29, 2021, 3:37pm

Ciao @RENATO_QUAGLIANI , thanks for the reply! Indeed I’m agree with you, I really appreciate RDataFrame, that’s why I’m really interested to “convert” old codes with this new tool.
Just as a matter of curiosity do you use RDF in C++ or python environment?

RENATO_QUAGLIANI · October 29, 2021, 6:13pm

Mainly C++ , but i compile all my code to generate libraries as well, which i load then in Python directly within my analysis framework if i need some functionality from C++ to be used in python.

Example, with some LinkDef and libraries for the code i have in C++ i do

node = HelperProcessing::AttachWeights( node)  ;

,

node = r.HelperProcessing.AttachWeights(node);

Where HelperProcessing is just a class with public methods using internally RDF C++ functors etc…

system · November 12, 2021, 6:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.