I recently got in touch on RDataFrame and it looks a very powerful tool.
My question is on performances. Let’s assume I would like to write a plotter which handles multiple root files, create histograms and then manipulate them. My actual plotter version is a C++ framework which basically uses tree->Draw() method and then manipulate histograms for rearranging plots in a good shape. Running time depends on the size of root files.
A similar approach with RDataFrame will be faster or slower?
If RDataFrame is faster, do you think a python version of this plotter could still have good time performances?
Hi @Francesco_Cirotto , TTree::Draw runs a full loop over the data for every invocation. Correctly written RDataFrame code runs a single event loop that fills many histograms. Coupled with multi-threading parallelization on large datasets, typically RDataFrame will be faster (if you fill more than a couple histograms and your datasets are not too small).
So, as far as I understand, in RDataFrame there are lazy actions, which basically avoid to call event loop for every histo filling. An optimized code should work as in the following
define selections
create histogram, for example with RDataFrame.Filter(SELECTION).Histo1D()
Yes, making sure that you do all the Histo1D calls before you actually access any of the results.
The event loop is started the first time any of the results is used, and it generates all the results booked so far. See the docs for more info.
Thanks @eguiraud for the advice!
A final question: is ok to use TStopwatch for monitoring running time between RDataFrame and TTree approaches? Or in ROOT there are more sophisticated tools?
Yes that’s ok.
RDataFrame also has its own logging in case it’s useful, see " Performance profiling of RDataFrame applications" at ROOT: ROOT::RDataFrame Class Reference.
In my experience with tuples of O(1M ) entries making O(40) histograms vs making them with plain root was O(20-30) times faster, easier to debug and quite elegant from a code style point of view.
The real bottleneck i reached was due to having too much jitted code ,but i think this Is something which Is fixed ( RunGraph if i am not wrong) or planned to be at some point. In any case this Will note answer your question @Francesco_Cirotto ,but the RDF tool Speed up and helped my analysis and the analysis team i work with to do what used to be done in days, in order of minutes.
Ciao @RENATO_QUAGLIANI , thanks for the reply! Indeed I’m agree with you, I really appreciate RDataFrame, that’s why I’m really interested to “convert” old codes with this new tool.
Just as a matter of curiosity do you use RDF in C++ or python environment?
Mainly C++ , but i compile all my code to generate libraries as well, which i load then in Python directly within my analysis framework if i need some functionality from C++ to be used in python.
Example, with some LinkDef and libraries for the code i have in C++ i do
node = HelperProcessing::AttachWeights( node) ;
,
node = r.HelperProcessing.AttachWeights(node);
Where HelperProcessing is just a class with public methods using internally RDF C++ functors etc…