I wanted to encourage someone to try RDataFrame but I found it dies like crazy on simple trees (ROOT 6.28/02).
These trees contain branches that are ordinary variables, simple arrays (fixed and variable size), and “split” and “unsplit” objects of some user’s classes.
Using TFile::MakeProject, I generate and load dictionaries for all “nonstandard” classes.
I then try to dump the available column names and types … that’s where RDataFrame breaks.
what(): TTree leaf V.V has both a leaf count and a static length. This is not supported.
*Br 1 :N : N/I
*Br 2 :V : V[N]/F
TTree:MakeSelector returns (for all such branches):
Warning in <TTreeReaderGenerator::AddReader>: Ignored branch SOME_BRANCH because type is unsupported.
Another tree (for the “V” column, RDataFrame returns “TClonesArray” as the column type):
what(): TTree leaf V.X has both a leaf count and a static length. This is not supported.
this is a limitation of TTreeReader, it does not support 2D arrays. As RDataFrame uses TTreeReader under the hood for I/O, it inherits this limitation. RDataFrame does not support multi-dimensional arrays.
That’s quite a misrepresentation of reality but hey, I guess for the multi-dim case, until we (I?) finally fix this, people have to resort to using TTree directly.
How so? We see accelerations of more than an order of magnitude in real life analyses, and physicists being very happy that they don’t need to debug those void*& arguments of SetBranchAddress() anymore. Scaling across clusters is excellent, too. All of that assumes we’re talking about more than a minute of work.
Each time I tried the highly complicated RDataFrame, I could get nothing because I could not read some important data from the existing trees.
So, its performance is exactly 0 (zero) for me. Even if I multiply this performance by any number of available cores (multi-threaded) and by any number of available machines (distributed modes), I still get a very round 0 (zero).
If I need “multi-threading”, … the "=legacy"Selector plus the “PROOF Lite” (I can’t find the corresponding web page now) is a solution.
And for the “distributed modes”, I need to go with any available ordinary “batch clusters” systems (plus the “Ganga” if desired).