I wanted to encourage someone to try RDataFrame but I found it dies like crazy on simple trees (ROOT 6.28/02).
These trees contain branches that are ordinary variables, simple arrays (fixed and variable size), and “split” and “unsplit” objects of some user’s classes.
Using TFile::MakeProject, I generate and load dictionaries for all “nonstandard” classes.
I then try to dump the available column names and types … that’s where RDataFrame breaks.
One tree:
what(): TTree leaf V.V has both a leaf count and a static length. This is not supported.
TTree::Print gives:
*Br 1 :N : N/I
*Br 2 :V : V[N][3]/F
TTree:MakeSelector returns (for all such branches):
Warning in <TTreeReaderGenerator::AddReader>: Ignored branch SOME_BRANCH because type is unsupported.
Another tree (for the “V” column, RDataFrame returns “TClonesArray” as the column type):
what(): TTree leaf V.X[4] has both a leaf count and a static length. This is not supported.
this is a limitation of TTreeReader, it does not support 2D arrays. As RDataFrame uses TTreeReader under the hood for I/O, it inherits this limitation. RDataFrame does not support multi-dimensional arrays.
Ah, yes. I’ve now even found my older thread (related to another set of trees):
So, the bad news is that RDataFrame is useless.
The good news is that it’s also useless in multi-threaded mode on a single machine and on Spark, Dask, and other distributed modes.
That’s quite a misrepresentation of reality but hey, I guess for the multi-dim case, until we (I?) finally fix this, people have to resort to using TTree directly.
How so? We see accelerations of more than an order of magnitude in real life analyses, and physicists being very happy that they don’t need to debug those void*& arguments of SetBranchAddress() anymore. Scaling across clusters is excellent, too. All of that assumes we’re talking about more than a minute of work.
Well, the “TTreeReader” was introduced something like 10 years ago. It still does NOT support simple ordinary columns (“branches”) holding fundamental types and arrays thereof (don’t tell me to find when the old ROOT 5 started to support them, probably something like 30 years ago).
Each time I tried the highly complicated RDataFrame, I could get nothing because I could not read some important data from the existing trees.
So, its performance is exactly 0 (zero) for me. Even if I multiply this performance by any number of available cores (multi-threaded) and by any number of available machines (distributed modes), I still get a very round 0 (zero).
If I need “multi-threading”, … the "=legacy"Selector plus the “PROOF Lite” (I can’t find the corresponding web page now) is a solution.
And for the “distributed modes”, I need to go with any available ordinary “batch clusters” systems (plus the “Ganga” if desired).