Alright, I think there are two main points here.
Why doesn’t RDataFrame have as nice a syntax as TTree::Draw?
It is true that for simple, common histogram-filling tasks TTree::Draw
offers a nice domain-specific-language (DSL), and if all you have to do is to fill one or two histograms, TTree::Draw
code will be shorter and probably look nicer. However, that same programming model based on a DSL hits some important limitations when analyses get a bit more complex:
-
flexibility:
TTree::Draw
is, by design, geared towards plotting. Nothing wrong with that, but that’s not all that analysts need to do. Extracting data, exploring values, performing custom operations and aggregations of events, writing out new ROOT files is not easily possible or not at all with TTree::Draw
, while RDataFrame
offers facilities for each of these tasks and it is easily extensible to arbitrary operations performed during the event loop. Skimming a TTree while, at the same time, adding a couple more branches and producing a control plot as well as writing out the processed data is a handful of lines with RDataFrame
-
performance:
TTree::Draw
runs one single-thread event loop per histogram. That does not scale. RDataFrame produces all results, histograms and otherwise, in one multi-thread event loop. TTree::Draw
is perfectly ok for quick data exploration, but then breaks down when the analysis becomes more complex. RDataFrame
has a larger starting offset in terms of complexity but easily scales from the quick exploration usecase up to large and complex analyses (e.g. here and here)
-
ease of debugging:
TTree::Draw
expressions are very hard to debug (e.g. you can’t inspect what TTree::Draw
is doing with a debugger, you can’t even easily insert print-outs). In contrast, RDataFrame
Defines and Filters can be standard C++ functions. It’s trivial to insert print-outs or to step through them with a debugger.
So TTree::Draw
works great for simple use-cases, while RDataFrame
is a bit more complex upfront but it is a viable programming model from quick data exploration up to large analyses with hundreds of histograms and branches involved. Typically you can run RDataFrame code on 1 CPU core, 8 or 64, with no change. It scales better, both in terms of performance and code complexity, with the complexity of the analysis.
Why do certain things work when I use strings as arguments to Define and Filter, and others only work when I use C++ functions?
The section called “Branch type guessing and explicit declaration of branch types” in the RDF user guide is supposed to explain is supposed to explain what’s going on under the hood, but I’m sure it can be improved (any concrete suggestion is very much welcome).
I think the confusion about the “3 argument form” and “2 argument form” will be reduced by clearly explaining what each form actually does.
A. passing C++ functions to Define/Filter:
auto getX = [](vector<TVector3> &v){RVec<double> out; for( auto p: v) out.push_back(p.X()); return out;};
auto h2=df.Define("X",getX,{"vec_list"})
When using this form, for performance reasons, RDF infers the types of the columns involved from the signature of the C++ function getX
. RDF can tell that "vec_list"
is a vector<TVector3>
from the signature of getX
, and generate the appropriate compiled code.
This is why using a template function does not work (you should see a compilation error, not a crash, but we should definitely make that compilation error more human-friendly). In this example:
template<typename T> inline auto getsize( T& vec){ return vec.size();};
auto dfplus = df.Define("num_list_size",getsize,{"num_list"});
RDF cannot look at the signature of a template function, and at compile time RDF cannot know what the type of "num_list"
is. Using getsize<vector<double>>
instead of getsize
will work, if that’s the type of “num_list”:
auto dfplus = df.Define("num_list_size",getsize<vector<double>>,{"num_list"});
B. passing strings to Define/Filter:
When using this form, RDataFrame will check what the types of the columns involved are at runtime and just-in-time-compile the appropriate code.
auto dfplus3 = df.Define("num_list_size","getsize(num_list)");
will just-in-time compile code that is equivalent to:
auto thecallable = [](vector<double> &num_list) { return getsize(num_list); };
auto dfplus3 = df.Define("num_list_size", thecallable, {"num_list"});
Just-in-time compilation produces less performant code: it will insert virtual calls during the event loop and prevent certain compiler optimizations such as inlining. But it is often quicker to write; users should use whichever form they think works best for their usecase.
However, for ROOT to be able to just-in-time-compile a call to some function such as getsize
, the function definition must be available to cling, ROOT’s C++ interpreter. This will not work unless cling knows about getsize_f
:
auto dfplus = df.Define("num_list_size","getsize_f(num_list)");
You can use gInterpreter->Declare("#include \"somefile.h\")
to let cling know about a function definition, or directly copy-paste the definition in the Declare
call, or run the whole program as a macro so everything is piped through the ROOT interpreter.
I hope this clarifies why certain things work and some others do not, and how to make each of your examples work.
Cheers,
Enrico
P.S.
Selecting some events, defining a new branch/column, filling a control plot and writing out the skimmed dataset (including the new branch/column), in a multi-thread event loop, in 5 lines of code:
ROOT::EnableImplicitMT(); // enable multi-threading
ROOT::RDataFrame df("treename", "some/files*.root");
auto df2 = df.Filter("some_vec.size() > 0").Define("other_vec", "sqrt(vec1*vec1 + vec2*vec2)");
auto control_h = df2.Histo1D("other_vec");
// write out new dataset. this triggers the event loop and also fills the booked control plot
df2.Snapshot("newtree", "newfile.root", {"x","y"});