RDataFrame and defining new columns

maurik · July 8, 2020, 6:43pm

I am trying to learn how to do some simple things with RDataFrame, and keep getting stumped. The basic issue is illustrated by a simple TTree that contains a vector object. Example for producing this is attached.
The “old” way to make a plots would be relatively simple:

tree->Draw("vec_list.X()")
tree->Draw("vec_list.Mag()")

Relatively simple. Doing the same with an RDataFrame, not so much?
I was hoping there would be a simpler way than:

auto getX = [](vector<TVector3> &v){RVec<double> out; for( auto p: v) out.push_back(p.X()); return out;};
auto h2=df.Define("X",getX,{"vec_list"}).Histo1D("X")
h2->Draw();

Even this was rather difficult to deduce from the available documentation, so I seem to be missing something.

Also, adding a new column does not seem to quite follow the documentation. The following code does not work as expected:

// Templated (inline) standard function.
template<typename T> inline auto getsize( T& vec){ return vec.size();};
// The 3 argument form now crashes!
// auto dfplus = df.Define("num_list_size",getsize,{"num_list"});
// The 2 argument form now works!
auto dfplus3 = df.Define("num_list_size","getsize(num_list)");

Compared to the Lambda function, or non templated function:

// Lambda function for the size of a vector<double>; !! Templated Lambda's not until c++20 !!
auto getsize_l = [](vector<double> &vec){ return vec.size(); };
// Using a Lambda function, the define needs special 3 args form:
auto dfplus1 = df.Define("num_list_size",getsize_l,{"num_list"});
// The 2 argument form does not work:
// auto dfplus = df.Define("num_list_size","getsize_l(num_list)");

// non-templated standard function. It does not matter if it is inline or not.
inline auto getsize_f(vector<double> &vec){ return vec.size(); };
auto dfplus2 = df.Define("num_list_size",getsize_f,{"num_list"});
// Again the 2 argument form does not work:
// auto dfplus = df.Define("num_list_size","getsize_f(num_list)");

sample code for generating the TTree and these snippets are attached.

tree_test.cxx (3.1 KB) LinkDef.h (199 Bytes) snippets.C (2.5 KB)

ROOT Version: 6.20/07
Platform: MacOS
Compiler: clang 11.0.3

bellenot · July 8, 2020, 6:53pm

I’m sure @eguiraud can give you some hints

eguiraud · July 9, 2020, 8:48am

Hi @maurik,
thank you for the structured feedback, this is very useful. I am off today and tomorrow, I’ll give a proper reply on monday!

Cheers,
Enrico

eguiraud · July 13, 2020, 9:55am

Alright, I think there are two main points here.

Why doesn’t RDataFrame have as nice a syntax as TTree::Draw?

It is true that for simple, common histogram-filling tasks TTree::Draw offers a nice domain-specific-language (DSL), and if all you have to do is to fill one or two histograms, TTree::Draw code will be shorter and probably look nicer. However, that same programming model based on a DSL hits some important limitations when analyses get a bit more complex:

flexibility: TTree::Draw is, by design, geared towards plotting. Nothing wrong with that, but that’s not all that analysts need to do. Extracting data, exploring values, performing custom operations and aggregations of events, writing out new ROOT files is not easily possible or not at all with TTree::Draw, while RDataFrame offers facilities for each of these tasks and it is easily extensible to arbitrary operations performed during the event loop. Skimming a TTree while, at the same time, adding a couple more branches and producing a control plot as well as writing out the processed data is a handful of lines with RDataFrame
performance: TTree::Draw runs one single-thread event loop per histogram. That does not scale. RDataFrame produces all results, histograms and otherwise, in one multi-thread event loop. TTree::Draw is perfectly ok for quick data exploration, but then breaks down when the analysis becomes more complex. RDataFrame has a larger starting offset in terms of complexity but easily scales from the quick exploration usecase up to large and complex analyses (e.g. here and here)
ease of debugging: TTree::Draw expressions are very hard to debug (e.g. you can’t inspect what TTree::Draw is doing with a debugger, you can’t even easily insert print-outs). In contrast, RDataFrame Defines and Filters can be standard C++ functions. It’s trivial to insert print-outs or to step through them with a debugger.

So TTree::Draw works great for simple use-cases, while RDataFrame is a bit more complex upfront but it is a viable programming model from quick data exploration up to large analyses with hundreds of histograms and branches involved. Typically you can run RDataFrame code on 1 CPU core, 8 or 64, with no change. It scales better, both in terms of performance and code complexity, with the complexity of the analysis.

Why do certain things work when I use strings as arguments to Define and Filter, and others only work when I use C++ functions?

The section called “Branch type guessing and explicit declaration of branch types” in the RDF user guide is supposed to explain is supposed to explain what’s going on under the hood, but I’m sure it can be improved (any concrete suggestion is very much welcome).

I think the confusion about the “3 argument form” and “2 argument form” will be reduced by clearly explaining what each form actually does.

A. passing C++ functions to Define/Filter:

auto getX = [](vector<TVector3> &v){RVec<double> out; for( auto p: v) out.push_back(p.X()); return out;};
auto h2=df.Define("X",getX,{"vec_list"})

When using this form, for performance reasons, RDF infers the types of the columns involved from the signature of the C++ function getX. RDF can tell that "vec_list" is a vector<TVector3> from the signature of getX, and generate the appropriate compiled code.

This is why using a template function does not work (you should see a compilation error, not a crash, but we should definitely make that compilation error more human-friendly). In this example:

template<typename T> inline auto getsize( T& vec){ return vec.size();};
auto dfplus = df.Define("num_list_size",getsize,{"num_list"});

RDF cannot look at the signature of a template function, and at compile time RDF cannot know what the type of "num_list" is. Using getsize<vector<double>> instead of getsize will work, if that’s the type of “num_list”:

auto dfplus = df.Define("num_list_size",getsize<vector<double>>,{"num_list"});

B. passing strings to Define/Filter:

When using this form, RDataFrame will check what the types of the columns involved are at runtime and just-in-time-compile the appropriate code.

auto dfplus3 = df.Define("num_list_size","getsize(num_list)");

will just-in-time compile code that is equivalent to:

auto thecallable = [](vector<double> &num_list) { return getsize(num_list); };
auto dfplus3 = df.Define("num_list_size", thecallable, {"num_list"});

Just-in-time compilation produces less performant code: it will insert virtual calls during the event loop and prevent certain compiler optimizations such as inlining. But it is often quicker to write; users should use whichever form they think works best for their usecase.

However, for ROOT to be able to just-in-time-compile a call to some function such as getsize, the function definition must be available to cling, ROOT’s C++ interpreter. This will not work unless cling knows about getsize_f:

auto dfplus = df.Define("num_list_size","getsize_f(num_list)");

You can use gInterpreter->Declare("#include \"somefile.h\") to let cling know about a function definition, or directly copy-paste the definition in the Declare call, or run the whole program as a macro so everything is piped through the ROOT interpreter.

I hope this clarifies why certain things work and some others do not, and how to make each of your examples work.

Cheers,
Enrico

P.S.

Selecting some events, defining a new branch/column, filling a control plot and writing out the skimmed dataset (including the new branch/column), in a multi-thread event loop, in 5 lines of code:

ROOT::EnableImplicitMT(); // enable multi-threading
ROOT::RDataFrame df("treename", "some/files*.root");
auto df2 = df.Filter("some_vec.size() > 0").Define("other_vec", "sqrt(vec1*vec1 + vec2*vec2)");
auto control_h = df2.Histo1D("other_vec");
// write out new dataset. this triggers the event loop and also fills the booked control plot
df2.Snapshot("newtree", "newfile.root", {"x","y"});

maurik · July 13, 2020, 7:20pm

Hello Enrico,

Thank you for your clear and comprehensive answer. (Thank you for the references too.) A lot of this makes sense, and indeed the first three points are why I am looking seriously into RDataFrame. I fairly enthusiastic about learning how to make all this work since it seems there is a lot to be gained.

It seems that my difficulties with the RDataFrame system come from a combination of less than excellent C++ skills and not quite complete (or sufficiently pedantic) documentation. I find it more difficult and time consuming (with lots of trial and error) to solve relatively simple problems than I have found for similar tasks using “old style” ROOT.

Since you ask, here are some ways in which I think the documentation can be improved:

The function Histo1D is easy to use, and all tutorial examples but one exclusively use Histo1D as an example. In the RDataFrame documentation under “Defining custom columns”, the graphic seems to suggest you can use df.Histo2D(“x”,“y”). Trying this did not work. It seems one needs the longer form: df.Histo2D({“xy”,“y versus x”,10,-5,5,10,-5,5},“x”,“y”). That little detail was confusing.
In your documentation, or in the tutorial examples, it would be quite useful to have some examples where the TTree contains either vector<my_class> or TClonesArray with my_class. This would be especially useful if my_class in not trivial, and an example code that produces the TTree is also provided. The example can then show how to filter on, say chi2 that is part of this class. More complicated classes, where my_class contains a vector of my_tracks, would be even better. There are probably experiments with lots of legacy root files that are written in such a manner that would benefit from such a detailed example. ^_^.
Some relatively simple sounding constructs are giving me significant difficulties. For instance, filter on a function that takes one column and a constant as its arguments. How do I pass this function a constant? The only method I found to work was creating a column with that constant, and then passing this column:
EG: ------
bool larger_than_i(std::vector &vec,int lt){ return std::all_of(vec.begin(),vec.end(), [&lt](double i){return i> lt;}); }
df.Define(“zero”,“0”).Filter(all_larger_than_i,{“x”,“zero”}) // This works.
df.Filter(my_greater_than,{“x”,“0.”}) // Error: libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Unknown column: 0. --> ROOT exits.
df.Filter(my_greater_than,{“x”,0}) // Error: type ‘double’ cannot be narrowed to ‘std::__1::basic_string_view::size_type’

The other method that does work it to use a global variable (“double cut_on_this_value = 0.”) and then write the function with only one argument and the global:
bool larger_than_i(std::vector &vec,int lt){ double lt = cut_on_this_value; return std::all_of(vec.begin(),vec.end(), [&lt](double i){return i> lt;}); }
This works only if the global is copied to the local variable lt, but it irks my old school coding sensibility by introducing a global.

Thanks again!

Best,
Maurik

eguiraud · July 14, 2020, 8:20am

Thank you for the precious feedback!

is fixed by https://github.com/root-project/root/pull/6032
sounds like https://root.cern.ch/doc/master/df002__dataModel_8C.html
needs lambda captures (and optionally RVecs):

int i = 1;
auto larger_than_i = [i](const RVec<int> &vec) { return All(vec > i); };
df.Filter(larger_than_i, {"x"});

Cheers,
Enrico

system · July 28, 2020, 8:20am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.