TTree::MakeClass analog for RDataFrame

beojan · June 25, 2018, 10:58am

Now that RDataFrame is the recommended way of doing analyses, an MakeClass analog would be useful. I would suggest creating a header file containing:

An std::vector<std::string> or initializer_list<std::string> containing column names
A preprocessor macro with the argument list with correct types, in the correct order

eguiraud · June 25, 2018, 12:21pm

Hi beojan,
thank you for the feedback!
Could you make a few examples of use-cases in which this would be useful (and how this would look roughly)?

Cheers,
Enrico

P.S.
Note that RDataFrame::GetColumnNames() returns a vector<string> of valid column names.
A macro that lists all types would probably not be very useful as in each transformation/action you want to pass only the types and column names that are needed there. Just-in-time compiled versions of transformations and actions (e.g. Filter("x > 0") and df.Min("x")) are provided so that users don’t have to type out typenames when there is no particular performance concern.

beojan · June 27, 2018, 8:28am

The JIT-ted versions are pretty limited because you can’t call functions or refer to captured variables.

I think a macro with all types would be perfectly fine for testing purposes. For production, a preprocessor that slimmed down the branch list to only those actually used in the function / lambda (there’s no reason this couldn’t be used for functions defined separately from the Filter or Define call, though the preprocessor may be more complicated) would provide the performance benefit of not having to read out all the branches.

eguiraud · June 27, 2018, 1:20pm

Hi Beojan,
I see, thank you!

I am hesitant to provide macros that list all column names and types, because that makes it easy (or easier) to write very, very slow code that looks elegant and just works.

If I understand correctly, the problem you have is that you don’t want to write long template parameter lists by hand and/or you don’t want to have to check what type a given column is every two seconds.

What about some utilities that makes it easier to write that kind of (fast) code?
E.g. at the root prompt one could type MakeFilter(filename, "x > 0") and the string Filter([](double x) { return x > 0; }, {"x"}) could be automatically generated for you.
Also something like TypesFor({"x","y","z"}) which prints vector<int>, int, double would be easy to do.

Cheers,
Enrico

beojan · June 27, 2018, 2:59pm

That would be useful. What I was suggesting was that this be done by a preprocessor instead. For instance You would run something like DataFrameDescription(treename, filename) that produces a data file (e.g. treename.desc) describing every branches, which the user could perhaps edit to reflect what they want to refer to the branches as. Then, write for instance

bool filter1( [[treename.desc]] ) {
    // code using branches
   return true;
}
// ...
df.Filter(filter1, "Filter 1");

And run a preprocessor as part of the compile process that would fill in the argument list and branch list in the correct order. The ordering is one of the pain points, because it makes the default argument list a lot less useful.

eguiraud · June 27, 2018, 3:21pm

Yes, my doubt is that such a system would make it easy (or easier) to write very, very slow code that looks elegant and “just works”: each branch passed to Filter would be read at every entry, even if //code using branches only uses 2 out of 200 branches.

Each Filter should only take the branches it needs as input, as that limits useless reads (and reading is most of the runtime for typical usecases), hence my alternative proposal

Cheers,
Enrico

beojan · June 27, 2018, 7:51pm

Sorry, I didn’t mean the C preprocessor, I meant another one like moc (for Qt) that would only include the branches that are used.

eguiraud · June 27, 2018, 9:17pm

Uhm, wouldn’t that require a full-blown C++ parser to go through //code using branches and figure out what variables are used and what of those variables have names that correspond to branches?
Branchnames such as Jet.Pt would be even more complicated.
Something like that would entail a significant development effort, even with cling available.

Taking a step back, what is the actual usability issue that you are encountering?
As I wrote above, as I understand it, it’s that you don’t want to write long template parameter lists by hand and/or you don’t want to have to check what type a given column is. Is this correct?

beojan · June 27, 2018, 10:09pm

It would require a full parser, but my impression was that with libclang that’s not really a barrier.

My key issue is that the branches used need to be listed twice (once in the argument list, once in the branch list), and you need to list them in the exact same order. This is repeated for every node.

If you misspell a branch name in the branch list – runtime error. Get the order wrong (very easy if you define a function instead of using an inline lambda) – runtime error. Use the wrong type – runtime error.

My current solution is to define one column at the beginning containing a struct and put all the information I need in that struct. This doesn’t seem very efficient though.

Though on second thought, the preprocessor would have to somehow deal with defined columns. Even a checker to turn those runtime errors into compile time ones would probably require some fairly involved static analysis. Maybe this is a lot harder than I anticipated.

A function that returns branch names along with types (as you suggested) might be the best start. At least this would provide a guide to avoid some of the issues.

Alternatively it might be easier to just extend the JIT mode (with optimization and the ability to call functions) and rely on that. The design does give me the impression that the JIT mode was meant to be the primary way RDataFrame is used.

eguiraud · June 28, 2018, 7:18am

It would require a full parser, but my impression was that with libclang that’s not really a barrier.

cling makes something like this possible, but not necessarily straightforward (for the reasons you list and more).
Also the double compilation pass is a bit clunky imo.

If you misspell a branch name in the branch list – runtime error. Get the order wrong (very easy if you define a function instead of using an inline lambda) – runtime error. Use the wrong type – runtime error.

I hear you. I don’t have a non-jitted solution (we need the compiler to see the branch types in the signatures and RDataFrame to have the names of the branches), but I am aware that’s a pain point.

My current solution is to define one column at the beginning containing a struct and put all the information I need in that struct. This doesn’t seem very efficient though.

If you know that you will always read all the branches in the struct for each event, this costs one extra copy of those values, which is probably not a performance bottleneck. If it does slow you down, refactoring later is fairly straightforward. If you don’t want to read all the branches in the struct for each event, this method gets you wasteful reading (which might have a sensible runtime cost).

it might be easier to just extend the JIT mode (with optimization and the ability to call functions)

The cost of jitting is some offset before starting the event loop (during which things get compiled) and a virtual call per jitted node. Depending on the use-case, this might be reasonably low or unbearably high.
However, you can do in JIT mode anything that you could do in the ROOT interpreter, including calling functions (as long as cling knows about them – you might have to gInterpreter->Declare("#include ...") and/or load the corresponding libraries via the interpreter).

The design does give me the impression that the JIT mode was meant to be the primary way RDataFrame is used

JIT for quick, possibly interactive exploration, and more verbose, native C++ code for a performant implementation that you code once and use many times.

In any case, thank you for your great feedback, we should definitely think about ways to mitigate the verbosity/redundancy of the native C++ interface.

Cheers,
Enrico

system · July 12, 2018, 7:18am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.