Pass a whole event content with RDataFrame

kandrosov · September 10, 2019, 1:29pm

Hello,
is there a way to pass all columns available in the TTree without specifying all of them in the signature of a custom function when calling ROOT::RDataFrame::Define() or ROOT::RDataFrame::Filter()?

For example, I want to store the result of some complex algorithm that uses information from (almost) all columns stored in the TTree in a new branch. I know it can be achieved by creating a wrapper-method that takes all required columns as arguments, but the flexibility of such approach is quite limited when the number of used columns is above 40. In my opinion, for such cases, it would be much easier if it would be possible to pass the entire event content wrapped into a simple structure, e.g. something like

template<typename Event>
double algo(const Event& e);

df = df.Define("score", algo);

Is there a way to do it in the last root version? Or maybe there are some plans to implement similar functionality?

Thank you!

eguiraud · September 10, 2019, 2:04pm

Hi Konstantin,
welcome to the ROOT forum and thank you for the feedback!
Unfortunately this is not possible at the moment. I agree that sometimes, especially when adapting old code, it could be useful to have an Event object available in RDF instead of passing a large number of columns and column names as argument.

Unfortunately, this feature is not there and there are no plans to implement it, currently.

You can however write only one function that takes all of your branches and produces the Event class:

df.Define("event", create_event, all_branches)

where create_event would be a function with a giant signature and would return an Event. Performance would still be quite bad since you are making one copy of every branch at every entry.

Rant alert
If we are talking C++, Event would have to be a quite complicated object, since at compile-time, when its layout could be decided, there is no way to tell what data members it should have. The performance impact of performing one full copy of each event is also to be considered carefully. So Event would need to be some opaque wrapper object that loads branches lazily, on demand, e.g. returning a void * from event["mybranch"] or returning a T& from event.get<T>("mybranch"). But theeen you would need to specify the T everytime you use a branch, so we’re back where we started. Another option could be to have a tool process your data and create a header file that contains the proper definition of Event. I’m not a fan of having users include and compile ROOT-generated code, but that would be possible, although probably not trivial.

Cheers,
Enrico

kandrosov · September 10, 2019, 3:34pm

Hello Enrico,
thank you for the detailed replay!
Indeed, with the current functionality, defining create_event function is probably the best way to make a workaround for an old code. Thank you for the suggestion!

Rant alert
About possible C++ implementation of my proposal: I don’t think that Event should be a complicated object. For the cases where more than 90% of branches are used for most of the events, it is not a big overhead to read all branches without waiting until the first access. In such case Event can be a simple struct and data from branches can be directly loaded into its fields, without a need of copying and the related performance losses. On the other hand, using clang interpreter, the definition of struct Event can be generated in runtime. So, for the use-cases where an analyst knows that all the branches will be used most of the time, he can specify it during the construction of the RDataFrame (e.g. add argument Mode::Full), and at that point under the hood, the struct Event would be generated and TTree::SetBranchAddress would be assigned to its fields. And also, specialization of algo<Event> can be generated in runtime by clang. In such a way, all other benefits of RDataFrame can still be exploited without considerable performance losses.

Cheers,
Konstantin.

eguiraud · September 10, 2019, 4:48pm

Yes! but then what do users put as an argument type in Define lambdas that should use Event?

Besides this problem, I agree that would be a way to do it.

kandrosov · September 10, 2019, 5:37pm

IMO, it is not a very good practice to write some complicated code inside lambdas, so in the case of externally defined lamdas providing the ordered list of columns looks fine to me. Besides, in C++20 also labmdas will be able to have template parameters

In case of functions or lambdas passed as a string into Define, clang will deduce the argument type, so users don’t need to specify it explicitly, e.g.:

template<typename Event> double myFn(const Event&) { return 21; }
df = df.Define("x", "myFn(event) * 2");

eguiraud · September 10, 2019, 5:42pm

RDF also supports free functions and functor classes, of course, but the problem remains. Yes, specifying everything via strings should work (note that you also have to declare myFn to cling as a string).

Cheers,
Enrico

system · September 24, 2019, 5:42pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.