One to many transformation in RDataFrame

Is it possible to do a one-to-many transformation in RDataFrame, e.g. with multiple ways of reconstructing an event?

Since the answer is probably no, could this feature be added?

Hi,
could you provide a little snippet that showcases the feature?

Cheers,
Enrico

Something along the lines of:

df.MultiDefine("dijet",
    [](const std::vector<TLorentzVector>& jets) -> std::vector<TLorentzVector> {
        std::vector<TLorentzVector> output;
        for(auto&& jet1 : jets) {
            for(auto&& jet2 : jets) {
                output.push_back(jet1 + jet2);
            }
        }
        return output;
    }, {"jets"});

The vector would then be unpacked, and there would be an RDataFrame entry with each dijet value.

That’s one way, it could also be an Unpack function that unpacks an existing vector column.

Ok I see,
you want to unpack/explode a given collection column in a new separate RDataFrame that would loop over the elements of the collection rather than on the original events.

We had thought about it but in the end deemed it might end up being confusing, also in terms of computational complexity.

At the moment we are thinking that we would rather keep a clear separation between loop over events (managed by RDataFrame) and nested loops (e.g. over jets in an event), but we want to make these nested loops and selections as simple and elegant as possible. RVec was introduced with this goal in mind, and it goes a long way, but when dealing with multiple collections like in this case it requires a bit of not-so-elegant index manipulation.

So…we are thinking about it. It would be great if you could contribute your examples of nasty nested loops to this repo either as a pull request or simply an issue that describes your use case: the more usecases we are aware of, the nicer the solution we’ll propose.

Actually, I solved the basic particles / jets / whatever in an event problem by making copious use of range-v3 which provides a lot of really nice FPesque functionality for working with collections.

This Unpack / Explode (I think I prefer the Explode name, actually) serves a different usecase, dealing with cases where there are different, independent, ways of looking at an event. You wouldn’t then re-group the rows from a single event, for instance, as you would if you exploded out the particles.

Hi beojan,

by making copious use of range-v3

nice, that’s of course RVec on steroids (unfortunately we can’t just tell users to use range-v3 in general).

This Unpack / Explode (I think I prefer the Explode name, actually) serves a different usecase, dealing with cases where there are different, independent, ways of looking at an event.

Sorry could you give an example? :smile:
I don’t see why your usecase before could not be addressed by some facility that allows to easily write things like df.Define("dijets", AllPairs, {"jets"}).Define("some_var", ProcessDijets, {"dijets"}).

Cheers,
Enrico

OK, lets take my specific usecase. I’m working on a HH->4b analysis, and part of this is deciding which two jets make up each of the two Higgses.

One solution is to simply select the best pairing. Unfortunately, doing this will inevitably sculpt the background, since every background event gets looked at in the “most HH like” way. The alternative is to consider every possible pairing as though it were an independent event (so an n jet event gets blown up into nP4 separate “views”, each of which is considered as an independent event), and continue applying cuts with Filters.

Without an Explode facility, all these cuts would have to be applied by either defining successively refined sets of candidate views for each event, or they would all be applied in the reconstruct function. Either way, much of the elegance of using RDataFrame is lost.

I see, thank you for the clear series of thoughts.
Indeed my alternative solution would be a series of Defines (with facilities to make the operations you need to perform in each Define nice and simple).

So in your case it would be (with a bit of imagination):

df.Define("dijets", AllPairs, {"jets"})
  .Explode("dijets")
  .Filter(IsGoodDijet, {"dijets"})...

while what I had in mind would look like this:

df.Define("dijets", AllPairs, {"jets"})
  .Define("good_dijets", SelectGoodDijets, {"dijets"})...

I realize that in your case it would be cool to have an Explode, but in general the kind of operations that you want to do on per-event collections are quite different than what RDataFrame provides: it’s not meaningful to sort or perform “group-by”'s or joins on events, but it is often what you want to do with these nested collections – that’s why I think in general we need two different grammars for the two levels.

I will very much keep in mind your suggestions as we go forward though! Thank you for sharing.
Cheers,
Enrico

Yep, that was my point. Generally a separate grammar is better for these nested loops (a good rule of thumb is that if you ever want to do a group-by or a sort, it shouldn’t be done at the RDataFrame level), including the usecases proposed in ROOT-9225. I think range-v3 is a good solution here, and probably could be recommended, at least for those using ROOT as a library in compiled code (it’s nicer with C++ 17, but perfectly useable with C++ 14). For macros or interpreted code range-v3 is probably far too template heavy though, I agree.

However, and this is my main point, there are usecases (like the one I described) where the post-Explode events are independent, and here, RDataFrame is the right level to work at, so Explode would be useful functionality, just not for the purpose it was originally proposed for in ROOT-9225.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.