Hi all,
finally an update 
The problem as I understand it
As I understand the situation, it is common to perform the same operations (mostly selections and sorting) on all arrays representing different physical properties of the same objects (e.g. selecting some âgoodâ elements from muon_pt
, muon_eta
, muon_phi
that are at the same index in each of the arrays). However, it is annoying to have to spell out the operation for each array you want to apply it to, like in @FoxWise 's example in the first post.
RDataFrame could provide features to perform these common operations with less characters typed.
Solutions considered
The proxy/binder object
As suggested by @nmangane :
df.Bind("muons", {"muon_pt", "muon_eta"})
would define a muons
column that can be treated like a single array and would broadcast operations to all the arrays itâs binding together. The original arrays would be accessible as data members, like muons.pt
.
So far I could not figure out how to make this work with compiled code.
If the type of muons
has to be known at compile time, then users will probably have to define that type (e.g. to specify the names and types of the data members) as @beojan suggests above, but thatâs already so much boilerplate that at that point Iâd suggest to just use a Define
:
df.Define("muons" [](...) { return MyMuons(pt, eta, phi); }, {"muon_eta", "muon_eta", "muon_phi"});
We could provide macros that make it easy to create a MyMuons
type with the desired data members and that behaves like a âbinder of arraysâ, but whoever is expert enough to use that machinery is probably also expert enough to write it for themselves and, more importantly, it would be awkward to use this technology from Python.
If the type of muons
is not known at compile time, but it is created just-in-time, then muons
can have have precisely the data members we need it to have with names and types that are programmatically generated, but it is impossible to use muons
in a non-jitted Filter/Define:
df.Bind("muons", {"muon_pt", "muon_eta"})
.Redefine("muons", [] (??? muons) { return muons[muons.eta > 5]; })
So I donât know how to make this work well.
An ad-hoc df.Select
df.Select
would be an ad-hoc method to perform selections of array elements, possibly on multiple arrays at the same time. Usage could look like this:
df.Select("muon_.*", "muon_eta > 0")
df.Select("muon_.*", [](RVecD &eta) { return eta > 0; }, {"muon_eta"})
This has two problems: it only solves simultaneous selections (which is probably the most common scenario, but it leaves e.g. simultaneous sorting out) and, perhaps more importantly, even the âcompiledâ version with the lambda expression actually requires jitting.
This would be the fully compiled version:
df.Select<RVecD>("muon_.*", [](RVecD &eta) { return eta > 0; }, {"muon_eta"})
Without that extra template parameter that tells RDF, at compile time, the type of all muon_.*
columns, we have to wait until runtime to generate the code that performs the actual selection on each of the columns. Plus, in principle the arrays could have different types.
A generic RedefineMany
df.RedefineMany("muon_.*", "_1[muon_eta > 0]")
auto select = [](RVecD &v, RVecD& eta) { return v[eta > 0]; };
df.RedefineMany("muon_.*", select, {"_1", "muon_eta"});
_1
is a placeholder for âthe column being redefinedâ. The compiled version only works if all columns being redefined have the same type, but RedefineMany
allows arbitrary operations besides selections (e.g. it would cover sorting too).
I think this would solve the titular problem?
DefineMany
could also exist but it would not address this issue as nicely.
I would love to hear your thoughts.
Cheers,
Enrico