Is there better way to filter array branches than defining new columns in RDataFrame?

Hi all,

Is there better way to filter array branches than defining new columns in RDataFrame?

Consider RDataFrame with a following data:

Each event has n_hits column
And hits properties array columns, e.g.

RVec<double> hit_position
RVec<double> hit_time
...
RVec<double> hit_energy
RVec<int> hit_pdg

I need to do many selections on hits, and study how do they affect my distributions.
The code I come up with results in a very repetitive code and long name column names, e.g.

df = df.Define("good_hits", " . . .")\
          .Define("good_hit_pos", hit_position[good_hits])\
          .Define("good_hit_t", hit_time[good_hits])\
          .Define("good_hit_e", hit_energy[good_hits])\
          .Define("good_hit_pdg", hit_pdg[good_hits])\

df = df.Define("sel_low_energy", "good_hit_energy < 10")\
          .Define("good_hit_low_e_pos", good_hit_pos[sel_low_energy])\
          .Define("good_hit_low_e_t", good_hit_t[sel_low_energy])\
          .Define("good_hit_low_e_e", good_hit_e[sel_low_energy])\
          .Define("good_hit_low_e_pdg", good_hit_pdg[sel_low_energy])\

df = df.Define("reject_electrons", "good_hit_low_e_pdg != 11")\
...

This all is very hard to read and maintain in the end.
Is there better way to write the following logic in concise manner?

The only option I could think of is to define all selections together in single column.

BUT then I lose the ability to compare how does hit_pdg, good_hit_pdg, good_hit_low_e_pdg distributions differ in the end.

This is just an illustrative example.
In the real analysis, there will be even more columns, e.g. for each pdg type, for each calorimeter layer, which is already dozens.

Is there anything experts could advice in terms of logic/style/etc.?

cheers,
Bohdan

Hi Bohdan,
thank you for bringing this up! Currently, to avoid defining many helper columns, you can:

  • use Redefine to use the same column names but change their contents
  • pass the mask together with the original columns and do the selection inside the expression

We could add syntactic sugar such as:

df.DefineMany("good_hit_\1", [](const RVecF &v, const RVecI &mask) { return v[mask]; }),
  {{"hit_position", "good_hits"}, {"hit_time", "good_hits"}, {"hit_energy", "good_hits"}, {"hit_pdg", "good_hits"}})

but I don’t know if it would really help in the end.

Very often these “repeated Defines” scenarios appear in the context of RVecs with different properties of the same physical object, just like in your example, so we could think of some mechanism to treat multiple RVecs simultaneously, but so far I haven’t had a “Eureka” moment.

@jbendavid @Axel @RENATO_QUAGLIANI @etejedor @vpadulan @beojan @pieterdavid etc. etc. if you have an idea of how to beautify that pattern let us know :grinning_face_with_smiling_eyes:
The tricky part is to reduce the characters without losing too much readability and still work with C++/lambdas or Python/strings – it’s a fairly constrained design space.

Cheers,
Enrico

An alternative proposal: what about RDF-side support for making proxies?

df = df.MakeProxy(“Muon”, {“Muon_pt”, “Muon_eta”, “Muon_phi”, “Muon_mass”, “Muon_pfIsoId”})
df = df.Define(“GoodMuon”, “Muon[Muon.pt > 30 && abs(Muon.eta) < 2.4 && Muon.pfIsoId > 3]”)
h1 = df.Histo1D((“h_m”, “”, 100, 25, 225), “GoodMuon.pt”, “nominal_weight”)
df.Snapshot( . . . , [“GoodMuon”], flatten_proxies=True)

Let RDF do the smart/right thing in terms of the internal structure (whether that’s some kind of function call (GoodMuon.pt → Muon_pt[some_mask].at(position) ) or an auto-generated struct/class, so that there aren’t unneccesary copies of data created, etc. Ideally, let snapshots be aware of these and optionally flatten them again when you want to write out to a root file, so that in this case it would produce branches for nGoodMuon, GoodMuon_pt, GoodMuon_eta, GoodMuon_phi, GoodMuon_mass, GoodMuon_pfIsoId with a single ‘column’ requested.

How to do that, especially in a way that doesn’t hamstring RDF from optimizing out branches that aren’t used (i.e. one overzealously requests all columns be added to the proxy, but only uses 20% of them in the current computation graph), is beyond me, if such a thing as this is even possible.

Cheers,
Nick

2 Likes

Like @nmangane I’ve found the most common pattern (in CMS NanoAOD analysis - other use cases may be less selection-heavy) to be using array branches from the original tree with indices that are the result of a calculation (often several steps of selection, sorting etc.).
The difficulty is that the format of the tree is not known at compile time (and may well differ between e.g. data and MC or different data-taking periods) - another way around that is generating code after loading the tree and using the JIT compiler (that’s what I do, but my collections live in a python layer, so RDF just sees the original and index arrays).
Since I can do the bookkeeping in python, I keep the index list always separate to avoid unnecessary copies; the same might perhaps be used to implement a MakeProxy prototype with a limited runtime cost for asking too many attributes (storing two pointers per attribute, if not used (?) - the branch would still be read then, though), something like this (nothing tested, just dumping some (pseudo)code ideas in case they help anyone :wink: ):

class MuonCollection {
public:
  // define length, operator[] (with an index list or single entry) etc.
  RVec<index_type> indices;
  // attributes (generated)
  RVecWithIndex<float> pt; // from base=Muon_pt, indices=this->indices

  MuonCollection(RVec<index_type> indices_, const RVec<float>& pt_, ...)
    : indices(indices_), pt(indices, pt_), /* ... */ {}

  class Element {
  public:
    index_type idx; // index in the original collection
    // attributes (generated)
    float pt() const { return m_parent.pt.base[m_idx]; }
  private:
    index_type m_idx;
  };
};

(where RVecWithIndex allows to write muons.pt[i] for Muon_pt[muons.indices[i]] - I think this could be a useful building block even without the collections, but maybe I’m worrying too much about the copies; the element class would take it all the way to muons[0].pt())
Then MakeProxy would reduce to Define("Muon", "MuonCollection{all_indices, Muon_pt, ...}");, but it becomes harder to figure out which branches are actually later on.
I’m not sure this is possible and practical to do in C++, and if it is something that RDF can and should solve, though, it may be more efficient or easier in a framework that is built on top.

2 Likes

I’d go with a bit less magic by having the MakeProxy call take a class to construct as a template argument.

There’s a good chance you would commonly want to use a Lorentz vector class (or something that inherits or contains a Lorentz vector) that supports addition, boosting, etc.

Maybe i am naive here, but wouldn’t a combination of Reduce, Define and Filters would do the trick?. Is there an example of the use case which one can look at? I guess with Reduce you could apply a mask to all the array elements given a selection in principle and then propagate the outcome?
EDIT:
What i usually do for collection of objects which in the original TTree are exposed as arrays or vectors, is to create a base class which contains the properties of them and then play with the vector of this object.
I see an alternative for such case maybe to generalize the approach without having the user to implement a class to abstract the parallel vectors into a Define Tuple where the users defines how many private members this pseudo class of a given collection has and they get forwarded to the std tuple. Not sure if it’s doable, but in a pseudo code like style

df.DefineTuple( “VecObject”, {“pt”, “eta”…})
Creating an RVec< std::tuple >
After this i guess one can define some sort of int mask for a given selection and use that later on.

And then you can filter myObject with some get<i> property.
I am not sure how elegant that would be, tough it would reduce a lot the amount of Defines one do.

Hi,
just a keepalive ping, I have not forgotten about this thread and I’d like to thank everyone for the input. I’ll try to allocate some brain time soon to come up with a design proposal.

1 Like

commenting on this… Currently (6.24.02) RDataFrame doesn’t know what is Redefine

Correct, it’s only in the nightlies for now (will be in 6.26).

I don’t think this is supposed to work, but in practice if you take arguments as non-const references, I think you can modify them in a Filter or Define call.

We tried not to break that quirk as long as we didn’t have Redefine, which is now the correct/superior way to do the same (e.g. Redefine can even change the type of the column).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi all,
finally an update :slight_smile:

The problem as I understand it

As I understand the situation, it is common to perform the same operations (mostly selections and sorting) on all arrays representing different physical properties of the same objects (e.g. selecting some “good” elements from muon_pt, muon_eta, muon_phi that are at the same index in each of the arrays). However, it is annoying to have to spell out the operation for each array you want to apply it to, like in @FoxWise 's example in the first post.

RDataFrame could provide features to perform these common operations with less characters typed.

Solutions considered

The proxy/binder object

As suggested by @nmangane :

df.Bind("muons", {"muon_pt", "muon_eta"})

would define a muons column that can be treated like a single array and would broadcast operations to all the arrays it’s binding together. The original arrays would be accessible as data members, like muons.pt.

So far I could not figure out how to make this work with compiled code.
If the type of muons has to be known at compile time, then users will probably have to define that type (e.g. to specify the names and types of the data members) as @beojan suggests above, but that’s already so much boilerplate that at that point I’d suggest to just use a Define:

df.Define("muons" [](...) { return MyMuons(pt, eta, phi); }, {"muon_eta", "muon_eta", "muon_phi"}); 

We could provide macros that make it easy to create a MyMuons type with the desired data members and that behaves like a “binder of arrays”, but whoever is expert enough to use that machinery is probably also expert enough to write it for themselves and, more importantly, it would be awkward to use this technology from Python.

If the type of muons is not known at compile time, but it is created just-in-time, then muons can have have precisely the data members we need it to have with names and types that are programmatically generated, but it is impossible to use muons in a non-jitted Filter/Define:

df.Bind("muons", {"muon_pt", "muon_eta"})
  .Redefine("muons", [] (??? muons) { return muons[muons.eta > 5]; })

So I don’t know how to make this work well.

An ad-hoc df.Select

df.Select would be an ad-hoc method to perform selections of array elements, possibly on multiple arrays at the same time. Usage could look like this:

df.Select("muon_.*", "muon_eta > 0")
df.Select("muon_.*", [](RVecD &eta) { return eta > 0; }, {"muon_eta"})

This has two problems: it only solves simultaneous selections (which is probably the most common scenario, but it leaves e.g. simultaneous sorting out) and, perhaps more importantly, even the “compiled” version with the lambda expression actually requires jitting.
This would be the fully compiled version:

df.Select<RVecD>("muon_.*", [](RVecD &eta) { return eta > 0; }, {"muon_eta"})

Without that extra template parameter that tells RDF, at compile time, the type of all muon_.* columns, we have to wait until runtime to generate the code that performs the actual selection on each of the columns. Plus, in principle the arrays could have different types.

A generic RedefineMany

df.RedefineMany("muon_.*", "_1[muon_eta > 0]")
auto select = [](RVecD &v, RVecD& eta) { return v[eta > 0]; };
df.RedefineMany("muon_.*", select, {"_1", "muon_eta"});

_1 is a placeholder for “the column being redefined”. The compiled version only works if all columns being redefined have the same type, but RedefineMany allows arbitrary operations besides selections (e.g. it would cover sorting too).

I think this would solve the titular problem?
DefineMany could also exist but it would not address this issue as nicely.

I would love to hear your thoughts.
Cheers,
Enrico

1 Like

TL;DR:

Is it worth supporting something like

df.XXX(..., "collection[selection]")

natively, and that’s a zero-copy indexed view into the original column "collection"?

This could be used like:

df = df.Define("good_muons", [](...){ return muon.getListOfIndices(2.5 < muon.eta && muon.eta < 2.5 && ...); }, {"muon"});
...

std::string selection = "[good_muons]";
df = df.XXX(..., { "pt" + selection, "eta" + selection })
       .YYY(..., { "muons" + selection });

selection = "[loose_muons]";
df = df.XXX(..., { "pt" + selection, "eta" + selection })
       .YYY(..., { "muons" + selection });

Longer version

We were using a framework that did a similar thing as @pieterdavid mentioned. Translated to RDF, that would roughly look like:

// We were not using a mask, but a collection of indices, so you can sort at your convenience:
df = df.Define("good_muons", [](...){ return muon.getListOfIndices(2.5 < muon.eta && muon.eta < 2.5 && ...); }, {"muon"});

// And to do stuff with it:
MuonCalibration calib_mu{ /*input =*/ "muon[good_muons]",
    /* output =*/ "calibratedMuons",
    calibrationData = "<filename>" };

// And this thing would know how to do something like:
df = df.Define(calib_mu.outputName(), [&](....){ return calib_mu.calibrate(....); }, calib_mu.requiredBranches() );

I think you can implement all of this already now, but the thing that’s missing is a wrapper around an RVec that does the job of "muon[good_muons]" in an efficient way. It should obviously support operator[] and iterators.

I understand that since you don’t create copies, data won’t be nicely packed in memory, so no vectorisation and no SBO for RVec, but also you don’t create copies. When we used something like this, we were reading directly from the decompressed TTree buffers, and the original data usually stayed in the cache, so it was fast.

This wrapper might look something like

template <typename T>
struct IndexedRVec {
  std::vector<std::size_t> indices;
  RVec<T>& data;

  T& operator[](std::size_t i) { return data[indices[i]]; }
  ...
};

You may want to decide whether it’s allowed to

  • mutate the index list (we didn’t - just write a new list if you want to change something)
  • mutate the underlying data (we didn’t, just write a new column)

Edit:

One nice thing about these indexed vectors was it was very easy to run the same algorithm on a bunch of different collections (e.g. loose muons, tight muons, veto muons, forward muons, … electrons, … jets, etc., etc.).
The only thing that changed was the index array that’s in use.
I’m not sure that this is still a crucial feature with nanoAOD or xAOD, though.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi, sorry to resurrect the conversation.

@StephanH I’m wary of introducing expressions in place of column name, it gets complicated fast and it always requires just-in-time compilation, but I’ll keep it in the back of my mind.

In the context of performing the same operation on several similar arrays, how prominent is selecting elements vs other operations, e.g. sorting? Enough to justify a dedicated operation?

df = df.Select(("muon_pt", "muon_eta"), "some_cut")
auto df2 = df.Select({"muon_pt", "muon_eta"}, [] (...) { return some_rvec_mask; });

?

Cheers,
Enrico