RDataFrame feature request: Filter "OR" for Snapshot with variations

Hi!

I know the VariationsFor(SnapshotPtr_t resPtr) implementation is still a work-in-progress, and I don’t know what approach will be followed exactly, so I will take this chance to make a small request:

It would be extremely useful if, as an option if this is not the default approach already being considered, the output of the VariationsFor(df.Snapshot(...)) call could be a single tree with an “OR” applied over all acting variations on each Filter, instead of different trees, one for each variation (like what we currently have for histograms).

This would be extremely useful for skimming scripts used on NTuples with kinematic systematic variations that affect Filters applied during the skimming. In most of these cases, we want our output NTuples to have roughly the same structure as the original tree (with maybe some additional/removed columns), and all the systematic variations.

Furthermore, a set of boolean flag columns could be optionally stored, signaling which filter-variation combination was passed on each RDF entry.

Thanks for all the ongoing work!

Best,

Jean Yves

Jean Yves,

Thanks for your input on this important topic. As a matter of fact, the topic of “Varied Snapshots” is something which has currently high priority. We’ll not have the time to deliver this for the ROOT 6.34.00 release, expected in November, however we’ll do our best to ship the feature with 6.36.00, foreseen in May.
Technical discussions are already quite advanced: let me add in the loop @StephanH @mczurylo and @vpadulan , the RDF lead developers.

Cheers,
Danilo

1 Like

Hi @Jean_Beaucamp,

thanks for getting in touch!
We were in fact thinking about this problem, and we see how this is useful. A first idea is that the output tree would indeed look like:

A B C A_VarUp A_VarDown
1 2 3 1.1 0.9
1 2 3 1.1 x

And so on. What gives us headaches is what to do when a filter passes for nominal, but not for the variation, or vice-versa. What do we write in the tree? We were discussing various options, but there is no clear way forward yet:

  1. A very C++ way would be to write a std::optional, so values can possibly be empty. But we would change the schema of the tree. You would have different column types. Furthermore, also the nominal column would have to change its type, because it could be empty whereas the systematic selections pass (think of a pT or energy cut, where the “Up” variation will make new events appear in the selection). On the Python side, optionals are only slightly OK; you can test if they have a value, but retrieving their value only seems to work with a.value().
  2. If we don’t change the schema of the tree, we have to write “something” that will be accepted by the tree, so we would need an extra column that you can base a filter on, like so:
A B C A_VarUp_Valid A_VarUp A_VarDown_Valid A_VarDown
1 2 3 true 1.1 true 0.9
1 2 3 true 1.1 false 0

The only way to not cause memory corruption is to write default-constructed values, so we definitely will write some zeroes or similar. If you don’t test the new columns (whose names we will have to somehow drop from the sky such that they make sense), you will read weird stuff.
3. Another way is to write vectors of length zero, if the selection didn’t pass, and length one if it passed. This is incredibly space-efficient in RNTuple, but again, the schema changes:

A B C A_VarUp A_VarDown
{1} 2 3 {1.1} {0.9}
{1} 2 3 {1.1} { }

Let us know if you come up with another idea, we are not terribly fond of any of the above yet.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.