RDataFrame syntax more like TTree::Draw

mwilkins · November 26, 2018, 8:24pm

Presently, using RDataFrame, if I want to make a histogram from a function of other columns, I have to Define a new column first. It would be useful, for me anyway, if I could define a new column on the fly while calling, e.g., Histo1D. ROOT could define a column name based on the passed function, then create (if necessary) and refer to this column in the Histo1D call.

For example, instead of:

In [9]: df_defi = df.Define('pt_test', 'sqrt(X_PX*X_PX + X_PY*X_PY)')

In [10]: h_defi = df_defi.Histo1D('pt_test')

I would like to do:

In [11]: h = df.Histo1D('sqrt(X_PX*X_PX + X_PY*X_PY)')

In the case of line 11, ROOT would create a new column in the background (with some unique name), much as in line 9 above, before doing Histo1D as normal. Any future calls of 'sqrt(X_PX*X_PX + X_PY*X_PY)' would refer to this same column (since the same name would be generated).

This would provide functionality similar to TTree::Draw.

EDIT:
additional thoughts below.

eguiraud · November 27, 2018, 9:04am

Hi @mwilkins,
thanks for the feedback!
Indeed we wanted to implement this since even before RDataFrame was part of ROOT!

Then when you get to the details of designing the feature things get messier – no blockers, just annoyances:

if you implement this for 1D histograms, it will be hard to justify why the functionality is not there also for 2D, 3D histograms and all other actions: df.Max("x.size() - y.size()"), df.Snapshot(..., {"x*x","y*y", "x*y*z"})
the proper way to implement this for histograms under the hood is with a function that calculates the quantity on the fly and fills the histogram with it directly, avoiding the cost of the copy and indirection that Define brings with it
the feature is easy to abuse: you don’t want to encourage users to define "myexpensivefunc(x)" in-place everytime they use it – Defineing it once avoids extra computation
the performant way to do this is with lambda functions rather than just-in-time compiled strings: df.Histo1D([](int x) { return x*y; }, {"x", "y"}), but this is so verbose that just using a Define does not seem so bad now…

So…since the functionality is there with just a few more keystrokes, we never got around to implement this. It’s on the bucket list though! And now we know users also feel this would be nice to have.

Cheers,
Enrico

pcanal · November 27, 2018, 3:07pm

In addition caching/defining all the value in memory (or even on disk) might be very expansive (especially if you don’t reuse it) in memory and time.

mwilkins · November 27, 2018, 3:34pm

Glad to hear this is on your radar! A few further thoughts:

I think implementation for 2D and 3D would be a good idea as well, e.g., "x:y" a la TTree::Draw(). (You can see I really just like the TTree::Draw() syntax .)

I think this point is addressed by dropping the one before it: calling Define under the hood avoids such a problem. I think the added cost of “copy and indirection” is worth the ease of implementation, from a user perspective, since this is what users have to do now anyway.

Thanks for filling me in!

mwilkins · November 27, 2018, 3:35pm

If df.Histo1D('sqrt(X_PX*X_PX + X_PY*X_PY)') were implemented as a sort of backdoor to Define, would it really be much more expensive? Seems to me you only lose the time required for Histo1D to call Define (vs. calling it directly).

pcanal · November 27, 2018, 4:01pm

‘Define’ as a cost much higher than an histogram. For an histogram you only need to keep in memory O(Number bins) while for Define you need to keep in memory O(Number of Entries in the TTree) … this later number can be larger than you machine RAM in some cases.

mwilkins · November 27, 2018, 4:14pm

Ah I think I see what you’re saying. Still, users already have to do that, since we have to call Define in order to make the histogram.

eguiraud · November 27, 2018, 4:15pm

This is false, Define does not store all the results of its computations, just the last one

pcanal · November 27, 2018, 4:38pm

Thanks for the clarification @eguiraud. I was confusing Define and Snapshot

mwilkins · December 3, 2018, 10:22pm

Additionally, it would be nice to be able to specify a weight in a Filter using the string syntax, similar to TTree::Draw, i.e., (weight) * (x > 10).

eguiraud · December 3, 2018, 11:08pm

TTree::Draw mixes the concepts of filtering out events and weighted events – it can do that because in the context of TTree::Draw, where the only possible output is a histogram, an event with weight zero is effectively a filtered out event.

In RDataFrame things are trickier. Event weights should be per branch of the computation graph, possibly compounding over several Filter calls. Also I’m not sure, should weights only be applied to the values histograms are filled with (probably also TProfiles?) or also to Sum, Mean, Reduce, Aggregate (Foreach, Book…)?

A more humble proposal is in ROOT-9786, where we suggest that weighted Filters might produce useful cutflow reports, but say nothing about the interplay with Defines and actions.

mwilkins · December 4, 2018, 3:17am

From my perspective, a weighted Filter would make the most sense if its weight were applied in all contexts, thus creating a true weighted data frame.

Thanks for the info about the other proposal.

eguiraud · December 4, 2018, 9:29am

Uhm, wouldn’t for example Max("pt") return weird things if we scaled pt by the event weight?

beojan · December 4, 2018, 10:35am

Why would you scale p_T by the event weight there? Max would (should) treat all events equally, regardless of weight.

Weights are really only relevant for histograms.

the performant way to do this is with lambda functions rather than just-in-time compiled strings

I’ve always thought it would make sense to remove the JITted strings entirely, other than for PyROOT. Obviously this only becomes reasonable if abbreviated lambdas make it into C++.

eguiraud · December 4, 2018, 10:58am

Why would you scale pT by the event weight there? Max would (should) treat all events equally, regardless of weight. Weights are really only relevant for histograms.

Yes, precisely.
I’m afraid things would start getting too implicit if we started weighing column values for certain actions and not for others. If we introduced weighted filters in RDF as a separate transformation, though, we could think of an interface like this:

df.EventWeight("w*(x > 0)").EventWeight("w2").Histo1D("x", "rdfcumweight_");

where rdfcumweight_ is a column RDF provides which contains the product of all upstream weights.
Or also

df.WeightedFilter("w*(x > 0)").Histo1D("x", "rdfcumweight_");

to have weighted cutflow reports.

beojan · December 4, 2018, 11:03am

I’m afraid things would start getting too implicit if we started weighing column values for certain actions and not for others.

I don’t think so. The weights would only be applied for cutflow reports and histograms.
I think I prefer your EventWeight idea though, actually, since it means you can split out different weights and scale factors.

mwilkins · December 4, 2018, 12:55pm

I’m afraid I am not familiar enough with the various use-cases of Max, etc., to have much insight here. I was speaking to what I would expect out of a Filter syntax using * to declare weights, a la TTree::Draw: I would expect a filtered data frame to always behave like a filtered data frame in every context by default, just as (I think) they do now.

Perhaps I am the only person who would find such a thing convenient, but that is the behavior I would expect.

Regarding your proposed syntax, it seems a bit unwieldy to me personally; specifying "rdfcumweight_" after chaining weights in this manner feels redundant. The logic makes sense; it just feels like extra typing. Would a flag, something like, Histo1D::UseRDFWeights(), instead of having to always specify the use of a particular column name be a viable alternative?

tobychev · December 4, 2018, 4:59pm

In the first example, how could root give a sensible name to the result of the arbitrary function being passed in?

Best to keep the explicit definition and naming I think, and instead allow chaining such that something like the following makes sense

In [9]: df_defi = df.Define('pt_test', 'sqrt(X_PX*X_PX + X_PY*X_PY)').Histo1D()

where the define adds the extra column and sets what is the latest column to be added which the Histo function then picks up. Also, as @eguiraud points out, it will take about three uses of a form where no new column is stored before someone is recalculating the same expensive function over and over. Eight uses before someone makes a thread complaining about the form being slow.

I guess sensible compromise for the special case where what you actually want is just the histogram is

In [9]: df.DefineHisto1D('pt_test', 'sqrt(X_PX*X_PX + X_PY*X_PY)')

Otherwise you just end up in the Draw world with all its hidden variables and silent population of the global namespace. Or however that magic works.

To me it seems dangerous to add the Draw syntax to the Datarame world; I have the impression that the point of Dataframes was to support a new way of doing computation with root files.

Reimplementing the mysterious and abused magic of the Draw command seems counter to that. Personally I hoped that the introduction of Dataframes would stop people from making half their analysis in one single 500 character long line passed to Draw().

mwilkins · December 4, 2018, 6:18pm

Depends what you mean by sensible. As long as special characters (that are not allowed in column names) are assigned unique representations, it seems to me this is straightforward. If by “sensible” you mean “easily read at a glance by a human”, I don’t think that is possible, but if that’s what the user wants, they wouldn’t be defining a column this way.

I actually dislike the “most recent” approach. Too reminiscent of cd, which I thought ROOT 7 was trying to get away from.

It’s always nice to have multiple ways of doing things