Presently, using RDataFrame, if I want to make a histogram from a function of other columns, I have to Define a new column first. It would be useful, for me anyway, if I could define a new column on the fly while calling, e.g., Histo1D. ROOT could define a column name based on the passed function, then create (if necessary) and refer to this column in the Histo1D call.
For example, instead of:
In [9]: df_defi = df.Define('pt_test', 'sqrt(X_PX*X_PX + X_PY*X_PY)')
In [10]: h_defi = df_defi.Histo1D('pt_test')
I would like to do:
In [11]: h = df.Histo1D('sqrt(X_PX*X_PX + X_PY*X_PY)')
In the case of line 11, ROOT would create a new column in the background (with some unique name), much as in line 9 above, before doing Histo1D as normal. Any future calls of 'sqrt(X_PX*X_PX + X_PY*X_PY)' would refer to this same column (since the same name would be generated).
This would provide functionality similar to TTree::Draw.
Then when you get to the details of designing the feature things get messier ā no blockers, just annoyances:
if you implement this for 1D histograms, it will be hard to justify why the functionality is not there also for 2D, 3D histograms and all other actions: df.Max("x.size() - y.size()"), df.Snapshot(..., {"x*x","y*y", "x*y*z"})
the proper way to implement this for histograms under the hood is with a function that calculates the quantity on the fly and fills the histogram with it directly, avoiding the cost of the copy and indirection that Define brings with it
the feature is easy to abuse: you donāt want to encourage users to define "myexpensivefunc(x)" in-place everytime they use it ā Defineing it once avoids extra computation
the performant way to do this is with lambda functions rather than just-in-time compiled strings: df.Histo1D([](int x) { return x*y; }, {"x", "y"}), but this is so verbose that just using a Define does not seem so bad nowā¦
Soā¦since the functionality is there with just a few more keystrokes, we never got around to implement this. Itās on the bucket list though! And now we know users also feel this would be nice to have.
In addition caching/defining all the value in memory (or even on disk) might be very expansive (especially if you donāt reuse it) in memory and time.
Glad to hear this is on your radar! A few further thoughts:
I think implementation for 2D and 3D would be a good idea as well, e.g., "x:y" a la TTree::Draw(). (You can see I really just like the TTree::Draw() syntax .)
I think this point is addressed by dropping the one before it: calling Define under the hood avoids such a problem. I think the added cost of ācopy and indirectionā is worth the ease of implementation, from a user perspective, since this is what users have to do now anyway.
If df.Histo1D('sqrt(X_PX*X_PX + X_PY*X_PY)') were implemented as a sort of backdoor to Define, would it really be much more expensive? Seems to me you only lose the time required for Histo1D to call Define (vs. calling it directly).
āDefineā as a cost much higher than an histogram. For an histogram you only need to keep in memory O(Number bins) while for Define you need to keep in memory O(Number of Entries in the TTree) ā¦ this later number can be larger than you machine RAM in some cases.
TTree::Draw mixes the concepts of filtering out events and weighted events ā it can do that because in the context of TTree::Draw, where the only possible output is a histogram, an event with weight zero is effectively a filtered out event.
In RDataFrame things are trickier. Event weights should be per branch of the computation graph, possibly compounding over several Filter calls. Also Iām not sure, should weights only be applied to the values histograms are filled with (probably also TProfiles?) or also to Sum, Mean, Reduce, Aggregate (Foreach, Bookā¦)?
A more humble proposal is in ROOT-9786, where we suggest that weighted Filters might produce useful cutflow reports, but say nothing about the interplay with Defines and actions.
Why would you scale pT by the event weight there? Max would (should) treat all events equally, regardless of weight.
Weights are really only relevant for histograms.
the performant way to do this is with lambda functions rather than just-in-time compiled strings
Iāve always thought it would make sense to remove the JITted strings entirely, other than for PyROOT. Obviously this only becomes reasonable if abbreviated lambdas make it into C++.
Why would you scale pT by the event weight there? Max would (should) treat all events equally, regardless of weight. Weights are really only relevant for histograms.
Yes, precisely.
Iām afraid things would start getting too implicit if we started weighing column values for certain actions and not for others. If we introduced weighted filters in RDF as a separate transformation, though, we could think of an interface like this:
Iām afraid things would start getting too implicit if we started weighing column values for certain actions and not for others.
I donāt think so. The weights would only be applied for cutflow reports and histograms.
I think I prefer your EventWeight idea though, actually, since it means you can split out different weights and scale factors.
Iām afraid I am not familiar enough with the various use-cases of Max, etc., to have much insight here. I was speaking to what I would expect out of a Filter syntax using * to declare weights, a la TTree::Draw: I would expect a filtered data frame to always behave like a filtered data frame in every context by default, just as (I think) they do now.
Perhaps I am the only person who would find such a thing convenient, but that is the behavior I would expect.
Regarding your proposed syntax, it seems a bit unwieldy to me personally; specifying "rdfcumweight_" after chaining weights in this manner feels redundant. The logic makes sense; it just feels like extra typing. Would a flag, something like, Histo1D::UseRDFWeights(), instead of having to always specify the use of a particular column name be a viable alternative?
In the first example, how could root give a sensible name to the result of the arbitrary function being passed in?
Best to keep the explicit definition and naming I think, and instead allow chaining such that something like the following makes sense
In [9]: df_defi = df.Define('pt_test', 'sqrt(X_PX*X_PX + X_PY*X_PY)').Histo1D()
where the define adds the extra column and sets what is the latest column to be added which the Histo function then picks up. Also, as @eguiraud points out, it will take about three uses of a form where no new column is stored before someone is recalculating the same expensive function over and over. Eight uses before someone makes a thread complaining about the form being slow.
I guess sensible compromise for the special case where what you actually want is just the histogram is
In [9]: df.DefineHisto1D('pt_test', 'sqrt(X_PX*X_PX + X_PY*X_PY)')
Otherwise you just end up in the Draw world with all its hidden variables and silent population of the global namespace. Or however that magic works.
To me it seems dangerous to add the Draw syntax to the Datarame world; I have the impression that the point of Dataframes was to support a new way of doing computation with root files.
Reimplementing the mysterious and abused magic of the Draw command seems counter to that. Personally I hoped that the introduction of Dataframes would stop people from making half their analysis in one single 500 character long line passed to Draw().
Depends what you mean by sensible. As long as special characters (that are not allowed in column names) are assigned unique representations, it seems to me this is straightforward. If by āsensibleā you mean āeasily read at a glance by a humanā, I donāt think that is possible, but if thatās what the user wants, they wouldnāt be defining a column this way.
I actually dislike the āmost recentā approach. Too reminiscent of cd, which I thought ROOT 7 was trying to get away from.
Itās always nice to have multiple ways of doing things