Create Histogram of composite variable in RDataframe

avilla · October 22, 2021, 2:03pm

Hi,
I am curious about the Histo1D function of the RDataFrame class.
As I understand if my Dataframe has a variable called D0_TAU, the only way to make a histogram of its logarithm would be:

histo = df.Define("log_D0_TAU", "log(D0_TAU").Histo1D((name, title, bins, min, max), "log_D0_TAU")

which is fine, but cumbersome when I am making several histograms from a list of variables, some of which need the log, while some don’t.
My question is: is there a technical reason why RDataFrame::Histo1D cannot interpret composite variables the same way for example TTree::Draw does?

eguiraud · October 22, 2021, 3:23pm

Hi @avilla ,
there is no technical reason, it was a difficult, much discussed design choice (and it’s still being discussed ).

The upside is obvious: less typing.

The downsides are sneaky. Here’s some I could think of:

It encourages users to write less efficient code:

df.Histo1D("log(D0_TAU)", "weight1")
df.Histo1D("log(D0_TAU)", "weight2")

now will evaluate the logarithm twice, while

auto df2 = df.Define("logd0", "log(D0_TAU)")
df2.Histo1D("logd0", "weight1")
df2.Histo1D("logd0", "weight2")

only evaluates the logarithm once per event as the dependency is clear.
There is some machinery to go from the string expression to executable code that would also have to run twice rather than once.

It removes the clear separation between where you can put an expression and where you can put a column name (or list thereof):

df.Histo1D("sqrt(x)", "weight > 10 ? 10 : weight")

is nice, but

df.Snapshot(..., {"sqrt(x)", "x*x"})

does not make sense (wouldn’t know what name to give to the output columns). So now you have some places where you can use either column names or expressions and some others where it has to be a plain column name.

Not the most important reason, but it would complicate internals

So that’s why I’m not completely sold on the idea…doesn’t mean I’m right, but it’s not simple either.
Cheers,
Enrico

avilla · October 22, 2021, 6:05pm

Hi @eguiraud,
thanks for the reply, you gave a clear picture of the arguments behind this decision.
My opinion is that I would always prefer flexibility and convenience over having the risk of doing something wrong (like your Snapshot example).
On the other side, I understand why you may want to keep things safe and prevent the user from falling into these traps, so in the end it’s probably for the best to have it designed this way.
Anyway since there is a way of doing this which only needs some more typing, I won’t complain about that
Thanks again and have a nice weekend,
Andrea