Create Histogram of composite variable in RDataframe

eguiraud · October 22, 2021, 3:23pm

Hi @avilla ,
there is no technical reason, it was a difficult, much discussed design choice (and it’s still being discussed ).

The upside is obvious: less typing.

The downsides are sneaky. Here’s some I could think of:

It encourages users to write less efficient code:

df.Histo1D("log(D0_TAU)", "weight1")
df.Histo1D("log(D0_TAU)", "weight2")

now will evaluate the logarithm twice, while

auto df2 = df.Define("logd0", "log(D0_TAU)")
df2.Histo1D("logd0", "weight1")
df2.Histo1D("logd0", "weight2")

only evaluates the logarithm once per event as the dependency is clear.
There is some machinery to go from the string expression to executable code that would also have to run twice rather than once.

It removes the clear separation between where you can put an expression and where you can put a column name (or list thereof):

df.Histo1D("sqrt(x)", "weight > 10 ? 10 : weight")

is nice, but

df.Snapshot(..., {"sqrt(x)", "x*x"})

does not make sense (wouldn’t know what name to give to the output columns). So now you have some places where you can use either column names or expressions and some others where it has to be a plain column name.

Not the most important reason, but it would complicate internals

So that’s why I’m not completely sold on the idea…doesn’t mean I’m right, but it’s not simple either.
Cheers,
Enrico