To those writing an analysis out there: we’re discussing the following for ROOT’s new histograms, and we want to hear your opinions on how terrible it is
Currently, people do:
TH1F all(...);
TH1F pass(...);
for (...) {
all.Fill(jet.pT();
if (trigger_selection(jet))
pass.Fill(jet.pT());
}
pass.Divide(all, "binomial"); // or TEfficiency
For RHist we will have integer-type axes (e.g. to fill “number of jets”) - and it would be logical to also have a boolean axis. E.g.
RHist<2, float, bool> hist(...);
for (...) {
hist.Fill(jet.pT(), trigger_selection(jet));
}
auto eff = DivideBinomial(hist);
This would actually be faster: one Fill per jet instead of one to two. Would that be awkward to use? Or do you think you could get used to it?
This looks good. I have a question for something else that would be nice and efficient (and perhaps is already in a roadmap): what about support for multiple weight variations? Since CMS is emphasizing systematic variations using weights in MC, it can easily be the case we have to fill dozens of histograms with the same x, y, z content and only a change in weight. I don’t know how (in)efficient RDataFrame is when creating all these histograms, but I’d figure there’s a much more efficient version that could be filled with a signature like the below in old event loop…
Yes, I think that would be useful (and less awkward than keeping track of two histograms).
The idea by @nmangane is also interesting (weight-based systematics are very easy to do in RDataFrame, but we tend to have a lot of them, for many histograms, so even a small gain in storage or computing time may be worth it).
We already foresaw keeping track of multiple uncertainties. The systematics layer is a great idea - we were thinking of plugging this into RDataFrame, but this has indeed a data-storage side, too, and I like your idea of a systematics “axis”. The bulk of hist filling goes into bin calculation, and that needs to be done only once for a set of systematics, i.e. this is a real usability and perf improvement!
I’ll have to think whether that’s better modeled as a “wrapper” layer around N histograms, or whether histograms should support that themselves. So far I’m leaning towards “wrapper object”, also because moments need to be kept per weight dimension. Opinions?