ROOT's new histograms and efficiency


To those writing an analysis out there: we’re discussing the following for ROOT’s new histograms, and we want to hear your opinions on how terrible it is :slight_smile:

Currently, people do:

TH1F all(...);
TH1F pass(...);

for (...) {
   if (trigger_selection(jet))
pass.Divide(all, "binomial"); // or TEfficiency

For RHist we will have integer-type axes (e.g. to fill “number of jets”) - and it would be logical to also have a boolean axis. E.g.

RHist<2, float, bool> hist(...);
for (...) {
   hist.Fill(jet.pT(), trigger_selection(jet));
auto eff = DivideBinomial(hist);

This would actually be faster: one Fill per jet instead of one to two. Would that be awkward to use? Or do you think you could get used to it?

Cheers, Axel.

Hi Axel,

where are the weights in all these examples? Will they be supported, too? We almost never have a privilege to do


, it’s always

all.Fill(jet.pT(), some_weight);
pass.Fill(jet.pT(), some_weight);

Sure, weights are there, just like before. So, back to the original question: bool axis okay? Convincing arguments?

Yes, absolutely!

This looks good. I have a question for something else that would be nice and efficient (and perhaps is already in a roadmap): what about support for multiple weight variations? Since CMS is emphasizing systematic variations using weights in MC, it can easily be the case we have to fill dozens of histograms with the same x, y, z content and only a change in weight. I don’t know how (in)efficient RDataFrame is when creating all these histograms, but I’d figure there’s a much more efficient version that could be filled with a signature like the below in old event loop…

    std::map<string, float> weights;
    weights["Nominal"] = 0.36;
    weights["PileupDown"] = 0.34;
    weights["PileupUp"] = 0.39;
    weights["FactorizationUp"] = 0.4;
    hist.Fill(jet.pT(), trigger_selection(jet), weights);
}//end loop

So a single histogram storing a vector of weights per bin, whether or not it really took a std::map for input

1 Like

Yes, I think that would be useful (and less awkward than keeping track of two histograms).

The idea by @nmangane is also interesting (weight-based systematics are very easy to do in RDataFrame, but we tend to have a lot of them, for many histograms, so even a small gain in storage or computing time may be worth it).

We already foresaw keeping track of multiple uncertainties. The systematics layer is a great idea - we were thinking of plugging this into RDataFrame, but this has indeed a data-storage side, too, and I like your idea of a systematics “axis”. The bulk of hist filling goes into bin calculation, and that needs to be done only once for a set of systematics, i.e. this is a real usability and perf improvement!

I’ll have to think whether that’s better modeled as a “wrapper” layer around N histograms, or whether histograms should support that themselves. So far I’m leaning towards “wrapper object”, also because moments need to be kept per weight dimension. Opinions?