ROOT's new histograms and efficiency

Axel · April 30, 2020, 3:35pm

Hi,

To those writing an analysis out there: we’re discussing the following for ROOT’s new histograms, and we want to hear your opinions on how terrible it is

Currently, people do:

TH1F all(...);
TH1F pass(...);

for (...) {
   all.Fill(jet.pT();
   if (trigger_selection(jet))
      pass.Fill(jet.pT());
}
pass.Divide(all, "binomial"); // or TEfficiency

For RHist we will have integer-type axes (e.g. to fill “number of jets”) - and it would be logical to also have a boolean axis. E.g.

RHist<2, float, bool> hist(...);
for (...) {
   hist.Fill(jet.pT(), trigger_selection(jet));
}
auto eff = DivideBinomial(hist);

This would actually be faster: one Fill per jet instead of one to two. Would that be awkward to use? Or do you think you could get used to it?

Cheers, Axel.

yus · April 30, 2020, 5:10pm

Hi Axel,

where are the weights in all these examples? Will they be supported, too? We almost never have a privilege to do

all.Fill(jet.pT());
pass.Fill(jet.pT());

, it’s always

all.Fill(jet.pT(), some_weight);
pass.Fill(jet.pT(), some_weight);

Axel · April 30, 2020, 5:44pm

Sure, weights are there, just like before. So, back to the original question: bool axis okay? Convincing arguments?

yus · April 30, 2020, 5:48pm

Yes, absolutely!

nmangane · May 7, 2020, 8:30pm

This looks good. I have a question for something else that would be nice and efficient (and perhaps is already in a roadmap): what about support for multiple weight variations? Since CMS is emphasizing systematic variations using weights in MC, it can easily be the case we have to fill dozens of histograms with the same x, y, z content and only a change in weight. I don’t know how (in)efficient RDataFrame is when creating all these histograms, but I’d figure there’s a much more efficient version that could be filled with a signature like the below in old event loop…

eventLoop(){
    std::map<string, float> weights;
    weights["Nominal"] = 0.36;
    weights["PileupDown"] = 0.34;
    weights["PileupUp"] = 0.39;
    weights["FactorizationUp"] = 0.4;
    hist.Fill(jet.pT(), trigger_selection(jet), weights);
}//end loop

So a single histogram storing a vector of weights per bin, whether or not it really took a std::map for input

pieterdavid · May 8, 2020, 11:07am

Yes, I think that would be useful (and less awkward than keeping track of two histograms).

The idea by @nmangane is also interesting (weight-based systematics are very easy to do in RDataFrame, but we tend to have a lot of them, for many histograms, so even a small gain in storage or computing time may be worth it).

Axel · May 26, 2020, 1:00pm

We already foresaw keeping track of multiple uncertainties. The systematics layer is a great idea - we were thinking of plugging this into RDataFrame, but this has indeed a data-storage side, too, and I like your idea of a systematics “axis”. The bulk of hist filling goes into bin calculation, and that needs to be done only once for a set of systematics, i.e. this is a real usability and perf improvement!

I’ll have to think whether that’s better modeled as a “wrapper” layer around N histograms, or whether histograms should support that themselves. So far I’m leaning towards “wrapper object”, also because moments need to be kept per weight dimension. Opinions?