On histograms fills and data dependency: Potential misuse of `Fill`, its consequences, and possible additional ROOT feature

gabrielle.hugo · February 8, 2023, 6:37pm

Dear ROOT experts,

I will take here the example of TH1, but this holds true for any histogram, in any dimension, whenever the storage of the sum of squares of weights has been triggered.

When a histogram is filled with a weight, the weight is immediately squared and stored as such. The running sum of squared weights is eventually used to compute the bin error, with the trivial formula.

For the bin error to be correct, this implies that consecutive calls to Fill need to be performed on independent data.
However, on the user side, this actually implies having an extra histogram object, keeping track of the data encountered within the event. Indeed, within an event (within which data is not independent), and for each bin, one needs to sum up the counts (weights) which are encountered (with no square).
At the end of each event, this extra histogram can be flushed into our histogram (with Fill or FillN): for each bin, the weight is then event-based, hence independent from the next event Fill, hence can be squared.

In practice, it seems that Fill is actually misused: it is not always kept in mind that Fill can be called on independent data only, and that one should hence create extra histograms on call site (in addition to the ROOT histograms). Instead, histograms are directly filled, and there can be several calls to Fill within the same event.

Let’s take the example of the CMS experiment framework.
In the analyzers, at the end of each event, one typically loops on a collection of non-independent objects, which were collected during the event: particles, vertices, digis, clusters, tracks, secondaries, etc.
For a given histogram, Fill can hence be called several times per event, on non-independent objects.

github.com

cms-sw/cmssw/blob/master/Validation/EventGenerator/plugins/HiggsValidation.cc#L128


      
              if (particle_id == std::abs((*iter)->pdg_id())) {
                std::vector<HepMC::GenParticle*> decayprod;
                int channel = findHiggsDecayChannel(*iter, decayprod);
                HiggsDecayChannels->Fill(channel, weight);
                Higgs_pt->Fill((*iter)->momentum().perp(), weight);
                Higgs_eta->Fill((*iter)->momentum().eta(), weight);
                Higgs_mass->Fill((*iter)->momentum().m(), weight);
                for (unsigned int i = 0; i < decayprod.size(); i++) {
                  int idx = monitoredDecays->isDecayParticle(decayprod.at(i)->pdg_id());
                  if (0 <= idx && idx <= (int)HiggsDecayProd_pt.size()) {
                    HiggsDecayProd_pt.at(idx)->Fill(decayprod.at(i)->momentum().perp(), weight);
                    HiggsDecayProd_eta.at(idx)->Fill(decayprod.at(i)->momentum().eta(), weight);
                  }
                }
                filled = true;
              }
            }
          
          
  delete myGenEvent;
          
          
}  //analyze

I just give an example here, but this is a general pattern, observed in many analyzers across the CMS framework. I did not have a look at the other experiments frameworks.

I noticed this issue in a G4 branch, where there is a similar issue. There are histogram classes with the same Fill implementation as ROOT, and the same misuse. I was observing a discrepancy in bin errors, with respect to what is obtained with a different histogramming toolkit (which supports within-the-event fills).

With this type of misuse, histogram bins errors are obviously underestimated (for a given bin, squares of partial weights are performed, instead of squaring the full events weights).

This obviously seems too huge to be true – yet, what am I overlooking?

While this would anyway be a misuse of the ROOT histograms interface, it would potentially be beneficial to directly add support for within-the-event fills. This would make the user think which situation is relevant (are my fills really independent from each other?), and avoid the needed duplication on all call sites of code dealing with an extra within-the-event histogram.

Let’s take the example of TH1. We need a third array to store the within-the-event info (let’s call it fEventSumw.fArray).
An extra TH1::FillWithinEvent(bin, weight=1) public member function could be used to sum up the weights (not their squares) into that array.
An extra TH1::EndOfEvent() public member function would be used by the user to signal that the end of event is reached. Internally, it flushes the content of the within-the-event array into the existing fArray and fSumw2.fArray (could call Fill, or FillN). For each (bin, weight) in fEventSumw.fArray: fArray[bin] += weight, fSumw2.fArray[bin] += weight^2 (and fEventSumw.fArray[bin]=0).

The resulting histogram class interface would hence inherently support several fills per event.

I thank you in advance for your time & thoughts.
Again, I am probably overlooking something, but anyway I wanted to share my puzzle with you

Best regards,
Gabrielle

moneta · February 9, 2023, 9:59am

Hi,

Thank you for this interesting post. Do you have a clear example showing clearly the under-estimation of the errors ?
In the one provided above, I am not convinced it is correct to sum linearly the weights (instead of the square), since the production of decay particles is itself a random process and as far as the event weight is not correlated with the process (e.g. the Higgs decay in the example above), I think it is correct to treat them as independent ones.

Best regards

Lorenzo

gabrielle.hugo · February 10, 2023, 3:58pm

Dear Lorenzo,

Thanks a lot for your answer.

In the above example, the decays themselves are independent. There is one decay per event considered in this study. However, the transverse momenta of decay products from the same decay are filled (for each decay product) → the momenta fills are hence correlated?
Let’s say the events weights are 1. Let’s say, for the histogram of the pT distribution of the gammas from H → yy decay mode, that we are in a given pT bin. In that bin, for each event: we can have 2 entries (from the same decay, which will only happen for few events n1), or we can have 1 entry (for a few events n2), or we can have 0 entry (most events n3). To get the bin error, we want to get into play 2^2 x n1 + 1^2 x n2 + 0^2 x n3, not (1^2+1^2) x n1 + 1^2 x n2 + 0^2 x n3, no?

Otherwise, one can find another example here: cmssw/ElectronAnalyzer.cc at master · cms-sw/cmssw · GitHub
This is looping over a collection of ECAL superclusters from the same event (in the context of electron showers analysis).
The fill in η is called for each supercluster (belonging to the same event).
The difference is a bit more striking, because in the previous example, there are only 2 γ analyzed per event, hence it is not super likely that they fall into the same bin.

The mean and std displayed in the stats box should obviously not be affected (by distributivity of multiplication, and because for the std, the abscissa is squared, not the weight). Instead, the bin errors are.

I have just run a few test cases (CMSSW_11_2_0_pre1, H130GGgluonfusion workflow, 300 events), as I was doing this without running anything (risky ;p).
ref indicates the standard output from CMSSW DQM, with no modification in CMSSW code. test indicates the standard output from CMSSW DQM, but with the addition of a data container to sum up the within-the-event data first.

Thanks again in advance for your thoughts,
Cheers,
Gabrielle

Wile_E_Coyote · February 10, 2023, 7:21pm

@gabrielle.hugo It seems to me that you try to completely neglect cases when γs (analyzed per event) fall into different bins. This actually introduces correlated uncertainties among the bins, and ROOT cannot deal with them:

gabrielle.hugo · February 13, 2023, 5:17pm

Hi,

Thanks for your answer.

I am not trying to fit a histogram. I am questioning error calculation, like it is presently done at fill time in the histogram itself.
If Fill is called on independent data, evth is already correctly supported.
Issues araise due to the practical calls on non-independent data.

At the bin level, I believe the error does not have to show correlations with other bins; but the error on its entries. Within a bin, I think 2 entries from the same event should not be treated with the same statistical value as 2 entries from 2 different events (even if those events have themselves entries in other bins).

At the global level instead, correlations between bins are consistently handled in the FillWithinEvent, EndOfEvent approach.
To go into more details, in FillWithinEvent, the arrays as well as natural counters would need to be filled on the fly anyway (within-event sum of entries, within-event sum of all bins weights, etc). At EndOfEvent call, they would be flushed into the histogram member data (in a similar way as what is done in Fill). For example, at EndOfEvent call, we would have fEntries += fWithinEventSumEntries, and fTsumw2 += fWithinEventTsumw * fWithinEventTsumw (in this case, exactly like what is done for each bin).
This implies that the histogram fTsumw2 would be numEvents * (1+1)^2, instead of the current numEvents * (1^2+1^2). Hence, it would properly take into account that the 2 entries per event are correlated - for all events, independently of the binning. Instead, in the current implementation, the 2 entries per event are wrongly treated as if they were independent.

This would be a simple extension of the existing histogram interface.
Note that while this would not affect mean, std, and higer order momenta (previously exposed reasons), an interesting effect is that GetEffectiveEntries() would be different: here, it would be 1 instead of 2 (* numEvents), which I believe makes sense (there is the statistical accuray of 1 entry per event, not 2). This implies that the mean error and std error would be higher and conservative, while in the exposed use cases, they are underestimated due to the unsatisfied independence hypothesis.

Thanks in advance,
Best regards,
Gabrielle

Wile_E_Coyote · February 13, 2023, 5:44pm

I guess, correlated uncertainties among bins would affect the GetEffectiveEntries, too.

gabrielle.hugo · February 13, 2023, 5:55pm

Yes, the fTsumw2 (which is the part which would affect GetEffectiveEntries) would be numEntries * 2^2 instead of numEntries * (1^2+1^2) → each event would properly take into account the fact that its 2 fills are correlated (in all cases, whether the decay products fall into the same bin or not).
fTsumw2 is independent from the binning.

moneta · February 14, 2023, 3:40pm

Hi Gabrielle,

Your reasoning above to sum linearly the weight for computing the bin error is correct only if the entries in one event are 100% correlated. In the case of H → yy decay, if you assume the Higgs pT is zero, you will have same pt for the two gammas, and they will enter in the same bins. It is then correct here to sum the weight linearly as you show.
If the Higgs is having some pT or you have > 2 body decay then the two photons will not end up in the same bins and they will not be 100% correlated. In this case the best we can assume is to sum the weight square. I think this is the best one can have in the generic case, like in a complex topology of the CMS event. For cases where strong correlations are present it is then better to verify the correct bin error using pseudo experiments or some other similar techniques and use linear sum of weights only when you are sure you have 100% correlations (e.g. back to back decays).
If you are then doing a fit, it is correct you should take into account cross-bin correlations.
This can be confirmed with a simple simulation.

Thank you again Gabrielle for raising this interest issue

Lorenzo

gabrielle.hugo · February 14, 2023, 6:46pm

Hi Lorenzo,

Agreed that the real correlations within an event are complex and would not always be 100% in practice → they would actually range anywhere between 0 and 100%, and likely closer to 0 than 100.
In my understanding, the actual errors on mean, std and higer order momenta (ie GetMeanError, GetStdDevError, etc), which are calculated with the GetEffectiveEntries, and are also affected here, independently of the binning, should lie anywhere in an envelope, made by the presently calculated errors (which assume 0 % correlation), and the ones exposed above (which assume correlations).

Since the correlations within an event are indeed too complex to analyze, it is only easily feasible in practice in the generic case to choose the envelope extrema (the 0% or the 100 % approach). I would have thought it is safer to just take the conservative approach (hence decide that independence is reached at event level only, not between objects within the same event), but I agree it can be overly conservative.

Anyway, thanks again for the exchange, it was an interesting discussion indeed!

Best regards,
Gabrielle

system · February 28, 2023, 6:47pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.