Dear ROOT experts,
I will take here the example of
TH1, but this holds true for any histogram, in any dimension, whenever the storage of the sum of squares of weights has been triggered.
When a histogram is filled with a weight, the weight is immediately squared and stored as such. The running sum of squared weights is eventually used to compute the bin error, with the trivial formula.
For the bin error to be correct, this implies that consecutive calls to
Fill need to be performed on independent data.
However, on the user side, this actually implies having an extra histogram object, keeping track of the data encountered within the event. Indeed, within an event (within which data is not independent), and for each bin, one needs to sum up the counts (weights) which are encountered (with no square).
At the end of each event, this extra histogram can be flushed into our histogram (with
FillN): for each bin, the weight is then event-based, hence independent from the next event
Fill, hence can be squared.
In practice, it seems that
Fill is actually misused: it is not always kept in mind that
Fill can be called on independent data only, and that one should hence create extra histograms on call site (in addition to the
ROOT histograms). Instead, histograms are directly filled, and there can be several calls to
Fill within the same event.
Let’s take the example of the CMS experiment framework.
In the analyzers, at the end of each event, one typically loops on a collection of non-independent objects, which were collected during the event: particles, vertices, digis, clusters, tracks, secondaries, etc.
For a given histogram,
Fill can hence be called several times per event, on non-independent objects.
I just give an example here, but this is a general pattern, observed in many analyzers across the CMS framework. I did not have a look at the other experiments frameworks.
I noticed this issue in a
G4 branch, where there is a similar issue. There are histogram classes with the same
Fill implementation as
ROOT, and the same misuse. I was observing a discrepancy in bin errors, with respect to what is obtained with a different histogramming toolkit (which supports within-the-event fills).
With this type of misuse, histogram bins errors are obviously underestimated (for a given bin, squares of partial weights are performed, instead of squaring the full events weights).
This obviously seems too huge to be true – yet, what am I overlooking?
While this would anyway be a misuse of the
ROOT histograms interface, it would potentially be beneficial to directly add support for within-the-event fills. This would make the user think which situation is relevant (are my fills really independent from each other?), and avoid the needed duplication on all call sites of code dealing with an extra within-the-event histogram.
Let’s take the example of
TH1. We need a third array to store the within-the-event info (let’s call it
TH1::FillWithinEvent(bin, weight=1) public member function could be used to sum up the weights (not their squares) into that array.
TH1::EndOfEvent() public member function would be used by the user to signal that the end of event is reached. Internally, it flushes the content of the within-the-event array into the existing
fSumw2.fArray (could call
FillN). For each
(bin, weight) in
fArray[bin] += weight,
fSumw2.fArray[bin] += weight^2 (and
The resulting histogram class interface would hence inherently support several fills per event.
I thank you in advance for your time & thoughts.
Again, I am probably overlooking something, but anyway I wanted to share my puzzle with you