TH1 stat calculations are inconsistent

mwilkins · June 30, 2020, 4:39pm

ROOT calculates histogram statistics in different ways depending on the context even though the function calls are the same.

As noted in this JIRA task, after initially filling the histogram, the statistics of the histogram are that of the dataset, not the histogram. If one zooms in on an axis, however, it returns the mean of the binned values. There are a few problems with this behavior:

It is not documented. (This was noted on the JIRA task, which was nonetheless marked as resolved…)
It is inconsistent. Zoomed histograms get histogram stats (even if the entire axis is in range), while unzoomed ones get dataset stats.
It is wrong. The mean of the histogram is the binned mean, not the mean of the dataset, no matter when it was filled.

Regarding (3), one gets the correct behavior if one calls TH1::ResetStats after filling, but a user would only know about this if they specifically looked up ResetStats–it is not mentioned in the documentation for TH1::GetMean, TH1::GetStdDev, nor for any other function call, nor is it mentioned in the Users Guide chapter on histograms.

One can recover the statistics of the histogram if one calls TAxis::UnZoom, but this only works if the histogram has been drawn–otherwise, the UnZoom call does nothing. This behavior is also unintuitive, since naively, one would expect histogram calculations to be independent of whether they were shown on a canvas. This behavior is not documented, and understanding it requires examining the source code. (Without a canvas, one can call TAxis::SetRange(0, 0) instead.)

Changing the default behavior would obviously break backward compatibility, so I suggest two changes:

Better documentation of this behavior, including up top in the TH1 class reference, the Users Guide, and all functions that get histogram stats.
TAxis::UnZoom should work even when there is no canvas or at least print a warning when it doesn’t do anything, and the documentation should be updated accordingly…

I’m happy to submit a PR with these changes if desired. Reproducer below.

import ROOT as r
h = r.TH1I('h', 'h', 1, 0, 100)  # histogram with just 1 bin
for i in range(1000): 
    h.Fill(r.gRandom.Gaus(20, 2)) 
# stats of dataset:
print(h.GetMean(), h.GetStdDev())  # (20.053602822081633, 2.075704758987478)
# stats of histogram:
h.GetXaxis().SetRangeUser(0, 1)
print(h.GetMean(), h.GetStdDev())  # (50.0, 0.0)
# still stats of histogram despite being all the way zoomed out:
h.GetXaxis().SetRangeUser(0, 100)
print(h.GetMean(), h.GetStdDev())  # (50.0, 0.0)
# UnZoom does nothing:
h.GetXaxis().UnZoom()
print(h.GetMean(), h.GetStdDev())  # (50.0, 0.0)
# unless h is drawn:
r.gROOT.SetBatch()
c = r.TCanvas()
h.Draw()
h.GetXaxis().UnZoom()
print(h.GetMean(), h.GetStdDev())  # (20.053602822081633, 2.075704758987478)

ROOT Version: 6.20/04
Platform: Not Provided
Compiler: Not Provided

jblomer · July 1, 2020, 7:28am

@Axel or @moneta can you comment?

moneta · July 1, 2020, 10:43am

Hi,

When setting a range the histogram statistics is computed from bin content. At the beginning is computed form the data statistics (unbinned). This explaines the difference.
Whenever possible the TH1 tries to return the data statistics, this might explain the inconsistencies.
I agree it should maybe better documented, and we will then do it.

I don’t understand your point number 3, why the binned mean is wrong. This is what can be calculated giving the bin center, since we don’t know the distribution of the entries within the bin

Lorenzo

mwilkins · July 1, 2020, 11:34am

Thanks for your attention to this @moneta.

This is not quite the case, since even zoomed out to the full range, it still returns the binned stats until SetRange() is called.

I meant to say the binned mean is right and the unbinned mean is wrong. The idea is that histograms store binned data, so the mean of the histogram should be the mean of the binned data, not the mean of the dataset used to fill the histogram.

moneta · July 1, 2020, 2:49pm

Hello,

This needs to be checked. Unzoomed in principle calls SetRange(), so one should be able to obtain the same.

I could agree with this, but historically ROOT has always done this and some people probably expect this functionality. I would then be reluctant to change, since you can always get the binned mean by calling ResetStats. We need as you said, document this better.

Lorenzo

mwilkins · July 1, 2020, 3:04pm

Hi, @moneta,

You can see in the reproducer that h.GetXaxis().SetRangeUser(0, 100) results in binned stats, whereas h.GetXaxis().UnZoom(), iff h is drawn on a pad, results in unbinned stats.

It sounds like we are on the same page. I will make a PR with the suggested documentation and changes to UnZoom.

moneta · July 1, 2020, 3:40pm

Hi,

I think the problem is that SetRange(0,0) that is called by Unzoom should have a different effect on the statistics as SetRange(firstbin, last bin). This is cause because the first one removes the Taxis::kAxisRange bit while the second not.
I think there are some reason to have the bit set when doing SetRange(firstbin, lastbin), so I am reluctant to change this too.
It is then just a question to document this better. The users can always remove the bit and get back the original data statistics.

Thank you very much if you can contribute with a PR for the documentation

Lorenzo

mwilkins · July 2, 2020, 4:23pm

I have added PR #5973 and #5974 making the suggested changes. Thanks for your attention.

moneta · July 2, 2020, 5:51pm

Thank you for the PR. We will review and comment there
Best
Lorenzo

system · July 16, 2020, 5:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.