The performance of TH1::Fit() using chi-square method

liuk · March 5, 2008, 1:58pm

Hi, all

Recently, for some reason I studied the performance of both chi-square fit method and LogLikelihood fit method implemented in ROOT. And I found the result of chi-square method is quite beyond my expectation.

I did a simple test to evaluate the performance of the two methods:

Randomly generate a histogram according to a pre-defined function fcn (I tested gaus and pol2), using TH1::FillRandom(“fcn”, Nentries);
Fit the histogram with the same function, using TH1::Fit(“fcn”) and TH1::Fit(“fcn”, “L”) respectively;
Calculate the number of entries from fitted parameters: number of entries = definite integral of fcn from xmin to xmax / bin width. And then calculate bias = (Fitted number - Generated number)/Generated number.
Set different seeds and repeat the steps above, to get the distribution of bias of the two methods.
Fix the generated number of entries and decrease the bin width, and repeat all the steps above.

When the event number of each bin is sufficiently large, both methods appear to work equally well. However with the bin width decreasing, the bias of the chi-square method has a obvious deviation from zero, while the bias of Log Likelihood is still sharply zero. And the RMS of chi-square is remarkably large than that of Log Likelihood.

As I formerly expected, there should not be such remarkable difference between the two methods, since they are both mathematically precise. So my question is: Is this really because of the mathematical difference of the two methods, the specific treatment during the implementation in ROOT, or some other aspects that I overlooked?

Thank you all in advance!

moneta · March 5, 2008, 3:20pm

Hi,

this is a known result. The least square (or chi2 ) method is known to be biased when the number of entries in an histogram is small.
The least square method assumes the values are gaussian distributed and this is correct only for large bin entries.
Normally, the counts of an histogram are Poisson distributed, and this is what is taken into account when using the binned likelihood fit method.
Furthermore, when using the chi2 method there is the problem for the bins with zero entries.
Currently, when using the chi2 method in ROOT these bins are excluded from the fit, but in reality they carry valuable statistics information.
They are instead considered in the likelihood method.
So, in conclusion, if your histogram contains bins with few entries
(let’s say < 5), it is strongly recommended to to use the likelihood fit.

You can find more information on this subject on the PDG statistics chapter,

pdg.lbl.gov/2004/reviews/statrpp.pdf

Thank you for your post.

Lorenzo

liuk · March 5, 2008, 3:32pm

I learned a lot. Thank you very much for your help.