Chi-square definition in ROOT

yamabuki-shan · August 5, 2022, 4:35pm

Hi All,

My co-worker and I, using different tools, were trying to fit some histograms. Surprisingly, we found the chi2 of our results were different by 1.5 to 2 times. Here is a toy histogram for example:

Bin centres: [0.025, 0.075, 0.125, 0.175, 0.225, 0.275, 0.325, 0.375, 0.425, 0.475, 0.525, 0.575, 0.625, 0.675, 0.725, 0.775, 0.825, 0.875, 0.925, 0.975]
Bin edges: [0., 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.]
Bin heights: [0, 0, 0, 0, 0, 1, 3, 16, 17, 14, 18, 18, 7, 5, 1, 0, 0, 0, 0, 0]

ROOT, as well as the other tool, gave a similar fitting result as

f(x) = 19.0521 * exp(-(x-0.497859)^2 / (2*0.0980641^2))

ROOT’s TF1() somehow gave a chi2 of 7.27082, while the other one (and a hand calculation) gave

chi2 = \sum{(f(bin_centre_i) - bin_height_i)^2 / bin_height_i} = 12.9570

for those bins with non-zero heights, of course.

I checked the manual and could not find the definition of chi-square in ROOT (but in RooFit package). And I also found a discussion about different chi-square definitions in this forum and it led to a closed issue. But unfortunately, the author seems to have deleted the original comments.

Can anyone enlighten me, for instance, where can I find the definition in ROOT manual or even the codes?

Many Thanks,
Yamabuki

ROOT Version: 6.26/04

Wile_E_Coyote · August 6, 2022, 12:39pm

{
  double Bin_edges[] =
    {0., 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.};
  double Bin_heights[] =
    {0, 0, 0, 0, 0, 1, 3, 16, 17, 14, 18, 18, 7, 5, 1, 0, 0, 0, 0, 0};
  TH1D *h = new TH1D("h", "h", sizeof(Bin_edges) / sizeof(double) - 1, Bin_edges);
  for (int i = 0; i < sizeof(Bin_heights) / sizeof(double); i++)
    h->SetBinContent(i + 1, Bin_heights[i]);
  h->Sumw2(kFALSE); h->ResetStats(); // "statistics" from bin content
  gStyle->SetOptFit();
  TF1 *f = (TF1*)gROOT->GetFunction("gaus");
  h->Fit(f, ""); // "" = Neyman chi2, "P" = Pearson chi2, "L" = log-likelihood
  // note: below it is assumed that "bin error" = sqrt("bin content")
  double chi2_N = 0.;
  for (int i = 1; i <= h->GetNbinsX(); i++) {
    double v = h->GetBinContent(i);
    if (v) chi2_N += TMath::Sq(f->Eval(h->GetBinCenter(i)) - v) / TMath::Abs(v);
  }
  std::cout << "hand calculated Neyman chi2 = " << chi2_N << "\n";
  // note: below it is assumed that "bin error" = sqrt( f("bin center") )
  double chi2_P = 0.;
  for (int i = 1; i <= h->GetNbinsX(); i++) {
    double v = f->Eval(h->GetBinCenter(i));
    if (v) chi2_P += TMath::Sq(v - h->GetBinContent(i)) / TMath::Abs(v);
  }
  std::cout << "hand calculated Pearson chi2 = " << chi2_P << "\n";
  // note: below Baker-Cousins binned log-likelihood is used (Poisson-distributed "bin content" is assumed)
  double nll = 0.;
  for (int i = 1; i <= h->GetNbinsX(); i++) {
    double vh = h->GetBinContent(i);
    double vf = f->Eval(h->GetBinCenter(i));
    if (vf) {
      nll += vf - vh;
      if (vh) nll += vh * TMath::Log(TMath::Abs(vh / vf));
    }
  }
  std::cout << "hand calculated negative log-likelihood nll = " << nll << "\n";
  std::cout << "hand calculated Baker-Cousins likelihood chi2 = " << 2. * nll << "\n";
}

Run the above macro three times, changing the fit option ("" = Neyman chi2, "P" = Pearson chi2, "L" = log-likelihood).

See also: TH1::Fit

@yamabuki-shan Always try to post a macro that “reproduces” your problem.

@moneta It seems that the “chi2” is wrongly calculated when log-likelihood is used.
Try my macro with “h->Fit(f, "L");”.
It returns “FCN=4.95327” which agrees with the “hand calculated negative log-likelihood nll”.
Then the "hand calculated Baker-Cousins likelihood chi2 = 9.90654, which agrees with “h->Chisquare(f, "L")” (same as “2 * FCN”).
However, the “f->GetChisquare()” returns “8.2006239” (which also appears in the drawn stat box).
This is neither the “Neyman chi2 = 7.94587” (same as “h->Chisquare(f, "")”) nor the “Pearson chi2 = 9.55313”.

Eddy_Offermann · August 7, 2022, 5:12pm

@Wile_E_Coyote , a tiny change to your script. Before calculating by hand the chi^2, the Fit with the correct fitOption is invoked because this will be important for the f->Eval call.

I have tried to investigate the f->GetChisquare versus h->Chisquare results with
the different fitting options, Neyman, Pearson and Baker-Cousins likelihood chi2.
The results are:

Neyman : f->GetChisquare and h->Chisquare correct
Pearson. : f->GetChisquare correct and h->Chisquare incorrect
likelihood chi2: f->GetChisquare incorrect and h->Chisquare correct

The code where the histogram fitting is happening (hist/hist/src/HFitImpl.cxx) is convoluted and I will let the author figure out what is happening

{
  double Bin_edges[] =
    {0., 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.};
  double Bin_heights[] =
    {0, 0, 0, 0, 0, 1, 3, 16, 17, 14, 18, 18, 7, 5, 1, 0, 0, 0, 0, 0};
  TH1D *h = new TH1D("h", "h", sizeof(Bin_edges) / sizeof(double) - 1, Bin_edges);
  for (int i = 0; i < sizeof(Bin_heights) / sizeof(double); i++)
    h->SetBinContent(i + 1, Bin_heights[i]);
  h->Sumw2(kFALSE); h->ResetStats(); // "statistics" from bin content
  gStyle->SetOptFit();
  TF1 *f = (TF1*)gROOT->GetFunction("gaus");
  h->Fit(f, ""); // "" = Neyman chi2, "P" = Pearson chi2, "L" = log-likelihood
  // note: below it is assumed that "bin error" = sqrt("bin content")
  double chi2_N = 0.;
  for (int i = 1; i <= h->GetNbinsX(); i++) {
    double v = h->GetBinContent(i);
    if (v) chi2_N += TMath::Sq(f->Eval(h->GetBinCenter(i)) - v) / TMath::Abs(v);
  }
  std::cout << "hand calculated Neyman chi2 = " << chi2_N << "\n";
  // note: below it is assumed that "bin error" = sqrt( f("bin center") )
  h->Fit(f, "P"); // "" = Neyman chi2, "P" = Pearson chi2, "L" = log-likelihood
  double chi2_P = 0.;
  for (int i = 1; i <= h->GetNbinsX(); i++) {
    double v = f->Eval(h->GetBinCenter(i));
    if (v) chi2_P += TMath::Sq(v - h->GetBinContent(i)) / TMath::Abs(v);
  }
  std::cout << "hand calculated Pearson chi2 = " << chi2_P << "\n";
  // note: below Baker-Cousins binned log-likelihood is used (Poisson-distributed "bin content" is assumed)
  h->Fit(f, "L"); // "" = Neyman chi2, "P" = Pearson chi2, "L" = log-likelihood
  double nll = 0.;
  for (int i = 1; i <= h->GetNbinsX(); i++) {
    double vh = h->GetBinContent(i);
    double vf = f->Eval(h->GetBinCenter(i));
    if (vf) {
      nll += vf - vh;
      if (vh) nll += vh * TMath::Log(TMath::Abs(vh / vf));
    }
  }
  std::cout << "hand calculated negative log-likelihood nll = " << nll << "\n";
  std::cout << "hand calculated Baker-Cousins likelihood chi2 = " << 2. * nll << "\n";

  std::cout << std::endl << "investigation f->GetChisquare vs h->Chisquare" << std::endl;
  h->Fit(f, "Q");
  std::cout << "h->Fit(f,\"\");" << std::endl;
  std::cout << "f->GetChisquare()  : " << f->GetChisquare() << std::endl;
  std::cout << "h->Chisquare(f,\"\") : " << h->Chisquare(f,"") << std::endl;
  h->Fit(f, "QP");
  std::cout << "h->Fit(f,\"P\");" << std::endl;
  std::cout << "f->GetChisquare()  : " << f->GetChisquare() << std::endl;
  std::cout << "h->Chisquare(f,\"P\") : " << h->Chisquare(f,"P") << std::endl;
  h->Fit(f, "QL");
  std::cout << "h->Fit(f,\"L\");" << std::endl;
  std::cout << "f->GetChisquare()  : " << f->GetChisquare() << std::endl;
  std::cout << "h->Chisquare(f,\"L\"): " << h->Chisquare(f,"L") << std::endl;
}

Wile_E_Coyote · August 7, 2022, 6:26pm

@Eddy_Offermann My macro calculates ALL available “chi2” values (and the “nll”) for ANY fit option. This is intentional.
You are really supposed to run it three times, changing the fit option ("" = Neyman chi2, "P" = Pearson chi2, "L" = log-likelihood).
If you do it, you will notice that with ANY fit option, “h->Chisquare(f, "")” returns the “Neyman chi2” and “h->Chisquare(f, "L")” returns the “Baker-Cousins likelihood chi2”.
@moneta There is no way to get the “Pearson chi2” from the “TH1::Chisquare” method, and I think this should be added (i.e., “h->Chisquare(f, "P")” is missing).
The “f->GetChisquare()” returns a wrong value only for the log-likelihood fit.

Eddy_Offermann · August 7, 2022, 6:47pm

Out of all these permutations only three are really relevant, no point of running “P” and doing the “N” by hand. My change implemented that. The whole implementation is not consistent,
it is possible to invoke “PL” …

Wile_E_Coyote · August 7, 2022, 6:51pm

@Eddy_Offermann The purpose of the “TH1::Chisquare” method is to calculate the “chi2” of your histogram with ANY function you want (and it should be able to calculate any available “chi2”). This method is NOT associated with fitting at all (i.e., you do not need to fit the histogram at all).
The “TF1::GetChisquare” method is supposed to return the “chi2” that was calculated when fitting this function to some object (e.g., a histogram or a graph, but note that it doesn’t know which object was fitted).

moneta · August 8, 2022, 7:18am

Hi,

TF1::GetChisquare returns the chi2 value that is computed during fitting, using the fitted data and it is computed together with the number of degree of freedom, TF1::GetNdf.
In the case of binned likelihood fit I agree this is confusing, it returns the Neyman chi2 using your fitter function. Probably we should return in that case the Baker-Cousins likelihood ratio, which is 2 * FitResult::MinimumFcn(). I think was kept like this for backward compatibility with the previous fitting code.

I see also the documentation of TF1::GetChisquare is missing. I will add this with the suggested change in case of likelihood fits.

Thank you for the feedback !

Cheers

Lorenzo

Wile_E_Coyote · August 8, 2022, 7:23am

This is not the case.
See the numbers in my first post here (“Neyman chi2 = 7.94587” versus “f->GetChisquare()” which returns “8.2006239” ).

It seems this has already been raised a year ago or so (you fixed something but it’s still incorrect):

moneta · August 8, 2022, 8:41am

Hi,
Yes, they should return the same value and this is not correct.
The reason is that the Neyman chi2 computed by the FitResult class after a binned likelihood fit includes the empty-bins with an error=1. This is not correct, if one wants to include empty bins it should use the Pearson chi2.
I think we should change the FitResult class to compute as Chisquare the Baker-Cousins chi2 in case of binned likelihood fits, and, if one want to compute the Neyman (or Pearson) chi2, it should use ROOT::Fit::Chisquare or TH1::Chisquare.
I will open a GitHub issue for this. Thank you for raising this issue

Cheers

Lorenzo

Wile_E_Coyote · August 8, 2022, 8:48am

The"Baker-Cousins likelihood chi2" does include bins with zero content.
So, I think, if the log-likelihood fit is done, then the returned “chi2” should be “2 * FCN” (i.e., the proper “Baker-Cousins likelihood chi2”).

Then, if possible, please add the "P" option (“Pearson chi2”) to the “TH1::Chisquare” method.

moneta · August 8, 2022, 8:51am

I have created [hist] Chi-square computation in Binned likelihood fits · Issue #11143 · root-project/root · GitHub
summarising these findings.
Thank you !

Lorenzo