Compare histograms with Chi2Test, which option?

Dear ROOT experts,

I am comparing the following two histograms using the Chi2Test function for TH1.
Validation_m1_no_iso_CR.pdf (22.8 KB)

The black dots are actual data from 2017, while the red histogram is from a RooPDF of unit 1, but then scaled to the actual data entries. Both histograms have the same binning.

Now my question is, which option (UU, UW, WW) should I use for this comparison. I tried Chi2Test with three options: UU, UW, WW and they give me quite different results:

UU: Chi2 = 233.596233, Prob = 0.23755, NDF = 219
UW: Chi2 = 0.080180, Prob = 1, NDF = 219
WW: Chi2 = 0.115606, Prob = 1, NDF = 219

I also pasted the relevant script at the end if that helps.


//Data histogram from dataset stored in workspace
TH1D *h1_control_offDiagonal_massF_data = (TH1D*) w->data("ds_dimudimu_control_offDiagonal_2D")->createHistogram("m2",220);
//2D pdf projection on Y and then scaled to total data entries
TH2D* h2D_template2D = (TH2D*)w->pdf("template2D")->createHistogram("m1,m2", 220, 220);
  TH1D *h1_control_offDiagonal_massF_template = new TH1D( *h2D_template2D_offDiagonal->ProjectionY() );
  h1_control_offDiagonal_massF_template->Scale( h1_control_offDiagonal_massF_data->Integral() / h1_control_offDiagonal_massF_template->Integral() );
h1_control_offDiagonal_massF_data->Chi2Test(h1_control_offDiagonal_massF_template, "UU/UW/WW P");

Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.12/07
_Platform: CentOS Linux release 7.7.1908 (Core)
_Compiler: gcc version 7.3.1 20180127 (GCC)


If both your histograms contain counts (i.e. integer contens) you should use the option “UU”.
If one of the histogram is weighted (e.g. scaled) you should use the option UW , while if both histogram are weighted you should use option WW.

However I see you are getting in case of UW a results which seems to me to make little sense, a very small chi2 value. Maybe something is not correct in the histogram re-weighting and its error calculation.
Make sure that your scaled histogram, h1_control_offDiagonal_massF_template has the bin error stored (i.e. h1_control_offDiagonal_massF_template->Sumw2(true) ) otherwise the option UW will not work correctly.

If you still have an issue please post your histogram in a ROOT file as attachment, so we can look at them in detail

Best regards

Hi Lorenzo,

Thanks for replying. Here is the ROOT canvas storing the relevant histograms in my previous question, please check:
Validation_m1_no_iso_CR.root (25.7 KB)

Regarding this, my scaled histogram is actually from a predefined well-parametrized pdf which has no error. Should I still do this? If so, how?

Link to original script if that helps:



As I have foreseen, the bin errors in your templated histograms make little sense. If it comes from a pdf that has no error then you should set the bin errors to zero and then run the Chi2 test with the option “UW”, since the other histogram seems to represent counts (i.e. is unweighted).
Note that if you are doing a 1 sample chi2 test (i.e. test histogram with a function) you can also use
the function TH1::Chisquare(function,“L”), where the function represents the pdf normalized to your data histogram. Note I use the option “L”, to use a chi2 built from a Poisson log-likelihood, see



Hi Lorenzo,

Thanks much for the explanation. Just a bit clarification, here I’m not fitting the pdf/template to the data histogram. The pdf/template is already settled by fitting another control dataset beforehand. Here I’m just validating the obtained pdf/template on actual signal dataset. So I believe Chi2Test is more appropriate in this case.

Anyway, following your suggestion, now I set template histogram bins error all to zero and use the UW option. I can’t understand some of the results still, for example, this:
Validation_m2_iso_CR_exclude_Jpsi.root (9.8 KB)
The returned Chi2Test with UW option is: Chi2 = 1244.618453, Prob = 6.57401e-148, NDF = 206
However, just eyeball the plot, the agreement seems good enough. Why it would give a so small p-value ~ e^-148?

The second example I can’t understand is this:
Validation_m1_iso_CR.root (9.8 KB)
Where the returned Chi2Test result is: Chi2 = 177.821857, Prob = 0.980964, NDF = 219
But looking at the plot, there is an obvious disagreement at 3GeV(J/psi region), but the p-value is close to one. Any idea what caused this?



If you are using the second histogram as a function, it is not true that TH1::CHi2Test is more appropriate, actually is the opposite. TH1::Chisquare with option “L” is more appropriate, especially in case of small statistics for the data histogram.
This is the case of your first example. The bins have very low statistics and in that case is much more appropriate using the TH1::Chi-square with option L where one taks into account the fact that the bin statistics is Poisson. This is the reason that TH1::Chi2Test which is base on the Pearson chi-square gives non-sense results, since it assumes a Gaussian distribution in each bin.
In this example using TH1::Chisquare I get a value of chi2=125 which is more reasonable. However, keep in ming that since the statistics is low, also that chi-square value will not be really uniform for the null case in the [0,1] range. It should be eventually calibrated with pseudo-experiments.
In case of small statistics, and very small bin sizes, you could use also TH1::KolmogorovTest

For the second case, again the statistics is probably not enough to detect the disagreement that exists.



Hi Lorenzo,

How did you convert the template histogram into a TF1 function in this case?

Also, I checked the K-S test, it looks like the result starts to make sense for most plots now. But just to confirm, does the following two results make sense to you? It’s been a bit hard for me to judge how much effect the event counts and bins has in two cases.
(1) Validation_m1_iso_CR_exclude_Jpsi.root (9.8 KB)
K-S test probability: 0.0171648

(2) Validation_m1_iso_CR.root (9.9 KB)
K-S test probability: 0.856499

Both (1) and (2) have the same range and bins (220 bins), but the event counts are 90 for (1) vs 151 for (2) in the data histogram. As before, the template histogram is scaled to the data entries respectively. Should I trust the probability here?



You can make a TF1 object from an histogram as following (supposing hist is your template histogram)

auto func = [&](double *x, double*) { int ibin = hist->FindBin(x[0]); return hist->GetBinContent(ibin);};
auto f1 = new TF1("f1",func, hist->GetXaxis()->GetXmin(), hist->GetXaxis()->GetXmax(), 0);
// compute Poisson Likelihood (Baker-Cousins) chi-square using the data histogram hdata
double chi2 = hdata->Chisquare(hist,"L");
double prob = TMath::Prob(chi2, hdata->GetNbinsX());

Concerning your other question the problem is that the statistics for histogram (1) is too low. I would not trust that value a lot. As I said before, I would run some toys to calibrate th obtained p-value, both for the Baker-Cousins chi-square (TH1::Chisquare or TH1::KolmogorovTest ).


Hi Lorenzo,

Following your suggestion:

I get the result here for testing four sets of histograms/template: StatisticalTest.pdf (392.3 KB)

But as you said, I probably shouldn’t trust these reported values without toys. I was running the toy option for TH1::KolmogorovTest using “X” option, but it all returned 0 probability for all four sets. Is this “X” option the right way to do toys calibration?

Also, for the TH1::Chisquare, I didn’t seem to find an option for toys calibration (only have “L” and “R” option). Could you point me to some relevant instructions?



It is true the “X” option does not work in case of a 1-side test (i.e. when comparing histogram with a function or an histogram with zero errors).
There is also no such option in case of TH1::Chisquare.
I attached here a simple macro doing this test in case of one set of your histograms you have attached before


example_toys.C (2.0 KB)

Hi Lorenzo,

Thanks for the macro. Following this, I did 1M toys for both methods and here is a summary: StatTest_Toys.pdf (413.2 KB)

As before there are four sets of comparison. p.1-2 both has 42 data events (black hist.), and p.3-4 both has 90 data events. The template/function(red hist.) in each page are slightly different. For each set, I also put the test stat. toy distributions on the right.

There is some inconsistency b/t the two methods. Maybe you can comment on this? For p1, results from both methods seem to be ok. But for p.2, the chi2 prob. is far lower than K-S test. While for p.3-4, it’s vice versa, the K-S test prob is far lower than chi2 in p.4.

At the end, we want to pick one test method consistently for all four sets. And we actually care more about this: if the two are not compatible but we say they are compatible (this seems very hard to test). The current returned probability by two methods is better at addressing the other scenario where the two are compatible and we say it’s not.

Since K-S finds the max distance of the CDF b/t the two histograms, which is probably more susceptible to fluctuations in such low stats, maybe Chisquare test is more suitable for us to use? What do you think of this?


PS: We found a small mistake in the template(red) when doing sanity check recently. Now after it’s corrected, the tail part on the right end is slightly enhanced relative to the left end. But this shouldn’t affect our previous discussions.


Unfortunately when doing a goodness of fit test, we don;t have an alternative hypothesis so there is no way to say that one test is more powerful than another one.
In general one can say that the KS-test looks more at the overall deviation of the data and is less sensible to tails deviation.
The chi2 test is instead more sensible to local deviation than the K-S. However, given the low statistics, even using toys the interpretation is maybe difficult.

I think all your cases you have shown one cannot say there is an incompatibility. It is true in the last case the KS test gives using toys a small result, but I am not sure one can claim that the histograms are incompatible.

See here for more notes on GoF tests


1 Like

Thanks Lorenzo, the discussions with you are very helpful.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.