How to use the error provided by the TFractionFitter

LongHoa · May 12, 2022, 6:46am

Hello,

I am doing a study that fits two MC histograms (signal and background) to a toy histogram using TFractionFitter. By repeating the fit on different toys generated from the same original distribution, I can obtain the pull distribution.

The way I compute the pull is described in the following steps:

Using TFractionFitter to fit the signal and background to the toy.
Obtain the fitted parameters and their errors.
As the fitted parameters are fractions, I converted them to scales by doing:

scale_component = parameter_component * area_toy / area_component

where component can either be signal or background

The errors of the scales are computed using propagation of uncertainties.
I finally compute the pull by using:

pull = (scale_component - 1) / error_component

The pulls are expected to obey the Normal distribution. In my result, the pulls do form a Gaussian of mean 0, but its std (~0.6) is smaller than the expected value (1.0)

As I’m not good at statistic, I can only tell that I might have overestimated the error, but I don’t know where things went wrong. Please tell me how to correct it.

Thank you very much,
Hoa.

jalopezg · May 12, 2022, 8:07am

Hi @LongHoa; I am sure @moneta can help you with this.

Cheers,
J.

LongHoa · May 17, 2022, 6:56am

Hello,

I just want to update that the component templates I used is weighted, meaning when I did TH1::Fill, the weights are different from 1. Could it be the reason that the fit gave me the unexpected error?

Thank you,
Hoa.

moneta · May 17, 2022, 7:34am

Hi,
It is possible that by doing a weighted fits the error are not fully correct, due to the weights.
In addition is known that the method of TFractionFitter under-estimate the errors, because they do not take into account the fluctuations of the normalisation. Although in your case it seems to me that you are estimating a too large error. When you are generating the toys, are you considering fixed or varying the total number of events ?

Also, if you are fitting for the number of signal and background events, it is maybe better to perform a direct fit of the two components using the TF1Norm class (see ROOT: tutorials/fit/fitNormSum.C File Reference ) or using RooFit.

Best regards

Lorenzo

Wile_E_Coyote · May 17, 2022, 8:05am

@Axel Out of curiosity, I checked the TFractionFitter description, and there is no single place that warns the user.
This is not the first time when some “insider knowledge” about ROOT giving incorrect results is “hidden” from a “wide public”. Could you, please, make sure that all “known problems” are explicitly mentioned in relevant places in the documentation.

Axel · May 17, 2022, 8:19am

Well, I think here we simply need to ask @moneta to share his “inside knowledge” in the documentation of TFractionFitter, possibly pointing out superior alternatives. @moneta could you update the doc, please?

moneta · May 17, 2022, 8:37am

Hi,

It is in the documentation, see the Assumption paragraph:

Biased fit uncertainties may result if these conditions are not fulfilled (see e.g. arXiv:0803.2711).

We need to add the link to that archive paper

Lorenzo

LongHoa · May 17, 2022, 9:10am

Hi Lorenzo,

Thank you for you answer.

When you are generating the toys, are you considering fixed or varying the total number of events

The number of events is fixed

Also, if you are fitting for the number of signal and background events, it is maybe better to perform a direct fit of the two components using the TF1Norm class (see ROOT: tutorials/fit/fitNormSum.C File Reference ) or using RooFit.

I will try the TF1Norm, as RooFit seems to give me worse fits compared (by eye) to TFractionFitter.

In addition, the result I in my first post came from the official data and MC samples (thus they have weights). I then tried the template fit with non-weighted toy signal and background templates, the pulls from the fist assembled the normal distribution quite well (with std ~0.9)

Thank you again, I will try your suggestion and update the result,

Best,
Hoa.

moneta · May 19, 2022, 10:51am

Hi,
If you could share your input histograms and your code, I could investigate further this issue

Lorenzo

LongHoa · June 1, 2022, 10:19am

Hi Lorenzo,

Sorry for the late reply. Unfortunately I cannot share my inputs as they stay on the local machine, and the files are quite large to be uploaded to lxplus (my codes are also messy).

It seems that I have over-estimated error because my toy template is a sum of 2 histogram (I generated the signal part and the background part of the toy separately then add them together).

using a pseudo code, the one that doesn’t work is (in either TFractionFitter or RooFit) :

hist_toy_signal -> FillRandom (hist_real_signal, nSignal);
hist_toy_background -> FillRandom (hist_real_background, nBackground);
hist_toy_sum = (TH1D*)hist_toy_signal -> Clone( );
hist_toy_sum -> Add (hist_toy_background);

The following seems to work well (so far)

hist_real_sum = (TH1D*)hist_real_signal -> Clone();
hist_real_sum -> Add (hist_real_background);
hist_toy_sum -> FillRandom (hist_real_sum, nSignal+nBackground);

I’m trying to reproduce the issue without using real data. Then I may share you the code for investigation.

Thank you,
Hoa.

Wile_E_Coyote · June 1, 2022, 12:19pm

In both cases, right after “Clone” and before any “Add”, make sure you execute: hist_..._sum->Sumw2(kTRUE);

moneta · June 1, 2022, 1:32pm

Hi,

I don’t think one needs to use SumW2(true), because the histograms are not weighted, and adding histograms is fine.
@LongHoa, if I have understood you well, you are saying that you get different pulls from the TFractionFitter, if you are using the first or the second case to generate the input data histogram for each pseudo-experiment.
I think the first case is not correct, you are fixing the number of background and signal events, instead it should fluctuate according to a binomial distribution.
The second case is instead correct and should be used.

Best regards

Lorenzo

LongHoa · June 1, 2022, 4:36pm

Hi Lorenzo,

Yes, that’s what I mean. The second case gives me the expected sigma of the pull distribution.

Please correct me if I’m wrong: the sum of 2 randomize histograms (1st case) doesn’t have the correct uncertainty, which results in miscalculating the error, and thus should not be used.
If this is the case, will:

hist_toy_sum = (TH1D*)hist_toy_signal -> Clone( );
hist_toy_sum -> Add (hist_toy_background);
hist_toy_sum -> Sumw2 (kTRUE);

help, as @Wile_E_Coyote suggested?

Thank you very much,
Hoa.

Wile_E_Coyote · June 1, 2022, 6:01pm

The “Sumw2” call must appear before “Add” (as pointed out by @moneta, it is only required if any histogram is filled with “weights” not equal to 1).

moneta · June 2, 2022, 9:30am

Hi,

The sum of 2 histograms have the correct uncertainty computed and if you call Sumw2 if your histogram is weighted. If it is not you don’t need to.

The problem in your cases is not the uncertainty of the histogram, is in the procedure.
If you randomise h1 and h2 with n1 and n2 as number of events, you have a different result than randomizing the sum of h1+h2 with (n1+n2). The result will be the same only if n1=n2 and the histograms have the same integrals.

Lorenzo

LongHoa · June 2, 2022, 10:57am

Hi Lorenzo,

Thank you, I understand that by generating the signal part and the background part of the toy, it will fix the number of events in each part, but I still don’t understand how it would affect the fit uncertainty.

By the way, as my problem is solved now, I mark the topic as “Solved”. Thank you all again for the help.

Best regards,
Hoa.

moneta · June 2, 2022, 3:34pm

Generating the signal and background part separately will not affect the fit uncertainty, but will affect the fluctuations in your obtained fit results values (there will be smaller) and this will affect your pull results.