RooFit: Integral error and manual question

astauffe · July 15, 2021, 3:01pm

Hi everybody,

I am having trouble evaluating the error of one component of a “RooAddPdf” model with a fit result. The GetVal is correct but the error returns always 0. Providing a reproducible example would be a bit tedious, so I will try to first explain what I do with a few lines as probably it is just something stupid from my side.
Firstly, I build two pdfs out of a respective histogram:

RooHistPdf hist_pdf_sig("hist_pdf_sig", "hist_pdf_sig", x, hist_sig, 0);
RooHistPdf hist_pdf_bkg("hist_pdf_bkg", "hist_pdf_bkg", x, hist_bkg, 0);

Secondly, I add an amplitude to each pdf and add them together to build my model, which I then fit to some data (histogram):

RooRealVar N_sig("N_{sig}", "N_{sig}", 1, 0, 2000);
RooRealVar N_bkg("N_{bkg}", "N_{bkg}", 1, 0, 1e+08);
RooExtendPdf extended_pdf_sig("extended_pdf_sig", "extended_pdf_sig", hist_pdf_sig, N_sig);
RooExtendPdf extended_pdf_bkg("extended_pdf_bkg", "extended_pdf_bkg", hist_pdf_bkg, N_bkg);
RooAddPdf model("model", "model", RooArgList(extended_pdf_bkg, extended_pdf_sig), RooArgList(N_bkg, N_sig));
RooFitResult *fit_result_data = model.fitTo(hist_data, Range("fullRange"), Save(), PrintLevel(-1));

And then lastly, I would like to extract the integral of only one component pdf in a given range with the correct error:

RooAbsReal *integral_bkg = extended_pdf_bkg.createIntegral(x, NormSet(x), Range("signal"));
Double_t integral_bkg_value = integral_bkg->getVal();
Double_t integral_bkg_value_error = integral_bkg->getPropagatedError(*fit_result_data, x);
cout << " Background Integral: " << integral_bkg_value << " +/- " << integral_bkg_value_error << endl;

This always return me “xyz ± 0”. From a control experiment I know that the getVal “xyz” is correct. However, I do not understand why the getPropagatedError is always 0? My best guess is that the model has two parameters and the fit-result contains a 2x2 covariance matrix, but since I fit only one component pdf (with just one parameter) maybe this confuses the error propagation? But I also though that the call to “fitTo” would set all the parameters in the model correctly and also in the components “RooExtendPdf” (so the amplitude N_sig). What am I doing wrong? Or how does one extract the propagated error correctly? For a model consisting only of one RooExtendPdf I get the right result this way. Thanks a lot for the help.

Oh and also as a random question. I have been reading through the RooFit Users Manual (v2.91). I really appreciate this great manual and it help me a lot in understanding how RooFit/RooStat works. Is there already a rough timeline set when the rest of the chapters will be filled? Especially chs. 12-14 would interest me very much. Or is it possible to get a non-polished early version or so, if available? That would help me out a ton

Thanks a lot

jalopezg · July 15, 2021, 3:28pm

Hi @astauffe; I am sure either @jonas or @moneta can help you with this.

Cheers,
J.

jonas · July 16, 2021, 1:30pm

Hi @astauffe!

When you create an integral of a pdf, the associated number of events for the extended likelihood fit is not considered. Since the shape of your background pdf is fixed by the input histogram, the integral has always no uncertainty no matter what the range is (the full range would given an integral value of 1.0 by definition of a pdf in RooFit).

If you want to get the background event yield in the signal region, you first have the multiply the integral with the number of background events from the fit, and then evaluate this product with the propagated error. You can do this with a RooProduct:

    RooAbsReal *integral_bkg = hist_pdf_bkg.createIntegral(x, NormSet(x), Range("signal"));
    RooProduct backround_yield{"backround_yield", "backround_yield", {*integral_bkg, N_bkg}};
    Double_t integral_bkg_value = backround_yield.getVal();
    Double_t integral_bkg_value_error = backround_yield.getPropagatedError(*fit_result_data, x);
    cout << " Background Integral: " << integral_bkg_value << " +/- " << integral_bkg_value_error << endl;

I hope this solution works for you, otherwise let me know if you have further questions.

I don’t think the RooFit users manual will be updated any time soon by the way, and there is also no unpolished early version. I think ROOT goes away from the PDF user guides and we are focusing on the tutorials now, and if there will be an updated guide it will be on the website appearing here: Topical Manuals - ROOT.

Cheers,
Jonas

astauffe · July 19, 2021, 9:28am

Hi @jonas

Thank you very much for your answer! Ah, I see that I then have misunderstood how exactly the pdf in a model works. Just to clarify then, when ones plots a model with the option “VisualizeError(fit_result)” the visualized error in each bin would in the above case then just be the error of the amplitude (N_bkg) times the integral in the respective bin?

Also, thanks for your proposed solution. However, in my case I am unfortunately interested not in the background event yield but in the fraction of background events in the signal region. This should naively just be the integral of the pdf in the signal range. But as you explained me, this will have 0 error per definition. I even tried cheating and propagating the error of (N_bkg * integrated_pdf_signal_region) / (N_bkg * integrated_pdf_full_range) which should give Sqrt(2) * integrated_pdf_signal_region * N_bkg_error / N_bkg, but this error is too large. Hmm, so I am wondering now what the proper way is to extract the error on the fraction of background events in the signal region? Maybe you have a better approach?

Maybe I am choosing the wrong approach? So, if I understood correctly, when the pdf has an error of 0; this means that as soon as I initialise it with:

RooHistPdf hist_pdf_sig("hist_pdf_sig", "hist_pdf_sig", x, hist_sig, 0);

the errors of each bin of the original histogram are neglected, right? Maybe then the right approach would be to propagate these bin errors with the use of the Barlow-Beeston method from rf709__BarlowBeeston_8C? Then every bin of the pdf gets a scaling parameter associated to it, which I hope can keep track of the error?
Therefore I would use:

RooParamHistFunc pdf_parametrized_histo_sig("pdf_parametrized_histo_sig", "pdf_parametrized_histo_sig", hist_sig);
RooParamHistFunc pdf_parmetrized_histo_bkg("pdf_parmetrized_histo_bkg", "pdf_parmetrized_histo_bkg", hist_bkg);
RooRealVar Amplitude_sig("Amplitude_sig", "Amplitude_sig", 1, 0.01, 5000);
RooRealVar Amplitude_bkg("Amplitude_bkg", "Amplitude_bkg", 1, 0.01, 5000);
RooRealSumPdf model_unconstrained("summed_pdf_parametrized_histo", "summed_pdf_parametrized_histo", RooArgList(pdf_parmetrized_histo_bkg, pdf_parametrized_histo_sig), RooArgList(Amplitude_bkg, Amplitude_sig), true);
// And then to add the "constraints"
RooHistConstraint histo_constraint_sig("histo_constraint_sig", "histo_constraint_sig", pdf_parametrized_histo_sig);
RooHistConstraint histo_constraint_bkg("histo_constraint_bkg", "histo_constraint_bkg", pdf_parmetrized_histo_bkg);
// _BB -> Barlow-Beeston
RooProdPdf model_BB("model_BB", "model_BB", RooArgSet(histo_constraint_bkg, histo_constraint_sig), Conditional(model_unconstrained, x));

And then fittings this model instead and extracting the fraction of background events in the signal region from the “RooParamHistFunc” instead of the “RooExtendPdf” as before. Would this be (more) correct? But I don’t understand exactly what RooFit does under the hood when one supplies the "RooHistConstraint"s. Maybe you could explain me? I just adjusted this from the tutorial. Also I presume this would not solve the problem either as for every future fit of this model, the next fit would adjust the scaling parameters in each bin to the newly supplied data, right? So then maybe, one has to go through the trouble of fixing all scaling parameters after the initial fit… Or can one then use the “RooParamHistFunc” after initial fit as building blocks of a new “RooAddPdf”? But I guess this would then again remove any error of the integrals…

Sorry for all these inconveniences. I am just really confused on what is the right way of extracting a pdf from a histogram while also keeping track of the original errors and then fitting a model to several datasets and extracting the integral value in the signal region with the correct error.

And regarding the RooFit Manual: Ok, I understood. Thanks

jonas · July 20, 2021, 2:14pm

Hi @astauffe,

Just to clarify then, when ones plots a model with the option “VisualizeError(fit_result)” the visualized error in each bin would in the above case then just be the error of the amplitude (N_bkg) times the integral in the respective bin?

That is correct, as the plotting and also VisualizeError considers the expected number of events, unlike the integral from createIntegral().

I even tried cheating and propagating the error of (N_bkg * integrated_pdf_signal_region) / (N_bkg * integrated_pdf_full_range) which should give Sqrt(2) * integrated_pdf_signal_region * N_bkg_error / N_bkg, but this error is too large.

That’s surprising. This “cheat” should not work, because there is only an uncertainty in N_bkg that cancels out because it appears both in the numerator and the denominator. So in the end you should get again zero How did you implement it to get the non-zero error that is too large? Have you tried a RooFormulaVar? Like this you can get the correctly propagated error:

    RooAbsReal *integral_bkg_full = extended_pdf_bkg.createIntegral(
                                        x, NormSet(x));
    RooAbsReal *integral_bkg = extended_pdf_bkg.createIntegral(
                                        x, NormSet(x), Range("signal"));

    RooFormulaVar bkg_frac_in_sig_region("bkg_frac_in_sig_region",
           "x[0] * x[1] / (x[0] * x[2])",
           RooArgList(N_bkg, *integral_bkg, *integral_bkg_full));

std::cout << "Fraction of background in signal region:"
          << bkg_frac_in_sig_region.getVal() << " +/- "
          << bkg_frac_in_sig_region.getPropagatedError(*fit_result_data, x)
          << std::endl;

But I don’t understand exactly what RooFit does under the hood when one supplies the "RooHistConstraint"s. Maybe you could explain me?

First of all, your idea inspired by the Barlow-Beeston tutorial is the right approach when you have a template histogram with an independent statistical uncertainties in each bin. Is that the case?

What happens “under the hood” is that the true number of counts predicted for each bin are allowed to vary away from the template values. Your full likelihood then will include the likelihood to have obtained the count in the template histogram, given this floating true number of counts. This additional term in the likelihood is the “RooHistConstraint”. This is the correct approach if the template is itself the result of a binned counting experiment. Is that the case for you?

Also I presume this would not solve the problem either as for every future fit of this model, the next fit would adjust the scaling parameters in each bin to the newly supplied data, right? So then maybe, one has to go through the trouble of fixing all scaling parameters after the initial fit…

Here is where I don’t understand anymore where you are going. If your template has statistical uncertainties, then you are never allowed to fix the scaling parameters. Because what measurement can you do to completely eliminate the statistical uncertainty on the template predictions? If such a measurement would be possible, you wouldn’t need the template to begin with? Sorry I’m confused. Can you maybe explain more precisely what measurements and fits you are doing? Then I might be able to help you more if you still have questions.

Cheers,
Jonas

astauffe · July 21, 2021, 12:30pm

Hi @jonas,
Thank you very much for your reply! It clarified a lot again.

Regarding the error:
My naive “cheat” was done by hand, nothing noteworthy. I just artificially propagated the errors of the “N_bkg” in the numerator and denominator by hand, which, as you said, does not really make sense… I just wanted to see the output.
Your approach with the RooFormulaVar is very nice and consistent. I implemented it and the output was “0.000236859 ± 1.35525e-20”. So indeed an error which is practically 0… Thanks. I think my confusion in this regard originated from the following. First (before the model in my original question) I had a much simpler model to test some things:

RooRealVar mean_bkg("#mu_{bkg}", "mean of gaussian", 0.35, xMin, xMax);
RooRealVar sigma_bkg("#sigma_{bkg}", "width of gaussian", 0.5, 0, 10);
RooGaussian gauss_bkg("gauss_bkg", "gaussian", x, mean_bkg, sigma_bkg);
RooRealVar n_bkg("N_{bkg}", "number of bkg events", 100000, 0., 1e+08);
RooExtendPdf ex_pdf_bkg("ex_pdf_bkg", "extended bkg p.d.f", gauss_bkg, n_bkg, "fullRange");

So just a simple gaussian background. After fitting this to some histogram, I was then able to extract the fraction of background leakage in the exact same way as in my original question and got an error which made sense as:

RooFitResult * fit_result_bkg = ex_pdf_bkg.fitTo(hdata, Range("sideband"), Save(), PrintLevel(-1));
RooAbsReal * integ_bkg_Gauss_signal_region = ex_pdf_bkg.createIntegral(x, NormSet(x), Range("signal"));
std::cout << "Fraction of background in signal region: "
          << integ_bkg_Gauss_signal_region->getVal() << " +/- "
          << integ_bkg_Gauss_signal_region->getPropagatedError(*fit_result_bkg, x)
          << std::endl;

which returns “0.000234738 ± 8.30063e-06”. Comparing with the output of the more complicated model, the estimated value is very similar (as they should), but the error is only useful in the case of the simpler model. This is were my original problem came from. To my understanding, one model is “amplitude * gaussian-pdf” and the other is “amplitude_1 * histogram_pdf_1 + amplitude_2 * histogram_pdf_2” and in both cases, after fitting, I try to the extract the fraction of background in the signal region (and its error) from the “createIntegral” method applied to the extended pdf part of the model. I guess I expected the same output and somehow I am still confused As you said, I understand that the “histogram_pdf” has no integral error by construction, but the gaussian pdf seems to have. Is my confusion understandable? So my problem basically was “how can I reproduce the error from the simple gaussian model in my model?”.

Regarding the Barlow-Beeston approach:
Thank you very much for explaining the principle; this was already helpful. And sorry that I didn’t provide enough context. I will try to do so. Basically, I am working with histograms of a discriminator variable. The distribution of the discriminator variable comes form a counting experiment, yes (binned for now, but could also be unbinned if needed). I want to constrain the pdf components of my model “A_bkg * pdf_bkg + A_sig * pdf_sig” from input histograms → “A_bkg * hist_pdf_bkg + A_sig * hist_pdf_sig”. Finally, I would like to fit this model (just 2 parameters, the amplitudes) to other histograms and extract the amplitudes A_sig, A_bkg as well as the integral of the bkg-component in the signal region, all with useful errors. Therefore, I assumed that the histogram-pdf components would keep track of the initial bin errors (e. g. poissonian) of the input histograms (to construct the histogram-pdfs) and hence, after fitting, the resulting integral of the bkg-component would also have an error origining from the errors of the original histogram (to construct the bkg-pdf). Is this understandable? So far my only problem is recovering the error on the bkg fraction in the signal region.
The Barlow-Beeston method was only used to check if the scaling parameters of all the bins would not be too different from 1 when I fit my model to different data. If the case, then I consider the model to be good. I also wanted to experiment with it to see if I somehow have to use the RooParamHistFunc to propagate the errors of the initial histogram.
Hopefully this helps to resolve my question?
Thanks a lot already.

Cheers,
Alex

astauffe · July 21, 2021, 4:07pm

Just as a quick visual aid in case this helps. Below is a plot of the signal region after the fit:

Red is the Bkg template (A_bkg * pdf_bkg), green is the Signal template (A_sig * pdf_sig), black is the data, blue is the model (bkg template + signal template) and turqoise is the model’s error (via VisualizeErrors()). As we discussed, these errors are just coming from the error of the amplitudes. What I would intuitively like to do is get the fraction of the Bkg component in the signal region with its associated error coming from the errors of the model.

astauffe · August 3, 2021, 11:58am

Hi everybody,
is it possible to still answer this question? Basically, my question was: “How is it possible to reproduce the error of the integral (in a subrange) of the GaussianPDF with my HistogramPDF?”. In my opinion there should be an error in the subrange. Or differently put, one could sample from the GaussianPDF and use the resulting generated histograms as a new PDF and should be able to roughly reproduce the same error as with the RooGaussian->GetPropagatedError, no?
I am sorry if I provided too much or little detail or asked my question in the wrong way. Also I am using Root v6.24/02 and if necessary I could come up with a minimal working example.
Best,
Alex

astauffe · August 16, 2021, 1:36pm

Is it still possible to get an answer on the reproducablity of the error of the integral leakage (percentage of integral in a subrange) of the GaussianPDF with the HistogramPDF?
In my opinion there should be an error on this leakage even though the HistogramPDF is normalized. In the case of the GaussianPDF the error comes from the covariance-matrix of the fit result. Naively, I assumed that the HistogramPDF would somehow keep track of the original (poissonian) errors of the input-histogram (which would then lead to some error in the integration of the HistogramPDF).
But if I understood @jonas correctly, the condition that the whole integral should equal to 1 leads to forgetting the initial bin errors? Meaning that I can not reproduce the error of the GaussianPDF… I can live with that, I would just like to make sure I didn’t miss any way to reproduce my error.

So basically, this means that if I want to extract a meaningful error on a signal percentage/leakage in a subrange then I have to either do a fit or use one of the pre-existing pdf classes like “RooGaussian”, but I can not use a histogram as a template?

system · September 8, 2021, 6:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.