I have a few questions about a quick short RooFit script below.
Test script
This script generates two Gaussian with the same shape. One with kN1
entries is used as a model PDF, and the other with kN2
entries is fitted with the model PDF.
void pdf(const int kN1, const int kN2) {
const double kXmin = -3;
const double kXmax = 3;
const int kNbins = 100;
TH1D* h1 = new TH1D("h1", "h1", kNbins, kXmin, kXmax);
h1->FillRandom("gaus", kN1);
TH1D* h2 = new TH1D("h2", "h2", kNbins, kXmin, kXmax);
h2->FillRandom("gaus", kN2);
RooRealVar x("x", "x", kXmin, kXmax);
RooRealVar norm("norm", "normalization", 0.5 * kN2, 0., 1.5 * kN2);
RooDataHist dh1("dh1", "dh1", x, h1);
RooDataHist dh2("dh2", "dh2", x, h2);
RooHistPdf pdf("pdf", "pdf", x, dh1);
RooAddPdf model("model", "model", RooArgList(pdf), RooArgList(norm));
model.fitTo(dh2);
RooPlot* xframe = x.frame(RooFit::Title("Gaussian Fit Test"));
dh2.plotOn(xframe);
model.plotOn(xframe);
xframe->Draw();
}
Q1.
When I execute the script repeatedly, the best-fit normalization parameter norm
always converges to 9.99986e+03 +/- 9.99948e+01, whereas the both Gaussian histograms are regenerated every time and thus the best-fit value should fluctuate accordingly.
Why do I always get the same best-fit value?
$ root
root [0] .x pdf.C(10000, 10000)
RooFit v3.60 -- Developed by Wouter Verkerke and David Kirkby
Copyright (C) 2000-2013 NIKHEF, University of California & Stanford University
All rights reserved, please read http://roofit.sourceforge.net/license.txt
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(dh1): fit range of variable x expanded to nearest bin boundaries: [-3,3] --> [-3,3]
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(dh2): fit range of variable x expanded to nearest bin boundaries: [-3,3] --> [-3,3]
[#1] INFO:Minization -- p.d.f. provides expected number of events, including extended term in likelihood.
[#1] INFO:Minization -- RooMinimizer::optimizeConst: activating const optimization
[#1] INFO:Minization -- The following expressions have been identified as constant and will be precalculated and cached: (pdf)
**********
** 1 **SET PRINT 1
**********
**********
** 2 **SET NOGRAD
**********
PARAMETER DEFINITIONS:
NO. NAME VALUE STEP SIZE LIMITS
1 norm 5.00000e+03 1.50000e+03 0.00000e+00 1.50000e+04
**********
** 3 **SET ERR 0.5
**********
**********
** 4 **SET PRINT 1
**********
**********
** 5 **SET STR 1
**********
NOW USING STRATEGY 1: TRY TO BALANCE SPEED AGAINST RELIABILITY
**********
** 6 **MIGRAD 500 1
**********
FIRST CALL TO USER FUNCTION AT NEW START POINT, WITH IFLAG=4.
START MIGRAD MINIMIZATION. STRATEGY 1. CONVERGENCE WHEN EDM .LT. 1.00e-03
FCN=-66095 FROM MIGRAD STATUS=INITIATE 4 CALLS 5 TOTAL
EDM= unknown STRATEGY= 1 NO ERROR MATRIX
EXT PARAMETER CURRENT GUESS STEP FIRST
NO. NAME VALUE ERROR SIZE DERIVATIVE
1 norm 5.00000e+03 1.50000e+03 2.14402e-01 -7.07110e+03
ERR DEF= 0.5
MIGRAD MINIMIZATION HAS CONVERGED.
MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=-68026.5 FROM MIGRAD STATUS=CONVERGED 26 CALLS 27 TOTAL
EDM=2.09874e-06 STRATEGY= 1 ERROR MATRIX ACCURATE
EXT PARAMETER STEP FIRST
NO. NAME VALUE ERROR SIZE DERIVATIVE
1 norm 9.99986e+03 9.99948e+01 2.54498e-03 -1.02441e-01
ERR DEF= 0.5
EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 1 ERR DEF=0.5
1.000e+04
**********
** 7 **SET ERR 0.5
**********
**********
** 8 **SET PRINT 1
**********
**********
** 9 **HESSE 500
**********
COVARIANCE MATRIX CALCULATED SUCCESSFULLY
FCN=-68026.5 FROM HESSE STATUS=OK 5 CALLS 32 TOTAL
EDM=2.09792e-06 STRATEGY= 1 ERROR MATRIX ACCURATE
EXT PARAMETER INTERNAL INTERNAL
NO. NAME VALUE ERROR STEP SIZE VALUE
1 norm 9.99986e+03 9.99949e+01 1.01799e-04 3.39816e-01
ERR DEF= 0.5
EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 1 ERR DEF=0.5
1.000e+04
[#1] INFO:Minization -- RooMinimizer::optimizeConst: deactivating const optimization
Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
Q2.
As the number of entries of the data histogram, kN2
, is 10000 in the above example, the obtained error estimate 9.99948e+01 is a reasonable value, which should be approximately sqrt(10000). However, the statistical fluctuation in the model PDF looks to be not taken into account in this fit. If it is also considered, the error estimate will be about sqrt(2)*sqrt(10000).
How do I also take into account the statistical fluctuation of the model PDF?
Q3.
When kN1
is small, say, 1000, I often get MIGRAD warning and wrong best-fit normalization result 1.91599e+03 +/- 5.95010e+03 (should be ~10000).
root [1] .x pdf.C(1000, 10000)
(snip)
[#0] WARNING:Minization -- RooMinimizerFcn: Minimized function has error status.
Returning maximum FCN so far (-1e+30) to force MIGRAD to back out of this region. Error log follows
Parameter values: norm=5000
RooAddPdf::model[ norm * pdf ]
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
getLogVal() top-level p.d.f evaluates to zero @ !refCoefNorm=(), !pdfs=(pdf = 0/1000), !coefficients=(norm = 5000)
RooNLLVar::nll_model_dh2[ paramSet=(norm) ]
function value is NAN @ paramSet=(norm = 5000)
(snip)
MINUIT WARNING IN HESSE
============== Second derivative zero for parameter1
MNHESS FAILS AND WILL RETURN DIAGONAL MATRIX.
FCN=-1e+30 FROM HESSE STATUS=FAILED 3 CALLS 28 TOTAL
EDM=0 STRATEGY= 1 ERROR MATRIX UNCERTAINTY 100.0 per cent
EXT PARAMETER APPROXIMATE INTERNAL INTERNAL
NO. NAME VALUE ERROR STEP SIZE VALUE
1 norm 1.91599e+03 5.95010e+03 0.00000e+00 -3.39837e-01
ERR DEF= 0.5
EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 1 ERR DEF=0.5
5.000e+07
ERR MATRIX APPROXIMATE
This is probably because the model PDF histogram has a few empty bins, and division by zero happens inside the likelihood calculation, I guess. However, it is likely that a model PDF made from a histogram filled with low-statistic experimental data can have a few empty bins like this example script.
Is there any way to avoid the above MIGRAD warning and wrong minimization?
Q4.
I always get INFO print below.
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(dh1): fit range of variable x expanded to nearest bin boundaries: [-3,3] --> [-3,3]
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(dh2): fit range of variable x expanded to nearest bin boundaries: [-3,3] --> [-3,3]
I do not understand why the variable range is extended from the original one [-3,3]
to the exactly same range [-3,3]
. Is it OK to ignore this message?