RooFit Chi2 fit to binned data yield suspicious results

GiacomoBoldrini · June 24, 2019, 10:57am

Hello everyone.
Before everything I’m new to RooFit so I chose the newbie category.
I have a toy MC generated from a 2d distribution with 3 parameters and another one for scale in the following way (t, p = theta and phi independent variables):

RooRealVar  t("t","t",0,2*M_PI) ;
RooRealVar  p("p","p", 0, 2*M_PI) ;
    
//Defining Parameters according to RooFit.
//Initial value is the 3rd argument, 4th and 5th arg
//are the range of existance of the parameter.
RooRealVar alpha("alpha", "alpha", 0.65, 0.62, 0.66);
RooRealVar beta("beta", "beta", 0.06, 0.04, 0.075);
RooRealVar gamma("gamma", "gamma", -0.18, -.2, -0.16);
RooRealVar scale("scale", "scale", 5., 0., 10.);

RooAbsPdf* genpdf = RooClassFactory::makePdfInstance("GenPdf","scale*(3./(4.*M_PI))*(0.5*(1.-alpha) + (0.5)*(3.*alpha-1)*cos(t)*cos(t) - beta*sin(t)*sin(t)*cos(2.*p)- sqrt(2.)*gamma*sin(2.*t)*cos(p))",RooArgSet(t,p,alpha, beta, gamma, scale)) ;

// Generate a toy MC dataset from the interpreted p.d.f
//data will contain unbinned data.
RooDataSet* data = genpdf->generate(RooArgSet(t,p),50000) ;

Now I can do an unbind likelihood fit and everything seems to work ok(results at the end):

RooFitResult* r_ML = genpdf->fitTo(*data, Save()) ;

Then I create a binned dataset (resolution on theta and phi of 0.01 rad) and perform a chi2 but the results are quite different/suspicious from what I expected:

t.setBins(624);
p.setBins(624);
RooDataHist data_hist("data_hist","binned version of data",RooArgSet(t,p),*data) ;
RooAbsReal* chi2 = genpdf->createChi2(data_hist, Extended(true), DataError(RooAbsData::Expected), PrintLevel(1), Save());
RooMinimizer n(*chi2);
n.migrad() ;
n.hesse() ;
RooFitResult* r_chi2 = n.save();

Here are the results of the RooFitResults.

==> Chi2 Fit results

  RooFitResult: minimized FCN value: 0, estimated distance to minimum: 0
                covariance matrix quality: Approximation only, not accurate
                Status : MIGRAD=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                 alpha    6.3022e-01 +/-  2.50e-02
                  beta    6.1174e-02 +/-  2.75e-02
                 gamma   -1.7819e-01 +/-  3.06e-02
                 scale    8.7985e+00 +/-  8.21e+00

==> ML Fit results

  RooFitResult: minimized FCN value: 176472, estimated distance to minimum: 2.35474e-06
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                 alpha    6.4211e-01 +/-  3.41e-03
                  beta    5.9586e-02 +/-  2.77e-03
                 gamma   -1.7837e-01 +/-  1.95e-03
                 scale    5.7338e+00 +/-  4.16e+00

Moreover, some not pleasant messages appear while it is chi2 fitting:

[#0] ERROR:Eval -- RooChi2Var::RooChi2Var(chi2_GenPdf_data_hist) INFINITY ERROR: bin 2 has zero error

I suppose it has something to do with empty bins (from other posts) but I do not know how to solve this. Is anything allright or shll I change something?
Thank you in advance.

Giacomo

Axel · June 24, 2019, 1:33pm

@StephanH can you have a look here?

StephanH · June 25, 2019, 9:58pm

Hi Giacomo,

very well-written question! No need to put it in the Newbie category.

It’s true that the Chi2 fit result is off. In particular, you cannot reasonably use the errors, as the covarinance matrix couldn’t be determined.
Further, empty bins are indeed a problem. Since you need to divide by the bin error (variance, in fact) to obtain Chi2, it’s indeed unclear what the error of the zero bin should be.

I have three ideas:

Use a Range. Maybe you manage to exclude the evil bins.
Try the different error definitions. https://root.cern.ch/doc/master/classRooAbsPdf.html#a37b90418e779e3ecd92d43c42745d672
Manually put a small value/error into the bin.

GiacomoBoldrini · June 27, 2019, 9:55am

Thank you Stephan.
I tried the 3rd option (which I thought it had to work even if it couldd lead to biased results) but no results actually… Here is what I did:
Here’s the new datahist I created from a TH2F where I iterate over every bin and set the bin width as the error if the bin is empty [Noge carefully that I have 628x628 bins…Isn’t it too much?]:

TH2F* dh2_chi = new TH2F("h2_chi", "h2_chi",628, 0, 2*M_PI, 628, 0, 2*M_PI);
    data->fillHistogram(dh2_chi, RooArgList(t,p));

    std::cout << "Imposing errors" << "\n";
    for(int i = 1; i < dh2_chi->GetNbinsX()+1; i++){
      for(int j = 1; j <  dh2_chi->GetNbinsY(); j++){
          int bin = dh2_chi->GetBin(i,j);
          if(dh2_chi->GetBinContent(bin) == 0){
            dh2_chi->SetBinError(bin, 0.01);          
            }
      }
    }

    RooDataHist data_hist("data_hist","binned version of data",RooArgList(t,p),dh2_chi) ;

Then as before I tried the chi2 fit as follows:

RooAbsReal* chi2 = genpdf->createChi2(data_hist, Extended(true), DataError(RooAbsData::Auto), PrintLevel(1), Save());
    RooMinimizer n(*chi2);
    n.migrad() ;
    n.hesse() ;
    RooFitResult* r_chi2 = n.save();

At this point I tried every possible DataError (Auto, Expected, SumW2) but with the first two I keep getting the same message as before:

[#0] ERROR:Eval -- RooChi2Var::RooChi2Var(chi2_GenPdf_data_hist) INFINITY ERROR: bin 2 has zero error

Meanwhile with the SumW2 Error I do not get that message but the minimisation does not converge:

**   16 **HESSE        1500
 **********
 MINUIT WARNING IN HESSE
 ============== Second derivative zero for parameter1
  MNHESS FAILS AND WILL RETURN DIAGONAL MATRIX. 
 FCN=50000 FROM HESSE     STATUS=FAILED          3 CALLS         107 TOTAL
                     EDM=0    STRATEGY= 1  ERROR MATRIX UNCERTAINTY 100.0 per cent
  EXT PARAMETER                APPROXIMATE     INTERNAL      INTERNAL  
  NO.   NAME      VALUE            ERROR       STEP SIZE       VALUE   
   1  alpha        6.32321e-01   2.58180e-02   0.00000e+00   1.05936e-01
   2  beta         5.95737e-02   1.72670e-02   0.00000e+00  -2.36301e-01
   3  gamma       -1.78372e-01   3.05319e-02   0.00000e+00   8.14795e-02
 EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  3    ERR DEF=1
  7.911e-04  0.000e+00  0.000e+00 
  0.000e+00  2.954e-04  0.000e+00 
  0.000e+00  0.000e+00  7.947e-04 
ERR MATRIX APPROXIMATE
 PARAMETER  CORRELATION COEFFICIENTS  
       NO.  GLOBAL      1      2      3
        1  0.00000   1.000  0.000  0.000
        2  0.00000   0.000  1.000  0.000
        3  0.00000   0.000  0.000  1.000
 ERR MATRIX APPROXIMATE

The Hessian is fully diagonal forever FCN = number of data in the histogram. The Status Is failed. The parameters nonetheless are not far from the true one and within their errors! Is this normal?
To be sure I also used ROOT to chi2 fit. In fact, as I know ROOT and RooFit chi2 formula are equal the only difference is in treating empty bins (correct me if I’m wrong). Here’s the output:

//defining a TF2 object with estimated chi2 parameters
    const int npar_chi = 4;
    double dist_params_chi[npar_chi] = {alpha_est_chi, beta_est_chi, gamma_est_chi, 5};
    TF2* myPdf_chi = new TF2("myPdf_chi", myPDF, 0, 2*M_PI, 0, 2*M_PI, npar_chi);
    myPdf_chi->SetParameters(dist_params_chi);
    myPdf_chi->SetParNames("#alpha", "#beta", "#gamma", "#scaling");


    //plotting chi2 results
    TCanvas* from_chi2 = new TCanvas("from_ML", "from_ML", 1000,1000,1000,800);
    dh2_chi->Fit(myPdf_chi);

Results on terminal:

 FCN=49999.2 FROM MIGRAD    STATUS=CONVERGED     162 CALLS         163 TOTAL
                     EDM=9.33674e-07    STRATEGY= 1      ERROR MATRIX ACCURATE 
  EXT PARAMETER                                   STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  #alpha       6.57498e-01   2.45317e+00   9.67434e-02   2.23753e-04
   2  #beta        5.42563e-02   1.30304e+00   1.25570e-01   9.29672e-05
   3  #gamma      -1.89619e-01   8.04239e-01   7.85479e-02  -1.72380e-03
   4  #scaling     1.39232e-04   4.44031e-04   1.70329e-05   3.83498e+00

The errors are completely off scale.

I don’t know If I can limit myself to a range. I have too many empty bins with a 628x628 binning…So basically 1 in 2 bins is empty roughly.
What is happening here?

StephanH · June 28, 2019, 4:19pm

Hi Giacomo,

I can only explain why setting the DataError to Auto, Expected doesn’t work:
In both cases, the error is taken as sqrt(bin content) == 0, and when dividing by the error, you divide by 0.
For SumW2, I don’t know what’s happening. The Chi2 test statistic is rarely used, and I only have experience with log-likelihood. Maybe the vanishing second derivative doesn’t have to do with the error definition, maybe it has. Wouter Verkerke might be able to help, but he is very hard to reach.

What might help is to put a very large error for empty bins. In that case, they don’t contribute to the fit because they get weighted by 1 / err^2. And empty bin means bins where the PDF is zero.

GiacomoBoldrini · June 29, 2019, 8:43pm

Hello StepanH.
Thank you very much for the explanation.
I solved the problem reducing the number of bins down to 100x100 and fitting with:

RooFitResult* r_chi2 = genpdf->chi2FitTo(data_hist, Save());

now the minimisation runs perfectly and the results agree with my expectations.

But still the previous methods did not work even with 100 bins (and so do the normal ROOT chi2 routine)…I don’t know but it’s done and hopefully correct! Thank you again.

Giacomo

system · July 13, 2019, 8:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.