Creating Asimov dataset from background-only fit to estimate the POI uncertainty

Dear RooStats experts,

I’m using HistFitter framework build on top of the HistFactory package to create RooFit workspaces and perform fits.

I create a setup with 3 control regions (CR) and a signal region (SR) and then fit in the control regions only. After that I want create an Asimov dataset using POI = 1, norm factors and nuisance parameters values from the CR fit and use this dataset in CR and SR to estimate the POI estimate error. I do it with the following code

  TFile *file = TFile::Open("path/to/file.root");
  RooWorkspace* w = (RooWorkspace *)file->Get("w"); 

  TString dataName ="obsData";
  RooAbsData * data = w->data(dataName);
 
  TString modelSBName = "ModelConfig";
  ModelConfig* sbModel = (ModelConfig*) w->obj(modelSBName);

  RooRealVar *mu = (RooRealVar*)w->obj("mu_EWK");
  mu->setConstant(false);
  mu->setVal(1);

  RooArgSet *allParams = sbModel->GetPdf()->getParameters(*data);
  RooArgSet globObs("globObs");
  RooAbsData *asimov = RooStats::AsymptoticCalculator::MakeAsimovData(*sbModel, *allParams, globObs);

  RooAbsPdf *simPdf = w->pdf("simPdf");
  RooFitResult* result = simPdf->fitTo(*asimov, Save(), SumW2Error(true));
  result->Print("V");

But for some reason I sometimes get an estimate of POI != 1. Is it possible to fit to Asimov data and not get a value of the parameter you have used to create the Asimov data? Or am I creating the Asimov data wrong?

Thanks in advance,
Aleksandr

Hi Aleksandr,

that’s indeed surprising. I have some ideas/questions:

  • The code you posted runs after the control-region fit, I assume?
  • Do the parameters make sense before you generate the Asimov data? Use allParams->Print("V") to print them.
  • Is your model using nuisance parameters with constraint terms? The observables of these constraint terms should be in the set of global observables. The global observables the you are passing are empty.

Hi, thanks for answering.

Yes.

I’ve check the output of the Print(“V”) and everything seems to be fine.

Yes, I do use NPs with constraint terms. What would be the proper way to extract the global observable values from the workspace? Is it sbModel->GetPdf()->getObservables(*data)?

Ok for the first two questions. Global observables are all variables of the model that are not observables (i.e. they are not in the dataset), and not parameters. I think HistFactory calls them <parameterName>_obs and the name of the parameter may contain something like con or constraint.
RooFit normally guesses these automatically, but I guess you have to collect them manually just to be sure that things are working correctly.

Further, I think it is possible to get non-unity POIs if the fit model is slightly unstable. How frequent and how strong are the deviations?
Do you see large correlations in the fit? That’s a sure sign of instabilities.

There were some large correlation in my fit setup. I’ve tried the following configurations:

  1. No systematics.
  2. Only systematics with low correlations.
  3. All systematics.

Asimov data creation and fit for configurations 1 and 2 resulted in norm factors estimates equal to the ones used to create the Asimov dataset, configuration 3 estimates were biased.
So I guess the problem was in the way MINUIT deals with fits with large correlations and not in the way I’m creating the Asimov dataset.
Thank you for your help.

Hi @apetukho,

that’s good to hear. It’s not a problem of MINUIT, minimisation with large correlations is something that cannot be solved easily. Imagine that the minimum of the function is not a single point in the parameter space, but a valley (that’s what happens when two parameters are correlated). Which point inside this valley should the minimiser return as the “best parameter values”? There is an infinite number of “correct” parameter values that would solve the problem equally well.

I recommend to look at the parameters that are correlated, and try to rewrite the fit model in a way where this correlation vanishes.

Thank you for the explanation! I’ve got two more questions.

First question:
I also want to use this Asimov dataset to estimate the median discovery significance. I’m doing it by saving the Asimov dataset to the workspace

asimov->SetName("AsimovDataAfterFit");
w->import(*asimov);
w->writeToFile(filePath.c_str());

and running the StandardHypoTestDemo with the following options:

StandardHypoTestDemo("path/to/workspace.root", "w", "ModelConfig", "no", "AsimovDataAfterFit", 2, 3)

(It crashes if I left the modelBName blank, so I pass some name that is not in the workspace for the script to create the model config for me).
Is this the correct type of the calculator and statistic for estimating discovery significance? Is the significance I get in the Results HypoTestAsymptotic_result: really the mean discovery significance?
The value will obviously be wrong for now, since there are problems with the model, but I want to be sure that the way to get it is correct.

Second and slightly off-topic question:
Is there a good explanation of how the uncertainties of the estimates are calculated with the second-order derivatives and how the introduction of the nuisance parameters influences those uncertainties? The information in the MINUIT manual is scarce, unfortunately.

Hi @apetukho,

The Asymptotic calculator is the one that you want to use, provided you have enough data. As the name says, it’s asymptotically correct. And what’s called significance in the output of the tutorial is the median significance, not mean, but that’s what you want, anyway.

Second question:
I’m not aware of many explanations that bring it to the point quickly. I will point you to one via PM.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.