Simultaneous 1D likelihood fits in different observables, with multiple PDFs

Hi,
I am new to RooFit but I have some experience with C++ and ROOT. Mine is a RooFit specific question. This is my first time posting to the forum but I have benefited immensely from all the past posts I have found in the forum.

Main goal: I have a single data ntuple with 5 observables in them. I have three Monte Carlo simulations ntuples with the same 5 observables, each MC representing ‘background 1’, ‘background 2’ , ‘signal’. I would like to perform a simultaneous 1D fit of the three MC across all 5 observables, using binned extended maximum likelihood.

What I have done: I have tried to follow the tutorial example here: https://root.cern.ch/root/html/tutorials/roofit/rf501_simultaneouspdf.C.html. For my MC ntuples, I have created histograms of each type (background 1, background 2, signal) and each type its own observable, and I converted them into RooHistPdfs (so I have about 15 RooHistPdf total for the MCs). For my data ntuple, I have made RooDataHist objects for each of the observables, so I have 5 RooDataHist objects, each corresponding to an observable.

My main issue is making ‘combData’ (line 92) in the example. In ‘combData’, instead of the observable ‘x’, I passed a RooArgSet with my 5 observables. When I ran it, my memory consumption spiked and it never progressed past line 92. What I think is happening is that it is trying to make a 5D histogram which I think doesn’t work. I am thinking I should use RooCategory somehow, but I am not sure how.

I am not sure how to proceed. I have looked up other forum posts but I couldn’t piece anything together. Any guidance would be great. Thanks!

@StephanH could you help here?

Hey @ilam,

let’s investigate a bit what you need to do.

  • Do you really need 5 observables? Are you e.g. fitting, pT, eta, phi, … simultaneously? In that case it’s a 5D fit.
  • Do you have 5 categories, and all share the same observable (e.g. the mass of a particle in a specific phase space region)? In that case it’s a 1D fit and you need RooSimultaneous.
  • Do you have 5 categories, but each one only has a single observable, and observables are not shared between the categories? In that case it’s again a 1D fit, but I would have to look up how to do this exactly.

Next, what have you done? You could post a (possibly stripped-down version) of the code, so it’s easier to see what’s happening.

For now, I suspect that you have to create a RooDataHist with categories. This is shown in rf401, which imports from a tree. You should be able to import from histograms, though.

Now, you can construct a simultaneous PDF with the same categories, and try the fit.

1 Like

Hi StephanH,

From the three points you listed, mine is the third one but I have 5 observables with 3 categories.
I stripped down my code here (as much as I could) to show 4 observables with 3 categories. The end game is to have 5 observables with around 6 categories. I also added some comments to hopefully clarify my code. To state my issue another way for clarity: I want to perform five 1D fits, each of them corresponding to an observable. And each of those 1D fits have 6 categories(each with their own PDF). The categories are the same type across the 1D fits. The quantity I am estimating is the number of events of each category.

reduced.cpp (10.6 KB)

Thanks you!

Hi @ilam,

you are right that it tries to create 5D dataset. This will indeed create a spike in memory usage if you have a lot of bins.

There is a workaround:
You can use the RooDataSet (the one that saves data points) instead. What you do is that for each bin you save the bin position as a data point, and you save the bin content as the weight of the respective point. In this way, you only save as many datapoints as you have bins.
I hacked the rf501 example in order to show how it’s done.

Now, you can use this for fitting. Although every data point contains values for all observables (and we are only ever interested in one of them), the category, which also is in the dataset will only switch on a single PDF that only reacts to its observable. Whatever the values of the other observables for this datapoint are, the PDF doesn’t depend on it.
RooFit will cycle through all datapoints, and only one single PDF will react to each of the points, according to the category number that is associated to this point. Note that I set the category to a new state right before I start filling. This is important!

To illustrate, I printed the data storage and the first datapoint:

DataStore combData (combined data)
  Contains 202 entries
  Observables: 
    1)       x = 8  L(-8 - 8)  "x"
    2)       y = 8  L(-8 - 8)  "y"
    3)  sample = Y(idx = 1)
  "sample"

  Dataset variable "weight" is interpreted as the event weight
The first datapoint is:
  1) RooRealVar::      x = -7.92
  2) RooRealVar::      y = 0
  3) RooCategory:: sample = X(idx = 0)

As you can see, y is at its default value, but we don’t care. The category is set to X.
When plotting, you just have to make sure that you only take the points that belong to the respective category. This is also shown in the macro.

Do this for all 5 observables, and happy fitting!

rf501_simultaneouspdf.C (4.0 KB)

1 Like

Hi @StephanH ,

Thanks for your reply! I have been working on it, hence the slow reply. Also, while waiting, I found another work-around by Wouter in this link [Simultaneous fits]. I modified my code with this approach and it also worked (I have attached the code). I ran both methods (NLL-method for mine, label-method for yours), I get the exact same answer, which is expected since the only difference is how the data files were handled. Both the NLL-method and label-method converged with no error messages. As I understand it, the NLL-method basically builds the negative log likelihood piece by piece and minimizes that.

Coincidently, I was looking over related forum posts again and I found this post [RooSimultaneous fit and Datasets in RooFit]. In here, Wouter basically seems to give both NLL-method and label-method as solutions, although his implementation of the “label-method” is a bit different (he uses addColumn).

I do have a follow-up question which is related to this issue in a sense that it is about simultaneous/multi-dimension fitting but about a different “how-to” issue. Let me know if I should move it to a different thread but I’ll address it here first.

Now that I have the fit working, I want to make a likelihood ratio plot of one of the quantities (say it is a quantity representing the signal). By likelihood ratio I mean profiled likelihood (found by maximizing the likelihood with respect to all nuisance parameters as a function of the signal) divided by the likelihood at the best fit signal. I would then want to integrate the likelihood ratio plot to 90% to set a 90% upper limit on the quantity using the Feldman-Cousins ranking method. To summarize, the goal now is to produce the likelihood ratio plot and integrate it to 90% using Feldman-Cousins ranking.

My first thought to approach this was the “brute force” method. Since I have my simultaneous PDF, I could feed it the results of the fit to get the denominator of my likelihood ratio by calculating the likelihood for each bin and multiply them. For the numerator, I would do the same but I instead vary my signal quantity (hold the other parameters fixed). I’ll calculate the ratio as I go and since I have the numbers, I can easily store them, make the plots and integrate.

Second thing I found was the FeldmanCousins method in RooStats, in particular this example here: [https://root.cern.ch/root/html/tutorials/roostats/StandardFeldmanCousinsDemo.C.html]. I am even less familiar with RooStats than RooFit so if I could confirm (or not) that this example is what I want, I can give it a go.

I bring this issue back full circle with the two fitting methods, in a sense that if this is the example I want, would it work for the two methods of handling the data?

Thanks,
Ian L.

reduced_2.cpp (8.5 KB)

Profile Likelihood

There is a profile likelihood helper for plotting the curves. See here:
rf605_profilell.C

Feldman Cousins

There’s two FelmanCousins examples in the roostats tutorials:
https://root.cern.ch/doc/master/dir_87facaa6959ceb8d3dacccaf663f7b84.html

Given that both methods give you the same log likelihood, you can use either of them to compute a Feldman Cousins interval. After all, you get the same likelihood from both.
What you need to do is to put your model and data into a ModelConfig such that the FeldmanCousins class can pick it up.
You can e.g. look at
https://root.cern/doc/master/HybridStandardForm_8C.html
to get an impression on how to do that.

Hi @StephanH,

Thank you for the references! I’ll work through them. You mentioned about “both methods giving me the same log likelihood”. As mentioned, I get the same results for my parameters regardless of the method but the FCN value returned from both is slightly different (if I understood correctly, FCN is the function value, this case the log-likelihood, at it’s minimum value). For the label-method, I get FCN=-12324.7 and for the NLL-method I get FCN= -18823.7. Does this difference matter, given that the best-fit values of the parameters are the same?

Thank you!

Hi @ilam,

Since we are usually doing likelihood ratio tests, the difference does not matter as long as the likelihoods differ by a constant factor. (As log-likelihoods, this means that they differ by a constant offset.)
Since in your case you seem to get the same parameters with both methods, the offset seems to be constant. This indicates that it’s just a normalisation difference, but this difference is constant.

When you conduct a likelihood ratio test (which the profile likelihood is), such normalisation differences cancel. The only thing you should not do is to conduct a likelihood ratio test where you mix these two approaches.

Hi @StephanH ,

Thanks for your explanation. That makes sense.
Along the lines of simultaneous fitting, I have another question. In my original question and your subsequent response, I was fitting across multiple observables in 1D. Now, say I found out that two of those observables have some correlation and I want to capture that correlation in the fit. To do that, I was thinking of making a 2D fit of those two observables, along with the usual 1D fit of the other observables. I have been using your solution to my original question i.e. the labeling method to combine datasets. I tried modifying it to take in a 2D dataset, something like this:

TH1 *hdataudrposnr = datatb1.createHistogram("hdataudrposnr", sposr, Binning(100), YVar(udotr,Binning(20)));

RooCategory sample("sample", "sample");
  sample.defineType("b14");
  sample.defineType("E");
  sample.defineType("sct");
  sample.defineType("udrpsnr");

RooRealVar urpsr("urpsr", "urpsr", 500.);

sample.setLabel("udrpsnr");
  for (int i=1; i<=hdataudrposnr->GetNbinsX()+1;++i){
    for(int j=1; j<=hdataudrposnr->GetNbinsY()+1; ++j){
      urpsr.setVal(hdataudrposnr->FindBin(i,j));
      combData.add(datasetVars,hdataudrposnr->GetBinContent(hdataudrposnr->FindBin(i,j)));
    }
  }

The idea I had was to store the global bin number instead of the bin center and get the content of that bin. Another idea I was thinking of was to use tutorial rf501 where I first make a combined dataset for the 2D case and somehow combine it with the other 1D cases using your labelling method. But I am unsure how to merge the two methods.

Hello @ilam,

the labelling method is a bit inferior to making one dataset per category. I advise to try something like this:

   RooCategory sample("sample", "sample");
   ...

   RooArgSet setX(x1, x2, weight);
   RooDataSet datax("datax", "datax", setX, WeightVar(weight));
   for (int i=1; i <= histX->GetNbinsX()+1; ++i) {
    for (int j=1; .... y-coordinate loop) {
      x1.setVal(histX->GetBinCenter(i));
      x2.setVal(... j ...)
      datax.add(setX, histX->GetBinContent(i, j));
    }
   }
   [... Same with y, but one-dimensional. ]

   
   // Construct combined dataset
   RooArgSet datasetVars(x1, x2, y, weight, sample);
   std::map<std::string, RooDataSet*> sampleMap;
   sampleMap["xSampleName"] = &dataX;
   sampleMap["ySampleName"] = &dataY;
   ...
   RooDataSet combData("combData", "combined data", datasetVars, Index(sample), WeightVar(weight), Import(sampleMap));
   
   combData.Print("V");
   combData.get(0)->Print("V");

Check out the documentation of the RooDataSet constructor to see where I got the arguments from:
https://root.cern.ch/doc/master/classRooDataSet.html#aeea71f616bd2cc6743f6869d31a4dbd5

Hi @StephanH,

Thanks for your response! I implemented your method and I think it worked. I just want to clarify that I understood your method correctly. The code below shows how I implemented it. Did I understand correctly?
Thank you!

RooRealVar x1(...);
RooRealVar x2(...);
RooRealVar y(...);

TH1 *hx1x2= datafile.createHistogram("hx1x2", x1, Binning(100), YVar(x2,Binning(20)));

RooCategory sample("sample", "sample");
sample.defineType("sampleY");
sample.defineType("sampleX1X2");

RooRealVar weight("weight", "weight", 1.);
RooArgSet setX(x1,x2,weight);
RooArgSet setY(y,weight);

RooDataSet datax("datax","datax", setX, WeightVar(weight));
RooDataSet datay("datay","datay", setY, WeightVar(weight));

for (int i=1; i<=histX1->GetNbinsX()+1;++i){
    for(int j=1; j<=histX2->GetNbinsX()+1; ++j){
      x1.setVal(histX1->GetBinCenter(i));
      x2.setVal(histX2->GetBinCenter(j));
      datax.add(setX, hx1x2->GetBinContent(i,j));
    }
  }

for (int i=1; i<=histY->GetNbinsX()+1;++i){
    y.setVal(histY->GetBinCenter(i));
    datay.add(setY, histY->GetBinContent(i));
    }

RooArgSet datasetVars(x1, x2, y);
  std::map<std::string, RooDataSet*> sampleMap;
  sampleMap["sampleX1X2"]=&datax;
  sampleMap["sampleY"]=&datay;
  
  RooDataSet combData("combData","combData",datasetVars,Index(sample),WeightVar(weight),Import(sampleMap));

Hi @ilam,

that’s exactly what I had in mind.