Unbinned NLL for a Poisson Distribution with mean as a function of time

RemiHill · June 8, 2020, 11:00pm

Hello,

I am struggling to get the NLL working for a Poisson distribution. I am trying to extract an observable called t0, where I have a signal of some counts (1,2,…N) with time indices t_i. To do this, I was going to use an unbinnned maximum likelihood approach to extract t0, where the PDF is a Poisson distribution with a mean as a function of time. When I try to construct my PDF and use it to evaluate the likelihood, I get some errors that I am not sure what is causing.

Some of the errors are NAN, which makes sense to me if it scans t0 to a point where t_i - t0 is less than zero since log_e( t_i - t0) would be undefined. But if I try to draw this on a TCanvas it comes back empty. Any suggestions or help would be great. I have attached my code and the error message down below.

Thank you for taking the time to help, apologies if this is trivial.

likelihood.cc (2.3 KB)
t.txt (302.3 KB)

bellenot · June 9, 2020, 6:27am

The root file is missing (to be able to run the code), but maybe @StephanH can take a look

StephanH · June 9, 2020, 7:55am

Hello,

indeed, the macro doesn’t work. I took the time to rewrite it by letting the PDF generate some toy data.
Fitting those doesn’t work, though, since the formula tPrime evaluates to something like 4.5E9, and the mean of the Poisson is about 15. To all reasonable floating-point precision this is 0, and log(0) is -infinity, and if you use this in further computations, it’s not a number.

I guess you need to fix the formula that computes tPrime. The likely cause of the problem is
RooRealVar meanTime("meanTime", "meanTime", -1e9, 10e9);
This sets the default value of mean to the middle between the range limits you passed as arguments. If you wantd to make it constant, pass only a single value.

RemiHill · June 9, 2020, 12:22pm

Good morning Stephan,

So the line you mention RooRealVar meanTime("meanTime", "meanTime", -1e9, 10e9), I was under the impression that this, in conjunction with meanCounts and the mean RooDataSet, would load the data from the TTree on line 47 of likelihood.cc, resulting in a mean as a function of time. I attached a figure of this below. So it is not a constant that I am after. I am not sure if this clarifies what I am trying to do, but the Poisson PDF I am attempting to construct would look like this:

P = n_bar(t-t0)^( n(t) ) * e^-( n_bar(t-t0) ) / n(t)!

where the observable is t0, n(t) is the counts at a discrete times t and n_bar is the mean (below).

I have also added the ROOT file below that I was running my above code on, I apologize for leaving that out in my original post. I was running the code as: ./likelihood OUTPUT-SimData-Neutron-Batch-0.root temp. The second ROOT file (t0-analyzed.root) will not upload since its size is ~100 Mb, the upload hangs at 13%.

OUTPUT-SimData-Neutron-Batch-0.root (21.3 KB)

Thank you so much for helping me.

StephanH · June 9, 2020, 4:05pm

Ah sure. I deleted all lines from your macro that did something with data, since it wouldn’t run without the files. So obviously I didn’t see that you were reading those values … For me this was a constant.

Ok, since you cannot upload, can you check that the dataset loads as you expect? Something like

data->get(0)->Print("V");
data->get(1)->Print("V");
...

Check if these values are reasonable.

Next, you can always print all values by e.g. doing something like

*pdf->getObservables() = *data->get(0); // assigns values of entry zero to the observables of the PDF
pdf->Print("T"); // Prints the values of all terms

See e.g. getObservables() to retrieve the observables (or getParameters for the values of parameters).

Now check if the x and mean of the poisson are

unreasonable (e.g. negative or very large)
or drastically different.

If that’s the case, something needs to be fixed in the model definition or when reading the data.

StephanH · June 9, 2020, 4:16pm

Maybe I found the problem:
Your mean dataset is not used. Try merging them, so mean and t values get loaded simultaneously. I wrote something like

 44   data.get(0)->Print("V");
 45   data.get(1)->Print("V");
 46   // Open the second file and get the mean as a function of time.
 47   TFile* f2 = new TFile("t0-analyzed.root", "OPEN");
 48   TTree* t2 = (TTree*)f2->Get("mean");
 49   
 50   // Store these in a RooRealVar.
 51   RooRealVar meanTime("meanTime", "meanTime", -1e9, 10e9);
 52   RooRealVar meanCounts("meanCounts", "meanCounts", 0, 30);
 53 
 54   std::cout << "Creating the mean dataset!" << std::endl;
 55   // Form the dataset for the mean.
 56   RooDataSet mean("mean", "mean", RooArgSet(meanTime, meanCounts), Import(*t2));
 57   data.merge(&mean);
 58   data.get(0)->Print("V");

But I’m missing the second file, so I cannot actually merge them. Check that after merging, you see all relevant variables.

RemiHill · June 9, 2020, 8:14pm

Hello Stephan,

Sorry for the late response. In terms of printing out data and mean, both output exactly as I would expect. As follows:

Opening the TFiles!
temp
Creating the observed dataset!
  1) RooRealVar::   time = 8.55282e+06
  2) RooRealVar:: counts = 1
Creating the mean dataset!
[#1] INFO:Eval -- RooTreeDataStore::loadValues(mean) Ignored 500002 out of range events
  1) RooRealVar::   meanTime = -2e+07
  2) RooRealVar:: meanCounts = 0.0166056

So it would appear that both are loaded in correctly. In your first repsonse I see you use pdf defined from some RooAbsArg, that is not defined in my code and I am not sure where you defined it. Also, the merging of the two data sets doesn’t work as the RooDataSet data and mean are of different lengths.

[#0] ERROR:InputArguments -- RooDataSet::merge(data) ERROR: datasets have different size

The mean data has about 10^7 entries, this is calculated from 10,000 “data” sets and interpolated to 1 microsecond spacing between points. I had done this before using RooFit, when I was just coding my own likelihood function and was interested in getting a fine resolution.

So, do I need to reduce the entries in my Poisson pdf to match the number in my dataset?

I had tried to create a public Git repository for you to grab the second ROOT file, but it is even too big for Github. I am currently writing a script to separate them into smaller files so I can append them to this message. It might help you, help me .

RemiHill · June 9, 2020, 9:04pm

Hello Stephan,

I was able to break the t0-analyzed.root file into 5 separate files. They are located in a public repository on my Github page. I apologize for how roundabout this is.

git clone https://github.com/RemiHill/ROOTFile

Let me know if that helps.

StephanH · June 10, 2020, 9:42am

To clarify: When I wrote something like

StephanH:

Next, you can always print all values by e.g. doing something like

*pdf->getObservables() = *data->get(0); // assigns values of entry zero to the observables of the PDF
pdf->Print("T"); // Prints the values of all terms

I assumed that you would replace pdf by the PDF that you are using, the poisson.

Now to your real problem:
In your fit only one dataset is used, data. The fact that you create mean doesn’t mean anything to the fit, because the fit instruction is
createNLL(data).
That means that the variable meanTime stays at the last value that it had before the fit was started.
I thought that you wanted to do a 4-D fit in {time, counts, meanTime, meanCounts}, so you would have to merge the datasets. The fact that they don’t have the same length shows that that’s apparently not the case, so you cannot merge them.

If I understand correctly, you wanted to use mean as kind of lookup table that converts some number into a “meanTime” or into a “tPrime”. So instead of tPrime = meanTime - t0, you should use the function that you coded for your own likelihood. It’s perfectly fine to make it a function that looks up the value in a histogram or in a tree, and passes the result to RooFit. RooFit doesn’t even need to know that the values are looked up in a table.

Something like this converts any function pointer into a RooFit object, e.g.

RooAbsReal* tPrime = RooFit::bindFunction("tPrime", &myTPrimeImplementation, t0);

This way, RooFit passes t0 as the only argument into the function, and uses whatever the function returns.

An alternative is to find the functional form of what you wanted to do with the lookup table. You could e.g. fit the mean dataset, extract fit parameters, and use this to model tPrime. If that’s easy to do really depends whether there is a functional relation between tPrime and t0.

RemiHill · June 18, 2020, 12:30am

Hello Stephan,

This is off topic to what I was discussing with you last week, but I figured this would be the best place to ask some follow up questions.

I suspect I am not fully understanding how RooFit handles event generation with some pre-defined PDF. I have the following code:


#include <RooFit.h>
#include <RooRealVar.h>
#include <RooAbsReal.h>
#include <RooDataSet.h>
#include <RooRandom.h>
#include <RooFormulaVar.h>
#include <RooMinuit.h>

#include <TF1.h>
#include <TGraph.h>
#include <TFile.h>
#include <TTree.h>

using namespace RooFit;

TGraph* g;
TGraph* g2;

double theFunction(Double_t* x, Double_t*){
  double num = g->Eval(x[0]);  
  return num;
}

void likelihoodInterpreter(){

  RooRandom::randomGenerator()->SetSeed(62);

  std::cout << "Loading our data!" << std::endl;
  TFile* file = new TFile("t0-analyzed.root", "OPEN");
  TTree* tree = (TTree*)file->Get("mean");
  g = (TGraph*)file->Get("mean-time");
  TFile* file2 = new TFile("OUTPUT-SimData-Neutron-Batch-0.root");
  TTree* tree2 = (TTree*)file2->Get("temp");

  RooRealVar tim("time", "time", -0.02, 10);
  RooRealVar cts("counts", "counts", 0, 100);

  RooDataSet SimData("SimData", "SimData", RooArgSet(tim), Import(*tree2));
  
  std::cout << "Creating our variables and formulas!" << std::endl;
  RooRealVar t("t", "t", -0.02, 10);
  RooRealVar mean("mean", "mean", 0, 26);
  RooRealVar t0("t0", "t0", 0);
  RooFormulaVar tPrime("tPrime", "t-t0", RooArgList(t,t0));

  std::cout << "Creating our function and poisson PDF!" << std::endl;
  // This repersents the mean as a function of the monte carlo truth time.  
  TF1* f = new TF1("f", theFunction, -0.02, 10);  
  
  TCanvas* c = new TCanvas("c", "c", 400, 400);
  c->Divide(3,1);
 
  // We bind this to our time.
  RooAbsReal* f1 = RooFit::bindFunction(f, t);  
  RooPoisson poisson("poisson", "poisson", t, *f1, kFALSE);

  int points = 246891;
  std::cout << "Generate our data!" << std::endl;
  TFile* output = new TFile("output-test.root", "RECREATE");
  
  RooDataSet* data = poisson.generate(t, points);

  // Get the Geant4 Monte Carlo data.
  TFile* tempFile = new TFile("createCountPDF.root", "OPEN");
  TTree* tempTree = (TTree*)tempFile->Get("data");
  
  RooRealVar timeDist("timeDist", "timeDist", -0.02, 10);
  RooDataSet MC("MC", "MC", RooArgSet(timeDist), Import(*tempTree));

  c->cd(3);
  RooPlot* frame3 = timeDist.frame();
  MC.plotOn(frame3, Binning(10020));
  frame3->Draw();
  
  c->cd(2);
  RooPlot* frame = t.frame();
  data->plotOn(frame, Binning(10020));
  frame->Draw();
  
  c->cd(1);
  RooPlot* frame2 = t.frame();
  f1->plotOn(frame2);
  frame2->Draw();
  
  data->Write();

}

I was working with the bindFunction() function you suggested last week to load a TF1 object of my mean (as a function of time) and then hand that over to my RooPoisson. I generate a few hundred thousand points from this Poisson distribution and get a toy data set that is wildly inconsistent with a dataset that comes directly from the Geant4 Monte Carlo. In the attached figure(s), the middle is the toy dataset and the right is the Monte Carlo. It almost appears like the toy data is a simple exponential, there is also some features in the Geant4 data set that are not accurately reflected in the toy data set. The left plot is the RooAbsReal f1, which is the mean (as a function of time) and is fine. Does anything jump out to you in my code in how I am defining my distribution? Is anything blatantly incorrect? I apologize if this is trivial and for my frequent questions.

To run the above code, there is two ROOT files you will need that were not present in our previous discussions. See below.

createCountPDF.root (1.8 MB)
t0-analyzed.root (14.8 KB)

StephanH · June 18, 2020, 8:24am

I don’t really know what makes you think that the toy data are wrong, but here are some random things I noticed:

You can directly bind a c++ function (won’t probably change results):

 19 double theFunction(double x){
 20   double num = g->Eval(x);  
 21   return num;
 22 }

 52   // We bind this to our time.
 53   RooAbsReal* f1 = RooFit::bindFunction("meanFunction", &theFunction, t);  
 54   RooPoisson poisson("poisson", "poisson", t, *f1, kFALSE);

The file OUTPUT-SimData-Neutron-Batch-0.root is missing, but I commented out all sections that use it.

Next, toy data doesn’t necessarily have the same distribution as the data you read from files. That would only be the case if the model is able to describe the data, and after you fit it to data. So unless you know that your model parameters are chosen correctly, I wouldn’t expect them to agree. Makes sense?

One more thing I see is that the MC “leaks” below zero. That’s not possible for a Poisson distribution. If this property made you distrust the toy data, it’s correct, and it proves that the MC doesn’t follow a poisson distribution.

Edit:
In general, what RooFit should do in this case, since the mean parameter of the poisson is not a constant, is to sample the function with an event generator. It uses TFoam for this, and this figures out which parts of the distribution have to be sampled with highest frequency.

system · July 2, 2020, 8:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.