Minimized NLL not corresponding to best chi2

Iain_Bertram1 · December 13, 2022, 5:47am

@Jonas If I wasn’t clear having the new NLL definition in the new root release would be fantastic…

I can build from the nighties if required (or head, let me know when available).

jonas · December 21, 2022, 2:24pm

Alright I’ll let you know! The NLL definition with the offset will be available soon.

In the meantime, I have realized there is another problem in the way that RooFit defines likelihoods.

I will elaborate on it here before opening a GitHub issue for this, because here I can use LaTeX formulas which is quite nice.

If you have a continuous function f that you use to model some binned data, the non-extended NLL with the correct predicted probabilities p_i would be:

l = \sum_i{n_i\log{p_i}} = -\sum_i{n_i\log\left(\frac{1}{A}\int_{a_i}^{b_i} f dx\right)},

where n_i is the number of observations in each bin, \int_{a_i}^{b_i} f dx is the integral of f over the bin, and A = \int f dx is the normalization factor (integral over the full domain of x, the union of all bins).

Because the integration for each bin is inefficient, one usually just takes the value of f at the bin center, let’s call it \tilde{f}_i. The actual bin integral is this value times an arbitrary bias: b_i\tilde{f}_i.

The correct NLL is therefore:

l = - \sum_i n_i{\log{\left(\frac{1}{A}b_i\tilde{f}_i\right)}} = - \sum_i n_i{\log{\tilde{f}_i}} - \sum_i n_i \log{b_i} + N\log{A},

where N = \sum_i n_i.

Moving on to the RooFit likelihood. The biases are unknown, and the likelihood is approximated by dropping them.

l_\text{RooFit} = - \sum_i n_i{\log{\tilde{f}_i}} + N\log{A}.

Note for later that no approximation was done for the normalization integral!

Let’s go back to the correct NLL, but this time also express the integral A as a sum with biases for the binning effects, A = \sum_i b_i \tilde{f}_i:

l = - \sum_i n_i{\log{\tilde{f}_i}} - \sum_i n_i \log{b_i} + N\log{\sum_i b_i \tilde{f}_i}

We now split the bias into an average bias B times a bin-depended part \epsilon_i, i.e., b_i = B\epsilon_i:

l = - \sum_i n_i{\log{\tilde{f}_i}} - \sum_i n_i \log{B\epsilon_i} + N\log{\sum_i B\epsilon_i \tilde{f}_i}

If you pull the mean bias B out of the sum and the log, the terms with B cancel:

l = - \sum_i n_i{\log{\tilde{f}_i}} - \sum_i n_i \log{\epsilon_i} + N\log{\sum_i \epsilon_i \tilde{f}_i}.

Now it becomes a bit more vague. For most analytical PDF shapes, the constant bias is more important than the bin-variation of the bias. Think for example of an exponential decay: the binning effect bias is a flat multiplicative factor. Therefore RooFit looses quite some fit precision by doing the normalization integral analytically over the full domain: the cancellation of the mean bias term doesn’t happen.

In conclusion: it would also help if the NLL in RooFit for binned fits is reformulated such that the normalization integral is only approximated by evaluating at the bin centers. Like this, the average binning effects are cancelling out because they appear both in the numerator and denominator. Only bin-dependent binning effects are remaining then.

Besides reducing binning effects, this would improve numerical fit stability because no numeric integration of the PDF function needs to be done in the integral is not known analytically.

Sorry for the long text! It’s mainly for myself to not forget what I should do after the Christmas break.

The associated GitHub issue is this one:

github.com/root-project/root

[RF] Improve precision of binned likelihood fits by redefining normalization term in NLL

opened 02:37PM - 21 Dec 22 UTC

guitargeek

improvement in:RooFit/RooStats

It would also help if the NLL in RooFit for binned fits is reformulated such tha…t the normalization integral is only approximated by evaluating at the bin centers. Like this, the average binning effects are cancelling out because they appear both in the numerator and denominator. Only bin-dependent binning effects are remaining then. Besides reducing binning effects, this would improve numerical fit stability because no numeric integration of the PDF function needs to be done in the integral is not known analytically. More detail with the mathematical background can be found on the ROOT forum: https://root-forum.cern.ch/t/minimized-nll-not-corresponding-to-best-chi2/50120/22

JackLindon · January 17, 2023, 9:31am

Hi @jonas , I’m wondering when this update is planned to be included, is it the 6.28/00 release which is planned for May? If it is, is there any possibility of this being added to a nightly sooner?

Thanks for your work on this,
Jack

Iain_Bertram1 · January 17, 2023, 9:32am

Hi @jonas

We are using the RooBinSamplingPDF to deal with some of the issues with the approximation of using the bin centre. Hopefully the above problem is handled correctly in this case…

jonas · January 18, 2023, 1:55pm

Hi @JackLindon,

the implementation of the more stable likelihood definition has been in the nightlies for a few weeks now, but I haven’t reported on it yet. I first wanted to also write a tutorial to explain all of that.

The tutorial is proposed here:

You can try it with the ROOT nightlies, otherwise the Offset("bin") feature is not there yet.

Can you try if it works for you? Also your feedback on the tutorial is greatly appreciated!

JackLindon · January 18, 2023, 4:20pm

Hi @jonas , great thanks very much . I will try to run with a recent nightly and use it, will let you know how it goes.

Iain_Bertram1 · January 24, 2023, 11:03am

@Jonas,

Thanks, this looks great. I have run some test using my standard fits to the dijet mass spectrum. So far the new Offset("bin") method works really well for the resulting fit parameters.

I have compared results with my own stand alone Minuit fitting code and get the same fit parameters within a small variation which is probably due to choices of integrator (the NLL value differs by ~0.1 out of 31.8).

I am using this in conjunction with RooBinSamplingPdf to get the integration over the bin.

One problem I am having is in making a pull distribution from a RooPlot.

In this case the resulting pull distribution is quite different from the true pull distribution I obtain by evaluating the resulting fit directly using the resulting fit parameters.

The RooPlot distribution (first plot) and my alternative plots are attached.

In conclusion the fitting appears to be working perfectly but the pdf that is stored in association with the fit has issues. If it is using a weighted bin centre rather than the integrated pdf this may be causing it.I have seen this before and will see if I can dig out the fix.

I can send my code privately or attach here. It is long and complicated so probably does not belong here.

JackLindon · April 13, 2023, 9:47am

Hi @jonas ,

Thanks again for this update. I’ve been attempting to use the new offset(“bin”) option in createNLL in ROOT v6.28 and have been having some problems. These problems are not directly because of the new option but I thought it’s likely you’d know how to resolve it (they are really just a problem with me not understanding RooAbsData in general I think). When I try to use it I just get the output

[#0] ERROR:Minimization -- The Offset("bin") option doesn't suppot fits to RooDataSet yet, only to RooDataHist. Falling back to no offsetting.

The error message is pretty clear on the reason for this, but for some reason I am struggling to pass it a RooDataHist. RooDataHist seems to be automatically converting itself to RooDataSet for some reason.

The code I’m using is very long and involved and I can’t share it but some relevant parts:

//Get RooDataHist from ws.
RooDataHist *data = (RooDataHist *)ws->data(_dataName.c_str());
if (dynamic_cast<RooDataSet*>(&(*data))){ std::cout << "After getting data from ws: data is RooDataSet" << std::endl;}
if (dynamic_cast<RooDataHist*>(&(*data))){ std::cout << "After getting data from ws: data is RooDataHist" << std::endl;}
//At this point the code outputs that data is a RooDataHist as expected.

[...]

if (dynamic_cast<RooDataSet*>(&(*data))){ std::cout << "Before startFit: data is RooDataSet" << std::endl;}
  if (dynamic_cast<RooDataHist*>(&(*data))){ std::cout << "Before startFit: data is RooDataHist" << std::endl;}
  int status = fitter->startFit(mc, data); // Perform fit
// At this point the code still says that data is a RooDataHist. The function int startFit(ModelConfig *mc, RooAbsData *data); is called

int fitTool::startFit(ModelConfig *mc, RooAbsData *data)
{
  if (dynamic_cast<RooDataSet*>(&(*data))){ std::cout << "Immediately in startFit: data is RooDataSet" << std::endl;}
  if (dynamic_cast<RooDataHist*>(&(*data))){ std::cout << "Immediately in startFit: data is RooDataHist" << std::endl;}
//At this point the code says that data is a RooDataSet *not* a RooDataHist
[...]

Why/how do I prevent the RooDataHist converting to a RooDataSet when I pass it to a function that uses RooAbsData?

JackLindon · April 13, 2023, 10:07am

P.S. I find that even if I immediately call createNLL while it still outputs that data is a RooDataHist, it still gives the error

[#0] ERROR:Minimization -- The Offset("bin") option doesn't suppot fits to RooDataSet yet, only to RooDataHist. Falling back to no offsetting.

I assume this is because of the same reason and that when data is passed into the createNLL function it converts into a RooDataSet?

jonas · April 13, 2023, 10:21am

Hi! The output in your latest post is empty. Can you edit it, and also tell me how your call to createNLL looks like? And what is the type of the pdf? A RooSimultaneous?

JackLindon · April 13, 2023, 10:28am

Hi @jonas

Thanks for the help, of course. I originally called createNLL as such:

RooAbsReal *fitTool::makeNLL(ModelConfig *mc, RooAbsData *data){
  RooAbsPdf *pdf = mc->GetPdf();
  RooAbsReal *nll = nullptr;
  if (dynamic_cast<RooDataSet*>(&(*data))){ std::cout << "data is RooDataSet" << std::endl;}
  if (dynamic_cast<RooDataHist*>(&(*data))){ std::cout << "data is RooDataHist" << std::endl;}
  nll = pdf->createNLL(*data, NumCPU(_nCPU), BatchMode(_useBatch), IntegrateBins(_samplingRelTol),
                           Constrain(*mc->GetNuisanceParameters()),
                           GlobalObservables(*mc->GetGlobalObservables()),
                           ExternalConstraints(*_externalConstraint),
                           Offset("bin"));
}

This as mentioned previously says it data is a RooDataSet even though I passed a RooDataHist to the function so I have tried (producing dataTemp in RooDataHist and passing this to createNLL):

RooAbsReal *fitTool::makeNLL(ModelConfig *mc, RooAbsData *data){
  RooDataHist *dataTemp = (RooDataHist *)data;
  RooAbsPdf *pdf = mc->GetPdf();
  RooAbsReal *nll = nullptr;
  if (dynamic_cast<RooDataSet*>(&(*dataTemp))){ std::cout << "dataTemp is RooDataSet" << std::endl;}
  if (dynamic_cast<RooDataHist*>(&(*dataTemp))){ std::cout << "dataTemp is RooDataHist" << std::endl;}
  nll = pdf->createNLL(*dataTemp, NumCPU(_nCPU), BatchMode(_useBatch), IntegrateBins(_samplingRelTol),
                           Constrain(*mc->GetNuisanceParameters()),
                           GlobalObservables(*mc->GetGlobalObservables()),
                           ExternalConstraints(*_externalConstraint),
                           Offset("bin"));
}

In which case it outputs “dataTemp is RooDataHist” as expected, but it still gives the error

[#0] ERROR:Minimization -- The Offset("bin") option doesn't suppot fits to RooDataSet yet, only to RooDataHist. Falling back to no offsetting.

jonas · April 13, 2023, 10:41am

Hold on, what you are doing there is super illegal in C++:

This is a C-style “forced” cast where the new object is interpreted as a RooDataHist, whether it is one or not.

And then you do dynamic_cast to check the type, but the compiler thinks it’s smart and will just say it’s a RooDataHist because you just told it so.

Here is an example to show this behavior:

#include <iostream>

class A {
public:
    virtual ~A() = default;
};

class B : public A  {
};
class C : public A  {
};

void testo() {

    auto c = new C{};

    B* tmp = (B*)c;

    std::cout << dynamic_cast<B*>(tmp) << std::endl;
    std::cout << dynamic_cast<C*>(tmp) << std::endl;
}

The output will be:

0x55bf4bf81770
0

So you see that the dynamic_cast gave not the behavior you expect there.

Can you try again removing the C-style cast to dataTemp and use the RooAbsData *data everywhere? Which type is it actually? You can also use data->Print() to check.

JackLindon · April 13, 2023, 10:50am

Hi @jonas ,

Ah ok that’s very helpful (as you can probably tell I don’t really understand polymorphism).

If I do at the very start:

RooWorkspace *ws = (RooWorkspace *)tf->Get(_wsName.c_str());
RooDataHist *data = (RooDataHist *)ws->data(_dataName.c_str());
std::cout << "Got data from ws" << std::endl;
data->Print();
if (dynamic_cast<RooDataSet*>(&(*data))){ std::cout << "After getting data from ws: data is RooDataSet" << std::endl;}
if (dynamic_cast<RooDataHist*>(&(*data))){ std::cout << "After getting data from ws: data is RooDataHist" << std::endl;}

I get the output

Got data from ws
RooDataSet::combData[obs_x_channel,channellist,weight:wt] = <a number> entries (<a number> weighted)
After getting data from ws: data is RooDataHist

So I guess the problem is that the data in the ws is a RooDataSet. Is there any way to convert this to a RooDataHist or do I need to go further back and change the code that generates the ws?

jonas · April 13, 2023, 11:02am

Yes, that’s very common. Most frameworks use RooDataSet, even for binned data. That’s because for simultaneous fits with many observables, you have to use RooDataSet if you don’t want your memory usage to blow up because of the curse of dimensionality.

That’s why I think at some point the OffsetBin feature needs support also for RooDataSet, as outlined here: [RF] Completely implement `Offset("bin")` feature · Issue #11965 · root-project/root · GitHub. Give the issue a thumbs up if you want to see it implemented

So far I’m still thinking about how to implement this best. I hope that by ROOT 6.28.06 it will work.

If you need this feature more urgently for your fit, we can also discuss possible temporary workarounds.

JackLindon · April 13, 2023, 11:09am

Hi @jonas ,

Thanks very much, I have mentioned my support on the issue as it would be very useful. I notice Hongtao has also supported it (the code I am currently trying to get to work here with Offset(“bin”) is from quickFit which Hongtao maintains as they mentioned in the issue).

However it is needed fairly urgently, my analysis group have been held up by this issue with the NLL for quite a long time, what possible temporary workarounds are there?

Thanks again for your help

fbartels · April 14, 2023, 8:26am

Hi Jack,
Thanks for bringing this up. I’ve been trying to get the new offset option to work in quickFit as well, but failed due to it using RooSimultaneous and RooDataSet. Bumping the GitHub issue as well.