createIntegral() gives unexpected result

mwilkins · February 6, 2019, 8:32pm

I would expect the integral of a shape to be equal to its yield after an Extended log likelihood fit, but this is not the case when I use createIntegral. With a yield of 1457, s1s2.createIntegral(ROOT.RooArgSet(m1, m2)) gives 228. What am I missing?

Details:

I construct the fit shapes as:

    ws.factory(
        'SUM::model('
        'Ns1s2[0, {0}]*PROD::s1s2(s1, s2),'
        'Nb1b2[0, {0}]*PROD::b1b2(b1, b2),'
        'Ns1b2[0, {0}]*PROD::s1b2(s1, b2),'
        'Nb1s2[0, {0}]*PROD::b1s2(b1, s2)'
        ')'.format(nents)
    )

where s1, b1, etc., are RooGaussians or RooChebychevs with various constrained parameters.

I do the fit like:

r = model.fitTo(data, RooFit.Extended(), RooFit.NumCPU(8), RooFit.Save(), RooFit.ExternalConstraints(extCons))

I would expect the yields (Ns1s2, etc.) to be equal to the value of the integral of the sub shapes. This is not the case:

import ROOT

wsf = ROOT.TFile.Open('../output/FitLcLc.root')
ws = wsf.Get('myWS')

s1s2 = ws.pdf('s1s2')
Ns1s2 = ws.var('Ns1s2')
m1 = ws.var('Lc1_M_Con')
m2 = ws.var('Lc2_M_Con')

int1 = s1s2.createIntegral(ROOT.RooArgSet(m1, m2))
print int1.getVal()
print Ns1s2.getVal()

gives

228.34193321559758
1494.1134543643934

What am I doing wrong? I’ve tried also using NormSet, but this always gives me a value of 1.0, regardless of the Range specified.

StephanH · February 7, 2019, 11:03am

I don’t know, yet, but we can figure this out.

If you use NormSet, you actively ask RooFit to normalise the integral with respect to the variables in NormSet. That’s why you get 1: You probably ask to normalise with respect to the integration variables.
You are doing something like this:
Sum = N1 * PDF1 + N2 * PDF2 + ...
If the number of coefficients in a RooAddPdf is the same as the number of PDFs, RooFit interprets this as an extended PDF, where N1, N2 are the number of events for the different components. In your case, there seem to be 1494 events in the first component.
The integral of PDF1 has nothing to do with N1. In most cases, the integral of PDF1 is just 1, and N1 is just the number of events. You would not expect these two to yield the same number.
In your case, the integral of PDF1 is apparently not 1. I can think of several reasons:
- PDF1 itself is extended. Now, N1*int(PDF1) is the true number of events.
- The integration ranges are different. Maybe the “automatic” integration in RooFit uses a different reference for the normalisation of the PDF.
- The coefficients of the RooAddPdf have been defined in a different range. In that case, RooFit will convert the values of the coefficients into the corresponding value of the active fit range, and run the fit.

mwilkins · February 7, 2019, 2:59pm

Thank you for your help and prompt reply. See responses to each of your suggestions below. I really appreciate your willingness to help me on this as I’m a bit lost.

Yes, I do normalize w.r.t. the integration variables, but changing the range of integration has no effect. Is this the expected behavior? For example:

print 'm1 Min, Max', m1.getMin(), m1.getMax()
print 'm2 Min, Max', m2.getMin(), m2.getMax()

m1.setRange('sig', *[2270, 2300])
m2.setRange('sig', *[2270, 2300])
m1.setRange('bkg', *[2320, 2330])
m2.setRange('bkg', *[2320, 2330])

int1 = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.NormSet(ROOT.RooArgSet(m1, m2)), ROOT.RooFit.Range('sig'))
int2 = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.NormSet(ROOT.RooArgSet(m1, m2)), ROOT.RooFit.Range('bkg'))
print 'sig int', int1.getVal()
print 'bkg int', int2.getVal()

yields:

m1 Min, Max 2235.0 2335.0
m2 Min, Max 2235.0 2335.0
sig int 1.0
bkg int 1.0

Ah yes, thank you. This is of course correct. I am curious, however: what is the intended effect of specifying a NormSet of the integration variables? This formulation suggests the NormSet argument is only useful when normalizing over something other than the original integration variables.

I do not believe this is the case. The sum of the shape yields equals the total number of events. This is the full declaration:

    # -- set up constrained parameters
    mg = wswithpars.var('mg')
    ws.factory('Gaussian::LcMg(m_g[{0}, {1}, {2}], {3}, {4})'.format(mg.getVal(), 2276.46, 2296.46, mg.getVal(), mg.getError()))
    o1 = wswithpars.var('o1')
    ws.factory('Gaussian::LcO1(o_1[{0}, {1}, {2}], {3}, {4})'.format(o1.getVal(), 0, 10, o1.getVal(), o1.getError()))
    o2 = wswithpars.var('o2')
    ws.factory('Gaussian::LcO2(o_2[{0}, {1}, {2}], {3}, {4})'.format(o2.getVal(), 0, 20, o2.getVal(), o2.getError()))
    f1 = wswithpars.var('f1')
    ws.factory('Gaussian::LcF1(f_1[{0}, {1}, {2}], {3}, {4})'.format(f1.getVal(), 0, 1, f1.getVal(), f1.getError()))
    if not all((mg, o1, o2, f1)):
        raise IOError('could not get all required values from wswithpars')
    
    # -- construct signal and background shapes
    # declare mass spaces
    ws.factory('RooRealVar::{}(2235, 2335, "MeV")'.format('Lc1_M_Con'))
    ws.factory('RooRealVar::{}(2235, 2335, "MeV")'.format('Lc2_M_Con'))
    # use previous fit parameters, require symmetric shapes
    ws.factory('SUM::s1(f_1 * Gaussian::g11({nm}, m_g, o_1), Gaussian::g12({nm}, m_g, o_2))'.format(nm='Lc1_M_Con'))
    ws.factory('Chebychev::b1({}, {{c11[-5, 5], c12[-5, 5]}})'.format('Lc1_M_Con'))
    ws.factory('SUM::s2(f_1 * Gaussian::g21({nm}, m_g, o_1), Gaussian::g22({nm}, m_g, o_2))'.format(nm='Lc2_M_Con'))
    ws.factory('Chebychev::b2({}, {{c21[-5, 5], c22[-5, 5]}})'.format('Lc2_M_Con'))
    # make combined shape
    ws.factory(
        'SUM::model('
        'Ns1s2[0, {0}]*PROD::s1s2(s1, s2),'
        'Nb1b2[0, {0}]*PROD::b1b2(b1, b2),'
        'Ns1b2[0, {0}]*PROD::s1b2(s1, b2),'
        'Nb1s2[0, {0}]*PROD::b1s2(b1, s2)'
        ')'.format(nents)
    )

Perhaps. When I do not use NormSet, I get the following behavior:

m1.setRange('sig', *[2270, 2300])
m2.setRange('sig', *[2270, 2300])
m1.setRange('bkg', *[2320, 2330])
m2.setRange('bkg', *[2320, 2330])
m1.setRange('tot', *[2235, 2335])
m2.setRange('tot', *[2235, 2335])

inttot = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.Range('tot'))
intsig = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.Range('sig'))
intbkg = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.Range('bkg'))
print 'tot int', inttot.getVal()
print 'sig int', intsig.getVal()
print 'bkg int', intbkg.getVal()

yields:

tot int 228.341933216
sig int 211.176452521
bkg int 2.40364550153e-08

Given the Gaussian shape in the sig region, this is roughly the expected behavior, just the wrong normalization. Do you know why it would be picking up this mysterious normalization or how to fix it?

I’m not sure what you mean by this, but see above for how the coefficients are defined.

mwilkins · February 7, 2019, 9:17pm

I believe I have confirmed that the problem is the normalization. It appears normalized to some value (not sure how this value is selected), so all I have to do is renormalize to get the expected result. For example, if I take the one-dimensional integral of one of the Gaussian sub-shapes in the 2 sigma range, I get the expected 95.44% in all 4 cases:

m1.setRange('sig_1', *[m_g.getVal() - 2 * o_1.getVal(), m_g.getVal() + 2 * o_1.getVal()])
m2.setRange('sig_1', *[m_g.getVal() - 2 * o_1.getVal(), m_g.getVal() + 2 * o_1.getVal()])
m1.setRange('sig_2', *[m_g.getVal() - 2 * o_2.getVal(), m_g.getVal() + 2 * o_2.getVal()])
m2.setRange('sig_2', *[m_g.getVal() - 2 * o_2.getVal(), m_g.getVal() + 2 * o_2.getVal()])
g11sig = g11.createIntegral(ROOT.RooArgSet(m1), ROOT.RooFit.Range('sig_1'))
g11tot = g11.createIntegral(ROOT.RooArgSet(m1), ROOT.RooFit.Range('tot'))
g12sig = g12.createIntegral(ROOT.RooArgSet(m1), ROOT.RooFit.Range('sig_2'))
g12tot = g12.createIntegral(ROOT.RooArgSet(m1), ROOT.RooFit.Range('tot'))
g21sig = g21.createIntegral(ROOT.RooArgSet(m2), ROOT.RooFit.Range('sig_1'))
g21tot = g21.createIntegral(ROOT.RooArgSet(m2), ROOT.RooFit.Range('tot'))
g22sig = g22.createIntegral(ROOT.RooArgSet(m2), ROOT.RooFit.Range('sig_2'))
g22tot = g22.createIntegral(ROOT.RooArgSet(m2), ROOT.RooFit.Range('tot'))
print 'g11tot', g11tot.getVal(), 'g11sig', g11sig.getVal(), 'ratio', g11sig.getVal() / g11tot.getVal()
print 'g12tot', g12tot.getVal(), 'g12sig', g12sig.getVal(), 'ratio', g12sig.getVal() / g12tot.getVal()
print 'g21tot', g21tot.getVal(), 'g21sig', g21sig.getVal(), 'ratio', g21sig.getVal() / g21tot.getVal()
print 'g22tot', g22tot.getVal(), 'g22sig', g22sig.getVal(), 'ratio', g22sig.getVal() / g22tot.getVal()

yields, as expected:

g11tot 19.9530283749 g11sig 19.0451603382 ratio 0.954499737103
g12tot 11.9696090033 g12sig 11.4249886349 ratio 0.954499736104
g21tot 19.9530283749 g21sig 19.0451603382 ratio 0.954499737103
g22tot 11.9696090033 g22sig 11.4249886349 ratio 0.954499736104

Still an open question is why RooFit is not properly handling the normalization itself.

StephanH · February 11, 2019, 11:02am

getMin() gets the minimum of the default range, getMin("sig") should get the minimum of the signal range.

Let’s say you have a PDF in x, y, z. You can integrate over z, but normalise over [x, y, z], such that the integral always reflects the total probability.

I think I understand now what you need the integral for. You want the number of signal(/background) events in a specific range, and not only the total number. Is this correct?

Total number of signal events
The bare number of signal/background events can be obtained using the coefficients of the sum pdf, for signal e.g. Ns1s2.getVal().
Warning: It seems that you are forcing the number of events to be less than nents in the factory expression where you .format(nents). If this number is too low, the fit has no way of putting the correct coefficient in front of the single PDFs.

Number of events in a specific range
Let’s say you are looking for the number of events found in the range sig.

That’s probably correct.
As I said, event yields in a sum PDF are:
Sum = N1 * PDF1 + N2 * PDF2 + ...,
where Sum(N_i) is the total sum of signal and background events, and the PDFs are all normalised to unity (over the default ranges of their variables). The problem you see is that evaluation and normalisation are separate steps in RooFit.
[Some details: The bare PDF values in RooFit are not normalised. If you use pdf.getVal(), it’s not even clear which range/observables this probability refers to. It’s not possible to normalise it properly. So you may as well also return an unnormalised value, which many PDFs do. The Gaussian, for example, skips the 1/sqrt(...) term for speed reasons.
Instead of getVal(), you can use pdf.getVal(RooArgset(x, y, z)), though. Now it’s clear that you want to normalise with respect to those variables, and RooFit will compute pdf.evaluate() / Int[x,y,z | current range of x, , y, z] (PDF). ]

In your case, the normalisation has to be done by calculating the two integrals that are relevant for the problem:
You start with the number of signal events, Ns1s2.getVal(), and multiply with Int[x | sub range](PDF) / Int[x | full range](PDF). This is what the NormSet argument is for.

Have a look at $ROOTSYS/tutorials/roofit/rf110_normintegration.[C|py] (online), where this is explained in a bit more detail.

mwilkins · February 11, 2019, 1:45pm

Yes, this is the conclusion I came to, and normalizing myself (without using NormSet) works as expected:

(s1b2.createIntegral(ROOT.RooArgSet(Lc1M, Lc2M), RooFit.Range('S1S2')).getVal() / s1b2.createIntegral(ROOT.RooArgSet(Lc1M, Lc2M)).getVal()) * Ns1b2.getVal()

Using NormSet, however, does not work as expected. If I give the NormSet argument, I always end up with 1.0, regardless of the specified range. I would expect NormSet to normalize w.r.t. the variables in the total range, then return the value in the specified range; indeed, rf110 says it should.

To reproduce:

import ROOT

wsf = ROOT.TFile.Open('FitLcLc.root')
ws = wsf.Get('myWS')

s1s2 = ws.pdf('s1s2')
m_g = ws.var('m_g')
o_1 = ws.var('o_1')
m1 = ws.var('Lc1_M_Con')
m2 = ws.var('Lc2_M_Con')

print 'm1 Min, Max', m1.getMin(), m1.getMax()
print 'm2 Min, Max', m2.getMin(), m2.getMax()

m1.setRange('sig_1', *[m_g.getVal() - 1 * o_1.getVal(), m_g.getVal() + 1 * o_1.getVal()])
m2.setRange('sig_1', *[m_g.getVal() - 1 * o_1.getVal(), m_g.getVal() + 1 * o_1.getVal()])
m1.setRange('bkg', *[2320, 2330])
m2.setRange('bkg', *[2320, 2330])
m1.setRange('tot', *[2235, 2335])
m2.setRange('tot', *[2235, 2335])

inttot = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.NormSet(ROOT.RooArgSet(m1, m2)), ROOT.RooFit.Range('tot'))
intsig = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.NormSet(ROOT.RooArgSet(m1, m2)), ROOT.RooFit.Range('sig_1'))
intbkg = s1s2.createIntegral(ROOT.RooArgSet(m1, m2), ROOT.RooFit.NormSet(ROOT.RooArgSet(m1, m2)), ROOT.RooFit.Range('bkg'))
print 'tot int', inttot.getVal()
print 'sig int', intsig.getVal()
print 'bkg int', intbkg.getVal()

gives:

m1 Min, Max 2235.0 2335.0
m2 Min, Max 2235.0 2335.0
tot int 1.0
sig int 1.0
bkg int 1.0

FitLcLc.root (224.8 KB)

StephanH · February 12, 2019, 10:24am

It’s a really nasty problem with temporaries:
https://sft.its.cern.ch/jira/browse/ROOT-9861

You can work around it by binding the ArgSets to names, such that don’t get destroyed by Python and survive until the integrals are actually evaluated:

set_m1m2 = ROOT.RooArgSet(m1, m2)
normSet = ROOT.RooFit.NormSet(set_m1m2)

inttot = s1s2.createIntegral(set_m1m2, normSet, ROOT.RooFit.Range('tot'))
intsig = s1s2.createIntegral(set_m1m2, normSet, ROOT.RooFit.Range('sig_1'))
intbkg = s1s2.createIntegral(set_m1m2, normSet, ROOT.RooFit.Range('bkg'))
print 'tot int', inttot.getVal()

Result:

m1 Min, Max 2235.0 2335.0
m2 Min, Max 2235.0 2335.0
[#1] INFO:Eval -- RooRealVar::setRange(Lc1_M_Con) new range named 'sig_1' created with bounds [2279.32,2295.24]
[#1] INFO:Eval -- RooRealVar::setRange(Lc2_M_Con) new range named 'sig_1' created with bounds [2279.32,2295.24]
[#1] INFO:Eval -- RooRealVar::setRange(Lc1_M_Con) new range named 'bkg' created with bounds [2320,2330]
[#1] INFO:Eval -- RooRealVar::setRange(Lc2_M_Con) new range named 'bkg' created with bounds [2320,2330]
[#1] INFO:Eval -- RooRealVar::setRange(Lc1_M_Con) new range named 'tot' created with bounds [2235,2335]
[#1] INFO:Eval -- RooRealVar::setRange(Lc2_M_Con) new range named 'tot' created with bounds [2235,2335]
tot int 1.0
sig int 0.667829588723
bkg int 6.03744066677e-11

mwilkins · February 12, 2019, 1:48pm

Yes, that’s solved it. I agree with your characterization that this is a nasty problem.

system · February 26, 2019, 1:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.