Unphysical signal yield in RooFit 2D MCStudy

lukash · July 15, 2014, 12:43pm

Hi all,

I have problems with a 2 dimensional extended maximum likelihood fit including a signal and a background. A large fraction of the fits return a number of signal events which is completely unphysical. Both signal and background are constant PDFs without any additional shape parameters. They are created from histograms as RooHistPdfs (see plot 1 with normal and log scale for each histogram) and have only a small overlap, i,e, they are well separated.

The only fit parameters therefore are the number of signal and background events, N_sig and N_bkgd.

For the moment I’m testing the model to determine the discrimination power. To do that I set the signal PDF to zero events and the background PDF to i.e. 100 events. From this I create 10,000 monte carlo toy sets and fit the total PDF. The idea was that (the width of) the N_signal distribution tells me about the discrimination power of my PDF model.

Now to the problem:
As plot 2 shows, an overwhelming fraction of fits doesn’t converge around N_sig = 0 (as one should expect because no signal component should be present in the toy event samples) but extends all the way down to the lower limit of the N_sig parameter (no matter if that is set to -10 or -500!). So apparently small fluctuations into the signal PDF are fitted as expected but under fluctuations lead to completely unphysical results for the number of signal events.

Limiting the signal to positive values gives a best fit of N_sig = 0.0 for most of the fits, which is not a distribution itself.

Attached is a root file with the two histograms used to create the PDFs as well as the python script with a minimal working example of the problem.

I would be very grateful for any comments and suggestions to help me understand and maybe solve this problem. Thanks,

lukash

Plot 1:

Plot 2:

Minimal working example:

#!/usr/bin/env python
"""Minimal working example

"""

from ROOT import *
from ROOT import RooFit
from ROOT.RooFit import *


x = RooRealVar('x', 'x', 0., 20)
y = RooRealVar('y', 'y', 0., 10)


HistFile = TFile('histograms.root', 'READ')


# Get background histogram and create PDF
BkgdHist = HistFile.Get('background')
BkgdDataHist = RooDataHist('bkgd_datahist', 'bkgd_datahist', RooArgList(x, y), BkgdHist)
BkgdPdf = RooHistPdf('bkgd_pdf', 'bkgd_pdf', RooArgSet(x, y), BkgdDataHist)
N_bkgd = RooRealVar('N_bkgd', 'N_bkgd', 100., 0., 200.)


# Get signal histogram and create PDF
SigHist = HistFile.Get('signal')
SigDataHist = RooDataHist('sig_datahist', 'sig_datahist', RooArgList(x, y), SigHist)
SigPdf = RooHistPdf('sig_pdf', 'sig_pdf', RooArgSet(x, y), SigDataHist)
N_sig = RooRealVar('N_sig', 'N_sig', 0., -10., 20.)


# Total PDF consisting of signal and background
TotalPdf = RooAddPdf('total_pdf', 'total_pdf', RooArgList(BkgdPdf, SigPdf), RooArgList(N_bkgd, N_sig) )


# Monte Carlo study
MCStudy = RooMCStudy(TotalPdf, RooArgSet(x, y), Extended(kTRUE), FitOptions(Save(kTRUE), Verbose(kFALSE)) )
MCStudy.generateAndFit(10000)


# Plot parameter and pull distribution of the Monte Carlo Study
canvas = TCanvas()
canvas.Divide(2, 2)

canvas.cd(1)
frame1 = MCStudy.plotParam(N_sig)
frame1.Draw()

canvas.cd(2)
frame2 = MCStudy.plotPull(N_sig)
frame2.Draw()

canvas.cd(3)
frame3 = MCStudy.plotParam(N_bkgd)
frame3.Draw()

canvas.cd(4)
frame4 = MCStudy.plotPull(N_bkgd)
frame4.Draw()

minimal_working_example.py (1.45 KB)
histograms.root (551 KB)

lukash · September 9, 2014, 8:18am

I found out some more information about the problem:
If I change signal and background in such a way, that they are completely separated in both observables (no overlap) then all fits of background only events give me a negative signal (at the lower boundary).
This seems to be connected to the formulation of the likelihood equation:

L = n_sig/(n_sig+n_bkgd) * SignalPDF + n_bkgd/(n_sig+n_bkgd) * BckgdPDF + Poisson(n_exp, n_obs)

If there are no signal events, then n_sig can take any value without influencing the first term (because the signal PDF is never evaluated for any event). The likelihood can be improved by increasing n_bkgd however, and n_sig is reduced by the same amount to keep n_exp constant (so that the Poisson term stays “high”).

I tried to incorporate constraint terms but there’s no justification to for them a priori. Does anyone know how to circumvent this behaviour? This must be a general problem and I can’t believe I’m the first one who encounters it.
Thanks a lot.