RooStats: Problem for counting experiment with large numbers

Daschm · August 21, 2013, 2:33pm

Hello,

I am trying to setup RooStats such that I can reproduce 95% CLs derived upper limits on Nsig in counting eperiments with given Nobs, Nbkg and dNbkg. For that, I used the example “Counting Experiment” program given on twiki.cern.ch/twiki/bin/view/Ro … August2012

Everything works fine and I am able to reproduce most numbers given in ATLAS papers. All of these have numbers of NObs which are smaller than 6000.

However, there is one ‘exceptional’ analysis with a very large number of events, and it gives strange results. It has the following parameters:
Observed: 350932
Expected: 344400 ± 12822

First, during the evaluation RooStats gives lots of messages of

or

or sometimes

[quote] *******************************************************************************
FUNCTION VALUE DOES NOT SEEM TO DEPEND ON ANY OF THE 2 VARIABLE PARAMETERS.
VERIFY THAT STEP SIZES ARE BIG ENOUGH AND CHECK FCN LOGIC.

[/quote]
Then, the result looks odd [see attachment]: For small s, the CL-distributions look right (the larger s, the smaller CLs+b and CLs). But at some point, CLb behaves strangely (rising to 1 , dropping to 0 or taking strange intermediate values) and also CLs+b and CLs then take strange values. The TestStatistics plot [see attachment] also look odd from that point on, but I don’t understand the reason, to be honest. For comparison, I attached the results I get from a well-working datset (Nobs = 210, NBkg = 214 ± 23.4)

I’ve tried a lot of things now (changing the parameter ranges, increasing the number of toys, changing from FrequentistCalculator to HybridCalculator or AsymptoticCalculator) but I always get the same problem: For large s, the CLb, CLs and CLb+s curves show peculiar behaviour such that I cannot read of the ‘actual’ 0.05-CLs value for the upper limit.

As I don’t see that this is an expected behaviour: Can you tell me, what causes this and how to fix it / construct a workaround? I do not really understand what the above error messages try to tell me. Is there some numerical problem because of the largeness of the numbers, or is there something fundamentally wrong in my setup which by luck did not produce any errors for my other tests?
example.cpp (3.7 KB)
TestStatistics2.eps (270 KB)
CLdistribution2.eps (14.3 KB)
TestStatistics.eps (193 KB)
CLdistribution.eps (18.9 KB)

moneta · August 27, 2013, 2:12pm

Hi,

I suspect what you have is a numerical problems and you have the fit failing for this. A possible solution is to re-define the parameters you are fitting s and b as close as possible to 1. You should scale them by some constant factor. For example define as parameter of interest a variable mu, and you use as s in the pdf the value = 33000 * mu

Lorenzo

EUIX · October 29, 2013, 9:10pm

Hi,

To bring this thread back from the dead. I have the identical problem and I think the issue as Lorenzo said is numerical. Basically for the Poisson pdf P(n|s+b) as you scan larger and larger values of “s” the pdf evaluates to smaller and smaller numbers until presumably some threshold is hit and the pdf simply evaluates to zero (the points seems to be around 10^-280. From that point on all the test statistics would return nonsensical results.

I am curious as to how this problem was eventually solved. Did you end up renormalizing your signal parameter in some way? If so how?

I am considering something like Poisson(n|mus+b) where mu is the poi and say is set so that mu[1,0,5], 1 being SM, 0 being no signal at all and 5 being 5x SM. Then set “s” to be some scaling factor. However in this case am I not just transferring the large number dependence to the s? For instance I would do my CLs scan from 0 to 5 instead of 0 to 15000 but I need to set s to 3000 if my original poi s was varying from 0 to 15000. The pdf would still be evaluated at large numbers due to the mus term.

Daschm · October 30, 2013, 8:09am

Hi,

I found two possible workarounds. Both of them produced reasonable results and didn’t raise severe RooFit/RooSTats errors.

The first ine is actually what Lorenzo proposed, i.e. reparametrising in order 1 parameters:
(note that the following is in Python syntax):

w.factory("Gaussian:constraint_b(eps_b[1, "+str(max(0, 1.-5.*sigmab/b0))+", "+str(1.+5.*sigmab/b0)+"], 1., sigma_b["+str(sigmab/b0)+"])")
w.factory("Gaussian:constraint_s(eps_s[1, "+str(max(0, 1.-5.*sigmas/s0))+", "+str(1.+5.*sigmas/s0)+"], 1., sigma_s["+str(sigmas/s0)+"])")
w.factory("sum:nexp(s[0,"+str(s0)+"]*eps_s,b["+str(b0)+"]*eps_b)")
w.factory("Poisson:pdf_sb(nobs["+str(nobs)+"],nexp)")
w.factory("PROD:model_sb(pdf_sb, constraint_b)")

[...]
mc_sb.SetParametersOfInterest(RooArgSet(w.var("s")))
mc_sb.SetObservables(RooArgSet(w.var("nobs")))
mc_sb.SetNuisanceParameters(RooArgSet(w.var("eps_b"), w.var("eps_s")))
mc_sb.SetGlobalObservables(RooArgSet(w.var("b")))

Currently, I am using this one and it seems to work quite well.

The second option, which I actually tried first, used a handwaving argument: The problem occured for large numbers and I thought it is because of the small width of the Poisson distribution. But a Poisson for large numbers approaches a Gaussian with width sqrt(N). Now, for the cases I had the main uncertainty was given by the background uncertainty sigma_b, which is also a Gaussian width around b (and sigma_s around s). Hence, I proposed a single Gaussian pdf with the width given as the combination of sqrt(N), sigma_b and sigma_s:

w.factory("sum:nexp(s[0,"+str(s0)+"],b["+str(b0)+"])")
w.factory("Gaussian:pdf_sb(nobs["+str(nobs)+"],nexp,expr::dnexp('sqrt(nexp+sigmab*sigmab+sigmas*sigmas)', nexp,sigmab["+str(sigmab)+"],sigmas["+str(sigmas)+"]))")
#w.factory("PROD:model_sb(pdf_sb)")

I am not a statistics expert, but I believe that this is not completely correct. However, it produced reasonable results for the parameter sets I looked at, so I took it as a ‘alternative pdf’ used for cases with very small statistical uncertainties sqrt(N). But as I said, now I am using version 1, which at least for now did not produce the errors I encountered before.

moneta · October 30, 2013, 1:36pm

Hi,

(answering to the previous post).

A large value of s * mu is not a problem per se. The pdf can be very small, close to zero, at the end also what is computed is the log of the pdf, so very small value should be fine. Also, it does not make sense to scan in regions where the pdf is close to zero. So, if you are having a problem due to this, you should restrict your parameter of interest region.
The numerical problem observed before is happening during the likelihood minimisation. Having a parameter value close to one, reduce the condition of the covariance matrix and makes the minimisation much more stable
numerically.

Best Regards

Lorenzo

EUIX · October 30, 2013, 6:55pm

Hi Daschm,

Could you define what the parameters b0, sigmab, s0 and sigmas are? And also tell me what values you entered for them? And also the range over which you scan “s”, the poi. I am assuming b0 and s0 are the expected background and signal events and sigmab and sigmas are the absolute uncertainties on them. If so I am having trouble understanding the model. For

w.factory(“sum:nexp(s[0,”+str(s0)+"]*eps_s,b["+str(b0)+"]*eps_b)")

I suppose s and b here are your signal and background strengths but they appear to range from 0 all the way up to the nominal values. Or does s0 and b0 define something else? Am I right in thinking that eps_s and eps_b are simply constraints representing your systematic uncertainties? I am sure I understand where the “reparametrising to 1” occurs, it doesn’t look different from the original model.

Also for the second method I get the same problem as before. I made my model like:

Gauss(n|s+b|sqrt(s+b))*Gauss(b0|b,sigma_b)*Gauss(lumi0|lumi,sigma_lumi) however I encounter the same problem when scanning large values of s. Here what I did was simply replace the Poisson distribution with a Gaussian under the assumption that for large means Poisson(n|mu)~Gauss(n|mu,sqrt(mu)). I kept my additional Gaussian constraints as is (not sure why you choose to the add the widths of the background systematic to the same pdf).

I am really curious as to how HistFitter handles the problem, if you try this number counting problem with their default python script MyUserAnalysis.py it doesn’t have this large number problem.

Cheers,

Kuhan

[quote=“Daschm”]Hi,

I found two possible workarounds. Both of them produced reasonable results and didn’t raise severe RooFit/RooSTats errors.

The first ine is actually what Lorenzo proposed, i.e. reparametrising in order 1 parameters:
(note that the following is in Python syntax):

w.factory("Gaussian:constraint_b(eps_b[1, "+str(max(0, 1.-5.*sigmab/b0))+", "+str(1.+5.*sigmab/b0)+"], 1., sigma_b["+str(sigmab/b0)+"])")
w.factory("Gaussian:constraint_s(eps_s[1, "+str(max(0, 1.-5.*sigmas/s0))+", "+str(1.+5.*sigmas/s0)+"], 1., sigma_s["+str(sigmas/s0)+"])")
w.factory("sum:nexp(s[0,"+str(s0)+"]*eps_s,b["+str(b0)+"]*eps_b)")
w.factory("Poisson:pdf_sb(nobs["+str(nobs)+"],nexp)")
w.factory("PROD:model_sb(pdf_sb, constraint_b)")

[...]
mc_sb.SetParametersOfInterest(RooArgSet(w.var("s")))
mc_sb.SetObservables(RooArgSet(w.var("nobs")))
mc_sb.SetNuisanceParameters(RooArgSet(w.var("eps_b"), w.var("eps_s")))
mc_sb.SetGlobalObservables(RooArgSet(w.var("b")))

Currently, I am using this one and it seems to work quite well.

The second option, which I actually tried first, used a handwaving argument: The problem occured for large numbers and I thought it is because of the small width of the Poisson distribution. But a Poisson for large numbers approaches a Gaussian with width sqrt(N). Now, for the cases I had the main uncertainty was given by the background uncertainty sigma_b, which is also a Gaussian width around b (and sigma_s around s). Hence, I proposed a single Gaussian pdf with the width given as the combination of sqrt(N), sigma_b and sigma_s:

w.factory("sum:nexp(s[0,"+str(s0)+"],b["+str(b0)+"])")
w.factory("Gaussian:pdf_sb(nobs["+str(nobs)+"],nexp,expr::dnexp('sqrt(nexp+sigmab*sigmab+sigmas*sigmas)', nexp,sigmab["+str(sigmab)+"],sigmas["+str(sigmas)+"]))")
#w.factory("PROD:model_sb(pdf_sb)")

I am not a statistics expert, but I believe that this is not completely correct. However, it produced reasonable results for the parameter sets I looked at, so I took it as a ‘alternative pdf’ used for cases with very small statistical uncertainties sqrt(N). But as I said, now I am using version 1, which at least for now did not produce the errors I encountered before.[/quote]

Daschm · October 30, 2013, 7:26pm

Hi,

to start with: I am a complete statistics rookie, so it could easily be that what I did is somewhat flawed or even complete rubbish. Still, for the parameter sets I had, I got the right result, so I guess it can’t be too wrong.

SO, I guess the following extra part of the code might be important for understanding:

Also I have to mention that the code from my second post is not actually the solution to what I did in the first post: Now, I want to calculate the CLs value for a given value of S ± dS, whereas before I was looking for the S that corresponds to CLs = 0.05. But the numerical problems I ran into before do not appear her anymore, even if I choose the same values for B, dB and O. And the ‘handwaving’ second option I actually used for the original problem and it worked.

So, physically I measured b and s with certain inaccuracies, such that I assume that the ‘actual b’ is the expected b0 (which is an observable) times a Gaussian number with center 1 and width sigma_b/b (which is a nuisance parameter). The same for s as the poi. For CLs, I need to evaluate the background_only pdf with s=0, which is why I allow that in the defined parameter range.

I guess if you use Gauss or if you use Poisson does not make a difference, because they are practically equivalent as we’ve discussed. So I wouldn’t expect your second solution to work.
My idea (which could be totally wrong) was that since the ‘width’ of the Poisson is very small, you need a very particular value of b to get the maximum likelihood. However, it is hard to numerically find this, if for most of the values the Gauss(b, b0, sigmab) PDF chooses, the Poss(o, s+b) PDF is practically zero. So I combined the two in an approximate ‘total Gauss’ Gauss(o, s+b, sqrt(s+b) x sigmab) [sigmas added in quadrature], such that the algorithm will find the ‘right’ b easier.

I hope that my statements make somewhat sense.

EUIX · October 31, 2013, 6:07pm

Hi Daschm,

Maybe it will be more clear if you post the section of code that shows how you set the limits given the model you have constructed.

Are you saying you are not scanning the a region of the poi? I don’t understand the difference between looking for the value of s that corresponds to Cls=0.05 and evaluting the CLs at a given value of s. The latter is found by scanning over a region of s, evaluating the CLs value at each particular value of s then drawing a horizontal line at 0.05 to indicate the value of s that corresponds to 5%. So you are just asking what is the CLs value at many different values of s.

I agree with your comment about the second solution. Replacing the poisson distribution with a Gaussian won’t work since the problem is hitting a numerical limit not a difficulty in calculating the distribution itself it seems.

Cheers,

Kuhan

TMar · March 25, 2014, 8:51pm

Hi Daschm,

I am trying to teach myself RooStats and I came across your post here. I tried using your code, but I ended up getting an error.

testroostats.exe(12975) malloc: *** error for object 0x1101c17c0: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug Abort trap: 6

It occurs during the execution of this line:

ModelConfig mc("ModelConfig",&w);

Did you ever encounter that? I assume you are using the ROOT interpreter to run your macro, correct? I use ROOT as libraries for C++ code, and that may be part of the issue.

That being said, I am curious if there exists any simple code to calculate the 95% exclusion limit given Nobs, Nbkg and Error on Nbkg. It seems like all of the RooStats examples I have seen go through excessive steps in order to calculate something that should be just a few lines. I can calculate CLs values in C++ without accounting for the error on Nbkg with about 10 lines of code and it runs quickly enough for Nobs < 10k. While I understand the powerfulness and usefulness of a system like RooStats, it certainly is not user friendly.