Error for unbinned/binned/weighted fits

ibelyaev · August 25, 2022, 10:24am

Dear Experts,

Making simultaneous fit for binned and unbined data, I need to convert binned data into “weighted” data (as has been proposed in some thread). However I’ve observed that it causes a drastic explosion of the fit uncertainty. And it is not what I can inspect.
I’ve tried to narrow down the problem (or feature) to rather simple case, where
I performed simple Gaussian fit for following configurations

Unbinned dataset
RooDataHist created from the histogram
Weighted RooDataSet created from the histogram
same as above but with SumW2Error ( True) treatment of uncertainty
same as above but with AsymptoticError (True) treatment of uncertainty

Results are rather unexpected (for me). The fitted values are practically the same for all scenarios, but the uncertainty are dfferent:

(1) and (2) are almost the same - well, it is expected for narrow bins and large statistics
(3) appears to be the same as (2)… it is not so clear why, since for no special treatment of weight one can expect incorrect estimation of uncertainty
(4) is different from (1,2,3). Naively I would expect that (4) is more close to (1,.2) than (3)
(5) is the same as (4) . Again naively I would expect that (5) should be more close to (1.2) than (3,4)

I have impression that I am doing something wrong.

my code is available here https://gist.github.com/VanyaBelyaev/2f385ea57efa04a618e1b39f01b39cbe

eguiraud · August 29, 2022, 9:08am

Hi @ibelyaev ,

we need @jonas 's help here, let’s ping him.

Cheers,
Enrico

jonas · August 29, 2022, 11:14am

(1) and (2) are almost the same - well, it is expected for narrow bins and large statistics

Yes, (1) and (2) should be the same

(3) appears to be the same as (2)… it is not so clear why, since for no special treatment of weight one can expect incorrect estimation of uncertainty

There is a special treatment of weights, also for the RooDataSet: each log-likelihood term is multiplied with the weight. So (1), (2), and (3) should all be the same which they are indeed.

(4) is different from (1,2,3). Naively I would expect that (4) is more close to (1,.2) than (3)

(4) and (5) are for a special case, You should read more about it in the documentation of RooAbsPdf::fitTo(), under SumW2Error. It is for the case where you have weighted (usually MC) events and you want to know how much your uncertainty would be if the events would be unweighted. But what is needed here is the sum of the original event weights, not the histogram counts! Usually the original event weights are one, and in your case it also seems unity since you filled the TH1D without weights.

Now, the RooDataSet doesn’t know about the original sum of squared weights anymore, so your fit result (4) is completely irrelevant and doesn’t mean anything. It’s like pretending each bin got filled with exactly one event that had a huge weight. You can try to do (4) with the hdset RooDataHist. The RooDataHist knows about the sum of squared weights, which are in your case identical to the sum of weights so then (4) will again give the correct result like (1), (2), and (3).

(5) is the same as (4) . Again naively I would expect that (5) should be more close to (1.2) than (3,4)

(5) gives a wrong result anyway right now, because it has a bug that I need to fix ASAP. But it also uses the sum of weights squared, so the expectation would be that once you use the hdset that keeps track of these correctly you get the same result.

Hope this makes things a bit clearer, if not let me know!
Jonas

ibelyaev · August 29, 2022, 11:24am

Dear Jonas,
Thank you very much.
Let me explain why I asking these questions.
I want to make a simultaneous fit for binned and unbinned data.
Following the earlier advice I convert binned data into weighted data (since I cannot combine RooDataHist and RooDataSet into single dataset). As a result I have weighted dataset.
Usually as soon as I have weight dataset I need to indicate SumW2Error ( True ), unless error estimates are wring . And It is not clear fo rme why for this particualr case it is not needed…

(Indeed my final case even a bit mode complicated - I want to combine weighted unbinned dataset (after sPlot) with binned dataset… Following the previous steps I can convert binned dataset into weighted dataset and then combine two weighted datasets… Question here is - should I activate Sumw2Error here or not? For sPlotted datasets I must do it… As example above shows for binned->weighted dataset I should not do it… what should be done for combined dataset?

jonas · August 29, 2022, 12:21pm

Usually as soon as I have weight dataset I need to indicate SumW2Error ( True ), unless error estimates are wring . And It is not clear fo rme why for this particualr case it is not needed…

Right, if you do your sPlots fit you need the error correction. I was however referring to your script attached on GitHub, in which you filled your histogram and data completely unweighted:

for i in range ( 1000 ) :
    value = random.gauss ( 5 , 1 )
    xvar.setVal ( value )
    dataset.add ( varset )
    histo.Fill  ( value )

In that case the error correction was not needed because you had no weights to begin with.

Good that your explained your usecase with the sPlots then! Okay, so you want to combine a weighted unbinned dataset with a binned dataset, and that apparently only works when converting the binned dataset to an unbinned one. However, when you do that conversion the dataset will forget the original sum of weights squared and then all results with SumW2Error() or AsymptoticError() are screwed up.

I will also ask some colleagues if they have an idea, but right now the only answer I can give is that this is not supported by RooFit and it will take me a while to give you a recommendation.

Which experiment are you working on? Have your colleagues maybe already encountered this fit setup?

ibelyaev · August 29, 2022, 12:25pm

Hi Jonas,
Yes, my use case is exactly as you have shortly and nicely described.
I’l wait some advice from you.
cheers, Vanya
P.S. I am working in LHCb experiment. I’ve checked with my colleagues, and nobody has a clear solution.

jonas · August 30, 2022, 12:53pm

Hi @ibelyev,

In your script, you were trying to propagate the sumw2 errors from the histogram to the RooDataSet by using weight errors:

    wdset.add ( vset , value , error )

I think that was reasonable that you expected this to work, to we plant to change RooFit such that it does. The plan is to make the following change of behavior: if you RooDataSet has weight errors, we assume them to be sumw2-based errors and hence use them to get the sum of weight squares for each dataset entry, instead of just squaring the weight.

I have opened this PR where the change is suggested:

When it’s merged, you can probably achieve what you want to do with the next ROOT release and the nightly builds.

Until then, the easiest (but imperfect) solution to your problem is to make a completely binned fit and also convert your RooDataSet from the sPlots to a binned data set, such that you can combine in the other RooDataHist without losing it’s sum of weight squares.

ibelyaev · September 1, 2022, 6:17am

Dear Jonas,
Thank you very much.
Since I usually work with LCG dev3 nightly slot, it is perfect for me

system · September 15, 2022, 6:17am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.