RooNDKeysPdf integral over range

lenzip · April 27, 2021, 9:08am

Hello,
I am trying to integrate a RooNDKeysPDF (2D) over a specified range in the two variables:
I noticed that as soon as I specify a range for the integration, with a code like the following,
the code falls back to the numeric integration, which poses some performance issues in my case.
However I can see that the analytical integral exist, so I am wondering if that is the expected behaviour.
Attached is an example code. Can anyone advise if and how I can achieve analytic integration in this case?
Thanks in advance
Giulio

testRooKeysPdf.py (898 Bytes)

lenzip · April 27, 2021, 11:33am

Digging more, I found the behavior happens because of this:
https://root.cern.ch/doc/master/RooNDKeysPdf_8cxx_source.html#l01134
However, why is it like that, it seems that analyticalIntegral would be able to manage ranges.
Thanks
Giulio

jalopezg · April 27, 2021, 12:05pm

Hi @lenzip,

Thanks for taking the time to investigate the origin of this behavior in the source code. I am inviting @jonas to this topic, as I am sure he will provide you more detailed information.

Cheers,
J.

jonas · April 27, 2021, 1:06pm

Hi @lenzip!

I’m looking into this. You are right, there is some code in RooNDKeysPdf that looks like it would support ranges, but if you enable that code it doesn’t give the correct result. That’s all I can say so far

jonas · April 27, 2021, 9:24pm

Hi @lenzip,

I won’t have time to continue looking at this problem tomorrow, but today I implemented the analytical integration for custom ranges to see how it behaves different from the numeric integration:

However, for your usecase you would not get a big speedup from analytical integration. Your range is really narrow, so numerical integration is rather fast. That could definitely worse. Actually, I’m more worried about the stability of the numerical integration when you have many data points for the density estimate.

I have two questions for you if you don’t mind:

You say the numerical integration causes performance issues for you. How much faster would the integration be so you can work with it? If I run your script you uploaded here, it only takes a few 10s of milliseconds to run.
What do you need these integrals for? While investigating the problem, I noticed that numerical integrals for RooNDKeysPdfs might have stability issues when you have many data points for the kernel density. Could this be a problem for you?

Oh and one more thing: I would probably disable the rotate feature of RooNDKeysPdf in your usecase, because x and y are completely independent.

Sorry for not coming up with a final solution but with an intermediate report.

Cheers,
Jonas

lenzip · April 28, 2021, 8:23am

Dear @jonas
Thank you so much for providing this, I will try it.
The script I provided was just a minimal toy example to show that it was falling back to numeric integration. In reality my use case is much more complicated. I have a sample of weighted data events that I use for estimate a particular background. I need the integral because I need to get a histogram out of the PDF, as the rest of the analysis uses histogram templates taken from high stat MC. The performance issue comes because, to estimate the statistical uncertainty, I need to resample the data many times to construct different PDF replicas and from each of them get the histogram. The numeric integration was taking several seconds per bin, so just producing 1 histogram took several minutes.
In terms of point, I have between 5k and 10k points per PDF.
It may seem a bit dumb that I go through this burden to make a histogram, instead of doing a histogram from the start. I am tackling a subtle problem, where I believe my issue is that a naive histogram gives me an estimate of the underlying PDF that is too dependent on the binning. I want to mitigate the dependency on the binning, that’s why I want a PDF and then integrate it.
Thanks for your help
Giulio

jonas · April 28, 2021, 3:19pm

Hi @lenzip,

thanks for explaining your problem in more details!

Unfortunately I don’t completely understand it You write:

I need the integral because I need to get a histogram out of the PDF, as the rest of the analysis uses histogram templates taken from high stat MC.

And then in the end:

It may seem a bit dumb that I go through this burden to make a histogram, instead of doing a histogram from the start. I am tackling a subtle problem, where I believe my issue is that a naive histogram gives me an estimate of the underlying PDF that is too dependent on the binning. I want to mitigate the dependency on the binning, that’s why I want a PDF and then integrate it.

Don’t you still have a binning effect when you make a histogram out of the PDF? How do you even get the histogram? Do you do something like:

RooDataSet → RooNDKeysPdf → RooDataHist

I feel like you could avoid using the RooNDKeysPdf altogether, but unfortunately I can’t give you good advise because I don’t understand exactly whats going on. Maybe it would make things clear if you share your full code if you don’t mind?

Cheers,
Jonas

system · May 12, 2021, 3:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.