Roofit - multiple fit different datasets using different PDFs with some shared parameters - ROOT version issue

Dear colleagues,
I’d like to follow-up on the issue that I was discussing here: Roofit - multiple fit different datasets using different PDFs with some shared parameters - #10

Apparently, the error was that I was using a different naming scheme for the combined dataset and for the model (thanks @jonas for reporting this).

However, I notice now that if I download exactly the same code that is present on the original topic (inside the archive code.tar), now the python code python3 ana_globalFit.py 1 does not work any longer.

  • Root version 6.26.04, python 3.9.12 (LCG_102): YES
  • Root version 6.28.04, python 3.9.12 (LCG_104): NO (error on RooDataHist constructor)
  • Root version 6.30.02, python 3.9.12 (LCG_105): NO (error on RooDataHist constructor)
  • Root version 6.32.02, python 3.11.9 (LCG_106): NO (error on RooDataHist constructor)

I also get an error: ERROR:InputArguments -- Layout or binning of histogram hcal0_data_1 is not consistent with first processed histogram hcal0_data_0, but indeed the two histograms have a different binning, this is expected.

May I ask you for a suggestion regarding this? It is clearly due to some ROOT change, since it was working with earlier versions.

The good news is that, after fixing the naming in my script, it now works as expected.

Thanks
Bests
Andrea

Dear @andrea.celentano,

Thanks for reaching out to the forum!

Let’s ask for the help of @jonas , it’s possible some change indeed changed a previous behaviour. For convenience, I link to your comment from the other post where you uploaded the reproducer Roofit - multiple fit different datasets using different PDFs with some shared parameters - #6 by andrea.celentano .

Cheers,
Vincenzo

Hi @andrea.celentano!

So the error is valid: the idea of combining histograms with an index category is that you can only combine histograms that have the same axes, and the index category will be added as the only new axis. You have different observables for each histogram. So if you fill the values from histogram “sample_0” in the combined multidimensional histogram that has all variables as axes, what should the value for the remaining axes be? It’s undefined and therefore not encouraged, even though it didn’t matter in practice because in the fit, it is again extracting the projections of the histograms, resolving the problem. But your combined histogram, even though an intermediate structure to build the NLL, is not well defined.

There is another reason why combining multidimensional histograms like this is discouraged: memory usage blows up fast. You let’s say you have 100 bins in all three dimensions, the combined histogram will have one million bins, which becomes a problem once you add more channels. For that reason, RooFit allows you to create combined RooDataSets from multiple histograms, which use a different datastructure under the hood: is stores pointers to clones of the input histograms, and if you get values from it, it just gives you “random” values for the observables that are not relevant in a given sample. This is what HistFactory also does.

So long story short, you should just use RooDataSet instead of RooDataHist for dataALL:

   dataALL=ROOT.RooDataSet(name+"_data",name+"_data",xALL,Index=sample,Import=dictData)

But even with that change, another error will hit you later because RooFit got more pedantic :smiley:

invalid_argument: Value 30.5 is outside the default range [40, 120] of the variable "hcal0_x_2"!

That’s because your MC histograms don’t have the same binning as the data, forcing a change of binning of the observables that can be seen earlier in the script output:

[#1] INFO:DataHandling -- RooDataHist::adjustBinning(hcal0_dh_0): fit range of variable hcal0_E_0 expanded to nearest bin boundaries: [0,500] --> [0,10]
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(hcal0_dh_1): fit range of variable hcal0_E_1 expanded to nearest bin boundaries: [0,500] --> [0,20]
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(hcal0_dh_2): fit range of variable hcal0_E_2 expanded to nearest bin boundaries: [0,500] --> [40,120]
[#1] INFO:DataHandling -- RooDataHist::adjustBinning(hcal0_dh_2): fit range of variable hcal0_x_2 expanded to nearest bin boundaries: [30,120] --> [40,120]

Don’t ignore this error! Since the creation of the MC histograms narrowed the binning of the observables, your data counts are clipped all inside the MC histogram range in the fit, which might have given you wrong results with the previous ROOT versions where the fit worked. In the past, the values were silently clipped without logging anything, which is dangerous exactly for the kind of reasons we see here.

Note: I checked your histograms, and the data counts are zero anyway in the bins that were not part of the MC histograms, so you seem to be safe even with the old ROOT versions. So just double check that and change the range of your RooRealVars from xmin=[0. ,0., 30.] to xmin=[0. ,0., 40.] and it will work. And as far as I checked, the results are identical with ROOT 6.26 or ROOT master.

So you have to update your code a bit to somehow have data and MC histograms with consistent binning. Then it should hopefully work and give you the right results :smiley:

Cheers,
Jonas

edit: some ROOT versions might not support the import from a map of RooDataHist to RooDataSet yet, but let’s maybe only try to find a fix for that if you’re actually using such a version (which ROOT version do you intend to use in the end?).

Dear @jonas, thanks for the very instructive reply!

I indeed understand why the use of histograms with different binning to create a global RooDataHist is not supported (both because technically the intermediate multi-axis object will be undefined for some regions, even if these are not relevant for the fit, and because of the memory consumption). I am currently using ROOT version 6.36.02.

I’d like to follow up on the second issue, the histogramming range. I take this opportunity to ask to clarify my (probably imprecise) knowledge of how this should work. Let’s assume that I have a dataset for an observable x, that is in form of a single histogram with x ranging from 0 to 100. The dataset is handled as a RooDataHist in RooFit.

  • In the simpler case, I may want to fit a PDF that has an analytical expression (and some free parameters) to this RooDataHist. This case is trivial, provided that the analytical expression can be evaluated consistently in the range [0,100].

  • In a more elaborated scenario, let’s assume that I want to derive a PDF from another histogram (e.g. a MC prediction), and fit this to the data. For discussion, let’s say that the fit is the sum of the MC prediction plus a background described as an analytical PDF, and there is a single free parameter, that is the relative ratio of the two contributions. To build the PDF from the MC prediction, I use a RooHistPdf. In my view the requirements here are that (i) the analytical PDF can be properly computed between 0 and 100, and (ii) the MC prediction histogram from which the RooHistPdf is built is defined at least in the range (0,100). It could be defined also for a larger range. The bin width is not required to be the same – clearly, the finer binning the MC prediction, the better in terms of RooHistPdf construction.

  • In the more sophisticated scenario that I typically consider, let’s say that the MC prediction is in terms of a different observable y, related to x by some linear scaling y=\alpha x, with \alpha being an unknown calibration parameter, with value close to 1. In this case, I start from a RooDataHist for the y prediction in a range larger than that for x (in this case, I would use for y [-50,150] at least), and I create a RooDataHist. Then, I define a RooFormulaVar to relate y and x, RooFormulaVar vx("vx","x*alpha",RooArgList(x,alpha). From this, I create a RooHistPdf using [this constructor], (ROOT: roofit/roofitcore/src/RooHistPdf.cxx Source File), in which vx is the pdfObs and y the histObs. In this case, it is clear that the hist MC and the data MC should have a different range, with that in MC being larger to accomodate the case \alpha not equal to one.

I attach a small reproducer of the third scenario, just for illustration.
test.C (2.0 KB)

Any feedback is really welcome!
Thanks,
Andrea

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.