Dear experts,
I am seeing weird behavior when running my RDF code while created weighted histograms. In our analysis we create weighted cutflows by basically creating a histogram after each Filter we apply. Which looks a bit like this:
df =df.Define(“cutbin”, 0.5)
df = df.Filter(“x>10”)
cutflowhists = []
cutflowhists.append( df.Hist1D((f"{weighted_cutflow1", weight_title, 1, 0.0, 1.0), "cutbin", "weight"))
df = df.Filter(“y>10”)
cutflowhists.append( df.Hist1D((f"{weighted_cutflow2", weight_title, 1, 0.0, 1.0), "cutbin", "weight"))
...
Where weight is the final weight to be used. This includes basically all the event per event weights as well as some metadata about cross-sections/luminosity and others.
In our code we run over multiple RDatasetSpecs, each having their own dataframe. Each datasetspec contains multiple (sub) samples. Think of it like a DY process (DatasetSpec) containing a 2016 MC sample, a 2017 MC sample etc.
We define rdfs for all of the datasets specs and run them all together via the run graphs method.
In the end we create the Outflow by looping over all the histograms in the cutflowhists list and get the Integral() of the Histogram. What we noticed is that always for the first few cuts/hists the Integral is basically nonsense. I’ll paste an example here:
Cut Input Pass Eff CumEff
__startcut__ 10794198204690424573386219210145792.0000 10794198204690424573386219210145792.0000 100.000% 100.000%
((bjet1.get_HadronConeExclTruthLabelID()==0 && bjet2.get_HadronConeExclTruthLabelID()==0)) 10794198204690424573386219210145792.0000 1090736866060343443482038239232.0000 0.010% 0.010%
n_analysis_leptons == 0 1090736866060343443482038239232.0000 1090736866060343443482038239232.0000 100.000% 0.010%
at least 2 central jets 1090736866060343443482038239232.0000 1090736866060343443482038239232.0000 100.000% 0.010%
at least 1 b-jet 1090736866060343443482038239232.0000 5041.5377 0.000% 0.000%
Trigger dependent cuts 5041.5377 4083.4054 80.995% 0.000%
mmc mass > 60 GeV 4083.4054 3972.9277 97.294% 0.000%
As you can see the first few cuts are basically a very big number. and then suddenly it starts to make sense. I have tried already many different things but I can’t get to fix it. A few things I noticed.
- It’s rather random, or at least to me, wether this happens for a dataset. There are datasetspecs that show a complete normal cutflow, while others are weird.
- For some of them the Cutflow is normal if I run only 1 datasetSpec at once while for others even if I run this datasetspec on its own the problem persists.
- I have tried running it in root version 6.32.08 as well as 6.36.02 but both seem to have the problem.
- I tried using a weight of one and that works normally.
- The entries from the Histogram via GetEntries() are compatible with the entries from the CutFlow object RDF automatically spits out. (Or at least the few ones I looked at)
I suspect there is something weird going on with the memory of the weight column but I have no idea what it is. Any help here is highly appreciated.
Thank you so much for your help.
Cheers, Jordy
root setup info:
I use views via cvmfs: lsetup “views LCG_106b x86_64-el9-gcc11-opt”
or lsetup “views LCG_108 x86_64-el9-gcc13-opt”