Question on RDataFrame Vary functionality (bootsrapping samples and weights)

RENATO_QUAGLIANI · August 1, 2022, 7:58pm

Dear experts,

I am trying to upgrade myself and use the latest greatest RDataFrame functionality and in particular the Vary function.

Lets Say i want to obtain 100 varied 1D histogram obtained from 100 variation of a weight.

Namely I have a RVec column filled with random poissonian weights representingy 100 bootsrapping slices of my data and a baseline weight attached from an existing Map.

In practice, i have a

node= node.Define( "weight", "baseweight * rndpoisson")

So that weight Is an RVec<double>

What i want to achieve without having to define 100 new columns for each weight[i] are 100 histograms of a given variable where each of the histogram Is obtained with

Hist = node.Histo1D( (....), "Myvar", "weight[i]")

If i understood correctly the Vary and VariationFor has been designed exactly to achieve this goal but i failed to understand the example in the doxygen. In the past i created a custom Book function filling a 2D histogram with 100 bins in y, but i feel like i better should use Vary.

Thanks in Advance for any help and TIPS
Renato

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.26

RENATO_QUAGLIANI · August 2, 2022, 7:24am


import ROOT as r 
r.gInterpreter.Declare("""
ROOT::VecOps::RVec<double> AddBootstrapping( const int & _rdfentry, const int & _rdfslot ){
    TRandom3 rnd;   
    rnd.SetSeed(_rdfentry*_rdfslot);
    ROOT::VecOps::RVec<double> rndPoisson;
    rndPoisson.reserve(100);
    for( int i =0; i < 100; ++i){
        rndPoisson.push_back((double)rnd.PoissonD(1));
    }
    return rndPoisson;
};     
""")
r.gInterpreter.Declare( """
double randommass( const int & _rdfentry, const int & _rdfslot){
    TRandom3 rnd;   
    rnd.SetSeed(_rdfentry*_rdfslot);
    return rnd.Gaus( 5270, 40);
}
"""
)
df = r.RDataFrame(10000)
node = df.Define(   "RndPoisson", "AddBootstrapping(rdfentry_,rdfslot_)")
node = node.Define( "massB",    "randommass(rdfentry_,rdfslot_)")
node = node.Define( "weight",    "1.")
nominal     = node.Vary( "weight", "ROOT::RVecD{weight, RndPoisson[0] * weight , RndPoisson[1] * weight}", ["nominal","bs0","bs1"])\
                  .Histo1D( r.RDF.TH1DModel("","",100,5100,5450), "massB", "weight")
histsVaried = r.RDF.Experimental.VariationsFor( nominal)
cc = r.TCanvas()
print(histsVaried.GetKeys())
histsVaried["nominal"].Draw()
histsVaried["weight:bs0"].SetLineColor(r.kRed)
histsVaried["weight:bs0"].Draw("same HIST")
histsVaried["weight:bs1"].SetLineColor(r.kCyan)
histsVaried["weight:bs1"].Draw("same HIST")
cc.Draw()

I guess i should do something like this right?

RENATO_QUAGLIANI · August 2, 2022, 7:29am

Actually, the code is smart enough to understand directly i have only 100 entries in a RVec.

node = df.Define(   "RndPoisson", "AddBootstrapping(rdfentry_,rdfslot_)")
node = node.Define( "massB",    "randommass(rdfentry_,rdfslot_)")
node = node.Define( "weight",    "1.")
nominal     = node.Vary( "weight", "RndPoisson * weight", [f"bs{i}" for i in range(100)])\
                  .Histo1D( r.RDF.TH1DModel("","",100,5100,5450), "massB", "weight")

this is also working.
Amazing stuff

etejedor · August 2, 2022, 12:41pm

@eguiraud can perhaps comment on this, thanks!

eguiraud · August 2, 2022, 4:11pm

Hi @RENATO_QUAGLIANI ,

yep that’s it, you need to have the nominal value already in a column and then return an RVec of varied values from the Vary expression.

Let us know if you have any more questions or if you think anything specific in the docs should be expanded/clarified – we’ll be adding a Vary tutorial soon.

Cheers,
Enrico

RENATO_QUAGLIANI · August 2, 2022, 7:06pm

The Only bit which cofused me a bit Is the role of the Maps[“nominal”] content.
In practice Is that a pointer to the main result pointer which invoke the Vary call?
In other words

nominal     = node.Vary( "weight", "ROOT::RVecD{weight, RndPoisson[0] * weight , RndPoisson[1] * weight}", ["nominal","bs0","bs1"])\
                  .Histo1D( r.RDF.TH1DModel("","",100,5100,5450), "massB", "weight")
histsVaried = r.RDF.Experimental.VariationsFor( nominal)

What Is histVaried["nominal"] pointing to?
How Is the “nominal” key generated? Based on the name of the result of the first line?

eguiraud · August 2, 2022, 7:11pm

The “nominal” key just points to the same histogram contained in your nominal variable, it’s there as a usability thing.

RENATO_QUAGLIANI · August 3, 2022, 9:26am

Hi @eguiraud , thanks again,
Let’s say i want now to combine 2 bootstrapped columns.
In my node i have defined

weight  (scalar) 
weightBS     (RVec<double> 100 values)
RndPoisson (RVec<double> 100 values)

Does the Vary expect in the variation setup , to re-use the “scalar” input?
I.e , does

nominal = node.Vary("weight",  "weight_BS * RndPoisson", [ f"BS_{i}" for i in range(100) ] )\
                                 .Histo1D( r.RDF.TH1DModel("","",100,5100,5450), "massB", "weight")
histsVaried = r.RDF.Experimental.VariationsFor( nominal)

will perform a replacement of weight with weightBS[i] * RndPoisson[i] for each map value?
In other word, does the second argument of the Vary function must re-use the baseline column name of the first one or it can be replaced by anything one want, as long as it’s already defined in the node?

Thanks
Renato

eguiraud · August 3, 2022, 4:57pm

Yes – you can verify this with a small reproducer with dummy values.

The Vary expression (in this case "weight_BS * RndPoisson") can be any arbitrary code, as it is for Defines and Filters.

Cheers,
Enrico

system · August 17, 2022, 4:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.