Python Interfacing and memory leaks when looping over different files

ROOT version: 6.32.02
Python version: 3.12.4

Dear experts,

I am writing a python code that loops over several ROOT files. In each loop, the root file is loaded into a RDataFrame, then the data is fitted according to some model, and the sPlot method is applied to retrieve the weights.
Now, the goal is to write out the sWeights in a new file (with all other branches from the initial ROOT file) using RDataFrame.Define(), then Snapshot(). The reason I’m using RDataFrame is to take advantage of the “easy” interface + speed that C++ offers, instead of converting to e.g. a pandas dataframe, adding the weights, and converting back to RDataFrame.

To write out the weights, I define a C++ lambda function using ROOT.gInterpreter.ProcessLine(), as passing a python callable is not supported yet in RDataFrame.Define():

ROOT.gInterpreter.ProcessLine("""
    auto getSignalSWeight = [](ULong64_t entry_number) -> Double_t {
        return global_sData->GetSWeight(entry_number, "nsig");
    };
    """)

In order for the C++ code to work, I declare the pointer to the RooStats::SPlot instance and give it the adress of the RooStats.SPlot object I defined through python:

sData = RooStats.SPlot("sData", "sPlot", data, model, RooArgList(nsig, nbkg))
# Declare `sData` globally for the JIT compiler
ROOT.gInterpreter.ProcessLine(f"RooStats::SPlot* global_sData = nullptr;")
ROOT.gInterpreter.ProcessLine(f"global_sData = (RooStats::SPlot*){ROOT.addressof(sData)};")

Now, this works well when I run the code through one root file, but when I start looping on different files, the pipeline fails at the 2nd iteration. I understand it must have something to do with the RooStats::SPlot* pointer, but even deleting the pointer manually or the sData instance causes a segfault.

What is happening there? How can I fix this issue without having to modify the workflow? (eg, not having to run the file X times instead of looping over X files). Also, is this horrible practice? I’m trying to make the most of the python interfacing and RDataFrame() but I don’t if this is good or not.

Here’s a minimal example to reproduce the issue:

import ROOT
from ROOT import RooFit, RooRealVar, RooGaussian, RooAddPdf, RooStats, RooDataSet, RooArgList, RooArgSet

def fit(index):

    x = RooRealVar("x", "Observable", -10, 10)

    # Signal 
    mean_sig = RooRealVar("mean_sig", "Mean of Gaussian", 1, -10, 10)
    sigma_sig = RooRealVar("sigma_sig", "Width of Gaussian", 1, 0.1, 5)
    gauss_sig = RooGaussian("gauss_sig", "Signal Gaussian", x, mean_sig, sigma_sig)

    # Background 
    slope_bkg = RooRealVar("slope_bkg", "Slope of Exponential", -0.1, -5, 0)
    expo_bkg = ROOT.RooExponential("expo_bkg", "Background Exponential", x, slope_bkg)

    nsig = RooRealVar("nsig", "Number of Signal Events", 500, 0, 1000)
    nbkg = RooRealVar("nbkg", "Number of Background Events", 500, 0, 1000)
    model = RooAddPdf("model", "Signal + Background", RooArgList(gauss_sig, expo_bkg), RooArgList(nsig, nbkg))

    # Generate a toy dataset
    data = model.generate(RooArgSet(x), 1000)

    # Fit the model to the data
    model.fitTo(data, RooFit.PrintLevel(-1))  # Suppress fit output for cleaner display

    # sPlot to get sWeights
    sData = RooStats.SPlot("sData", "sPlot", data, model, RooArgList(nsig, nbkg))
    for i in range(10):  # Print first 10 weights
        sweight_signal = sData.GetSWeight(i, "nsig")
        sweight_background = sData.GetSWeight(i, "nbkg")
        print(f"Entry {i}: Signal sWeight = {sweight_signal}, Background sWeight = {sweight_background}")

    # Up to now this is pretty standard RooFit code


    # Create a new ROOT file to save results
    tree = data.GetClonedTree()
    rdf = ROOT.RDataFrame(tree)

    # Declare `sData` globally for the JIT compiler
    ROOT.gInterpreter.ProcessLine(f"RooStats::SPlot* global_sData = nullptr;")
    ROOT.gInterpreter.ProcessLine(f"global_sData = (RooStats::SPlot*){ROOT.addressof(sData)};")

    # Define lambda functions to retrieve sWeights
    ROOT.gInterpreter.ProcessLine("""
    auto getSignalSWeight = [](ULong64_t entry_number) -> Double_t {
        return global_sData->GetSWeight(entry_number, "nsig");
    };
    """)

    # Apply the lambda functions in RDataFrame to add sWeights as columns
    df_with_weights = rdf.Define("sWeight_signal", "getSignalSWeight(rdfentry_)", ["rdfentry_"])

    df_with_weights.Snapshot("tree_with_sWeights", f"file_{index}.root")



def main():
    for i in range (3):
        print(i)
        fit(i) # segfault at 2nd iteration after Define()
    
if __name__ == "__main__":
    main()

Hi @leadreyfus,

sorry to hear you’re experiencing the issues described and thank you for the reproducer - I will start investigating the issue and will get back to you soon.

Cheers,
Marta

Hi @leadreyfus,

In order to save your sweight_signal or sweight_background, you could create numpy arrays for each and use a function to add those arrays to your dataset (you need to declare this function to the interpreter first). Modifying your example this would be:

import numpy
import ROOT
from ROOT import RooFit, RooRealVar, RooGaussian, RooAddPdf, RooStats, RooDataSet, RooArgList, RooArgSet

ROOT.gInterpreter.Declare('''
        ROOT::RDF::RNode AddArray(ROOT::RDF::RNode df, ROOT::RVec<double> &v, const std::string &name) {
        return df.Define(name, [&](ULong64_t e) { return v[e]; }, {"rdfentry_"});
        }
    ''')

def fit(index):

    x = RooRealVar("x", "Observable", -10, 10)

    # Signal 
    mean_sig = RooRealVar("mean_sig", "Mean of Gaussian", 1, -10, 10)
    sigma_sig = RooRealVar("sigma_sig", "Width of Gaussian", 1, 0.1, 5)
    gauss_sig = RooGaussian("gauss_sig", "Signal Gaussian", x, mean_sig, sigma_sig)

    # Background 
    slope_bkg = RooRealVar("slope_bkg", "Slope of Exponential", -0.1, -5, 0)
    expo_bkg = ROOT.RooExponential("expo_bkg", "Background Exponential", x, slope_bkg)

    nsig = RooRealVar("nsig", "Number of Signal Events", 500, 0, 1000)
    nbkg = RooRealVar("nbkg", "Number of Background Events", 500, 0, 1000)
    model = RooAddPdf("model", "Signal + Background", RooArgList(gauss_sig, expo_bkg), RooArgList(nsig, nbkg))

    # Generate a toy dataset
    data = model.generate(RooArgSet(x), 1000)

    # Fit the model to the data
    model.fitTo(data, RooFit.PrintLevel(-1))  # Suppress fit output for cleaner display

    # sPlot to get sWeights
    sData = RooStats.SPlot("sData", "sPlot", data, model, RooArgList(nsig, nbkg))
    
    signalsweight = []
    bcksweight = []
    
    for i in range(1000):
        sweight_signal = sData.GetSWeight(i, "nsig")
        sweight_background = sData.GetSWeight(i, "nbkg")
        signalsweight.append(sweight_signal)
        bcksweight.append(sweight_background)
        
    tree = data.GetClonedTree()
    rdf = ROOT.RDataFrame(tree)
    arr_as_rvec = ROOT.VecOps.AsRVec(numpy.array(signalsweight))
    rdf = ROOT.AddArray(ROOT.RDF.AsRNode(rdf), arr_as_rvec, "sWeights")
                
    rdf.Snapshot("tree_with_sWeights", f"file_{index}.root")
    
def main():
    for i in range (3):
        print(i)
        fit(i)
    
if __name__ == "__main__":
    main()

Cheers,
Marta

1 Like

Hi Marta!
Thanks for getting back to me.
I implemented your solution (with minor changes, ie, I realised that RooDataHist data contains the sweight after applying RooStats::sPlot, and I can just take directly the numpy array using RooDataHist.to_numpy()), and this workaround indeed works for my use, thanks!

Still, do you have any idea as to why the memory leak is happening in the first place? Funnily enough, to save the output of the RooFit, I use the same trick of declaring a pointer with the gInterpreter and manually assigning it the adress of an object I created in the pyROOT part, and this does not cause any memory issue.

# This codes is looped over with different RooFitResults during the same execution of the python script

# Saving the fit result in a txt file:
    ROOT.gInterpreter.ProcessLineSynch(f"RooFitResult* res = nullptr;")
    ROOT.gInterpreter.ProcessLineSynch(f"res = (RooFitResult*){ROOT.addressof(result)};")
    ROOT.gInterpreter.ProcessLine(f'''
        std::ofstream out;
        out.open("{path_to_save}/{cfg.channel.name}.txt");
        res->printMultiline(out, 1, true);
        out.close();
    ''')

Anyways, thanks a lot again!
Cheers,
Léa

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.