RooFit NumCPU in PyROOT

atasattari · December 6, 2023, 8:55pm

Hi,
I’m using PyROOT to do fits in RooFit with NumCPU set to use multiple cores. It appears that parallel processes for the minimization don’t get killed after removing NLL if I create NLL manually. The automated fitting procedure using .fitTo(...) works fine. Is there a mistake in my code?
Short reproducer:

import ROOT
def make_nll():
    w = ROOT.RooWorkspace('w')
    w.factory('Gaussian::test(x[0,100],mean[50,10,90],sig[10,5,20])')
    data = w['test'].generate(w['x'],1000)
    nll = w['test'].createNLL(data,NumCPU=(6,2))
    return nll
nll = make_nll()
del nll

Longer story:
I’m doing a simultaneous fit with a binned and an unbinned data set. Searching the previous posts, I found a suggestion to create NLLs individually and use ‘RooAddition’ to combine them. I’m also trying to do MC studies but I cannot use the total NLL with RooMCStudy. So, I wanted to manually write a loop which is the reason I ran into this issue. I can switch to the C++ interface if that’s the suggestion.

ROOT version: 6.30/00
Best,
Ata

mdessole · December 7, 2023, 10:14am

Hi @atasattari,
thanks for reaching out!
I’m not sure to have fully understood your problem, you execute your code snippet, say $ python script.py, and right after you observe those processes hanging?

Cheers,
Monica

atasattari · December 7, 2023, 4:25pm

If I call createNLL in a loop the number of processes increases untill I have to restart the computer. The Notebook reproduces the same behavior.
Untitled1.ipynb (1.7 KB)

Best,
Ata

jonas · December 9, 2023, 5:40pm

Hello! The problem is that many functions in RooFit return pointers that the user needs to call delete on, and PyROOT can’t do this automatically for you (there is no way it can guess whether it needs to call delete or not).

More on this problem also in this GitHub issue. One day, all these functions will return std::unique_ptr, and then the leaks are gone, but until then you need to claim ownership of the relevant objects explicitly such that they get deleted with ROOT.SetOwnership().

For example, here is the leak-free version of the code in your notebook:

w = ROOT.RooWorkspace('w')
w.factory('Gaussian::test(x[0,100],mean[50,10,90],sig[10,5,20])')

def reset(w):
    w['mean'].setVal(50)
    w['mean'].setError(0)
    w['sig'].setVal(10)
    w['sig'].setError(0)

def make_nll(w):
    reset(w)
    data = w['test'].generate(w['x'],1000)
    nll = w['test'].createNLL(data,NumCPU=(6,2))
    nll._data = data # Make sure that data lives as long as NLL
    ROOT.SetOwnership(data, True)
    ROOT.SetOwnership(nll, True)
    return nll

def fit(nll):
    m = ROOT.RooMinimizer(nll)
    m.migrad()
    out = m.save()
    ROOT.SetOwnership(out, True)
    return out

for i in range(100):
    # It keeps producing subprocesses untill I have to restart my computer.    
    nll = make_nll(w)
    result = fit(nll)

Does that solve the problem for you?

Note by the way that del nll doesn’t help here. If you delete the Python reference, but PyROOT doesn’t know that this means it should also delete the underlying C++ object, the leak is still there.

Cheers,
Jonas

atasattari · December 9, 2023, 7:44pm

@jonas, Thanks, makes sense now. Yes, it fixed the issue.

One follow-up question. My likelihood is a simultaneous fit over 6(or 8) different data sets. For MC studies I repeat the fit many times applying Poisson weights to original events. I’m also running MC fits on a cluster. For a better performance, would it make sense to turn off the NLL parallelization and instead submit single-core job arrays?

Best,
Ata

jonas · December 9, 2023, 9:21pm

Yes, it makes sense to parallelize over jobs if you can, and avoid using the NumCPU() parallelization. The latter has quite a large overhead because of the inter-process communication.

By the way, have you tried you EvalBackend("cpu"), so use the new optimized single-thread likelihood evaluation backend? See also the docs of RooAbsPdf::createNLL().

Cheers,
Jonas

system · December 23, 2023, 9:21pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.