Memory leak with RDataFrame in Python

rooter_03 · January 12, 2021, 9:54am

Hi,

I am having the same problem as in:

there we are told that this:

https://sft.its.cern.ch/jira/browse/ROOT-9438

suggests a way around the problem. However this is a ticket opened in 2.5 years ago that is still open
and that does not really tell me what to do to avoid this leak, it seems to be just people discussing how good it would be to improve the code. Could you please tell us what exactly to do? Like what function to call, what lines to include, etc.

Thanks.

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

eguiraud · January 12, 2021, 10:02am

Hi,
just to clarify, there is no memory leak in RDF, there is some memory hogging on the part of the interpreter: whenever C++ code is just-in-time compiled, the generated binary occupies some RAM until the process exits.

If that’s the case in your application, a solution might be to run each event loop in a separate subprocess using Python’s multiproc module or similar, or even just change the order of operations such that you first build all computation graphs and only after you trigger the first event loop and therefore jitting (this makes it so that all code is generated once, in one go, which is faster and occupies less RAM than doing it many separate times, once per computation graph).

A minimal reproducer of your particular issue would be useful.

ROOT-9438 remains a feature that would be nice to have but so far has been lower in priority than other things. It might happen in the next months as TMVA has a usecase that would require something like that.

Cheers,
Enrico

EDIT: you did not provide a ROOT version, but if it’s <6.22 I would also suggest you switch to ROOT v6.22 in which we introduced some optimizations when it comes to RDataFrame and just-in-time compiled code – they might or might not impact your application, I would need a minimal repro to be sure.

rooter_03 · January 12, 2021, 10:11am

Hi,

Ok so I can either use multiproc or something else related with “jitting” that I do not really understand or want to spare the time to understand, given that I am already pretty busy with my job. Regarding the reproducer, I could provide it to you, but It would take me too much time to extract the exact lines needed. However the relevant lines look pretty close to what you can find in:

i.e.:

import ROOT
import os
import time

@profile
def loopDataFrame(treeName, file, cuts):
	print "Processing %s"%file
	print "Processing %s"%cuts
	df = ROOT.ROOT.RDataFrame(treeName, file)
	for cutName, cutDef in cuts.iteritems():
		df = df.Filter(cutDef)
	model = ROOT.RDF.TH1DModel("lep_0_p4.Pt()", ";p_{T} (lep_{0}) GeV;", 100, 0., 100.)
	myHisto = df.Define("myP4", "lep_0_p4.Pt()").Histo1D(model, "myP4")
	return myHisto

@profile
def main():
	file = 'merged.root'
	treeName = 'NOMINAL'

	ROOT.ROOT.EnableImplicitMT()

	cuts = {}
	for i in range(0,100):
		cuts['lePt%s'%i] = 'lep_0_p4.Pt()>%s'%i
		hist = loopDataFrame( treeName, file, cuts )
		hist.Draw()
		time.sleep(1)

	raw_input("Press Enter to continue...")

if __name__ == '__main__': main()
#EOF

Could you please add your second solution, and ideally also the first one to this script? I think that would be pretty useful to anyone seeing this in the future.

Cheers.

eguiraud · January 14, 2021, 11:14am

Hi,
that script is a bit weird because as it loops it keeps adding more cuts, while I think the original intention was to use a different cut per iteration…?

Anyway with ROOT v6.22 this should help: instead of

	cuts = {}
	for i in range(0,100):
		cuts['lePt%s'%i] = 'lep_0_p4.Pt()>%s'%i
		hist = loopDataFrame( treeName, file, cuts )
		hist.Draw()

book all computations first, use the results second:

    cuts = {}
    histos = []
	for i in range(0,100):
		cuts['lePt%s'%i] = 'lep_0_p4.Pt()>%s'%i
		hist = loopDataFrame( treeName, file, cuts )
		histos.append(hist)
   for h in histos:
      h.GetValue() # or Draw, or whatever

For the multiproc solution, the idea is to run each loopDataFrame invocation in a different subrocess using e.g. a process pool, so when the processing of a dataframe is done the related worker process is killed and the memory allocated by the interpreter is freed with it.

These might be good suggestions for your usecase or not, I’d need to see a reproducer to check what exactly is hogging memory in your case.

Hope this helps!
Enrico