How to delete RDataFrame and clean up memory


ROOT Version: v6-16-00@v6-16-00
Platform: MacOS Mojave 10.14.3 (18D109)
Compiler: gcc 4.2.1


Hi,

Could be this is kind of straight forward question, but I didn’t find an answer.
Calling RDataFrame multiple time I can see operative memory is growing. Are there any methods to clean up RDF class and delete it in python?

import ROOT
import os
import time

@profile
def loopDataFrame(treeName, files):
   df = ROOT.ROOT.RDataFrame(treeName, files)

   print "Processing %s"%files
   model = ROOT.RDF.TH1DModel("lep_0_p4.Pt()", ";p_{T} (lep_{0}) GeV;", 100, 0., 100.)

   myHisto = df.Define("myP4", "lep_0_p4.Pt()").Histo1D(model, "myP4")

   # to show that we whish to get histo as output
   myHisto = myHisto.Clone("unicName")
   myHisto.SetDirectory(0)
   del df # try to delete RDataFrame
   return myHisto 



#_____________________________________________________________________________
@profile
def main():
   file1 = 'user.dponomar.17123811._000001.SM_WLepton.root'
   treeName = 'NOMINAL'

   ROOT.ROOT.EnableImplicitMT()

   hist1 = loopDataFrame(treeName, file1)
   hist1.Draw()

   time.sleep(3)

   file2 = 'user.dponomar.17123811._000003.SM_WLepton.root'
   hist2 = loopDataFrame(treeName, file2)
   hist2.Draw()

   time.sleep(3)
   raw_input("Press Enter to continue...")

#_____________________________________________________________________________
if __name__ == '__main__': main()
#EOF

Best regards,
Daniil

Hi Daniil,
the del at the end of loopDataFrame is redundant, df is deleted at the end of the function call. @etejedor can correct me if I’m wrong, but I think PyROOT guarantees that when the last python variable referencing a given python object goes out of scope, the underlying C++ object is deleted.

So even without the del, the C++ RDataFrame object should go out of scope and its memory should be freed.

In your case, the question is rather “what is allocating memory and never freeing it?”. It’s possible that the culprit is the ROOT interpreter: RDataFrame just-in-time compiles (jits) C++ code (for example, "lep_0_p4.Pt()" is jitted into a corresponding C++ function, and Histo1D(model, "myP4") is jitted in the template call Histo1D<float>(model, "myP4").
The memory allocated by the ROOT interpreter contains code, and is never released.

I’m not sure what else might contribute to that memory usage. One could check whether the corresponding C++ program (with and without jitted calls) uses less memory and how much. Or a tool like valgrind --tool=massif might help in identifying memory hoggers.

I’m available for further clarifications!
Cheers,
Enrico

1 Like

That is indeed the default behaviour. Can the memory increase be related to jitting?

At least part of it for sure, but 150MB per event loop seems a lot does it not?

Thanks a lot for taking care of it!

Simple tests with valgrind didn’t show me any memory leaks more then 1%.

In the same time now I have a temp solution, but it is definitely crutching. If I do wrap loopDataFrame() using Process from multiprocessing module - it helps.

  from multiprocessing import Process
   p = Process(target=loopDataFrame, args=(treeName, file1,))
   p.start()
   p.join()

That also means I have to pass histogram back (pretty tricky way) or create cash root file inside loopDataFrame() to store my hists there and read them out later.

Then my main project memory plot looks like this (7 input datasets with the friend trees).
Before:


After:

Hi,
looks like the peak RAM usage is actually lower before. Is there something I’m missing?

If you can provide a simple example that make RAM usage really explode, e.g. up to 4 or 8 GB so that it’s obvious there is something that should be freed and isn’t, we might be able to take a look and investigate whether something can be done about it.

Cheers,
Enrico

1 Like

Good day,

here is simple example that make RAM usage really explode:

import ROOT
import os
import time

@profile
def loopDataFrame(treeName, file, cuts):
	print "Processing %s"%file
	print "Processing %s"%cuts
	df = ROOT.ROOT.RDataFrame(treeName, file)
	for cutName, cutDef in cuts.iteritems():
		df = df.Filter(cutDef)
	model = ROOT.RDF.TH1DModel("lep_0_p4.Pt()", ";p_{T} (lep_{0}) GeV;", 100, 0., 100.)
	myHisto = df.Define("myP4", "lep_0_p4.Pt()").Histo1D(model, "myP4")
	return myHisto

@profile
def main():
	file = 'merged.root'
	treeName = 'NOMINAL'

	ROOT.ROOT.EnableImplicitMT()

	cuts = {}
	for i in range(0,100):
		cuts['lePt%s'%i] = 'lep_0_p4.Pt()>%s'%i
		hist = loopDataFrame( treeName, file, cuts )
		hist.Draw()
		time.sleep(1)

	raw_input("Press Enter to continue...")

if __name__ == '__main__': main()
#EOF

Plots for only 10 iterations:

And other plot w/o calling hist.Draw():

Logs: logs.txt (3.5 KB)

Looks like @eguiraud first guess is the right one. If the script doesn’t create new jitters, there is no memory slope. However, this memory stays to be used even after the end of an executable function.
That would be awesome to have some sort of RDataFrame::Clean() or RDataFrame::Delete() method to clean up memory.

Hi,
to me it looks like with or without hist.Draw() makes little difference: there is a memory creep upwards in both cases, isn’t there?

While I fully acknowledge that this is a problem, note that this is most probably not RDataFrame itself that is hogging memory: it’s cling, ROOT’s C++ interpreter, that RDataFrame uses, for every loop, to just-in-time compile C++ code used in the loop. Code occupies memory, and code is never deleted.

A solution might be what is proposed here: this proposed improvement to RDF would allow to re-use the same RDF computation graph on different datasets, which in your case it would mean that just-in-time compilation happens only once. Could that help?

A note: I would expect more code to be just-in-time compiled (i.e. more memory being hogged) in the case in which hist.Draw is called, but it seems this is not the case, which confuses me.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

In case someone stumbles upon this topic, also see this post for some possible solutions.