Memory leak with RDataFrame in Python II

Hi,

I am having again this problem:

this time in code that looks like:

import ROOT
import warnings

warnings.filterwarnings( action='ignore', category=FutureWarning, message='Instantiating a function template.*' )
#--------------------
def test(replica):
    expr="v_trg_wgt_sta[{}]".format(replica)
    repl=expr.replace("[", "_").replace("]", "_")
    dff=df.Define(repl, expr)
    d_var=dff.AsNumpy([repl])
#--------------------
df=ROOT.RDataFrame("tree", "file.root")

for i in range(0, 10):
    test(i)

This problem kept my jobs getting killed for days. As you mentioned, we could do:

import ROOT
import warnings

warnings.filterwarnings( action='ignore', category=FutureWarning, message='Instantiating a function template.*' )
#--------------------
def test(replica):
    expr="v_trg_wgt_sta[{}]".format(replica)
    repl=expr.replace("[", "_").replace("]", "_")
    d_var=df.AsNumpy([repl])
#--------------------
def define(replica):
    global df
    
    expr="v_trg_wgt_sta[{}]".format(replica)
    repl=expr.replace("[", "_").replace("]", "_")
    df=df.Define(repl, expr)
#--------------------
df=ROOT.RDataFrame("tree", "file.root")

for i in range(0, 600):
    define(i)
    
for i in range(0, 10):
    test(i)

So that whatever needs to be allocated by “RDataFrame::Define” gets allocated once. This seems to lower the amount of memory used to 1.5Gb.

However:

  1. The dataframe is needed by many functions, not just “test”. These functions might need to read a different part of the file or do different things with the data. So I would strongly prefer to have one instance of it, rather than making it multiple times.
  2. I need to have a booking function everytime I use RDataFrame, i.e. the second approach, which is less simple than the first one.
  3. Everything that gets booked occupies space throughout the process, even if it’s used once only.
  4. 1.5 Gb is not a small amount, for memory that we won’t get back til the end of the process. I mean, we might need several files to be processed, which might need several dataframes, that’s a lot of wasted memory.
  5. The garbage collector does not deallocate that memory.

My use case is pretty standard, from the scripts, I am not doing anything exotic.

Question: Is there a way to go around this or should I go back to using TFile and TTree?

The type of answer I expect is:

“You can deallocate the object with df.Free(repl) after every call.”

But I assume, we cannot do such a thing. Feel free to modify the scripts with a realistic and workable approach.

Cheers.


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.20
_Platform: x86_64-centos7
_Compiler: gcc9-dbg


Hi,
can you please share the input files (also privately works) with me so I can check what is the source of the memory hogging? (I assume it’s not a leak but a hog, it can happen when a lot of code is just-in-time compiled).

Also note that my suggestion of the previous post, which you applied iiuc, might yield some more gains in 6.22 w.r.t. 6.20.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.