Memory usage issue in RDataFrame snapshot

Dear ROOT experts,

I am trying to sort the events in TTree containing hundreds of branches and ~million events using RDataFrame.
The TTree has branches named GenModel_TChiWH_950_400, GenModel_TChiWH_900_400, ..... etc. For any given event only one of these is true and all other similar branches are false. So I want to create multiple root files by sorting the events according to GenModel_TChiWH_*. For example, I need to create SortedFile_GenModel_TChiWH_950_50.root in which all events belong to GenModel_TChiWH_950_50 == true.

Here is the script [1] I have. The issue I am facing is that the memory usage keeps growing as the files are produced. At the end, I think it consumes about 2.5 GB for an input file of 221 MB and ~130k events.

Is there a way to “clear some memory” inside the for loop? I am fine with increased computing time by some amount. I have turned off multi-threading since it reduces the memory usage to some extent.

Thanks,
Vinay

_ROOT Version: 6.26/07
_Platform: AlmaLinux release 9.4 (Seafoam Ocelot)
Compiler: Not Provided


[1]

import ROOT as rt
import sys

# rt.EnableImplicitMT() # Enable multi-threading. Not that helpful for this code. Can save some memory with single thread.
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python3 makeTTreeForEachMass_v2.py <In_root_file> <OutFile string>")
        sys.exit(1)
    
    file_path = sys.argv[1]
    outFnameSt = sys.argv[2]

    print("Starting analysis")

    tree_name = "Events"
    df = rt.RDataFrame(tree_name, file_path)
    Nentries = df.Count().GetValue()
    print("Number of events:", Nentries)

    # Get all branch names
    branch_names = df.GetColumnNames()
    
    # print(branch_names)
    # Extract mass pairs from branch names
    branchNamePatr = "GenModel_TChiWH_" # There are branches named GenModel_TChiWH_950_50, GenModel_TChiWH_950_400, etc. For a given event only one of these GenModel_TChiWH_* are 1 (true). Other GenModel_TChiWH_* are set to 0 (false).
    mass_pairs = []
    outFileNames = []
    totalEvents = 0
    for branch in branch_names:
        branch_str = str(branch)
        if str(branchNamePatr) in branch_str:
            mXY = branch_str.split('_')[-2::]
            mX = float(mXY[0])
            mY = float(mXY[1])
            mass_pairs.append((mX, mY, branch_str))
            ###########
            df_temp = df.Filter(f"({branch} == 1)")
            outName = outFnameSt+"_"+str(int(mX))+"_"+str(int(mY))+".root"
            print("Creating file",outName)
            df_temp.Snapshot("Events",outName) # write a TTree that contains events in which only GenModel_TChiWH_950_400 is true, for example.

Hi,

Thanks for the post and welcome to the ROOT Community!

You are using a rather old version of ROOT, especially when considering RDF. Do you see the same behaviour with ROOT 6.32.04, the latest stable?

Cheers,
Danilo

Hi Danilo,

Here is the comparison carried out on a different machine.
ROOT 6.32.04 : 1.2 GB
ROOT 6.30/07 : 3.8 GB.

So, the latest version does consume less memory. But still the usage was going up as the job progressed.

Thanks,
Vinay

Hi Danilo, experts,

I just wanted to check if we have any fix for this increasing memory consumption?

Thanks,
Vinay

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.