Memory usage issue in RDataFrame snapshot

vhegde · September 13, 2024, 4:20pm

Dear ROOT experts,

I am trying to sort the events in TTree containing hundreds of branches and ~million events using RDataFrame.
The TTree has branches named GenModel_TChiWH_950_400, GenModel_TChiWH_900_400, ..... etc. For any given event only one of these is true and all other similar branches are false. So I want to create multiple root files by sorting the events according to GenModel_TChiWH_*. For example, I need to create SortedFile_GenModel_TChiWH_950_50.root in which all events belong to GenModel_TChiWH_950_50 == true.

Here is the script [1] I have. The issue I am facing is that the memory usage keeps growing as the files are produced. At the end, I think it consumes about 2.5 GB for an input file of 221 MB and ~130k events.

Is there a way to “clear some memory” inside the for loop? I am fine with increased computing time by some amount. I have turned off multi-threading since it reduces the memory usage to some extent.

Thanks,
Vinay

_ROOT Version: 6.26/07
_Platform: AlmaLinux release 9.4 (Seafoam Ocelot)
Compiler: Not Provided

[1]

import ROOT as rt
import sys

# rt.EnableImplicitMT() # Enable multi-threading. Not that helpful for this code. Can save some memory with single thread.
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python3 makeTTreeForEachMass_v2.py <In_root_file> <OutFile string>")
        sys.exit(1)
    
    file_path = sys.argv[1]
    outFnameSt = sys.argv[2]

    print("Starting analysis")

    tree_name = "Events"
    df = rt.RDataFrame(tree_name, file_path)
    Nentries = df.Count().GetValue()
    print("Number of events:", Nentries)

    # Get all branch names
    branch_names = df.GetColumnNames()
    
    # print(branch_names)
    # Extract mass pairs from branch names
    branchNamePatr = "GenModel_TChiWH_" # There are branches named GenModel_TChiWH_950_50, GenModel_TChiWH_950_400, etc. For a given event only one of these GenModel_TChiWH_* are 1 (true). Other GenModel_TChiWH_* are set to 0 (false).
    mass_pairs = []
    outFileNames = []
    totalEvents = 0
    for branch in branch_names:
        branch_str = str(branch)
        if str(branchNamePatr) in branch_str:
            mXY = branch_str.split('_')[-2::]
            mX = float(mXY[0])
            mY = float(mXY[1])
            mass_pairs.append((mX, mY, branch_str))
            ###########
            df_temp = df.Filter(f"({branch} == 1)")
            outName = outFnameSt+"_"+str(int(mX))+"_"+str(int(mY))+".root"
            print("Creating file",outName)
            df_temp.Snapshot("Events",outName) # write a TTree that contains events in which only GenModel_TChiWH_950_400 is true, for example.

Danilo · September 13, 2024, 5:10pm

Hi,

Thanks for the post and welcome to the ROOT Community!

You are using a rather old version of ROOT, especially when considering RDF. Do you see the same behaviour with ROOT 6.32.04, the latest stable?

Cheers,
Danilo

vhegde · September 15, 2024, 11:42am

Hi Danilo,

Here is the comparison carried out on a different machine.
ROOT 6.32.04 : 1.2 GB
ROOT 6.30/07 : 3.8 GB.

So, the latest version does consume less memory. But still the usage was going up as the job progressed.

Thanks,
Vinay

vhegde · September 20, 2024, 12:58pm

Hi Danilo, experts,

I just wanted to check if we have any fix for this increasing memory consumption?

Thanks,
Vinay

system · October 4, 2024, 12:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Danilo · October 14, 2024, 6:28am

Hi Vinay,

I must have missed your post.
Moving to the latest stable seems a good thing to do at this point, irrespectively of anything else…

Now, for the memory consumption: have you perhaps tried to rebuild the RDF for every iteration in the loop instead of growing the computation graph?

Cheers,
D

vpadulan · July 1, 2025, 8:43am

Dear @vhegde ,

I am taking the liberty of bringing up this old topic again because we have just finished an improvement of the capability of the RDataFrame Snapshot such that it won’t be necessary anymore to specify any template arguments at all. At the same time, this will also not require any JIT-compiling. That practically means that you get the benefit of a simpler API (i.e. no need to care about template arguments) with a much, much faster runtime performance.

In your original example, you are not just calling Snapshot but also other RDataFrame operations. So it might be that your memory usage won’t go down to zero, but removing the contribution from the Snapshot calls will surely help!

The change has just been merged to the development branch of ROOT, so it’s going to be available in the next ROOT version 6.38 scheduled for end of this year. In the meanwhile, if you have access to an LCG release via CVMFS, for example using lxplus, you can already see this in action. You can source the environment via e.g.

source /cvmfs/sft.cern.ch/lcg/views/dev3/latest/x86_64-el9-gcc13-opt/setup.sh

Cheers,
Vincenzo