RDataFrame AsNumpy Memory Consumption

jblank · January 23, 2024, 2:52pm

Hello,

I’m currently rewriting my analysis pipeline to always use RDataFrame for data loading. While for simple Filters and Snapshots this works extremely well and fast, there is part of the code, where I need pandas as an interface for training a BDT. Currently I’m using the AsNumpy function to get a dictionary that then is converted to a pandas DataFrame. However, I found that when using multi-threading for speed up, I often get a memory overflow error. I wrote the following profiling script to monitor the behavior for different numbers of threads.

from ROOT import RDataFrame, EnableImplicitMT
import time
import pandas as pd
from memory_profiler import memory_usage
import matplotlib.pyplot as plt
import numpy as np
import argparse as ap
parser = ap.ArgumentParser()
parser.add_argument('-t', '--threads', help='Number of threads to use RDataFrame.')
args = parser.parse_args()
n_threads = int(args.threads)
 
def pandas_from_RDataFrame(treeName, files):
    time.sleep(2)
    
    df = RDataFrame(treeName, files)
    
    time.sleep(10)
    
    print(f"Processing {files}")
    dic = df.AsNumpy()
    
    time.sleep(5)
    
    pd_df = pd.DataFrame(dic)
    
    time.sleep(2)
    return pd_df 

def loop(treeName, file1, file2, n_threads):
    EnableImplicitMT(n_threads)
    mem_usage1 = np.array(memory_usage((pandas_from_RDataFrame, (treeName, file1))))
    t1 = np.linspace(0, len(mem_usage1)/10, num=len(mem_usage1))

    mem_usage2 = np.array(memory_usage((pandas_from_RDataFrame, (treeName, file2))))
    t2 = np.linspace(0, len(mem_usage2)/10, num=len(mem_usage2))
    mem_norm1 = mem_usage1/6460.68763351/n_threads
    mem_norm2 = mem_usage2/266.533475876/n_threads
    
    t_mem_1 = np.stack((t1, mem_norm1))
    t_mem_2 = np.stack((t2, mem_norm2))
    
    np.savetxt("save/t_mem_1_" + str(n_threads) + ".txt", t_mem_1)
    np.savetxt("save/t_mem_2_" + str(n_threads) + ".txt", t_mem_2)
    

file1 = '/data/DATA.root'
file2 = '/mc/MC.root'
treeName = 'D02hh'

loop(treeName, file1, file2, n_threads)

The two files I chose for there different data size (about 6.5 GB data, 300MB MC), but I see the same trend in all my datasets.

As expected the memory consumption is larger when using more cores, since the data has to be copied multiple times. The second (smaller) peak is the conversion from the dictionary to pandas.DataFrame.

MEM_DATA.pdf (26.1 KB)
MEM_MC.pdf (16.3 KB)

Since the needed memory is connected to the size of the dataset, I also produced a version normalized to the disk size and per thread. For that reason the size of the second peak is not correct, since it is always single threaded. It seems that the memory in the multithreaded AsNumpy is always 2-4 times the disk size per thread.

NORM_MEM_DATA.pdf (31.4 KB)
NORM_MEM_MC.pdf (15.6 KB)

Is this expected behavior? Especially for larger files (where higher thread counts could be beneficial) this means the memory used is going into the O(100GiB), as can be seen in the first picture… In my case for example the larger dataset is already a selected sample. If I try to load the full dataset used in the analysis workflow it exceeds the available resources already at 5-10 threads.

Cheers

Jonah

ROOT Version: 6.28/00 (installed via conda)
Platform: CentOS 7 x86_64
Compiler: 11.3.0

bellenot · January 23, 2024, 3:00pm

Maybe @vpadulan can take a look

vpadulan · January 23, 2024, 3:16pm

Dear @jblank ,

Thanks for reaching out to the forum! And for the detailed example and explanation!

I just wanted to point out first that your use case has intrinsic large memory requirements. As you point out, it scales with the size of the dataset. Let’s make an extreme example: if you just wanted to take the data sample file of 6.5GB and export all the branches to numpy arrays (without any extra computation/filtering), that means you need 6.5GB memory. So indeed in general you can get very easily outside of the memory limits of commodity machines (that’s one of the things that a more complex data format e.g. TTree solves).

Then, we can investigate a bit more the trend that you show. Ideally we wouldn’t need 4 times the amount of memory w.r.t. your dataset size. Keeping the same example, maybe 6.5GB could be found easily on an average laptop, but 26GB is a different story.

From your post, I understand you see this trend independently on the specific file you use, i.e. even with the smaller file. So could I ask you to share one such file, so that we can investigate on it directly?

Cheers,
Vincenzo

jblank · January 23, 2024, 4:19pm

Hi @vpadulan,
thanks for the fast reply. Since I’m not sure about the data policy of my group, I generated a toy dataset that you can find here.

https://cernbox.cern.ch/s/DdT8EfTKlWO2ULI

Using that I get the following trend this time only using a fixed number of 10 threads:

toy_MEM.pdf (14.5 KB)

So even though the size is just 7.7MB the memory consumption goes as high as almost 1 GB.

Cheers
Jonah

jblank · February 5, 2024, 2:36pm

Hi @vpadulan,
were you able to reproduce this behavior? For now, I can use my old version with uproot.open as a workaround, but this takes about a factor of 10 longer for the given dataset.

Cheers
Jonah

system · February 19, 2024, 2:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

vpadulan · April 10, 2024, 2:40pm

Dear @jblank ,

Sorry for the late reply, I just downloaded your dataset and I am using this (reduced) version of your reproducer. I cannot see an increase in memory usage on my machine, although I may be missing something from your original reproducer (the memory_usage function was not available from the post but I believe that should not interfere).

import ROOT
import pandas as pd
import psutil


def pandas_from_RDataFrame(treeName, filename):
    df = ROOT.RDataFrame(treeName, filename)

    print(f"Processing {filename}")
    dic = df.AsNumpy()

    pd_df = pd.DataFrame(dic)

    return pd_df


if __name__ == "__main__":
    fileName = "toy.root"
    treeName = "D02hh"
    ROOT.EnableImplicitMT(10)

    memory_begin = psutil.virtual_memory().available
    for _ in range(10):
        pandas_from_RDataFrame(treeName, fileName)
        memory_end = psutil.virtual_memory().available
        memory = memory_begin - memory_end
        print(f"Used memory: {memory}")
        memory_begin = memory_end

Let me know if my snippet seems a reasonable benchmark to you. Also, could I ask you to rerun your benchmark with the latest ROOT compiled from master? It might be that I am seeing some improvements that were implemented since v6.28 (which you indicated in your original post).

Cheers,
Vincenzo