Hello,
I’m currently rewriting my analysis pipeline to always use RDataFrame for data loading. While for simple Filters and Snapshots this works extremely well and fast, there is part of the code, where I need pandas as an interface for training a BDT. Currently I’m using the AsNumpy
function to get a dictionary that then is converted to a pandas DataFrame. However, I found that when using multi-threading for speed up, I often get a memory overflow error. I wrote the following profiling script to monitor the behavior for different numbers of threads.
from ROOT import RDataFrame, EnableImplicitMT
import time
import pandas as pd
from memory_profiler import memory_usage
import matplotlib.pyplot as plt
import numpy as np
import argparse as ap
parser = ap.ArgumentParser()
parser.add_argument('-t', '--threads', help='Number of threads to use RDataFrame.')
args = parser.parse_args()
n_threads = int(args.threads)
def pandas_from_RDataFrame(treeName, files):
time.sleep(2)
df = RDataFrame(treeName, files)
time.sleep(10)
print(f"Processing {files}")
dic = df.AsNumpy()
time.sleep(5)
pd_df = pd.DataFrame(dic)
time.sleep(2)
return pd_df
def loop(treeName, file1, file2, n_threads):
EnableImplicitMT(n_threads)
mem_usage1 = np.array(memory_usage((pandas_from_RDataFrame, (treeName, file1))))
t1 = np.linspace(0, len(mem_usage1)/10, num=len(mem_usage1))
mem_usage2 = np.array(memory_usage((pandas_from_RDataFrame, (treeName, file2))))
t2 = np.linspace(0, len(mem_usage2)/10, num=len(mem_usage2))
mem_norm1 = mem_usage1/6460.68763351/n_threads
mem_norm2 = mem_usage2/266.533475876/n_threads
t_mem_1 = np.stack((t1, mem_norm1))
t_mem_2 = np.stack((t2, mem_norm2))
np.savetxt("save/t_mem_1_" + str(n_threads) + ".txt", t_mem_1)
np.savetxt("save/t_mem_2_" + str(n_threads) + ".txt", t_mem_2)
file1 = '/data/DATA.root'
file2 = '/mc/MC.root'
treeName = 'D02hh'
loop(treeName, file1, file2, n_threads)
The two files I chose for there different data size (about 6.5 GB data, 300MB MC), but I see the same trend in all my datasets.
As expected the memory consumption is larger when using more cores, since the data has to be copied multiple times. The second (smaller) peak is the conversion from the dictionary to pandas.DataFrame.
MEM_DATA.pdf (26.1 KB)
MEM_MC.pdf (16.3 KB)
Since the needed memory is connected to the size of the dataset, I also produced a version normalized to the disk size and per thread. For that reason the size of the second peak is not correct, since it is always single threaded. It seems that the memory in the multithreaded AsNumpy
is always 2-4 times the disk size per thread.
NORM_MEM_DATA.pdf (31.4 KB)
NORM_MEM_MC.pdf (15.6 KB)
Is this expected behavior? Especially for larger files (where higher thread counts could be beneficial) this means the memory used is going into the O(100GiB), as can be seen in the first picture… In my case for example the larger dataset is already a selected sample. If I try to load the full dataset used in the analysis workflow it exceeds the available resources already at 5-10 threads.
Cheers
Jonah
ROOT Version: 6.28/00 (installed via conda)
Platform: CentOS 7 x86_64
Compiler: 11.3.0