A memory leak with RDataFrame.AsNumpy() and vector<vector<vector<float>>>

Hi,

I’ve encountered a small but significant memory leak with RDataFrame.AsNumpy() in ROOT 6.38.00. I have a tree with the largest branch being a 3D vector trace_ch: vector<vector<vector>>. The following code:

ROOT.gInterpreter.GenerateDictionary("ROOT::VecOps::RVec<vector<vector<float>>>", "vector;ROOT/RVec.hxx")
df = ROOT.RDataFrame(dd.trawvoltage._tree)
for ij in range(np.ceil(dd.trawvoltage.get_entries()/events_to_read).astype(int)):
    print(ij)
    df.Range(ij * events_to_read, (ij + 1) * events_to_read).AsNumpy(["trace_ch"])
    continue

leaks memory. In my tests I was reading 3000 events at time, with 5x4x1024 dimensions of the single entry vector on average (the first dimension varies, but it is 5 on average). So a single iteration should read out ~246 MB in C++ terms. After about 300 iterations, the code uses 1.1 GB more than at the first iteration. The growth is not linear - sometimes it keeps almost constant for several dozen iterations, then grows quickly. Perhaps it is something caching related?

A C++ code called from Python, where I pass vectors, and in C++ do the iterating and filling works without any memory leaks.

I know I should provide a working example, but… I am uncertain how could I give you as much data as is needed to notice the leak…


ROOT Version: 6.38.00
Platform: Fedora 43


Dear @LeWhoo ,

Thanks for reaching out! Could you provide one input data file so I could start debugging from there?

Cheers,
Vincenzo

Thank you! I shared them with you in a private message.

What I forgot to mention is that I read them as a TChain (the RDataFrame was initialised with a TChain) - perhaps this has something to do with the problem.

Hi, before the topic closes - any news on the leak?

Dear @LeWhoo ,

I have started debugging your issue with the files you sent me. The first thing I notice by reading your short snippet above is that you are using Range. This operation will need to read all the entries up to the starting one(the first argument). Say, if you call Range(1000, 2000), RDataFrame will first need to read anyway entries from 0 to 999, then of course you will only get as a result entries 1000 to 1999 as requested in the call. This may be partly responsible for the leak you report. In order for me to understand if that’s the case, I would need to also see the C++ code you mention

A C++ code called from Python, where I pass vectors, and in C++ do the iterating and filling works without any memory leaks.

Meanwhile I’ll continue the investigation.
Cheers,
Vincenzo

Dear @LeWhoo ,

Here is a first attempt to reproduce your issue, which so far has not happened. I think you may be finding yourself in a situation where you are reading inefficiently and seeing some cumulated effects of that. I have created this github repository GitHub - vepadulano/root-forum-64767 · GitHub

In the repo you will find some preparatory code to generate the dictionary for ROOT::RVec<std::vector<std::vector<foat>>>, together with a Python script. In the script, I keep track of the RSS used during the for loop with the Range + AsNumpy calls.

Keeping a similar approach as your initial post, shows some fluctuation in terms of RSS in-between different iterations (albeit still fairly small). But when I move to a for loop that takes into account cluster boundaries, then I see practically no fluctuation in-between iterations. I do see something in the neighbourhood of 0.5MB, but I don’t think this can be called a leak at this stage.

For a brief explanation of what a “cluster” is, this is the smallest compressed group of entries of a TTree on disk. This means that when you read back the TTree from disk, you will be uncompressing an entire cluster of entries at a time, irrespective of how many entries you need to read. For example, all the files you sent me each have 1 cluster. That means that when reading each tree of the chain, the entire tree will be read in memory, even if you just request a smaller number of entries via Range. Even more concretely, the first file has 197 entries and only one cluster [0, 197). When you call Range(0, 100), the TTree will still uncompress from disk all 197 entries. This is the same in C++, Python, with or without RDataFrame, with or without ROOT I/O (i.e. any software reading TTree will do this).

Can I ask you to take a look at my example, maybe try to run it and reproduce it, and in case tell me what I am missing to show the leak you see?

On my machine, I get the following:

Delta RSS before for loop: 434.71
begin=0,end=197
pmem(rss=615362560, vms=3308306432, shared=334733312, text=4096, lib=0, data=2581798912, dirty=0)
Delta RSS: 145.05
begin=197,end=483
pmem(rss=615882752, vms=3308609536, shared=334733312, text=4096, lib=0, data=2582093824, dirty=0)
Delta RSS: 0.52
begin=483,end=578
pmem(rss=616181760, vms=3308765184, shared=334733312, text=4096, lib=0, data=2582249472, dirty=0)
Delta RSS: 0.30
begin=578,end=677
pmem(rss=616595456, vms=3308765184, shared=334733312, text=4096, lib=0, data=2582249472, dirty=0)
Delta RSS: 0.41
begin=677,end=781
pmem(rss=617107456, vms=3309228032, shared=334733312, text=4096, lib=0, data=2582712320, dirty=0)
Delta RSS: 0.51
Delta RSS w.r.t. before the for loop: 146.79

Note that the first iteration of the for loop shows an increase in memory usage but without further knowledge that’s most probably just due to the JITting of the column types and function instantiations for the first AsNumpy call.

Cheers,
Vincenzo

Thanks. I tried your script, and I think that the lack of a visible memory leak is just due… to a too small number of events. May I suggest you multiply the files I’ve sent you with something like:

for i in {6..600}; do cp 5.root "$i.root"; done

And then read 3000 entries at each iteration for all files. I attach your modified script for your convenience.

If I exclude the initial 4 iterations, then the difference in RSS between the last iteration and iteration 4 is 47 MB. That’s after reading ~63000 entries. This is shown on the plot below:

Not much. However, I read ~5,000,000 entries. That’s roughly 100 times more, so naively, the memory would blow by 4.7 GB. As I wrote, the growth is not linear, and given my real files, I think it was over 10 GB, and I start to run out of memory.

Perhaps reading without crossing the file boundaries, as you suggest, could help, but doesn’t that defeat the purpose of using a TChain?
forum64767.py (2.8 KB)

Dear @LeWhoo ,

Thanks for the update. I will re-run with more files and see if I can reproduce the trend you show.

Perhaps reading without crossing the file boundaries, as you suggest, could help, but doesn’t that defeat the purpose of using a TChain?

Don’t get me wrong. Mine was just an explanation of the internals of the TTree data format, which may or may not influence this particular case. Also, the fact that your files only have one cluster each is just by chance. Usually a single TTree has tens if not hundreds of clusters. RDataFrame already takes this into account and exploits it for optimised performance under the hood, for example in multithreaded runs. In your example you are actively working against this by selecting a range of entries manually, which may or may not correspond to cluster boundaries. The fact you are using a TChain does not really matter here: if your files had more than one cluster each, you could still call e.g. Range(1000,2000), Range(2000,3000) where both ranges would end up in the same file of the chain, just different clusters.

Thank you!

Sure, I get it. The point is that in my logic, I am reading the events to a NumPy array and then converting it to a CuPy array on a GPU, with limited memory. Thus, I am reading a limited number of events that fit on a GPU (a rough estimation, as the event size varies significantly). I guess there is no simple mechanism that would allow me to iterate through a TTree most optimally (reading whole clusters only, not repeating the clusters), while at the same time not exceeding a specific amount of memory used by read out events. That’s not related to the leak I am describing, but perhaps I am missing some simple mechanism (I know I could scan the chain, derive some common denominator between clusters and memory occupied, etc. but that seems like complicating logic significantly).

Dear @LeWhoo ,

I’ve been exploring your use case extensively, trying out many things. I started profiling the memory usage for your application with GitHub - bloomberg/memray: Memray is a memory profiler for Python · GitHub , which produces very detailed reports that can be turned into flamegraphs showing both the total heap size as well as the RSS over time.

With regards to the example at hand with AsNumpy, visible at CERNBox, this is a typical graph I see

With the full flamegraph visible at CERNBox

Then I went on and I changed slightly the example. Everything stays the same, except I substitute the AsNumpy operation with Take (see example at CERNBox):

col_type = "ROOT::RVec<std::vector<std::vector<float>>>"
df.Range(begin, end).Take[col_type]("trace_ch").GetValue()

The flamegraph is different in the fact that I see drops and rises of the RSS

But when I zoom in, I see that there is still some leftover rising trend in the peaks of RSS: the second highest peak is at 3.35GB and the last peak is at 3.40GB

The full flamegraph is visible at CERNBox

For the record, I also used the integration of awkward array with RDataFrame, as in

        range_df = df.Range(begin, end)
        awkward.from_rdataframe(range_df, "trace_ch")

In each iteration of the for loop, see example at CERNBox. This does not change the picture much, but does increase both the total heap size and the RSS substantially:

Note that the peaks now hover around the 7GB mark. The full flamegraph is available at CERNBox

From this analysis I can see that there are multiple factors at play that contribute to your memory usage:

First, the size of your data

Your dataset is made of deeply-nested, very large arrays. The flamegraph with the raw Take operation shows that even the net heap size reaches 2.8GB at each iteration, which is just the size of the vectors you have.

Then, the specific type ROOT::RVec<std::vector<std::vector<float>>> is also more complicated to deal with than a flat vector. At each event fill, the higher-level array that will represent your event batch (in this case of size 3000) must allocate (and most often re-allocate) for the entirety of one ROOT::RVec<std::vector<std::vector<float>>> object at a time

This is even more noticeable when changing the Python code to C++. I did that (see example CERNBox) and I profiled it with valgrind --tool=massif. Here is the output massif file CERNBox. You can visualize it for example with Massif-Visualizer - KDE Applications which shows

The vast majority of your memory is used by your own data.

Second, the specific implementation of AsNumpy

After removing the largest contributor, there is still the difference represented by the fact that while Take shows rises and drops in the RSS, in conjunction with the end/beginning of the next batch creation, AsNumpy instead shows a steady increase in the RSS consumption. This difference is most likely due to the fact that we’re creating a numpy.array with dtype=object, because ROOT::RVec<std::vector<std::vector<float>>> can’t be represented with a numpy native type. For each 3000-sized batch, we are creating and destroying 3000 Python objects that wrap the C++ type. This may trigger lookups in the Python-C++ bindings engine to instantiate and call the function (destructor) responsible for deleting that C++ object.

I hoped this could be improved by using the awkward.from_rdataframe function which bypasses the schema above, but it actually made things worse overall.

Third, leftover TTree and interpreter allocations

There is always a bit of memory that’s not easy to get rid of.

One part of this is represented by the TTreeCache. Every TTree creates this cache for I/O optimization reasons, but it may represent a bit of leftover memory increase in your application. In my tests, I disabled this simply by calling chain.SetCacheSize(0), that’s why you don’t see it in the flamegraphs.

The other part of remaining memory increase may just be due to the interaction with the interpreter which happens at multiple levels (RDF surely, but also in the I/O with TTree and TFile). There’s nothing I can practically do here. In principle, if your application allows, you could think about wrapping each batch iteration in a subprocess which would then destroy all the allocated memory

# This bit must be saved to another Python script, e.g. "work.py"
def run_df(begin, end):
    # Create the RDF here and do stuff with it within the given range

# This bit can be executed from the main script
for begin, end in ranges:
    subprocess.run([sys.executable, "work.py", str(begin), str(end)])

All in all, let me say that I’m not fully satisfied with these results, but I don’t see any immediate thing we could do to improve the situation. Most of the memory is used by your data, which is clearly non negotiable, the leftover comes mostly from interactions with the interpreter which can’t be generally fixed except by working around them with the subprocess approach.

Cheers,
Vincenzo