A memory leak with RDataFrame.AsNumpy() and vector<vector<vector<float>>>

Hi,

I’ve encountered a small but significant memory leak with RDataFrame.AsNumpy() in ROOT 6.38.00. I have a tree with the largest branch being a 3D vector trace_ch: vector<vector<vector>>. The following code:

ROOT.gInterpreter.GenerateDictionary("ROOT::VecOps::RVec<vector<vector<float>>>", "vector;ROOT/RVec.hxx")
df = ROOT.RDataFrame(dd.trawvoltage._tree)
for ij in range(np.ceil(dd.trawvoltage.get_entries()/events_to_read).astype(int)):
    print(ij)
    df.Range(ij * events_to_read, (ij + 1) * events_to_read).AsNumpy(["trace_ch"])
    continue

leaks memory. In my tests I was reading 3000 events at time, with 5x4x1024 dimensions of the single entry vector on average (the first dimension varies, but it is 5 on average). So a single iteration should read out ~246 MB in C++ terms. After about 300 iterations, the code uses 1.1 GB more than at the first iteration. The growth is not linear - sometimes it keeps almost constant for several dozen iterations, then grows quickly. Perhaps it is something caching related?

A C++ code called from Python, where I pass vectors, and in C++ do the iterating and filling works without any memory leaks.

I know I should provide a working example, but… I am uncertain how could I give you as much data as is needed to notice the leak…


ROOT Version: 6.38.00
Platform: Fedora 43


Dear @LeWhoo ,

Thanks for reaching out! Could you provide one input data file so I could start debugging from there?

Cheers,
Vincenzo

Thank you! I shared them with you in a private message.

What I forgot to mention is that I read them as a TChain (the RDataFrame was initialised with a TChain) - perhaps this has something to do with the problem.

Hi, before the topic closes - any news on the leak?

Dear @LeWhoo ,

I have started debugging your issue with the files you sent me. The first thing I notice by reading your short snippet above is that you are using Range. This operation will need to read all the entries up to the starting one(the first argument). Say, if you call Range(1000, 2000), RDataFrame will first need to read anyway entries from 0 to 999, then of course you will only get as a result entries 1000 to 1999 as requested in the call. This may be partly responsible for the leak you report. In order for me to understand if that’s the case, I would need to also see the C++ code you mention

A C++ code called from Python, where I pass vectors, and in C++ do the iterating and filling works without any memory leaks.

Meanwhile I’ll continue the investigation.
Cheers,
Vincenzo

Dear @LeWhoo ,

Here is a first attempt to reproduce your issue, which so far has not happened. I think you may be finding yourself in a situation where you are reading inefficiently and seeing some cumulated effects of that. I have created this github repository GitHub - vepadulano/root-forum-64767 · GitHub

In the repo you will find some preparatory code to generate the dictionary for ROOT::RVec<std::vector<std::vector<foat>>>, together with a Python script. In the script, I keep track of the RSS used during the for loop with the Range + AsNumpy calls.

Keeping a similar approach as your initial post, shows some fluctuation in terms of RSS in-between different iterations (albeit still fairly small). But when I move to a for loop that takes into account cluster boundaries, then I see practically no fluctuation in-between iterations. I do see something in the neighbourhood of 0.5MB, but I don’t think this can be called a leak at this stage.

For a brief explanation of what a “cluster” is, this is the smallest compressed group of entries of a TTree on disk. This means that when you read back the TTree from disk, you will be uncompressing an entire cluster of entries at a time, irrespective of how many entries you need to read. For example, all the files you sent me each have 1 cluster. That means that when reading each tree of the chain, the entire tree will be read in memory, even if you just request a smaller number of entries via Range. Even more concretely, the first file has 197 entries and only one cluster [0, 197). When you call Range(0, 100), the TTree will still uncompress from disk all 197 entries. This is the same in C++, Python, with or without RDataFrame, with or without ROOT I/O (i.e. any software reading TTree will do this).

Can I ask you to take a look at my example, maybe try to run it and reproduce it, and in case tell me what I am missing to show the leak you see?

On my machine, I get the following:

Delta RSS before for loop: 434.71
begin=0,end=197
pmem(rss=615362560, vms=3308306432, shared=334733312, text=4096, lib=0, data=2581798912, dirty=0)
Delta RSS: 145.05
begin=197,end=483
pmem(rss=615882752, vms=3308609536, shared=334733312, text=4096, lib=0, data=2582093824, dirty=0)
Delta RSS: 0.52
begin=483,end=578
pmem(rss=616181760, vms=3308765184, shared=334733312, text=4096, lib=0, data=2582249472, dirty=0)
Delta RSS: 0.30
begin=578,end=677
pmem(rss=616595456, vms=3308765184, shared=334733312, text=4096, lib=0, data=2582249472, dirty=0)
Delta RSS: 0.41
begin=677,end=781
pmem(rss=617107456, vms=3309228032, shared=334733312, text=4096, lib=0, data=2582712320, dirty=0)
Delta RSS: 0.51
Delta RSS w.r.t. before the for loop: 146.79

Note that the first iteration of the for loop shows an increase in memory usage but without further knowledge that’s most probably just due to the JITting of the column types and function instantiations for the first AsNumpy call.

Cheers,
Vincenzo

Thanks. I tried your script, and I think that the lack of a visible memory leak is just due… to a too small number of events. May I suggest you multiply the files I’ve sent you with something like:

for i in {6..600}; do cp 5.root "$i.root"; done

And then read 3000 entries at each iteration for all files. I attach your modified script for your convenience.

If I exclude the initial 4 iterations, then the difference in RSS between the last iteration and iteration 4 is 47 MB. That’s after reading ~63000 entries. This is shown on the plot below:

Not much. However, I read ~5,000,000 entries. That’s roughly 100 times more, so naively, the memory would blow by 4.7 GB. As I wrote, the growth is not linear, and given my real files, I think it was over 10 GB, and I start to run out of memory.

Perhaps reading without crossing the file boundaries, as you suggest, could help, but doesn’t that defeat the purpose of using a TChain?
forum64767.py (2.8 KB)

Dear @LeWhoo ,

Thanks for the update. I will re-run with more files and see if I can reproduce the trend you show.

Perhaps reading without crossing the file boundaries, as you suggest, could help, but doesn’t that defeat the purpose of using a TChain?

Don’t get me wrong. Mine was just an explanation of the internals of the TTree data format, which may or may not influence this particular case. Also, the fact that your files only have one cluster each is just by chance. Usually a single TTree has tens if not hundreds of clusters. RDataFrame already takes this into account and exploits it for optimised performance under the hood, for example in multithreaded runs. In your example you are actively working against this by selecting a range of entries manually, which may or may not correspond to cluster boundaries. The fact you are using a TChain does not really matter here: if your files had more than one cluster each, you could still call e.g. Range(1000,2000), Range(2000,3000) where both ranges would end up in the same file of the chain, just different clusters.

Thank you!

Sure, I get it. The point is that in my logic, I am reading the events to a NumPy array and then converting it to a CuPy array on a GPU, with limited memory. Thus, I am reading a limited number of events that fit on a GPU (a rough estimation, as the event size varies significantly). I guess there is no simple mechanism that would allow me to iterate through a TTree most optimally (reading whole clusters only, not repeating the clusters), while at the same time not exceeding a specific amount of memory used by read out events. That’s not related to the leak I am describing, but perhaps I am missing some simple mechanism (I know I could scan the chain, derive some common denominator between clusters and memory occupied, etc. but that seems like complicating logic significantly).