RDF memory consumption increased significantly with merged files

Dear experts,

I am running RDataframes and am doing some tests when running on merged/unmerged files. We have a lot of files with only very few events stored within the ntuples of only a few MB. I tried first merging them into bigger files of about 0.5-1 GB. This lead to a speed improvement of about a factor 5. This is of course great new, although not completely unexpected I suppose. However what worries me a bit is that I noticed that running the code on the merged files also increased the memory consumption by about 25%. From 20GB virtual memory (unmerged to about 28GB (merged) and from 15GB RES memory to 19GB. Is this a known issue? Naively I always thought this was quite a strength of root, eg that the size of the ntuple would not increase the memory consumption, given that only 1 event was in memory (or a few now that rdataframes is multithreaded).

To give a bit more an idea of what I am exactly doing. I have multiple RDatasetSpecs (~25), each with their own RDF graph, which I then run with the RunGraphs option. I only make histograms, about 100 per RDF.

Personally its not such a big problem as the PC on my local cluster can deal with this increase in memory, however lxplus usually kills process with large memory consumption…

Anyway, my question is if this is expected behavior or not?

Cheers,
Jordy

my root is setup from cvmfs via:

which root
/cvmfs/sft.cern.ch/lcg/views/LCG_106b/x86_64-el9-gcc11-opt/bin/root
ROOT Version: 6.32.08
Built for linuxx8664gcc on Dec 03 2024, 17:12:25
From tags/6-32-08@6-32-08

@vpadulan could you please take a look at this question?

Hello Jordy,

There are multiple factors that contribute to the memory consumption of RDF. (A few paragraphs can be found here under “Memory usage”).

I think in your case it might be the computation graphs that have to be compiled for each DatasetSpec. The compiled products will stay in memory until the process exits.

Another reason for higher memory consumption is the thread-local histograms that have to be filled. The higher the level of parallelism, the more histograms get created. If you used to run your graphs one after the other, those intermediate histograms might already haven been cleared when the next graphs starts running, saving you some RAM. Now that you merged the inputs, it could be that multiple graphs are running in parallel, so there are more histograms “in flight”.

If you want to test, you could “RunGraph” the 25 different graphs after each other. If that uses less, it may be the histograms in flight. If it uses a similar amount or more, it is the compiled computation graphs that fill up the RAM.