Memory usage and large DataFrames

sbrommer · February 15, 2024, 10:46am

ROOT Version: 6.30/02
Platform: CentOS7
Compiler: g++ 11.3.0

Hello everyone,

I have a more conventional question concerning RDataFrames. I am one of the developers of CROWN (GitHub - KIT-CMS/CROWN: C++ based ROOT Workflow for N-tuples (CROWN)) where we use a configuration, to automatically generate C++ Code that is then compiled. The heavy lifting is done using a single, large RDataFrame. Most functions in the framework are optimized and do not use any JITting.

Now, we are at a point, where one of our users is running a DF, that is hard to handle on our HTCondor system. This DF contains a large amount of defines and outputs, which result in a memory consumption of about 40GB when running with 4 threads

Total Number of Defines: 28620 
Total Number of Outputs: 21778

I think it is very nice, that DF can handle something like this, however, the memory footprint is worrisome for large-scale productions of ntuples.

It is possible to split the task into multiple DFs by hand, but I was thinking about potential solutions on how to reduce this overall memory usage from the framework side, so the user does not have to optimize this on his side. One idea that I had was to split up the DF into multiple DFs that generate intermediate output files. This would cost a significant amount of performance since events have to be processed multiple times but would allow the overall task to succeed with less memory.

Are there any other solutions that you can think of, that might help to reduce the memory usage of such a DF? Are there even some built-in mechanisms that might help here?

Best Wishes
Sebastian

[1] The code of the main executable can be found here PrivateBin

mczurylo · February 15, 2024, 2:38pm

Hi @sbrommer,

thanks for reaching out!

We’ve heard of similar issues recently from another framework developers, but we don’t have the solution just yet. We’ll take a deeper look in the next days and we will get back to you - if that’s not the case please ping me here.

Cheers,
Marta

sbrommer · February 22, 2024, 9:25am

Hi,

Is there already any news, or is there something I can assist you with?

Best Wishes
Sebastian

mczurylo · February 22, 2024, 10:55am

Hi @sbrommer,

unfortunately we haven’t had a chance to get closer to the solution yet. We opened a GH issue here: Large computation graphs cause serious memory and runtime overhead · Issue #14510 · root-project/root · GitHub, but the solution is non trivial.

Also, thank you for bearing with us and offering to help, we might ping you about some testing of the new solution once we have it.

Cheers,
Marta

sbrommer · February 22, 2024, 12:15pm

Hi,

thanks for the quick update! I will keep watching the ticket and do not hesitate to ask for testing of some potential improvements within our framework.

Best Wishes
Sebastian

system · March 7, 2024, 12:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.