RDataFrame+cling memory usage

Hi,

When using RDataFrame with JITted nodes, I see the memory usage grow by (roughly) 0.7MB per histogram. I saw that this was also discussed in How to delete RDataFrame and clean up memory , and from that thread and the answer in RDataFrame Foreach causing memory leak I gather that this is due to cling keeping the AST in memory (this is also confirmed by profiling: most of the space in use is allocated by llvm and clang symbols), and that it is being worked on. Is there a time estimate, or a jira ticket I could follow? (Iā€™m sorry for opening a new thread for this, but the other ones are closed)
In analysis use cases, filling thousands of histograms in one go is not uncommon, and having a smaller memory footprint makes quite a difference for the turnaround time on a batch system. It is possible to organise analysis code such that not too many histograms are made in one loop, but the lower the practical limit, the more work that becomes, so it does matter whether that is at 2000, 5000, or 10000.
Please let me know if there are any more performance numbers I can provide, or checks that I can do, to help.

Thanks in advance,
Pieter


ROOT Version: 6.18/04 (through LCG_96bpython3, x86_64-centos7-gcc9-opt)
Platform: CentOS 7
Compiler: GCC 9.2.0


2 Likes

I guess @eguiraud can help.

Hi Pieter,

Iā€™m not sure this is being worked on. Itā€™s a conceptual problem: clang thinks that itā€™s a compiler that will exit after being done compiling the current source file, so it doesnā€™t have to give back any memory.
In JIT-ed RDF nodes, we are ā€œabusingā€ this compiler a bit, because we donā€™t let it exit when itā€™s done compiling. The memory it allocates will therefore not be given back.
Unless @Axel has another idea, the only solution I can offer is to reduce JITting to a minimum if the memory footprint is an issue.
Which function are you using to create histograms? Maybe there is a JIT-free version of it? At first look, I didnā€™t see JITting when creating histograms.

Thanks for the answer @StephanH!
In the bamboo framework (repository, documentation, for an overview see this talk) I am relying quite heavily on the JIT to have an (even more) high-level python interface on top of RDataFrame (currently code strings are passed to the RDataFrame methods; with the latest pyroot features this could probably be done differently, but I think then I would either end up calling into python code, which would be slower, or calling the same JIT from python, which would run into the same issue), so within that context avoiding the JIT is not really an option (to be fair there is also overhead from the python layer, so this is not the ultimate low-memory approach, but on the test I did 70% of the memory allocated after calling the first RResultPtr<TH1F>->Write(), so at least that part is from RDataFrame/cling).
Up until a few thousand histograms this works really well, but when including systematic variations of histograms one gets their quickly - I think it is workable like this, but one needs to be very careful with the size of the graph when writing analysis code then.
I am not familiar with the architecture of clang/llvm, but I suppose that, in principle, after the JIT is done compiling, only the equivalent of the output binary is needed (which should be much smaller).

Hi Pieter,

maybe you can reduce the amount of JITting by defining a C++ function once, something like

makeHistogramDouble(RDF& df, ...) {
  return df.Histo1D<double>({"histName", "histTitle", 64u, 0., 128.}, "myColumn");
}

(Sorry that I donā€™t have time to look up the proper argument and return types). Note that if you leave out the template argument, JITting is required see here.
If you JIT this somewhere at the beginnig using ROOT.gInterpreter.Declare() or by loading a library where it has been defined, you might be able to get most of the JITting out of the way. You can now call this factory function from python to make histograms. Iā€™m not sure if thereā€™s more JITting going on for converting the arguments, but itā€™s worth a try. Iā€™m curious if that actually can work, so please report back if you give it a try. :slight_smile:

For Filter and Define itā€™s hopefully ok to JIT a bit. If itā€™s really the histograms causing the problems, you might get a step further.

About

You are right in principle. However, we are using clang/llvm in a way that wasnā€™t foreseen by the developers. They optimised their memory management to hog, i.e. always grow and never give back until the end of the process. Itā€™s faster like this for compiling, but itā€™s obviously bad if you keep the process running. Again, @Axel might know more, but I think that we cannot change such a fundamental design decision in clang/llvm.

Hi Stephan,

Thanks for the suggestion, using factory methods like these

template<typename RDF,typename VAR1, typename WEIGHT>                                                                                                                                                             
ROOT::RDF::RResultPtr<TH1D> Histo1D(RDF& rdf, const ROOT::RDF::TH1DModel& model, std::string_view vName, std::string_view wName)
{
    return rdf.template Histo1D<VAR1,WEIGHT>(model, vName, wName);
}
template<typename RDF,typename VAR1>                                                                                                                                                                              
ROOT::RDF::RResultPtr<TH1D> Histo1D(RDF& rdf, const ROOT::RDF::TH1DModel& model, std::string_view vName)
{
    return rdf.template Histo1D<VAR1>(model, vName);
}

works indeed (directly instantiating the template member methods from python does not, maybe because there is confusion between the template and non-template arguments), and I can see a small speed improvement for the event loop, but for the overall memory there is almost no difference, 12.85MB out of 2565.11MB (166.48MB more is allocated while defining the graph, but almost the same amount less while evaluating it - the numbers also change a bit from run to run, so I donā€™t know if itā€™s significant; I am caching pointers to the few instantiations of these that I need, but I think PyROOT or cling does something like that anyway).
The graph I have been using for these tests also has 634 Define nodes and 90 Filter nodes (and about 40 methods of a few lines that are compiled with gInterpreter.Declare and used from the Define strings), so maybe the Histo1D methods were not the dominant factor, after all, despite their large number (I have also quoted it to give an idea of the size of the graph).
I will try to gather a bit more information on which calls exactly are contributing most, and also compare with a different, similarly-sized graph, from a colleague.

We are trying to combine as many JITed code snippets into one big piece of code as much as possible. Already that reduces the allocations quite a lot.

Sharing a massif profile with us might maybe help! You can get it through running valgrind, see https://valgrind.org/docs/manual/ms-manual.html - please use a low --threshold parameter, e.g. =0.1

Thanks @Axel ! I collected a few massif profiles in this directory (massif.out.40049 corresponds to the job discussed before in the thread, massif.out.31442 is a more extreme case; both with the explicit template instantiations for Histo1D proposed by @StephanH ). Please let me know if any other information can help.

Hi Pieter!

Iā€™ve printed your massif profiles, see the relevant parts attached!

massif.out.40040.msprint.relevant.txt (44.4 KB) massif.out.31442.msprint.relevant.txt (55.4 KB)

Short analysis of 31442 with a memory footprint of 6.8 GB:

  • 50% data from histograms (bin counts and so on)
  • 12% Python strings
  • 25% ā€œjittingā€ (llvm and clang functions)
  • 11% under the threshold of 1%

Short analysis of 40040 with a memory footprint of 6 GB:

  • 26% data from histograms (bin counts and so on)
  • 7% Python strings
  • 43% ā€œjittingā€ (llvm and clang functions)
  • 13% under the threshold of 1%

Only rough numbers, does not always add up to 100% :wink:

What can we learn from this? E.g., what is the difference between the software in 31442 and 40040? Do you use jitting more excessively in 40040?

Best
Stefan

Hi Stefan,

Thanks! I realised I didnā€™t pass the massif files I intended to, Iā€™m sorry - but these are also interesting: they fill the same histograms, but indeed with a graph with less duplication (and far less JITting) in the latter case - itā€™s nice that the profiling data confirm this (outside valgrind I saw a larger difference in the overall memory usage, though, about 30% more for 40040).
Based on your feedback I looked into the python strings, and I found some optimisations that I could do in my python code (they moved to ā€˜below thresholdā€™, and I also got rid of some other python objects), thatā€™s a nice byproduct of this investigation :slight_smile: (for completeness, the massif profile after that is in the same directory, as massif_ZA_MR80.out). Another useful insight from your files is that 2D histograms are quite expensive memory-wise (not really a surprise, but good to keep in mind), compared to that the overhead from jitting is not the only main contribution (but still one of the two, so any improvement there would be welcome).

Thanks,
Pieter

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.