JIT compiling RDataFrame filters and defines within condor

asterenb · February 9, 2021, 7:28am

Dear experts,

My question is somewhat related to a few recent posts about JIT compiling RDataFrame filters and defines and memory hogging. I’m building a configurable framework where the cuts to apply and input datasets to use are set via json files and given at runtime. The code can be packaged and sent to condor, each job having a different set of cuts and datasets jsons to better parallelize things.

However, I’m running into the issue of memory getting increasingly hogged with each dataset that my code processes throughout the runtime of the job. I believe this is essentially because I create a new RDF instance for each processed dataset inside a given condor job, which then gets deleted after that dataset is finished. This means the cuts and defines get JIT’ed every time, and because that code is not released even after deleting the RDF object, it accumulates and uses up a significant chunk of available condor memory (even though this has improved quite significantly with the latest 6.22 version). Barring other limitations such as disk IO, I think I could increase the multithreading of my RDF object and speed up things if I wasn’t so limited by this memory hogging issue (but I’m not sure actually). By trial-and-error I find that 4 threads is currently the highest I can go without exceeding the default 2 GB condor memory allocation.

My question is whether I might be able to JIT compile the cuts only once at the beginning of the whole code execution, according to whatever config is provided by the user, and then simply pass those compiled functions to the RDF object instead. This would prevent the JIT from being triggered for every dataset, and still allow for runtime customization. But I’m not seeing how I could go about doing that. I hope this belabored explanation of what I’m trying to do is clear enough.

Is this something possible to do? Or would you recommend some other structuring of my code?

Thanks a lot for any ideas,
Andre

Please read tips for efficient and successful posting and posting code

ROOT Version: Not 6.22/06
Platform: x86_64-centos7-gcc8-opt (LCG99)
Compiler: GCC 8.3.0

eguiraud · February 9, 2021, 11:59am

Hi Andre,
the explanation is super clear, and I’m happy to hear that the improvements we made in 6.22 are visible.

The amount of memory taken by the jitting does not increase with the number of threads though, and all per-thread memory allocations (e.g. thread-local histograms that are merged into a final histogram result) are freed when the dataframe is destructed.
It would be great if you could run a test program within valgrind --tool=massif, which should produce an output file that lists exactly who is allocating how much during execution (it will also slow down execution massively, so don’t run a 30 minutes job within valgrind ).

About pre-jitting the cuts and reusing them, that’s not a thing at the moment. If the same long-running process builds many different dataframes (with string filters/defines) one option might be to do those operations in a sub-process, whose memory is freed when done.

I will think about this a bit more, let’s see It would be nice to simplify this usecase.

Cheers,
Enrico

asterenb · February 9, 2021, 3:10pm

Hi Enrico,

Thanks very much for the quick reply, and for giving this some thought.

I had run the code through valgrind before and didn’t find anything very conclusive, but that may be because I’m not too familiar with the tool. I can definitely give it another try though.

Regarding the multi-threading and jitting, I mostly meant that general memory use increases with number of threads (not because of jitting per se), and with a 2GB budget, every time the cuts are recompiled for a new dataset more memory is used up, leaving less for the threads and increasing the chance of going over the condor budget (i.e. budget >= memory_per_thread * N_threads + M_datasets * JIT_memory_use). At least that’s my understanding of it. So more JIT recompilations means fewer threads per condor job.

The subprocess idea could be interesting – I guess I would have to use a root file for interprocess data communication instead then? I.e. the subprocess with the RDF calculates everything for a given dataset, saves that info to a root file, which then gets read and aggregated by the caller process. Is that what you had in mind?

Cheers,
Andre

eguiraud · February 9, 2021, 3:25pm

valgrind --tool=massif is what we want, not the default memcheck. It should say who’s allocating how much, I am mainly interested to see whether the culprit is cling, RDF, TFile, TTree, or whatever else.

Yes that’s precisely it (unfortunately in your case it seems to be a <= )

I wasn’t sure how (or whether) you were aggregating the results returned by each RDF Yes an intermediate file is would do the job.

I do realize this memory hogging is undesirable for an application such as yours – I will look into what other options are there, maybe if we ask very very nicely the interpreter can unload that code from memory, but it’s probably not trivial to implement.

system · February 23, 2021, 3:26pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.