PyROOT: Running RDataFrame event loop in subprocess (to avoid JIT memory issue) with EnableImplicitMT()

joeycarter · October 29, 2021, 2:16pm

Hi there,

I’m using RDataFrames in PyROOT for my analysis to read in ntuples and fill a number of histograms. In most cases I use a pre-compiled library to apply my filters and new-column definitions, however in some case it was more convenient to apply these actions as string expressions, resulting in just-in-time compilations. Because of the large numbers of samples I have, this JITing resulted in the memory-hog issue described elsewhere, e.g. JIT Compilation Memory Hogs in RDataFrame.

To fix this memory issue, I run the RDataFrame event loop in a new process so that all the memory consumed by the JIT compilation is released once the event loop finishes. I define the histograms in the parent process and trigger the event in the child process. This is a condensed version of what I am doing:

import ROOT
import multiprocessing

ROOT.ROOT.EnableImplicitMT()

def fill_hists(df, hist_args):
    df_local = df.Filter("...", "...").Define("...", "...")
    rhists = []
    for args in hist_args:
        rhists.append(df_local.Histo1D(*args))

    with multiprocessing.Manager() as manager:
        results = manager.dict()
        proc = multiprocessing.Process(
            target=fill_hists_worker,
            args=(results, rhists)
        )
        proc.start()
        proc.join()

def fill_hists_worker(results, rhists):
    for rhist in rhists:
        hist = rdf_hist.GetValue()
        results[hist.GetName()] = hist

However, when I trigger the event loop in a new process in this way, the execution time to fill the histograms increases by about a factor of how many CPU cores I have, leading me to believe it is running the event loop in a single thread dispite having called ROOT.ROOT.EnableImplicitMT(). The alternative, as mentioned elsewhere, is to using a multiprocessing.Pool() instead of a single process, but I would have thought this would run a separate event loop in each process in the pool, which is not ideal since only a single event loop is necessary.

Is it possible to use ROOT’s implicit MT in a Python multiprocessing.Process()? And if not, what would the solution be to run a single event loop in parallel while still avoiding the JIT memory-hogging issue?

NOTE: I am using ROOT v6.20/04, since this is the latest version available on my local cluster.

Thanks,
Joey

ROOT Version: 6.20/04
Platform: x86_64 (Red Hat 4.8.5-44)
Compiler: gcc 9.3.0

eguiraud · October 29, 2021, 2:24pm

Hi Joey,
that’s interesting, I am not sure what’s happening, I’d like to try with a more recent ROOT version in which RDF has verbose logging available; that should give us a clear picture of what’s going on.

Could you please provide a self-container reproducer that I can run? Basically a filled-in version of what you have above that still causes the problem. As input we can use a fake dataset produced with RDataFrame(100000).Define(...).Define(...).Snapshot().

In case this is easy to try, it might also be worth checking what happens if you do everything in the subprocess, including calling ROOT.EnableImplicitMT and the Filter, Define, Histo1D calls.

Cheers,
Enrico

joeycarter · October 29, 2021, 9:09pm

Hi Enrico,

I’ve prepared a few minimal examples. I ran these tests on my own machine, where I have v6.24/00 installed, and already I see some performance improvements with respect to v6.20/04 (both in CPU time and memory usage). I’ll ask the sysadmins at my local cluster to install a more recent version of ROOT…

In the meantime, I think it’s still worthwhile to look at a few test cases. To begin, the script create_test_dataset.py creates a test dataset with random data split over 10 ROOT files.

The first test, rdataframe_test1.py, is the baseline case, where I define the histograms and run the RDataFrame event loop in the main process with ROOT.ROOT.EnableImplicitMT(). I used the memory-profiler Python package to profile the program, first with a single thread, and then again with 8:

$ mprof run rdataframe_test1.py && mprof plot

$ mprof run rdataframe_test1.py -j8 && mprof plot

You can see the memory usage creep up over time (bad), but more threads = faster execution (good!).

In the second test, rdataframe_test2.py, I call ROOT.ROOT.EnableImplicitMT() and define the histograms in the parent process, and run the RDataFrame event loop in the child process.

$ mprof run --include-children rdataframe_test2.py && mprof plot

$ mprof run --include-children rdataframe_test2.py -j8 && mprof plot

Here, while the peak memory usage is greater, it does not increase over time (good!). However, the execution times are worse compared to the baseline test (I assume due to the overhead of creating the child processes), and even using 8 threads gives only a modest improvement (although it is an improvement nonetheless, contrary to what I had originally thought—this may be due to the ROOT version or some quirk in resource allocation on my local cluster).

Finally, in rdataframe_test3.py, I call ROOT.ROOT.EnableImplicitMT(), create the RDataFrame, define the histograms and run the event loop all in the child process.

$ mprof run --include-children rdataframe_test3.py && mprof plot

$ mprof run --include-children rdataframe_test3.py -j8 && mprof plot

Here, the memory usage is about the same as test #2, but the execution time is slower, so clearly we get no benefit from doing it this way.

I think the take-aways from these tests are:

Update to a more recent version of ROOT (e.g. v6.24).
Define the histograms in the main process and run the RDataFrame event loops in a child process (take a hit in execution time, but no increase in memory usage over time).

Would you agree?

Thanks,
Joey

create_test_dataset.py (881 Bytes)
rdataframe_test1.py (2.4 KB)
rdataframe_test2.py (2.7 KB)
rdataframe_test3.py (2.7 KB)

eguiraud · November 1, 2021, 12:35pm

Hi,
I agree with everything, with some extra observations:

if you can wait to call the first GetValue until all the fill_hist calls have been performed (or whatever the equivalent is in your real code), jitting will happen once for all computation graphs. It might or might not helps: you should not have a memory creep anymore, but baseline memory usage will be higher
the extra cost due to running things in the subprocess should not depend on the size of the dataset, so it should become less and less important as the size of the dataset increases

Cheers,
Enrico

joeycarter · November 1, 2021, 1:47pm

Hi Enrico,

Thanks for the additional information, this is good to know. However, I think given how large my datasets are (~400 GB in total), and how many there are (~100 in total, and I would need a separate RDataFrame for each of them), in the end I’ll get the best performance if I can do away with RDataFrame calls that require jitting altogether and re-write the equivalent of fill_hists() in a C++ library that I can pre-compile. I know my ntuples, so there really isn’t a need to infer data types, for example.

Thanks for your help!

Cheers,
Joey

eguiraud · November 1, 2021, 2:02pm

That’s true at the moment. FYI we are working on reducing the performance penalty of jitted code (to zero, if everything goes well) inside the event loop, but the memory cost and some start-up cost will always be there compared to pre-compiled code.

Cheers,
Enrico

system · November 15, 2021, 2:03pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.