Memory hoarding with RDFs

Hi,

I am running some code using RDFs and Root 6.22/06.

The analysis code I am working on intends to sklim some ROOT N-tuples, adding some new computed branches and removing some branches. I am trying to do this sklimming for all trees in the ROOT file (~200Trees) corresponding to systematic variations. The code takes in one inputf file at a time. We then start looping over the trees, and creating a dataframe for each tree then snapshotting the dataframe to an output file. All our input trees have the same branches, and we dump the trees to one output file.

We have noticed that the code is hoarding memory in the duration of the loop. Upon looking at valgrind output, we saw that the memory consumption is not driven by leaks in our own code – most leaks are in ROOT libraries. With the help of the ATLAS physics analysis helpdesk, we were advised that we hit a limitation of RDFs, where Cling state is not being reset after every tree is processed, which means memory is not really being freed after we call the Snapshot() method.

We try to minimise this hoarding by modifying the dataframe for each tree in-place whenever we add a group of columns to it or filter it.

We were wondering if you have any advice on how to get around this issue ?

I guess @eguiraud or @vpadulan might help.

Hi @MoAly98 ,

Thanks for reaching out. I don’t know whether that is a leak, you are within the same process, so that means there’s always going to be only one interpreter, code you declare to the interpreter stays there until the end of the applicaiton. Actually, this allows RDF to potentially cache functions which bodies have already been “seen” before.

and we dump the trees to one output file.

I don’t know whether that also means that you are keeping the trees alive throughout the application, which will naturally require more memory.

We try to minimise this hoarding by modifying the dataframe for each tree in-place whenever we add a group of columns to it or filter it.

I’m not sure about this, do you mean that you call new operations always on the same RDataFrame object (for each tree)? What were you doing before? I don’t see alternatives here that’s why I ask.

Cheers,
Vincenzo

Hi @vpadulan,

Thanks a lot for the quick reply!

Maybe I can provide some pseudo-code:

std::vector<std::string>   skimmedTreenames = {"tree1", "tree2"};
 for (unsigned int i=0;i<skimmedTreenames.size();i++) {
TChain *tree = new TChain(TStringIt(skimmedTreenames[i]));
int chained = tree->Add(TStringIt(inpaths[f]).Data());
if(!tree->GetEntries()) continue;
 ROOT::RDataFrame df(*tree);
 ROOT::RDF::RNode d = df;
 ROOT::RDF::RSnapshotOptions opts;
 opts.fMode = "UPDATE";
d = AddColumns(d);
d = d.Filter("cutvar>10");
d = AddMoreColumns(d);
d.Snapshot(TStringIt(skimmedTreenames[i]).Data(), fout , {"column1","column2",..}, opts);
delete tree;
}

(TStringIt() is just a method to change std::string to TString)

from this code, I see consumption graphs such as the one attached (from massif)

Regarding the in-place part, we were in the past defining a new dataframe everytime we add more columns, which was obviously not very efficient :slight_smile:

Hi @MoAly98 ,

Other than avoiding the creation of the TChain on the heap and just declaring it on the stack like you already do for the RDataFrame, I don’t see anything inherently faulty in your logic. If I understand correctly, you need to keep the trees separated in the output file, so you can’t create only one RDataFrame where you process all the trees together. Thus, every RDataFrame you create at each iteration of the loop will internally have to JIT its computation graph, which probably makes for the increase in memory consumption you see. I guess the functions AddColumns and AddMoreColumns have to do different things depending on the input tree? I would expect to see some caching of the functions that RDF needs to JIT that have the same body, but maybe that’s already the case and the increase in memory still stands.

Cheers,
Vincenzo

Ah, another very small detail is that you could just pass skimmedTreenames[i].c_str() to TChain API instead of converting to TString first, it shouldn’t change anything other than avoid the extra conversion.

Hi @vpadulan

Thanks a lot for the advice, kept note of it :slight_smile:

The functions AddColumns and AddMoreColumns are mostly for modularisation reasons. I wanted to separate for example adding columns for object kinematics from event-level variables, weights, etc… Also some columns were needed to run a filter, so i define them, filter, then add some more on the reduced dataset.

I do need to keep the trees separated, since they are systematics trees.

Maybe I mis-understood your reply, but does that mean there is no way around this, except running the code once per file per tree in the dataset? I am quite surprised by this, and the fact that nobody complained before haha :smiley: When I ran on this grid consumption rached 6GBs before my job got killed, so the accumlation is quite signficant and is completely stopping us from proceeding in the production.

TL;DR

  • switch to a more recent ROOT version, it might make a difference
  • if that’s not enough, run each iteration of the loop in its own sub-process
  • you can also pre-compile a fully typed C++ function that does your Define/Filter/Snapshot calls (and then call that from Python) so that they don’t need to be just-in-time-compiled

Hi @MoAly98 , @vpadulan ,

let me add my two cents. As mentioned above the issue is likely due to the ROOT interpreter (cling) allocating memory every time some C++ code needs to be just-in-time-compiled and never releasing it (that’s a design decision in LLVM itself, not something that’s easy to fix at the level of cling or RDataFrame).

With that said, could you please check how things look like with v6.26.10 or v6.28.00? Although some level of memory hoarding will still be present, the situation might be better.

If that does not help, then the simplest solution is to just run each iteration of the loop over trees in its own sub-process, in order to sandbox its memory allocations (or, say, 10 iterations of the loop per sub-process in order to amortize the cost of process initialization/teardown).

Please also take a look at Memory leak when processing RDataFrames in python loop - #6 by eguiraud and the discussion there – among other things it shows another possible mitigation/workaround at the user level: pre-compiling functions that use fully typed Filter/Define/Snapshot calls (e.g. if you have a function like RNode ApplyFilter(RNode df) { return df.Filter([](int x) { return x < 10; }, {"myValue"}); } and then call ROOT.ApplyFilter(df) from Python, RDF does not have to just-in-time-compile anything. If you call directly df.Filter("x < 10"), then RDF has to just-in-time-compile the corresponding filter function).

I hope (at least some of) this helps!
Cheers,
Enrico

Hi @eguiraud

Thanks a lot for the insight, help and the link!

The good news: Switching to ROOT 6.28, reduced the maximum consumption by a factor of ~2. It also sped up the code by a similar factor, so thanks a lot :slight_smile: This may be sufficient, but also for big files, I have a feeling I am going to regret leaving it at that haha :smiley:

I am not sure if I am able to decipher what ROOT is compiling JIT in my code. I run in native C++, not pyROOT, and I don’t use any string definitions except for applying the filter. Can that really use this much memory? All other functions in my code are pre-compiled. I am also surprised (looking at the other post you point to) that Cling is allocates so much memory for JIT code :open_mouth:

Re-garding sub-process – This is something I have been thinking about as a workaround when i first encountered the issue. I wouldn’t go for a sub-processes per iteration since the memory consumption is not that dramatic, probably one subprocesses for every 20 trees or so could be fine. I have no idea how difficult this would be in a C++ code, but I currently (in my not-test setup) execute the C++ code via a python subprocess with

subprocess.run( "./program arg1 arg2", check=True, shell =True)

and I was thinking of switching to Popen and run one Popen for every 20 trees for example (would this eventually run into same issue because we parallelise? can valgrind keep track of subprocesses properly so that I profile?). Otherwise I can keep using run and just serially process chunks of trees.

Sorry for the ramble, I am trying to think out loud :smiley:

Again, thanks a lot for the help :slight_smile:

Cheers, Mo

nice :grimacing: :grimacing:

There is the Filter with a string expression and there are actions such as Snapshot(...): the fully typed version is Snapshot<ColType1, ColType2, ColType3, ...> (which of course quickly becomes unwieldy).

That would get rid of the problem permanently. If you run different sub-processes, but you do so in parallel, memory usage will not keep increasing but the max memory usage will be higher than with sub-processes executed serially.

Cheers,
Enrico

Hi @MoAly98 ,

If you want to be sure you are not JIT-ting anything, you can use the verbose mode of RDF described here.

For example, here’s code with JIT-ting:

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RLogger.hxx>

auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);

int main()
{
   ROOT::RDataFrame df{10};
   df.Filter("rdfentry_ > 5").Count().GetValue();
}

Output:

Info in <[ROOT.RDF] Info /home/vpadulan/Programs/rootproject/rootsrc/tree/dataframe/src/RLoopManager.cxx:805 in void ROOT::Detail::RDF::RLoopManager::Run(bool)>: Starting event loop number 0.
Info in <[ROOT.RDF] Info /home/vpadulan/Programs/rootproject/rootsrc/tree/dataframe/src/RLoopManager.cxx:778 in void ROOT::Detail::RDF::RLoopManager::Jit()>: Just-in-time compilation phase completed in 1.044992 seconds.
Info in <[ROOT.RDF] Info /home/vpadulan/Programs/rootproject/rootsrc/tree/dataframe/src/RLoopManager.cxx:830 in void ROOT::Detail::RDF::RLoopManager::Run(bool)>: Finished event loop number 0 (0s CPU, 0.00035882s elapsed).

And code without JIT-ting

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RLogger.hxx>

auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);

int main()
{
   ROOT::RDataFrame df{10};
   df.Filter([](unsigned long long x) {return x > 5;}, {"rdfentry_"}).Count().GetValue();
}

Output:

Info in <[ROOT.RDF] Info /home/vpadulan/Programs/rootproject/rootsrc/tree/dataframe/src/RLoopManager.cxx:805 in void ROOT::Detail::RDF::RLoopManager::Run(bool)>: Starting event loop number 0.
Info in <[ROOT.RDF] Info /home/vpadulan/Programs/rootproject/rootsrc/tree/dataframe/src/RLoopManager.cxx:770 in void ROOT::Detail::RDF::RLoopManager::Jit()>: Nothing to jit and execute.
Info in <[ROOT.RDF] Info /home/vpadulan/Programs/rootproject/rootsrc/tree/dataframe/src/RLoopManager.cxx:830 in void ROOT::Detail::RDF::RLoopManager::Run(bool)>: Finished event loop number 0 (0s CPU, 2.59876e-05s elapsed).

Notice the Nothing to jit and execute. statement. If you increase the verbosity level to ROOT::Experimental::ELogLevel::kDebug+10 you will also see all the code that RDF is declaring to cling before for JIT-ting.

Cheers,
Vincenzo

Hey @vpadulan and @eguiraud !

@vpadulan This is very useful, thanks a lot! I have had a go with this and it is very clear that the only Jitting happening is in Snapshot() calls, which is re-assuring in some sense that the code is written ok.

I have now changed my code around to run subprocesses. I wanted to outline some of the results here. In the tests I am running over 23 systematic trees:

1- If I run the trees directly using the c++ executable, all in one-go, the memory profile is

2- If I run these usign python subprocesses, serially, in chunks of maximum 10 trees (10,10,3), the profile is:

which is much better in terms of accumlation of memory! Also, unexpectedly, I see a massive imprvement in speed :open_mouth: I am not sure I can explain that, so would appreciate your insight!

It is worth noting that in the python script, I do some light checks on the file before I pass it to the C++ code, using pyROOT. There is a flat region there where nothing is happening. This was intriguing, so I did a bit of digging there to see why this memory is kept (my initial guess was python needing it). I had a look at the memory usage between 151s and 182s and I find the following:

95.94% (198,508,949B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->24.38% (50,451,900B) in 5445 places, all below massif's threshold (1.00%)
| 
->13.68% (28,307,136B) 0x621B5782: clang::ASTReader::ReadASTBlock(clang::serialization::ModuleFile&, unsigned int) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| ->13.68% (28,307,136B) 0x621DCB6B: clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, llvm::SmallVectorImpl<clang::ASTReader::ImportedSubmodule>*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|   ->13.68% (28,307,136B) 0x61DFA5E5: clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|     ->13.68% (28,307,136B) 0x61DFBF08: clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<std::pair<clang::IdentifierInfo*, clang::SourceLocation> >, clang::Module::NameVisibilityKind, bool) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       ->13.68% (28,307,136B) 0x627A8E45: clang::Sema::ActOnModuleImport(clang::SourceLocation, clang::SourceLocation, clang::SourceLocation, llvm::ArrayRef<std::pair<clang::IdentifierInfo*, clang::SourceLocation> >) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       | ->13.68% (28,307,136B) 0x617A2343: cling::Interpreter::loadModule(clang::Module*, bool) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       |   ->13.68% (28,307,136B) 0x616BE316: LoadModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::Interpreter&) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       |     ->13.68% (28,307,136B) 0x616E7748: RegisterCxxModules(cling::Interpreter&) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       |     | ->13.68% (28,307,136B) 0x616E944D: TCling::TCling(char const*, char const*, char const* const*, void*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       |     |   ->13.68% (28,307,136B) 0x616EB3E3: CreateInterpreter (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
|       |     |     ->13.68% (28,307,136B) 0x5505E7FB: TROOT::InitInterpreter() (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCore.so)
|       |     |       ->13.68% (28,307,136B) 0x5505ECAE: ROOT::Internal::GetROOT2() (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCore.so)
|       |     |         ->13.68% (28,307,136B) 0x54E9F9FB: __static_initialization_and_destruction_0(int, int) [clone .constprop.0] (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy_backend3_9.so)
|       |     |           ->13.68% (28,307,136B) 0x400F9C2: _dl_init (in /usr/lib64/ld-2.17.so)
|       |     |             ->13.68% (28,307,136B) 0x401459D: dl_open_worker (in /usr/lib64/ld-2.17.so)
|       |     |               ->13.68% (28,307,136B) 0x400F7D3: _dl_catch_error (in /usr/lib64/ld-2.17.so)
|       |     |                 ->13.68% (28,307,136B) 0x4013B8A: _dl_open (in /usr/lib64/ld-2.17.so)
|       |     |                   ->13.68% (28,307,136B) 0x5227FAA: dlopen_doit (in /usr/lib64/libdl-2.17.so)
|       |     |                     ->13.68% (28,307,136B) 0x400F7D3: _dl_catch_error (in /usr/lib64/ld-2.17.so)
|       |     |                       ->13.68% (28,307,136B) 0x52285AC: _dlerror_run (in /usr/lib64/libdl-2.17.so)
|       |     |                         ->13.68% (28,307,136B) 0x5228040: dlopen@@GLIBC_2.2.5 (in /usr/lib64/libdl-2.17.so)
|       |     |                           ->13.68% (28,307,136B) 0x4C1B60F: _PyImport_FindSharedFuncptr (dynload_shlib.c:100)
|       |     |                             ->13.68% (28,307,136B) 0x4BE6F04: _PyImport_LoadDynamicModuleWithSpec (importdl.c:137)
|       |     |                               ->13.68% (28,307,136B) 0x4BE4864: _imp_create_dynamic_impl (import.c:2302)
|       |     |                                 ->13.68% (28,307,136B) 0x4BE4864: _imp_create_dynamic (import.c.h:330)
|       |     |                                   ->13.68% (28,307,136B) 0x4B2084E: cfunction_vectorcall_FASTCALL (methodobject.c:430)
|       |     |                                     ->13.68% (28,307,136B) 0x4A8F213: do_call_core (ceval.c:5125)
|       |     |                                       ->13.68% (28,307,136B) 0x4A8F213: _PyEval_EvalFrameDefault (ceval.c:3582)
|       |     |                                         ->13.68% (28,307,136B) 0x4BBF78D: _PyEval_EvalFrame (pycore_ceval.h:40)
|       |     |                                           ->13.68% (28,307,136B) 0x4BBF78D: _PyEval_EvalCode (ceval.c:4329)
|       |     |                                             ->13.68% (28,307,136B) 0x4ADC330: _PyFunction_Vectorcall (call.c:396)
|       |     |                                               ->13.68% (28,307,136B) 0x4A91831: _PyObject_VectorcallTstate (abstract.h:118)
|       |     |                                                 ->13.68% (28,307,136B) 0x4A91831: PyObject_Vectorcall (abstract.h:127)
|       |     |                                                   ->13.68% (28,307,136B) 0x4A91831: call_function (ceval.c:5077)
|       |     |                                                     ->13.68% (28,307,136B) 0x4A91831: _PyEval_EvalFrameDefault (ceval.c:3489)
|       |     |                                                       ->13.68% (28,307,136B) 0x4A89BAA: _PyEval_EvalFrame (pycore_ceval.h:40)
|       |     |                                                         ->13.68% (28,307,136B) 0x4A89BAA: function_code_fastcall (call.c:330)
|       |     |                                                           
|       |     ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
|       |     
|       ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
|       
->12.01% (24,854,010B) 0x617933B9: cling::Dyld::ContainsSymbol((anonymous namespace)::LibraryPath const*, llvm::StringRef, unsigned int) const (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| ->09.97% (20,625,225B) 0x61796107: cling::Dyld::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| | ->09.97% (20,625,225B) 0x61796384: cling::DynamicLibraryManager::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool) const (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |   ->09.97% (20,625,225B) 0x6172305A: AutoloadLibraryGenerator::tryToGenerate(llvm::orc::LookupState&, llvm::orc::LookupKind, llvm::orc::JITDylib&, llvm::orc::JITDylibLookupFlags, llvm::orc::SymbolLookupSet const&) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |     ->09.97% (20,625,225B) 0x62C334BC: llvm::orc::ExecutionSession::OL_applyQueryPhase1(std::unique_ptr<llvm::orc::InProgressLookupState, std::default_delete<llvm::orc::InProgressLookupState> >, llvm::Error) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |       ->09.97% (20,625,225B) 0x62C3E0A9: llvm::orc::ExecutionSession::lookup(llvm::orc::LookupKind, std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet, llvm::orc::SymbolState, llvm::unique_function<void (llvm::Expected<llvm::DenseMap<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr>, llvm::detail::DenseMapPair<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol> > >)>, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> >, llvm::DenseMapInfo<llvm::orc::JITDylib*>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> > > > const&)>) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         ->09.43% (19,514,165B) 0x62C3ECD7: llvm::orc::ExecutionSession::lookup(std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet const&, llvm::orc::LookupKind, llvm::orc::SymbolState, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> >, llvm::DenseMapInfo<llvm::orc::JITDylib*>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> > > > const&)>) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         | ->09.43% (19,514,165B) 0x62C3EF33: llvm::orc::ExecutionSession::lookup(std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolStringPtr, llvm::orc::SymbolState) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         |   ->09.43% (19,514,165B) 0x62C546D4: llvm::orc::LLJIT::lookupLinkerMangled(llvm::orc::JITDylib&, llvm::orc::SymbolStringPtr) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         |     ->09.43% (19,514,165B) 0x61820393: cling::IncrementalJIT::getSymbolAddress(llvm::StringRef, bool) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         |       ->09.43% (19,514,165B) 0x6181A761: cling::IncrementalExecutor::getAddressOfGlobal(llvm::StringRef, bool*) const (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         |         ->09.43% (19,514,165B) 0x617A65E2: cling::Interpreter::getAddressOfGlobal(clang::GlobalDecl const&, bool*) const (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         |           ->09.43% (19,514,165B) 0x6173F9C7: TClingDataMemberInfo::Offset() (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCling.so)
| |         |             ->09.43% (19,514,165B) 0x5516C979: TDataMember::GetOffsetCint() const (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libCore.so)
| |         |               ->09.43% (19,514,165B) 0x54EA55AD: Cppyy::GetDatamemberOffset(unsigned long, unsigned long) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy_backend3_9.so)
| |         |                 ->09.43% (19,514,165B) 0x54E21C12: CPyCppyy::CPPDataMember::Set(unsigned long, unsigned long) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                   ->09.43% (19,514,165B) 0x54E4A92F: CPyCppyy::BuildScopeProxyDict(unsigned long, _object*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                     ->09.43% (19,514,165B) 0x54E4D2E5: CPyCppyy::CreateScopeProxy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, _object*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                       ->09.43% (19,514,165B) 0x54E4DED9: CPyCppyy::CreateScopeProxy(unsigned long) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                         ->09.43% (19,514,165B) 0x54E4DF43: CPyCppyy::BindCppObjectNoCast(void*, unsigned long, unsigned int) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                           ->09.43% (19,514,165B) 0x54E259A5: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) [clone .isra.0] (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                             ->09.43% (19,514,165B) 0x54E28367: CPyCppyy::CPPMethod::ExecuteProtected(void*, long, CPyCppyy::CallContext*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                               ->09.43% (19,514,165B) 0x54E26439: CPyCppyy::CPPMethod::Execute(void*, long, CPyCppyy::CallContext*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                                 ->09.43% (19,514,165B) 0x54E270AB: CPyCppyy::CPPMethod::Call(CPyCppyy::CPPInstance*&, _object*, _object*, CPyCppyy::CallContext*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                                   ->09.43% (19,514,165B) 0x54E2B7B6: CPyCppyy::(anonymous namespace)::mp_call(CPyCppyy::CPPOverload*, _object*, _object*) (in /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.28.00-a8528/x86_64-centos7-gcc11-opt/lib/libcppyy3_9.so)
| |         |                                     ->09.43% (19,514,165B) 0x4ADC5A7: _PyObject_MakeTpCall (call.c:191)
| |         |                                       ->09.43% (19,514,165B) 0x4A92208: _PyObject_VectorcallTstate (abstract.h:116)
| |         |                                         ->09.43% (19,514,165B) 0x4A92208: _PyObject_VectorcallTstate (abstract.h:103)
| |         |                                           ->09.43% (19,514,165B) 0x4A92208: PyObject_Vectorcall (abstract.h:127)
| |         |                                             ->09.43% (19,514,165B) 0x4A92208: call_function (ceval.c:5077)
| |         |                                               ->09.43% (19,514,165B) 0x4A92208: _PyEval_EvalFrameDefault (ceval.c:3489)
| |         |                                                 ->09.43% (19,514,165B) 0x4BBF78D: _PyEval_EvalFrame (pycore_ceval.h:40)
| |         |                                                   ->09.43% (19,514,165B) 0x4BBF78D: _PyEval_EvalCode (ceval.c:4329)
| |         |                                                     ->09.43% (19,514,165B) 0x4BBFACD: _PyEval_EvalCodeWithName (ceval.c:4361)
| |         |                                                       ->09.43% (19,514,165B) 0x4BBFB1A: PyEval_EvalCodeEx (ceval.c:4377)
| |         |                                                         
| |         ->00.54% (1,111,060B) in 1+ places, all below ms_print's threshold (01.00%)
| |  

with another 20% similar to the stacks above, and then small contributions from Clang AST, llvm and PyObjectMalloc making up the remaining 30% of the memory.

To me this suggests that the flat part is coming from C++ ROOT (starting with PyEval I am guessing from invoking os.subprocess). I am not complaining about the flatness, however I am quite surprised by it. Is this exepcted? Is it some result of caching? I would appreciate your insight here too (also see attached ms_print output for this setup)

ms_print_subproc_10trees_23syst_skip_nonrgxd_chunks_root628_massif.out.513.txt (1.9 MB)

That’s surprising, no idea why that would happen. Using the RDF logging functionality that Vincenzo mentioned above you should be able to see how much time each round of jitting and each event loop take. What changes? Is each event loop actually faster? That would be very surprising.

200 MB of base memory usage is not totally unexpected, see the 400 MB max resident I get just by importing ROOT:

$ python -c 'import ROOT; ROOT.RDataFrame(1)'                                                                                                                      
0.97user 0.15system 0:01.12elapsed 100%CPU (0avgtext+0avgdata 513252maxresident)k
0inputs+0outputs (0major+50738minor)pagefaults 0swaps

Hello @eguiraud

200 MB of base memory usage is not totally unexpected, see the 400 MB max resident I get just by importing ROOT:

Yeah I am not shocked by the 200MBs, I am just shocked by the flatness of it. I would have thought there would be bumps of usage as chunks of trees are getting processed.

That’s surprising, no idea why that would happen. Using the RDF logging functionality that Vincenzo mentioned above you should be able to see how much time each round of jitting and each event loop take. What changes? Is each event loop actually faster? That would be very surprising.

Surprising to me too haha. The output for one tree of the debug from ROOT when using subprocesses is:

Info in <[ROOT.RDF] Info /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RLoopManager.cxx:821 in void ROOT::Detail::RDF::RLoopManager::Run(bool)>: Starting event loop number 0.
Info in <[ROOT.RDF] Debug /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RDFUtils.cxx:329 in Long64_t ROOT::Internal::RDF::InterpreterCalc(const string&, const string&)>: Jitting and executing the following code:

ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Snapshot, int, int, Int_t, Int_t, Int_t, Int_t, float, Int_t, vector<TLorentzVector>, vector<TLorentzVector>, UInt_t, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, Int_t, Int_t, Int_t, Int_t, ULong64_t, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, UInt_t, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, vector<int>, vector<int>, vector<float>, vector<float>, vector<float>, vector<float>, vector<int>, vector<int>, ROOT::VecOps::RVec<vector<unsigned long>>, vector<TLorentzVector>, vector<int>, vector<int>, vector<int>, vector<vector<int> >, vector<float>, vector<float>, vector<float>, vector<float>, vector<int>, vector<float>, vector<float>, vector<float>, vector<float>, UInt_t, ROOT::VecOps::RVec<float>, vector<string>, vector<float>, Float_t, Float_t, Float_t, Float_t, vector<float>, float, Float_t, Float_t, Int_t, Int_t, Int_t, Int_t, Int_t, int, int, TLorentzVector, TLorentzVector, int, int, float, float, float, float, int, int, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, int, UInt_t, UInt_t, float, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, ULong64_t, float, Int_t, Int_t, Float_t, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, double, double, double, double, double, double, double, double, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, double>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x11e91f80), new const char*[258]{"HFStopClassification_partonlabel", "HFStopClassification_truthflavonly", "HF_Classification", "HF_ClassificationGhost", "HF_SimpleClassification", "HF_SimpleClassificationGhost", "Ht", "TopHeavyFlavorFilterFlag", "alljets_p4", "allparticles", "backgroundFlags", "bjets_e", "bjets_eta", "bjets_p4", "bjets_pcbt", "bjets_phi", "bjets_pt", "bjets_truthPartonLabel", "bjets_truthflav", "ejets_2015", "ejets_2016", "ejets_2017", "ejets_2018", "eventNumber", "fwdjets_e", "fwdjets_eta", "fwdjets_p4", "fwdjets_pcbt", "fwdjets_phi", "fwdjets_pt", "fwdjets_truthPartonLabel", "fwdjets_truthflav", "hasBadMuon", "jets_e", "jets_eta", "jets_p4", "jets_pcbt", "jets_phi", "jets_pt", "jets_truthPartonLabel", "jets_truthflav", "leptons_ID", "leptons_PLIVtight", "leptons_e", "leptons_eta", "leptons_phi", "leptons_pt", "leptons_tight", "leptons_truthIFFClass", "ljetGA_trackjets_indeces", "ljets_p4", "ljets_pcbt", "ljets_truthPartonLabel", "ljets_truthflav", "lrjets_GA_tjet_idxs", "lrjets_e", "lrjets_eta", "lrjets_phi", "lrjets_pt", "lrjets_truthLabel", "lrjets_xbb", "lrjets_xbb_higgs", "lrjets_xbb_qcd", "lrjets_xbb_top", "mcChannelNumber", "mc_generator_weights", "mc_generator_weights_name", "mc_generator_weights_norm", "met_met", "met_phi", "met_px", "met_py", "met_pz", "met_rescaled", "met_sumet", "mu", "mujets_2015", "mujets_2016", "mujets_2017", "mujets_2018", "nPDFFlavor", "nalljets", "nbjets", "neutrino1", "neutrino2", "nfwdjets", "njets", "njets_CBT0", "njets_CBT123", "njets_CBT4", "njets_CBT5", "nljets", "nnonfwdljets", "nonfwdljets_e", "nonfwdljets_eta", "nonfwdljets_p4", "nonfwdljets_pcbt", "nonfwdljets_phi", "nonfwdljets_pt", "nonfwdljets_truthPartonLabel", "nonfwdljets_truthflav", "ntjets", "randomRunNumber", "runNumber", "sphericity", "tjets_e", "tjets_eta", "tjets_p4", "tjets_pcbt", "tjets_phi", "tjets_pt", "tjets_truthflav", "tjets_truthflavExt", "totalEvents", "totalEventsWeighted", "truth_HDecay", "ttbb_categories", "weight_bTagSF_DL1r_Continuous", "weight_bTagSF_DL1r_Continuous_eigenvars_B_down", "weight_bTagSF_DL1r_Continuous_eigenvars_B_up", "weight_bTagSF_DL1r_Continuous_eigenvars_C_down", "weight_bTagSF_DL1r_Continuous_eigenvars_C_up", "weight_bTagSF_DL1r_Continuous_eigenvars_Light_down", "weight_bTagSF_DL1r_Continuous_eigenvars_Light_up", "weight_custom_leptonSF_PLIV_ELSTATDOWN", "weight_custom_leptonSF_PLIV_ELSTATUP", "weight_custom_leptonSF_PLIV_ELSYSDOWN", "weight_custom_leptonSF_PLIV_ELSYSUP", "weight_custom_leptonSF_PLIV_MUSTATDOWN", "weight_custom_leptonSF_PLIV_MUSTATUP", "weight_custom_leptonSF_PLIV_MUSYSDOWN", "weight_custom_leptonSF_PLIV_MUSYSUP", "weight_forwardjvt", "weight_forwardjvt_DOWN", "weight_forwardjvt_UP", "weight_globalLeptonTriggerSF", "weight_globalLeptonTriggerSF_EL_Trigger_DOWN", "weight_globalLeptonTriggerSF_EL_Trigger_UP", "weight_globalLeptonTriggerSF_MU_Trigger_STAT_DOWN", "weight_globalLeptonTriggerSF_MU_Trigger_STAT_UP", "weight_globalLeptonTriggerSF_MU_Trigger_SYST_DOWN", "weight_globalLeptonTriggerSF_MU_Trigger_SYST_UP", "weight_indiv_SF_EL_ChargeID", "weight_indiv_SF_EL_ChargeID_DOWN", "weight_indiv_SF_EL_ChargeID_UP", "weight_indiv_SF_EL_ChargeMisID", "weight_indiv_SF_EL_ChargeMisID_STAT_DOWN", "weight_indiv_SF_EL_ChargeMisID_STAT_UP", "weight_indiv_SF_EL_ChargeMisID_SYST_DOWN", "weight_indiv_SF_EL_ChargeMisID_SYST_UP", "weight_indiv_SF_EL_ID", "weight_indiv_SF_EL_ID_DOWN", "weight_indiv_SF_EL_ID_UP", "weight_indiv_SF_EL_Isol", "weight_indiv_SF_EL_Isol_DOWN", "weight_indiv_SF_EL_Isol_UP", "weight_indiv_SF_EL_Reco", "weight_indiv_SF_EL_Reco_DOWN", "weight_indiv_SF_EL_Reco_UP", "weight_indiv_SF_MU_ID", "weight_indiv_SF_MU_ID_STAT_DOWN", "weight_indiv_SF_MU_ID_STAT_LOWPT_DOWN", "weight_indiv_SF_MU_ID_STAT_LOWPT_UP", "weight_indiv_SF_MU_ID_STAT_UP", "weight_indiv_SF_MU_ID_SYST_DOWN", "weight_indiv_SF_MU_ID_SYST_LOWPT_DOWN", "weight_indiv_SF_MU_ID_SYST_LOWPT_UP", "weight_indiv_SF_MU_ID_SYST_UP", "weight_indiv_SF_MU_Isol", "weight_indiv_SF_MU_Isol_STAT_DOWN", "weight_indiv_SF_MU_Isol_STAT_UP", "weight_indiv_SF_MU_Isol_SYST_DOWN", "weight_indiv_SF_MU_Isol_SYST_UP", "weight_indiv_SF_MU_TTVA", "weight_indiv_SF_MU_TTVA_STAT_DOWN", "weight_indiv_SF_MU_TTVA_STAT_UP", "weight_indiv_SF_MU_TTVA_SYST_DOWN", "weight_indiv_SF_MU_TTVA_SYST_UP", "weight_jvt", "weight_jvt_DOWN", "weight_jvt_UP", "weight_leptonSF", "weight_leptonSF_EL_SF_ID_DOWN", "weight_leptonSF_EL_SF_ID_UP", "weight_leptonSF_EL_SF_Isol_DOWN", "weight_leptonSF_EL_SF_Isol_UP", "weight_leptonSF_EL_SF_Reco_DOWN", "weight_leptonSF_EL_SF_Reco_UP", "weight_leptonSF_EL_SF_Trigger_DOWN", "weight_leptonSF_EL_SF_Trigger_UP", "weight_leptonSF_MU_SF_ID_STAT_DOWN", "weight_leptonSF_MU_SF_ID_STAT_LOWPT_DOWN", "weight_leptonSF_MU_SF_ID_STAT_LOWPT_UP", "weight_leptonSF_MU_SF_ID_STAT_UP", "weight_leptonSF_MU_SF_ID_SYST_DOWN", "weight_leptonSF_MU_SF_ID_SYST_LOWPT_DOWN", "weight_leptonSF_MU_SF_ID_SYST_LOWPT_UP", "weight_leptonSF_MU_SF_ID_SYST_UP", "weight_leptonSF_MU_SF_Isol_STAT_DOWN", "weight_leptonSF_MU_SF_Isol_STAT_UP", "weight_leptonSF_MU_SF_Isol_SYST_DOWN", "weight_leptonSF_MU_SF_Isol_SYST_UP", "weight_leptonSF_MU_SF_TTVA_STAT_DOWN", "weight_leptonSF_MU_SF_TTVA_STAT_UP", "weight_leptonSF_MU_SF_TTVA_SYST_DOWN", "weight_leptonSF_MU_SF_TTVA_SYST_UP", "weight_leptonSF_MU_SF_Trigger_STAT_DOWN", "weight_leptonSF_MU_SF_Trigger_STAT_UP", "weight_leptonSF_MU_SF_Trigger_SYST_DOWN", "weight_leptonSF_MU_SF_Trigger_SYST_UP", "weight_mc", "weight_oldTriggerSF", "weight_oldTriggerSF_EL_Trigger_DOWN", "weight_oldTriggerSF_EL_Trigger_UP", "weight_oldTriggerSF_MU_Trigger_STAT_DOWN", "weight_oldTriggerSF_MU_Trigger_STAT_UP", "weight_oldTriggerSF_MU_Trigger_SYST_DOWN", "weight_oldTriggerSF_MU_Trigger_SYST_UP", "weight_pileup", "weight_pileup_DOWN", "weight_pileup_UP", "weight_tauSF", "weight_tauSF_EFF_ELEOLR_TOTAL_DOWN", "weight_tauSF_EFF_ELEOLR_TOTAL_UP", "weight_tauSF_EFF_RECO_TOTAL_DOWN", "weight_tauSF_EFF_RECO_TOTAL_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2025_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2025_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2530_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2530_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT3040_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT3040_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPTGE40_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPTGE40_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2025_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2025_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2530_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2530_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT3040_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT3040_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPTGE40_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPTGE40_UP", "weight_tauSF_EFF_RNNID_HIGHPT_DOWN", "weight_tauSF_EFF_RNNID_HIGHPT_UP", "weight_tauSF_EFF_RNNID_SYST_DOWN", "weight_tauSF_EFF_RNNID_SYST_UP", "weight_tauSF_ELE_EFF_ELEOLR_STAT_DOWN", "weight_tauSF_ELE_EFF_ELEOLR_STAT_UP", "weight_tauSF_ELE_EFF_ELEOLR_SYST_DOWN", "weight_tauSF_ELE_EFF_ELEOLR_SYST_UP", "weight_trackjet_bTagSF_DL1r_Continuous", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_B_down", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_B_up", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_C_down", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_C_up", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_Light_down", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_Light_up", "xsec_weight"}, 258, 1, reinterpret_cast<shared_ptr<ROOT::Internal::RDF::SnapshotHelperArgs>*>(0x11b2a830), reinterpret_cast<std::weak_ptr<ROOT::Internal::RDF::RJittedAction>*>(0xf798670), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0xcd75ad0));

Info in <[ROOT.RDF] Info /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RLoopManager.cxx:794 in void ROOT::Detail::RDF::RLoopManager::Jit()>: Just-in-time compilation phase completed in 6.865265 seconds.
Info in <[ROOT.RDF] Debug /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RLoopManager.cxx:548 in void ROOT::Detail::RDF::RLoopManager::RunTreeReader()>: Processing trees {nominal_Loose} in files {/eos/user/m/maly/thbb_boost/test_grid_sklim/testInputs/user.brle.mc16_13TeV.410470.PhPy8EG_ttbar_l.SGTOP1.e6337_s3126_r9364_p4514.lj.v26.skim_sys_v1_1_output.root/user.brle.32476059._000309.output.root}: entry range [0,17757], using slot 0 in thread 139682238956608.

while when processing all trees in one go:

Info in <[ROOT.RDF] Info /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RLoopManager.cxx:821 in void ROOT::Detail::RDF::RLoopManager::Run(bool)>: Starting event loop number 0.
Info in <[ROOT.RDF] Debug /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RDFUtils.cxx:329 in Long64_t ROOT::Internal::RDF::InterpreterCalc(const string&, const string&)>: Jitting and executing the following code:

ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Snapshot, int, int, Int_t, Int_t, Int_t, Int_t, float, Int_t, vector<TLorentzVector>, vector<TLorentzVector>, UInt_t, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, Int_t, Int_t, Int_t, Int_t, ULong64_t, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, UInt_t, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, vector<int>, vector<int>, vector<float>, vector<float>, vector<float>, vector<float>, vector<int>, vector<int>, ROOT::VecOps::RVec<vector<unsigned long>>, vector<TLorentzVector>, vector<int>, vector<int>, vector<int>, vector<vector<int> >, vector<float>, vector<float>, vector<float>, vector<float>, vector<int>, vector<float>, vector<float>, vector<float>, vector<float>, UInt_t, ROOT::VecOps::RVec<float>, vector<string>, vector<float>, Float_t, Float_t, Float_t, Float_t, vector<float>, float, Float_t, Float_t, Int_t, Int_t, Int_t, Int_t, Int_t, int, int, TLorentzVector, TLorentzVector, int, int, float, float, float, float, int, int, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, int, UInt_t, UInt_t, float, vector<float>, vector<float>, vector<TLorentzVector>, vector<int>, vector<float>, vector<float>, vector<int>, vector<int>, ULong64_t, float, Int_t, Int_t, Float_t, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, double, double, double, double, double, double, double, double, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, Float_t, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, ROOT::VecOps::RVec<float>, double>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x864d27e0), new const char*[258]{"HFStopClassification_partonlabel", "HFStopClassification_truthflavonly", "HF_Classification", "HF_ClassificationGhost", "HF_SimpleClassification", "HF_SimpleClassificationGhost", "Ht", "TopHeavyFlavorFilterFlag", "alljets_p4", "allparticles", "backgroundFlags", "bjets_e", "bjets_eta", "bjets_p4", "bjets_pcbt", "bjets_phi", "bjets_pt", "bjets_truthPartonLabel", "bjets_truthflav", "ejets_2015", "ejets_2016", "ejets_2017", "ejets_2018", "eventNumber", "fwdjets_e", "fwdjets_eta", "fwdjets_p4", "fwdjets_pcbt", "fwdjets_phi", "fwdjets_pt", "fwdjets_truthPartonLabel", "fwdjets_truthflav", "hasBadMuon", "jets_e", "jets_eta", "jets_p4", "jets_pcbt", "jets_phi", "jets_pt", "jets_truthPartonLabel", "jets_truthflav", "leptons_ID", "leptons_PLIVtight", "leptons_e", "leptons_eta", "leptons_phi", "leptons_pt", "leptons_tight", "leptons_truthIFFClass", "ljetGA_trackjets_indeces", "ljets_p4", "ljets_pcbt", "ljets_truthPartonLabel", "ljets_truthflav", "lrjets_GA_tjet_idxs", "lrjets_e", "lrjets_eta", "lrjets_phi", "lrjets_pt", "lrjets_truthLabel", "lrjets_xbb", "lrjets_xbb_higgs", "lrjets_xbb_qcd", "lrjets_xbb_top", "mcChannelNumber", "mc_generator_weights", "mc_generator_weights_name", "mc_generator_weights_norm", "met_met", "met_phi", "met_px", "met_py", "met_pz", "met_rescaled", "met_sumet", "mu", "mujets_2015", "mujets_2016", "mujets_2017", "mujets_2018", "nPDFFlavor", "nalljets", "nbjets", "neutrino1", "neutrino2", "nfwdjets", "njets", "njets_CBT0", "njets_CBT123", "njets_CBT4", "njets_CBT5", "nljets", "nnonfwdljets", "nonfwdljets_e", "nonfwdljets_eta", "nonfwdljets_p4", "nonfwdljets_pcbt", "nonfwdljets_phi", "nonfwdljets_pt", "nonfwdljets_truthPartonLabel", "nonfwdljets_truthflav", "ntjets", "randomRunNumber", "runNumber", "sphericity", "tjets_e", "tjets_eta", "tjets_p4", "tjets_pcbt", "tjets_phi", "tjets_pt", "tjets_truthflav", "tjets_truthflavExt", "totalEvents", "totalEventsWeighted", "truth_HDecay", "ttbb_categories", "weight_bTagSF_DL1r_Continuous", "weight_bTagSF_DL1r_Continuous_eigenvars_B_down", "weight_bTagSF_DL1r_Continuous_eigenvars_B_up", "weight_bTagSF_DL1r_Continuous_eigenvars_C_down", "weight_bTagSF_DL1r_Continuous_eigenvars_C_up", "weight_bTagSF_DL1r_Continuous_eigenvars_Light_down", "weight_bTagSF_DL1r_Continuous_eigenvars_Light_up", "weight_custom_leptonSF_PLIV_ELSTATDOWN", "weight_custom_leptonSF_PLIV_ELSTATUP", "weight_custom_leptonSF_PLIV_ELSYSDOWN", "weight_custom_leptonSF_PLIV_ELSYSUP", "weight_custom_leptonSF_PLIV_MUSTATDOWN", "weight_custom_leptonSF_PLIV_MUSTATUP", "weight_custom_leptonSF_PLIV_MUSYSDOWN", "weight_custom_leptonSF_PLIV_MUSYSUP", "weight_forwardjvt", "weight_forwardjvt_DOWN", "weight_forwardjvt_UP", "weight_globalLeptonTriggerSF", "weight_globalLeptonTriggerSF_EL_Trigger_DOWN", "weight_globalLeptonTriggerSF_EL_Trigger_UP", "weight_globalLeptonTriggerSF_MU_Trigger_STAT_DOWN", "weight_globalLeptonTriggerSF_MU_Trigger_STAT_UP", "weight_globalLeptonTriggerSF_MU_Trigger_SYST_DOWN", "weight_globalLeptonTriggerSF_MU_Trigger_SYST_UP", "weight_indiv_SF_EL_ChargeID", "weight_indiv_SF_EL_ChargeID_DOWN", "weight_indiv_SF_EL_ChargeID_UP", "weight_indiv_SF_EL_ChargeMisID", "weight_indiv_SF_EL_ChargeMisID_STAT_DOWN", "weight_indiv_SF_EL_ChargeMisID_STAT_UP", "weight_indiv_SF_EL_ChargeMisID_SYST_DOWN", "weight_indiv_SF_EL_ChargeMisID_SYST_UP", "weight_indiv_SF_EL_ID", "weight_indiv_SF_EL_ID_DOWN", "weight_indiv_SF_EL_ID_UP", "weight_indiv_SF_EL_Isol", "weight_indiv_SF_EL_Isol_DOWN", "weight_indiv_SF_EL_Isol_UP", "weight_indiv_SF_EL_Reco", "weight_indiv_SF_EL_Reco_DOWN", "weight_indiv_SF_EL_Reco_UP", "weight_indiv_SF_MU_ID", "weight_indiv_SF_MU_ID_STAT_DOWN", "weight_indiv_SF_MU_ID_STAT_LOWPT_DOWN", "weight_indiv_SF_MU_ID_STAT_LOWPT_UP", "weight_indiv_SF_MU_ID_STAT_UP", "weight_indiv_SF_MU_ID_SYST_DOWN", "weight_indiv_SF_MU_ID_SYST_LOWPT_DOWN", "weight_indiv_SF_MU_ID_SYST_LOWPT_UP", "weight_indiv_SF_MU_ID_SYST_UP", "weight_indiv_SF_MU_Isol", "weight_indiv_SF_MU_Isol_STAT_DOWN", "weight_indiv_SF_MU_Isol_STAT_UP", "weight_indiv_SF_MU_Isol_SYST_DOWN", "weight_indiv_SF_MU_Isol_SYST_UP", "weight_indiv_SF_MU_TTVA", "weight_indiv_SF_MU_TTVA_STAT_DOWN", "weight_indiv_SF_MU_TTVA_STAT_UP", "weight_indiv_SF_MU_TTVA_SYST_DOWN", "weight_indiv_SF_MU_TTVA_SYST_UP", "weight_jvt", "weight_jvt_DOWN", "weight_jvt_UP", "weight_leptonSF", "weight_leptonSF_EL_SF_ID_DOWN", "weight_leptonSF_EL_SF_ID_UP", "weight_leptonSF_EL_SF_Isol_DOWN", "weight_leptonSF_EL_SF_Isol_UP", "weight_leptonSF_EL_SF_Reco_DOWN", "weight_leptonSF_EL_SF_Reco_UP", "weight_leptonSF_EL_SF_Trigger_DOWN", "weight_leptonSF_EL_SF_Trigger_UP", "weight_leptonSF_MU_SF_ID_STAT_DOWN", "weight_leptonSF_MU_SF_ID_STAT_LOWPT_DOWN", "weight_leptonSF_MU_SF_ID_STAT_LOWPT_UP", "weight_leptonSF_MU_SF_ID_STAT_UP", "weight_leptonSF_MU_SF_ID_SYST_DOWN", "weight_leptonSF_MU_SF_ID_SYST_LOWPT_DOWN", "weight_leptonSF_MU_SF_ID_SYST_LOWPT_UP", "weight_leptonSF_MU_SF_ID_SYST_UP", "weight_leptonSF_MU_SF_Isol_STAT_DOWN", "weight_leptonSF_MU_SF_Isol_STAT_UP", "weight_leptonSF_MU_SF_Isol_SYST_DOWN", "weight_leptonSF_MU_SF_Isol_SYST_UP", "weight_leptonSF_MU_SF_TTVA_STAT_DOWN", "weight_leptonSF_MU_SF_TTVA_STAT_UP", "weight_leptonSF_MU_SF_TTVA_SYST_DOWN", "weight_leptonSF_MU_SF_TTVA_SYST_UP", "weight_leptonSF_MU_SF_Trigger_STAT_DOWN", "weight_leptonSF_MU_SF_Trigger_STAT_UP", "weight_leptonSF_MU_SF_Trigger_SYST_DOWN", "weight_leptonSF_MU_SF_Trigger_SYST_UP", "weight_mc", "weight_oldTriggerSF", "weight_oldTriggerSF_EL_Trigger_DOWN", "weight_oldTriggerSF_EL_Trigger_UP", "weight_oldTriggerSF_MU_Trigger_STAT_DOWN", "weight_oldTriggerSF_MU_Trigger_STAT_UP", "weight_oldTriggerSF_MU_Trigger_SYST_DOWN", "weight_oldTriggerSF_MU_Trigger_SYST_UP", "weight_pileup", "weight_pileup_DOWN", "weight_pileup_UP", "weight_tauSF", "weight_tauSF_EFF_ELEOLR_TOTAL_DOWN", "weight_tauSF_EFF_ELEOLR_TOTAL_UP", "weight_tauSF_EFF_RECO_TOTAL_DOWN", "weight_tauSF_EFF_RECO_TOTAL_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2025_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2025_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2530_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT2530_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT3040_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPT3040_UP", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPTGE40_DOWN", "weight_tauSF_EFF_RNNID_1PRONGSTATSYSTPTGE40_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2025_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2025_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2530_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT2530_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT3040_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPT3040_UP", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPTGE40_DOWN", "weight_tauSF_EFF_RNNID_3PRONGSTATSYSTPTGE40_UP", "weight_tauSF_EFF_RNNID_HIGHPT_DOWN", "weight_tauSF_EFF_RNNID_HIGHPT_UP", "weight_tauSF_EFF_RNNID_SYST_DOWN", "weight_tauSF_EFF_RNNID_SYST_UP", "weight_tauSF_ELE_EFF_ELEOLR_STAT_DOWN", "weight_tauSF_ELE_EFF_ELEOLR_STAT_UP", "weight_tauSF_ELE_EFF_ELEOLR_SYST_DOWN", "weight_tauSF_ELE_EFF_ELEOLR_SYST_UP", "weight_trackjet_bTagSF_DL1r_Continuous", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_B_down", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_B_up", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_C_down", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_C_up", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_Light_down", "weight_trackjet_bTagSF_DL1r_Continuous_eigenvars_Light_up", "xsec_weight"}, 258, 1, reinterpret_cast<shared_ptr<ROOT::Internal::RDF::SnapshotHelperArgs>*>(0x8512da70), reinterpret_cast<std::weak_ptr<ROOT::Internal::RDF::RJittedAction>*>(0x8656d0c0), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x80405b70));

Info in <[ROOT.RDF] Info /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RLoopManager.cxx:794 in void ROOT::Detail::RDF::RLoopManager::Jit()>: Just-in-time compilation phase completed in 58.144986 seconds.
Info in <[ROOT.RDF] Debug /build/jenkins/workspace/lcg_release_pipeline/build/projects/ROOT-6.28.00/src/ROOT/6.28.00/tree/dataframe/src/RLoopManager.cxx:548 in void ROOT::Detail::RDF::RLoopManager::RunTreeReader()>: Processing trees {nominal_Loose} in files {/eos/user/m/maly/thbb_boost/test_grid_sklim/testInputs/user.brle.mc16_13TeV.410470.PhPy8EG_ttbar_l.SGTOP1.e6337_s3126_r9364_p4514.lj.v26.skim_sys_v1_1_output.root/user.brle.32476059._000309.output.root}: entry range [0,17757], using slot 0 in thread 196836544.

I see a huge difference in the JITing time! The Jitting speed improvement is 8.5 times, the speed improvement is about 3-5 times. It’s worth noting the tree I compare in both cases is the last tree being processed. In the subprocesses approach, it’s processed pretty much on it’s own, while in this code it gets processed after memory was hoarded for 22 trees before it. Maybe that’s why?

The snippets do not show the event loop time?

Also can you confirm that every jitted Snapshot invocation has a different set of template parameters? (if not, jits after the first should take very little time when processing all trees in the same process, because they can just reuse the existing template instantiations).

Hi @eguiraud,

Also can you confirm that every jitted Snapshot invocation has a different set of template parameters? (if not, jits after the first should take very little time when processing all trees in the same process, because they can just reuse the existing template instantiations).

Out of the 23, the first 22 trees (systematic trees) will have same template, but the last one (which I posted here) will have a different template. Indeed when looking at the first 22 trees in the C++ code without chunking, first JIT takes 56s, and all following JITings take 5.6s. Then the last tree (nominal) takes 34s.

Yes, I have forgotten to add these. Here is the results summarised:

  1. Running C++ code over 23 trees with no chunking:

    • Excution Time: 613s
    • JIT time for nominal tree: 33.3s
    • Loop time for nominal tree: (17.52s CPU, 19.2893s elapsed).
    • JIT time for first systematic tree: 56s
    • Loop time for first systematic tree: (14.67s CPU, 16.8767s elapsed)
    • JIT time for not-first systematic trees: 5.6s
    • Loop time for not-first systematic tree: (14.26s CPU, 16.5614s elapsed)

Reference Distribution:

  1. Running C++ code through a python subprocess.run() over 23 trees with no chunking:

    • Excution Time: 160s
    • JIT time for nominal tree: 3.97s
    • Loop time for nominal tree: (1.1s CPU, 2.53289s elapsed)
    • JIT time for first systematic tree: 6.55s
    • Loop time for first systematic tree: (0.6s CPU, 4.16665s elapsed)
    • JIT time for not-first systematic trees: 0.78s
    • Loop time for not-first systematic tree: (0.61s CPU, 1.24314s elapsed)

Reference Distribution:

  1. Running C++ code through a python subprocess.run() over 23 trees with 10 chunks per subprocess:

    • Excution Time: 167s
    • JIT time for nominal tree: 7s
    • Loop time for nominal tree: (1.05s CPU, 2.76378s elapsed)
    • JIT time for first systematic tree: 6.4s
    • Loop time for first systematic tree: (0.57s CPU, 0.653824s elapsed)
    • JIT time for not-first systematic trees: 0.93s
    • Loop time for not-first systematic tree: (0.54s CPU, 0.798583s elapsed)

Reference Distribution:

I added these reference distributions because I am really unsure what’s going on, and my first guess was that the processing was not done properly, but the distributions look consistent. I am trying to profile CPU usage…

Okay Ignore me, I was stupidly comparing

valgrind python run.py against valgrind ./run args

so valgrind was simply not slowing down the C++ code when I run the python script, but of course slows down the direct c++ call. The execution times without valgrind actually are:

  1. 100s
  2. 110s
  3. 120s

Sorry for the noise :slight_smile:

Normality restored :slight_smile:

(also: valgrind distorts the runtime of the program so much that it basically makes the number irrelevant, runtime comparisons should only be made between programs running with full optimizations, compiled with -O2, etc.)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.