Multithread JIT for RDF?

Dear ROOT experts,

In my physics analysis, I need to construct O(100) RDataFrame instances (for different MC channels and data taking periods). It seems for every define/filter, on each rdf, root needs to generate one line like this ROOT::Internal::RDF::JitDefineHelper<...>(...) call globally for JIT.

For my use case this means O(10000) such lines need to be JITed before running the event loop with RunGraphs([df1, df2, …]) call. which takes 18 seconds on my local machine. While running the event loop only takes 18 seconds without MT.

By looking at the LoopManager->jit() method it occurs to me that the JIT is done in sequential order on 1000 line chunks.

I was wondering if it is possible for ROOT to do the compilation in parallel to speed up the process?

Here are my script in which the slow JIT is observed: (4.5 KB)

Here are some logs from the script and some sample code that is JITed:

Info in <[ROOT.RDF] Info /usr/src/debug/root/root-6.28.04/tree/dataframe/src/RDFHelpers.cxx:69 in void ROOT::RDF::RunGraphs(std::vector<RResultHandle>)>: Just-in-time compilation phase for RunGraphs (198 unique computation graphs) completed in 17.314043 seconds.

Info in <[ROOT.RDF] Info /usr/src/debug/root/root-6.28.04/tree/dataframe/src/RDFHelpers.cxx:91 in void ROOT::RDF::RunGraphs(std::vector<RResultHandle>)>: Finished RunGraphs run (198 unique computation graphs, 18.5s CPU, 20.2613s elapsed).

ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo2D, double, double, double>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x5586dbec2770), new const char*[3]{"var1_to_be_fill", "var2_to_be_fill", "final_weight"}, 3, 1, reinterpret_cast<shared_ptr<TH2D>*>(0x5586dbec2750), reinterpret_cast<std::weak_ptr<ROOT::Internal::RDF::RJittedAction>*>(0x5586dbec28c0), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x5586dbebd220));ROOT::Internal::RDF::JitFilterHelper(R_rdf::func27, new const char*[0]{}, 0, "", reinterpret_cast<std::weak_ptr<ROOT::Detail::RDF::RJittedFilter>*>(0x5586dbec2ac0), reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x5586dbec2aa0),reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x5586dbebd1c0));

ROOT::Internal::RDF::JitDefineHelper<ROOT::Internal::RDF::DefineTypes::RDefineTag>(R_rdf::func28, new const char*[1]{"omega"}, 1, "var1_to_be_fill", reinterpret_cast<ROOT::Detail::RDF::RLoopManager*>(0x558691fc6830), reinterpret_cast<std::weak_ptr<ROOT::Detail::RDF::RJittedDefine>*>(0x5586dbec2e60), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x5586dbec2ef0), reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x5586dbec2e40));

ROOT::Internal::RDF::JitDefineHelper<ROOT::Internal::RDF::DefineTypes::RDefineTag>(R_rdf::func29, new const char*[1]{"mutau_colin_p4"}, 1, "var2_to_be_fill", reinterpret_cast<ROOT::Detail::RDF::RLoopManager*>(0x558691fc6830), reinterpret_cast<std::weak_ptr<ROOT::Detail::RDF::RJittedDefine>*>(0x5586dbec4190), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x5586dbec4370), reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x5586dbec4170));

Cheers, Qichen Dong

_ROOT Version: 6.28/04
_Platform: Archlinux
_Compiler: gcc 13.1.1

Hi @qidong,

welcome to the root forum! It’s an interesting question @vpadulan might be able to help with.


Hi @mczurylo,

Thanks for your reply!

Hope this post receive some attention :wink:

Cheers, Q Dong

Hi @qidong ,

sorry for the high latency! Unfortunately it is not possible to parallelize the just-in-time-compilation, it’s not something cling (ROOT’s C++ interpreter) can do.

RDF employs a couple of tricks to speed up just-in-time compilation:

  • the chunking 1000 lines at a time: that’s because clang (and therefore cling) slows down to a crawl when just-in-time-compiling functions with very large bodies, which is what happened before we did the chunking
  • RDF makes sure to re-use as many template instantiations as possible across different invocations of the jitting. To that end it also unifies calls such as df.Filter("x > 0") and df.Filter("y > 0"): if x and y are of the same type, RDF produces only 1 helper function instead of 2, reducing the number of different template instantiations needed

For both these optimizations it’s necessary that all code that’s to be jitted has already been registered with RDF when the first event loop starts, which indeed seems to be the case in the script you shared.

Our general expectation is that for computation graphs (or hundreds thereof) that are so large that jitting takes ~20 seconds, then processing should take a correspondingly large time (say, at least 10 minutes) so that the 20 seconds of startup don’t hurt much. Is that not the case here?

One way to work around this problem is to avoid the jitting altogether: you can write a pure C++ function that takes an RDF object (as the RDF::RNode type) and returns an RDF object and instead of making calls such as:

.Filter("Sum(good_taus_muonrm) >= 1", "filter_tau_id")

it does

RDF::RNode ApplyFiltersAndDefines(RDF::RNode df) {
   return df.Filter([] (const ROOT::RVecF &taus) { return Sum(taus) >= 1' }, {"good_taus_muonrm"}, "filter_tau_id")

From Python you can then call that single C++ function and skip the RDF jitting step altogether.

I realize the second version it’s clunkier to write, but it makes sense to me as a performance optimization in an extreme case such as this.

We could also check what exactly is taking time to compile in cling, but given the problem size you describe 18 seconds are not surprising. It’s in the ballpark of what a C++ translation unit of a similar size would take to compile.

Let me know if this helps.

1 Like

Hi @eguiraud,

Thanks a lot for your informative answer! That is indeed really helpful!
Avoiding JIT all together with pure C++ functions seems to be the way to go :wink: It is indeed a bit more work to write all the types out, but it is completely manageable.

As you correctly pointed out, if the JIT takes so long, one would also expect heavy computational load in general. But for my analysis, I am looking at a niche phase-space where I still need a lot of separate MCs (thus high numbers of rdfs) but the number of events in each MC is low. Thus the actual compute takes very little time.
I did realise the compilation optimisation you mentioned, only a handful of functions were JITed for actual computation. But the majority of time was spent on compiling these lines, which are compiled for every dfs and every action booked:


And finally - I do have a dumb question - could you elaborate a bit on why clang/cling is not able to do the JIT in parallel? I am asking because it is common to compile different translation units in parallel?

Cheers, Dong

I’m not a cling expert (we would need @Axel or @vvassilev ) but my understanding is that all jitted code ends up in the equivalent of a single translation unit.

Once we adopt the ORCv2 infrastructure we might be able to teach cling to “JIT” code in parallel while executing other code. However, that’s partly available in llvm16 and in practice we need llvm17 (which is not yet released).

Hi @vvassilev, all,

Thanks a lot for your answer, hope in the future analysis in python with RDF will be more efficient!

Just some notes from implementing the C++ function with type RNode and call it from python:

  • In python the RDataFrame type do not automatically convert to RNode,
  • one have to use ROOT::RDF::AsRNode() function to wrap the df in python for it to work with C++ function with signature RNode myFunc(RNode df, ...)
  • one other solution is to use template (and hopefully with C++20, concept) for function declaration: template <typename T_df> ROOT::RDF::RNode myFunc(T_df df, ...) will work flawlessly.

Cheers, Dong

Indeed that’s the motivation for having ROOT.RDF.AsRNode. cppyy/PyROOT does not pick up the conversion operator. I wonder whether something can be done there, maybe @vpadulan knows.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.