Simple way to test/prove EnableImplicitMT performance

I’m writing my annual ROOT tutorial. I want to simply demonstrate how EnableImplicitMT makes RDataFrame run faster.

My problem is that, on two different environments running ROOT 6.26/10, using something very similar to the multi-threading example from the ROOT manual, I can’t show that EnableImplicitMT is actually doing anything.

My C++ code is (split between two Jupyter notebook cells):

ROOT::EnableImplicitMT();
auto poolSize = ROOT::GetThreadPoolSize();
std::cout << "Pool size = " << poolSize << std::endl;

%%time
auto dataframe = ROOT::RDataFrame("tree1","experiment.root");
auto pzcut = dataframe.Filter("pz < 145").Histo1D("pz");
pzcut->Draw();

and in Python:

import ROOT
ROOT.ROOT.EnableImplicitMT()
poolSize = ROOT.GetThreadPoolSize()
print ("Pool size =",poolSize)

%%timeit
dataframe = ROOT.RDataFrame("tree1","experiment.root")
pzcut = dataframe.Filter("pz < 145").Histo1D("pz")
pzcut.Draw()

When I run the above code in either C++ or Python, on both an AlmaLinux 9 system with Intel processors and and MacOS 13 Ventura with M1 processors, I essentially get the same time result for both EnableImplicitMT and DisableImplicitMT,

On both systems, the value of poolSize for DisableImplicitMT is 0, the value of poolSize for EnableImplicitMT is 10.

I restart the Jupyter notebook kernels in-between each test, to make sure that one test isn’t “contaminating” the other.

The file experiment.root was created by me about 20 years ago. It’s a simple n-tuple of 100,000 events.

Am I doing anything obviously wrong? Is the example in the manual flawed in some way? Did I mis-read it? Or have I misunderstood something and EnableImplicitMT does not actually work in this way? If so, I’ll omit the discussion of multi-threading from my tutorial.

Remember, my goal here is not to create some detailed framework to take advantage of multi-threaded processing. I’d just like for my students to type in a few simple commands (basically the same as in the manual!) and go “Oh, wow! Multi-threaded is faster!”


ROOT Version: 6.26/10
Platform: AlmaLinux 9 and MacOS 13 Ventura
Compiler: gcc 11.3.1 * and* clang 14.0.0

Distributions:
On AlmaLinux 9, I installed ROOT from the EPEL repository with dnf -y install root\*
On MacOS, I installed ROOT via conda-forge with conda create --name jupyter-pyroot python jupyter jupyterlab root


I guess @eguiraud can help.

Hi Bill,

With Enrico on vacation I’ll try to help out!

IMT only really helps if the file is big enough, or there are enough files. If you divide the processing time by 10, how much time would you expect each thread to take? We do have some setup time, maybe that dominates? You can also process a TChain of 10x the same file: that should take 10x as long, does it with IMT?

Do you have a chance to look at (and share) the output from time?

Cheers, Axel

First I’ll respond to the requests. Then a general comment.

Result from single-thread Python test using %%timeit:

230 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Result for multiple-thread Python test (EnableImplicitMT called) using %%timeit

230 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Result from single-thread C++ test using %time (ROOT C++ notebook doesn’t appear to support %%timeit):

Time: 2.959040880203247 seconds.

Result for multiple-thread C++ test (EnableImplicitMT called) using %%time

Time: 3.097882032394409 seconds.

Looking at the TTree cluster(s) in experiment.root:

root [1] tree1->Print("clusters")
******************************************************************************
*Tree    :tree1     : Reconstruction ntuple                                  *
*Entries :   100000 : Total =         2810761 bytes  File  Size =    2171135 *
*        :          : Tree compression factor =   1.30                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry           Size   Number of clusters
0                0                99999            -30000000          1 (estimated)
Total number of clusters: 1 (estimated)

All of this is consistent with Wile_E_Coyote’s statement that only one thread is actually executing. I didn’t mention this, but when I try Snapshot(), all of the output lines are in chronological order, instead of the “scrambling” I’d expect from multi-threaded processing. This further supports the idea that I’m only getting one thread to execute.

So that kinda-sorta diagnosis the issue. Now the general comments:

  • None of what’s been pointed out so far is in the manual or the tutorial examples. They just state something like “Call EnableImplicitMT() to turn on multi-threading.” From the perspective of a new user (which, by definition, all my students are), the initial impression is that multi-threading is useless.

  • Which leads to my original goal: What is a simple way to demonstrate within the context of a beginner’s ROOT tutorial that EnableImplicitMT is of any value? Remember, my basic issue is that if I do exactly what it says in the documentation, I don’t see any improvement.

Create a TChain, add your (single) ROOT file 100 times to it, and then “analyze” it.

Your suggestion worked, but the results are not what I would have hoped for.

In order to see a 10x increase in speed, I had to go to adding my file 1000 times to a chain; less than that, and the speed improvement was poorer. It appears to go logarithmically. Also, the 10x speed improvement was on the Python side, but not on the C++ side.

Python code (in two cells):

ROOT.ROOT.EnableImplicitMT()
chain = ROOT.TChain("tree1")
for i in range(1000):
    chain.Add("experiment.root")

%%timeit
dataframe = ROOT.RDataFrame(chain)
pzcut = dataframe.Filter("pz < 145").Histo1D("pz")
pzcut.Draw()

C++ code in two cells:

ROOT::EnableImplicitMT();
TChain chain("tree1");
for ( int i = 0; i < 1000; ++i ) {
    chain.Add("experiment.root");
    }

%%time
auto dataframe = ROOT::RDataFrame(chain);
auto pzcut = dataframe.Filter("pz < 145").Histo1D("pz");
pzcut->Draw();

Results from Python test, EnableImplicitMT off:

17 s ± 314 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python test, EnableImplicitMT on:

2.37 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

C++ test, EnableImplicitMT off:

Time: 19.272875547409058 seconds

C++ test, EnableImplicitMT on:

Time: 5.420331239700317 seconds.

If I only added 100 copies of the file to the chain, there was at best a factor of two improvement on the Python side; if I only add 10 copies, it was only a 20% improvement.

I question whether I should be introducing them to EnableImplicitMT in my tutorial. I’d have to also teach about TChain (which I know seems basic enough to this forum, but is unlikely to be of any use to them in the work they’d be asked to do) solely to bloat the input file. And the results of the tests don’t seem dramatic enough to make a fuss about it. Plus I’d have to explain why the improvement for the C++ version is poorer than the Python version; the reason isn’t obvious to me.

I’m curious: For those folks who developed the RDataFrame tutorial examples, was the code tested with and without EnableImplicitMT to see if it provided any noticeable improvements?

Try to create one “big” file using the “hadd” utility to sum your “small” file 100 times.

Instead of using hadd (which would be a brand-new external utility that the students would have to learn), I ran my original n-tuple creation program with 100x more events:

ls -lh experiment.root tree.root 
-rw-r--r-- 1 seligman atlas 2.1M Jun 11  2001 experiment.root
-rw-r--r-- 1 seligman atlas 207M Mar 23 08:41 tree.root

I even checked to see if I now had more “clusters” in the file:

root tree.root
root [1] tree1->Print("clusters")
******************************************************************************
*Tree    :tree1     : Reconstruction ntuple                                  *
*Entries : 10000000 : Total =       280121799 bytes  File  Size =  216558000 *
*        :          : Tree compression factor =   1.29                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry           Size   Number of clusters
0                0                9999999           1388694          7
Total number of clusters: 7 

Given what you said before, I would have expected a 7x improvement in speed. But that’s not what I got; the results were consistent with setting up a TChain with only 100 repeats:

Python, no EnableImplicitMT:

1.63 s ± 72.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python, EnableImplicitMT:

526 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

C++, no EnableImplicitMT:

Time: 4.817873477935791 seconds.

C++, EnableImplicitMT:

Time: 3.295820951461792 seconds.

I dunno. Maybe the lesson to be learned is that ROOT is so efficient that multithreading doesn’t have much effect.

Just for fun … try a TChain adding 10 times your new “big” file (this would then be a direct comparison with 1000 times your original “small” file).

BTW. I guess it’s something for @eguiraud

Hi,

lots to unpack here in order to give a good answer – I’ll need a bit of time, bear with me! :smiley:

I went ahead and created an input file with 1000 times more events than my original input file, which of course made it 1000x larger:

$ ls -lh experiment.root tree.root 
-rw-r--r-- 1 seligman atlas 2.1M Jun 11  2001 experiment.root
-rw-r--r-- 1 seligman atlas 2.1G Mar 23 15:35 tree.root

Lots of those clusters:

root tree.root
   ------------------------------------------------------------------
  | Welcome to ROOT 6.24/08                        https://root.cern |
  | (c) 1995-2021, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Sep 29 2022, 13:04:57                 |
  | From tags/v6-24-08@v6-24-08                                      |
  | With c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)                 |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q'       |
   ------------------------------------------------------------------

root [0] 
Attaching file tree.root as _file0...
(TFile *) 0x1f8f150
root [1] tree1->Print("clusters")
******************************************************************************
*Tree    :tree1     : Reconstruction ntuple                                  *
*Entries : 100000000 : Total =      2800204747 bytes  File  Size = 2164558164 *
*        :          : Tree compression factor =   1.29                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry           Size   Number of clusters
0                0                99999999          1388694         72
Total number of clusters: 72 

My results were:

Python, 1 thread:

CPU times: user 16.1 s, sys: 253 ms, total: 16.3 s
Wall time: 16.2 s

Python, 10 threads:

CPU times: user 19.8 s, sys: 286 ms, total: 20.1 s
Wall time: 4.92 s

C++, 1 thread:

Time: 16.523686170578003 seconds.

C++, 10 threads:

Time: 5.195289134979248 seconds.

These are not performance improvements I’d hoped for.

Is there any chance that either of these are affecting the results?

  • Each row of the n-tuple is quite short, only seven variables.

  • I’m reading the file over NFS, which is likely to be the case for any students using our JupyterHub server.

So, from my perspective …

Assuming that NFS transfers something like 20 MB/s, you need something like 100 seconds only to “download” the 2.1 GB data (ROOT Team tests the “performance” on machines that have high-speed local SSD disks, so no time is “spent” here).

Your actual “action” on entries is very simple. So, I think ROOT will “spend” most of the time “unpacking” the tree branches.
Well, you could try to create your ROOT file with “compression” 0 (then at least the “uncompressing” time would disappear, but the file will be bigger, so the “downloading” will take longer).

Maybe you could try to create some “very complicated” actions (which would need a significant CPU time per entry), and then the “multithreading” would show real improvements.

Note however that decompression is CPU time that is parallelized.

Nevertheless there is a lot to say about why the scaling behavior with the setup in this thread might not be ideal. NFS might very well play a role. I don’t know if I’ll manage to finish the write-up I started yesterday today – if not, early next week :sweat_smile: sorry

TL;DR

I’m not too surprised you don’t see a great scaling with your setup. The motivation is probably a mix of several of the potential issues I mention below. This write-up should provide some indication of how to get nicer results by cleaning up the benchmark. Reading from NFS, running with more threads than the number of physical cores, measuring the full runtime of the application which includes just-in-time-compilation and other overheads, and depending on the CPU configuration (CPU throttling etc.), it might be that that’s the best RDF can do. At the end of the post I provide a not-completely-synthetic benchmark that, in controlled conditions, provides good scaling up to the number of physical cores on my laptop.


Alright, here we go!

The idea of EnableImplicitMT is that users just turn it on and trust ROOT to do what’s best.
Not always that means actually using all available cores. And even when using all available cores, sometimes the best possible scaling is still much worse than ideal.

In particular RDataFrame is optimized for large datasets and complex workloads and sometimes that means that small applications don’t benefit from implicit multi-threading that much (but those are also the applications that run in O(1 minute) and where absolute gains would be smaller).

With that said, in realistic scenarios we expect excellent (that is, not ideal but as-good-as-it-gets) performance and scaling, see e.g. [1], [2] (both presentations at this link have scaling plots for different benchmarks, some synthetic some realistic), [3] (slides 11-15 ), [4] (slides 7-9), [5].

Why doesn’t my program scale better?

Parallelization of typical HEP analysis workloads is a complex topic, but let’s try to look at the important parts that contribute to the topic.

Granularity of the tasks

One thing that’s a frequent source of confusion is that just because N threads have been spawned, it does not mean all of them will have work to do.
This is typically the case for small inputs.

When running on TTrees or TChains: RDataFrame creates no more than one task per TTree cluster (tree->Print("clusters") displays information about entry clustering for a given TTree). We also try not to have more than N=10 tasks per worker so multiple clusters could be processsed by the same task (N can be tweaked via TTreeProcessorMT:::SetTasksPerWorkerHint); this is because starting up and tearing down a task has a constant overhead, so we try to keep the total amount of tasks low while still providing enough “meat” to the scheduler.

When running on empty entries (using the RDataFrame(n_entries) constructor), we schedule 2 tasks per thread, dividing n_entries equally between tasks.

When running on other data sources (RNTuple, CSV, SQlite), it’s the data source implementation that provides a certain amount of separate “entry ranges”, and we schedule one task per entry range (i.e. it’s the data source that decides at which granularity tasks will run).

Constant overheads

The second most frequent source of confusion: for runtimes of the order of a few seconds, the part of the program that is actually parallelized (the event loop) might only make up for a fraction of the total runtime, so the wallclock time scaling will look bad.

For runtimes up to the order of seconds, runtime overheads that are constant with the dataset size might dominate or account for a large part of the execution time, so the scaling of the total application wallclock time will look worse than the scaling of the part that’s actually parallelized (the event loop). The largest sources of overheads in RDataFrame are:

  • RDF’s just-in-time compilation phase
  • import ROOT and interpreter startup: /usr/bin/time python -c "import ROOT; ROOT.RDataFrame;" alone takes 1 second on my laptop

Until v6.26 there was also some initial thread contention when every thread calls TFile::Open at the same time. Resolved in later ROOT versions.

RDataFrame logging provides the real event loop time as well as the time it spent in just-in-time compilation, see here. The event loop time as reported by RDF is what is actually parallelized.

Compiler/interpreter optimization level

Another tricky bit is that C++ code can be compiled at different optimization levels, and the ROOT interpreter does not have the best default.

Runtimes and scaling behavior can change significantly based on the compiler optimization level used for:

  • the RDF’s program/macro/script itself
  • the that is just-in-time compiled by RDataFrame right before the event loop starts

For just-in-time compilation (and for the Python-generated C++ calls of PyROOT), the interpreter by default uses optimization level O1. Setting the environment variable EXTRA_CLING_ARGS='-O2' usually has a visible (sometimes important) effect. We plan to change the default to O2, but that’s not the case for now and it means RDF usage from Python and with a lot of strings in filters and defines has worse performance (and usually also worse scaling) than a fully compiled RDF application (one that uses C++ functions instead of strings and explicitly specifies template parameters like in Histo1D<double>).

For compiled code, remember to pass -O2 as a compiler flag (or -DCMAKE_BUILD_TYPE=Release with cmake).

Hyperthreading

We expect good scaling in the event loop time (that is, excluding the overheads discussed above) up to the number of physical cores.
EnableImplicitMT() by default creates as many worker threads as the logical cores, because hyper-threading usually still gives a little bit of a boost, but scaling will be far from ideal beyond the number of physical cores.

CPU throttling

Modern processors have a number of mechanisms to scale their frequency up and down in order to not overheat the machine and provide a good balance between performance and power consuption. The kernel also has a CPU frequency “governor” (on Linux typically “powersave” by default, with “performance” being the other usual governor) that it uses to scale the CPU frequency based on load.

On my Intel laptop, in the BIOS configuration I can turn on/off 3 relevant settings: Intel Turbo-boost, Intel speed-step and hyperthreading.

The kernel’s CPU frequency governor can be set with cpupower frequency-set --governor performance (requires superuser privileges).

I/O bandwidth saturation

Sometimes, especially with larger amount of cores or with data stored on a single spinning disk (as opposed to an SSD or several spinning disks), it can happen that scaling is capped by the speed at which the storage layer can provide data. The rootreadspeed CLI tool, which comes with ROOT since v6.28, can be used to evaluate the throughput of a given application to make sure it’s not capped by what the storage layer can do.

In particular, for multiple threads hitting a single spinning disk, more threads can perform much worse than the nominal maximum disk throughput because seeking back and forth throughout the file to satisfy each thread’s request is slower than just providing a single thread with sequential chunks of data.

Amdahl’s law

This is really just something to keep in mind to manage expectations.

The relationship between the effective % of a workload that can be perfectly parallelized and the consequent speed-up is somewhat counterintuitive. Amdhal’s law is the equation that governs it, here is a calculator.

As an example, if 90% of a workload is perfectly parallelized between 32 threads while the remaining 10% remains serial, what kind of speed-up would you expect? 30x? 28x? The answer is less than 8x.

In the case of multi-thread RDF, for instance, the data processing part is parallelized (ignoring non-idealities such as bottlenecking on I/O or thread contention on memory allocations), but the merging of results coming from each thread is serial (and becomes larger the more threads there are, to complicate matters).

Enough talk, just show me something that scales

Alright!

I’m running the following benchmark A) on my laptop with turbo-boost, speedstep and hyper-threading turned off and the CPU performance governor set to “performance” via cpupower frequency-set --governor performance and B) on a powerful machine with 64 physical cores (AMD EPYC 7702P). Also notably, nothing else is running on the laptop at the same time (certainly not a browser or a mail client! :smiley: ).

Before taking the times I ran the benchmark once in order to cache the input file in the filesystem cache (so that we will definitely not be limited by I/O throughput).

I am using a recent build of the ROOT master branch (cmake build type RelWithDebInfo, that is ROOT is compiled with -O2).

The following code:

#include "ROOT/RDataFrame.hxx"
#include "ROOT/RVec.hxx"
using ROOT::RVecI;
using ROOT::VecOps::InvariantMass;

void df102_NanoAODDimuonAnalysis()
{
   ROOT::RDataFrame df("Events", "Run2012BC_DoubleMuParked_Muons.root");

   auto df_2mu = df.Filter([](unsigned nMuon) { return nMuon == 2; }, {"nMuon"});
   auto df_os = df_2mu.Filter([](const RVecI &charges) { return charges[0] != charges[1]; }, {"Muon_charge"});
   auto df_mass = df_os.Define("Dimuon_mass", InvariantMass<float>, {"Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"});
   auto h = df_mass.Histo1D<float>({"Dimuon_mass", "Dimuon mass;m_{#mu#mu} (GeV);N_{Events}", 30000, 0.25, 300}, "Dimuon_mass");
   std::cout << h->GetMean() << '\n';
}

int main(int argc, char **argv) {
  unsigned n_threads = 0;
  if (argc > 1)
    n_threads = std::stoi(argv[1]);

  std::cout << "threads: " << n_threads << '\n';
  if (n_threads > 0)
    ROOT::EnableImplicitMT(n_threads);
   df102_NanoAODDimuonAnalysis();
}

compiled as g++ -g -Wall -Wextra -Wpedantic -o "df102_NanoAODDimuonAnalysis" "df102_NanoAODDimuonAnalysis.cpp" $(root-config --cflags --libs) -O2

then produces the following timings (in seconds; the columns are wallclock time, CPU time and %CPU usage):

A. Timings for my laptop
# WALL CPU %CPU
## 0 threads (no IMT)
43.87 43.37 99
44.19 43.72 99
44.35 43.89 99

## 1 thread (IMT activated but with only 1 thread)
44.47 43.97 99
44.17 43.68 99
44.19 43.68 99

## 2 threads
24.80 46.13 188
24.54 45.74 188
24.31 45.23 188

## 4
12.78 46.88 371
12.61 46.33 371
12.88 48.17 378

## 8
06.85 47.53 705
07.14 49.00 697
07.04 47.84 690
B. Timings for a beefy machine with 64 physical cores
# WALL CPU %CPU
## 0 threads (no IMT)
307 303 99

## 2 threads
154 304 199

## 4
85 323 388

## 8
43 322 767

## 16
22 328 1509

## 32
12 348 2964

## 64
8 438 5492

The input file can be downloaded from root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root.

Note that in this code all RDF calls are fully typed so there is no just-in-time-compilation phase (so the wallclock time of the program is very close to the event loop time – otherwise I would have had to add RDF logging in order to grab the actual event loop time, and I would have had to set EXTRA_CLING_ARGS='-O2' to have the jitted code compiled at O2 instead of O1 optimization level (see above)).

2 Likes

Thank you, Enrico. This helps me to understand what’s going on.

While your write-up is informative, it’s too much for my ROOT tutorial. I have to assume that my students are all undergraduates with (if I’m lucky) background solely in Python. Also, with respect to the rest of the tutorial, multi-threading is a minor point that’s not likely to be relevant for an undergrad-level summer work assignment.

So I think I’ll mention EnableImplicitMT, let them run it on my tiny experiment.root file to see that it actually makes things worse, then point them to a file that’s 1000 times bigger so they can see a difference. I’ll link to this forum post so they can follow up if they’re interested.

I would very much like to see this write-up somewhere in the ROOT documentation, so others don’t have to go down the same rabbit-hole that I did.

I completely agree! The write-up is really for you (and anyone else that might have the same question) – then you know how best to present, digest and omit info to best fit your audience :slight_smile:

With that said, if I were you I would try to convey that adding EnableImplicitMT() at the top of the program a) tells ROOT that it can use all the available cores as it thinks it’s best, and b) that should provide a visible speed-up. I wouldn’t try to look for an example application that has close-to-ideal scaling, but I hope you can find an example that provides a sensible speed-up – that’s the expected effect in most situations.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.