Simple way to test/prove EnableImplicitMT performance

eguiraud · March 29, 2023, 12:21am

TL;DR

I’m not too surprised you don’t see a great scaling with your setup. The motivation is probably a mix of several of the potential issues I mention below. This write-up should provide some indication of how to get nicer results by cleaning up the benchmark. Reading from NFS, running with more threads than the number of physical cores, measuring the full runtime of the application which includes just-in-time-compilation and other overheads, and depending on the CPU configuration (CPU throttling etc.), it might be that that’s the best RDF can do. At the end of the post I provide a not-completely-synthetic benchmark that, in controlled conditions, provides good scaling up to the number of physical cores on my laptop.

Alright, here we go!

The idea of EnableImplicitMT is that users just turn it on and trust ROOT to do what’s best.
Not always that means actually using all available cores. And even when using all available cores, sometimes the best possible scaling is still much worse than ideal.

In particular RDataFrame is optimized for large datasets and complex workloads and sometimes that means that small applications don’t benefit from implicit multi-threading that much (but those are also the applications that run in O(1 minute) and where absolute gains would be smaller).

With that said, in realistic scenarios we expect excellent (that is, not ideal but as-good-as-it-gets) performance and scaling, see e.g. [1], [2] (both presentations at this link have scaling plots for different benchmarks, some synthetic some realistic), [3] (slides 11-15 ), [4] (slides 7-9), [5].

Why doesn’t my program scale better?

Parallelization of typical HEP analysis workloads is a complex topic, but let’s try to look at the important parts that contribute to the topic.

Granularity of the tasks

One thing that’s a frequent source of confusion is that just because N threads have been spawned, it does not mean all of them will have work to do.
This is typically the case for small inputs.

When running on TTrees or TChains: RDataFrame creates no more than one task per TTree cluster (tree->Print("clusters") displays information about entry clustering for a given TTree). We also try not to have more than N=10 tasks per worker so multiple clusters could be processsed by the same task (N can be tweaked via TTreeProcessorMT:::SetTasksPerWorkerHint); this is because starting up and tearing down a task has a constant overhead, so we try to keep the total amount of tasks low while still providing enough “meat” to the scheduler.

When running on empty entries (using the RDataFrame(n_entries) constructor), we schedule 2 tasks per thread, dividing n_entries equally between tasks.

When running on other data sources (RNTuple, CSV, SQlite), it’s the data source implementation that provides a certain amount of separate “entry ranges”, and we schedule one task per entry range (i.e. it’s the data source that decides at which granularity tasks will run).

Constant overheads

The second most frequent source of confusion: for runtimes of the order of a few seconds, the part of the program that is actually parallelized (the event loop) might only make up for a fraction of the total runtime, so the wallclock time scaling will look bad.

For runtimes up to the order of seconds, runtime overheads that are constant with the dataset size might dominate or account for a large part of the execution time, so the scaling of the total application wallclock time will look worse than the scaling of the part that’s actually parallelized (the event loop). The largest sources of overheads in RDataFrame are:

RDF’s just-in-time compilation phase
import ROOT and interpreter startup: /usr/bin/time python -c "import ROOT; ROOT.RDataFrame;" alone takes 1 second on my laptop

Until v6.26 there was also some initial thread contention when every thread calls TFile::Open at the same time. Resolved in later ROOT versions.

RDataFrame logging provides the real event loop time as well as the time it spent in just-in-time compilation, see here. The event loop time as reported by RDF is what is actually parallelized.

Compiler/interpreter optimization level

Another tricky bit is that C++ code can be compiled at different optimization levels, and the ROOT interpreter does not have the best default.

Runtimes and scaling behavior can change significantly based on the compiler optimization level used for:

the RDF’s program/macro/script itself
the that is just-in-time compiled by RDataFrame right before the event loop starts

For just-in-time compilation (and for the Python-generated C++ calls of PyROOT), the interpreter by default uses optimization level O1. Setting the environment variable EXTRA_CLING_ARGS='-O2' usually has a visible (sometimes important) effect. We plan to change the default to O2, but that’s not the case for now and it means RDF usage from Python and with a lot of strings in filters and defines has worse performance (and usually also worse scaling) than a fully compiled RDF application (one that uses C++ functions instead of strings and explicitly specifies template parameters like in Histo1D<double>).

For compiled code, remember to pass -O2 as a compiler flag (or -DCMAKE_BUILD_TYPE=Release with cmake).

Hyperthreading

We expect good scaling in the event loop time (that is, excluding the overheads discussed above) up to the number of physical cores.
EnableImplicitMT() by default creates as many worker threads as the logical cores, because hyper-threading usually still gives a little bit of a boost, but scaling will be far from ideal beyond the number of physical cores.

CPU throttling

Modern processors have a number of mechanisms to scale their frequency up and down in order to not overheat the machine and provide a good balance between performance and power consuption. The kernel also has a CPU frequency “governor” (on Linux typically “powersave” by default, with “performance” being the other usual governor) that it uses to scale the CPU frequency based on load.

On my Intel laptop, in the BIOS configuration I can turn on/off 3 relevant settings: Intel Turbo-boost, Intel speed-step and hyperthreading.

The kernel’s CPU frequency governor can be set with cpupower frequency-set --governor performance (requires superuser privileges).

I/O bandwidth saturation

Sometimes, especially with larger amount of cores or with data stored on a single spinning disk (as opposed to an SSD or several spinning disks), it can happen that scaling is capped by the speed at which the storage layer can provide data. The rootreadspeed CLI tool, which comes with ROOT since v6.28, can be used to evaluate the throughput of a given application to make sure it’s not capped by what the storage layer can do.

In particular, for multiple threads hitting a single spinning disk, more threads can perform much worse than the nominal maximum disk throughput because seeking back and forth throughout the file to satisfy each thread’s request is slower than just providing a single thread with sequential chunks of data.

Amdahl’s law

This is really just something to keep in mind to manage expectations.

The relationship between the effective % of a workload that can be perfectly parallelized and the consequent speed-up is somewhat counterintuitive. Amdhal’s law is the equation that governs it, here is a calculator.

As an example, if 90% of a workload is perfectly parallelized between 32 threads while the remaining 10% remains serial, what kind of speed-up would you expect? 30x? 28x? The answer is less than 8x.

In the case of multi-thread RDF, for instance, the data processing part is parallelized (ignoring non-idealities such as bottlenecking on I/O or thread contention on memory allocations), but the merging of results coming from each thread is serial (and becomes larger the more threads there are, to complicate matters).

Enough talk, just show me something that scales

Alright!

I’m running the following benchmark A) on my laptop with turbo-boost, speedstep and hyper-threading turned off and the CPU performance governor set to “performance” via cpupower frequency-set --governor performance and B) on a powerful machine with 64 physical cores (AMD EPYC 7702P). Also notably, nothing else is running on the laptop at the same time (certainly not a browser or a mail client! ).

Before taking the times I ran the benchmark once in order to cache the input file in the filesystem cache (so that we will definitely not be limited by I/O throughput).

I am using a recent build of the ROOT master branch (cmake build type RelWithDebInfo, that is ROOT is compiled with -O2).

The following code:

#include "ROOT/RDataFrame.hxx"
#include "ROOT/RVec.hxx"
using ROOT::RVecI;
using ROOT::VecOps::InvariantMass;

void df102_NanoAODDimuonAnalysis()
{
   ROOT::RDataFrame df("Events", "Run2012BC_DoubleMuParked_Muons.root");

   auto df_2mu = df.Filter([](unsigned nMuon) { return nMuon == 2; }, {"nMuon"});
   auto df_os = df_2mu.Filter([](const RVecI &charges) { return charges[0] != charges[1]; }, {"Muon_charge"});
   auto df_mass = df_os.Define("Dimuon_mass", InvariantMass<float>, {"Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"});
   auto h = df_mass.Histo1D<float>({"Dimuon_mass", "Dimuon mass;m_{#mu#mu} (GeV);N_{Events}", 30000, 0.25, 300}, "Dimuon_mass");
   std::cout << h->GetMean() << '\n';
}

int main(int argc, char **argv) {
  unsigned n_threads = 0;
  if (argc > 1)
    n_threads = std::stoi(argv[1]);

  std::cout << "threads: " << n_threads << '\n';
  if (n_threads > 0)
    ROOT::EnableImplicitMT(n_threads);
   df102_NanoAODDimuonAnalysis();
}

compiled as g++ -g -Wall -Wextra -Wpedantic -o "df102_NanoAODDimuonAnalysis" "df102_NanoAODDimuonAnalysis.cpp" $(root-config --cflags --libs) -O2

then produces the following timings (in seconds; the columns are wallclock time, CPU time and %CPU usage):

A. Timings for my laptop

# WALL CPU %CPU
## 0 threads (no IMT)
43.87 43.37 99
44.19 43.72 99
44.35 43.89 99

## 1 thread (IMT activated but with only 1 thread)
44.47 43.97 99
44.17 43.68 99
44.19 43.68 99

## 2 threads
24.80 46.13 188
24.54 45.74 188
24.31 45.23 188

## 4
12.78 46.88 371
12.61 46.33 371
12.88 48.17 378

## 8
06.85 47.53 705
07.14 49.00 697
07.04 47.84 690

B. Timings for a beefy machine with 64 physical cores

# WALL CPU %CPU
## 0 threads (no IMT)
307 303 99

## 2 threads
154 304 199

## 4
85 323 388

## 8
43 322 767

## 16
22 328 1509

## 32
12 348 2964

## 64
8 438 5492

The input file can be downloaded from root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root.

Note that in this code all RDF calls are fully typed so there is no just-in-time-compilation phase (so the wallclock time of the program is very close to the event loop time – otherwise I would have had to add RDF logging in order to grab the actual event loop time, and I would have had to set EXTRA_CLING_ARGS='-O2' to have the jitted code compiled at O2 instead of O1 optimization level (see above)).