Performance of RDataFrame on KNL

kkrizka · August 29, 2019, 10:22am

ROOT Version: v6.18/00
Platform: Linux
Compiler: g++

Dear ROOT experts,

I am trying to port my analysis to run using RDataFrame. My plan is to run on KNL nodes with multi-threading support enabled, as they are quite unused at my institute’s cluster. From what I’ve seen reported[1], it seems like I should expect roughly linear scaling with the number of threads. That is, doubling the threads will (roughly) double the performance.

However I am having trouble reproducing those results. I start from a data source with a single column and 100M events. I try using both on-the-fly and a stored TTree as the original data source, to see what effect reading from disk has. After running a few random operations, I process the data frame using Stats. Also I have a version where I specify the Defines using C++ lambdas instead of strings+JIT. The general structure of the code is as follow.

#include <iostream>
#include <chrono>
#include <ROOT/RDataFrame.hxx>

void dftest(uint32_t threads=0)
{
  ROOT::EnableImplicitMT(threads);

  ROOT::RDataFrame r(100000000);
  auto rr = r.Define("v", "rdfentry_")  \
    .Define("w", "return 1./(v+1)")             \
    .Define("x", "v*w")                         \
    .Filter("v%100==0");
    
  ROOT::RDF::RResultPtr<TStatistic> stats_iw = rr.Stats("x", "w");


  auto start = std::chrono::high_resolution_clock::now();
  stats_iw->Print();
  auto finish = std::chrono::high_resolution_clock::now();

  std::chrono::duration<double> elapsed=finish-start;
  
  std::cout << threads << "\t" << elapsed.count() << std::endl;
}

The macros for other tests are attached.

dfotfnojit.C (779 Bytes) gendftest.C (435 Bytes) dfotf.C (643 Bytes) dffile.C (626 Bytes) dftest.py (836 Bytes)

I’ve run this for different arguments to EnableImplicitMT on a KNL node that was only allocated to me (aka this was the only big process running). The results are show here.

timing

The gains from the additional theads are not great. In the on-the-fly example, the gain from running using 64 threads (the CPU has 68 physical threads) vs using a single thread is only x3.

Am I setting everything up correctly? I to see the CPU utilization of the test being threads*100%, so it should be OK.

Is this expected? Maybe my example is too simple and most of the time is spent merging results from the threads.

Related to the simplicity of the example, is there any extra overhead? I saw an improvement of 2x when compiling the Define functions instead of using JIT. Maybe there is some extra per-thread setup?

Is it possible to run JIT on the dataframes graph in a single thread before processign data?

There is also a considerable slowdown when reading from a file after 50 threads. I haven’t looked much into this, since I would like to understand the on-the-fly generation first. The input file is not large (134Mb) and stored on relatively fast storage. Also the file has 10000 clusters, so plenty of tasks to be split among the threads. But it is quite possible that the file is not very optimal, since I had to manually tweak AutoFlush to increase the cluster size. The default settings resulted in only 20 clusters.

–
Karol Krizka

[1] https://indico.cern.ch/event/587955/contributions/2937534/attachments/1683046/2704767/RDataframe__CHEP.pdf

eguiraud · August 29, 2019, 10:50am

Hi Karol,
this is very interesting, and definitely unexpected, especially in the case with no disk access. As you say, let’s try to understand that case first.

I’d start with the simplest test possible, make sure we understand that, and work from there by incrementally making it more complicated. The simplest (and most forgiving) test possible is on-the-fly data generation and processing, one “fat” non-jitted Define and a non-jitted Sum action. Importantly, if you care about performance, like in this case, you should not execute ROOT macros but compile the code into an executable (with -O3 optimization).

double FatDefine() {
  volatile int a = 0;
  // these might be too many iterations
  // the point is that we want to simulate some work
  for (int i = 0; i < 500; ++i)
    ++a;
  return 42;
}

RDataFrame df(100000000);
// note the template argument to Sum<double>
auto result = df.Define("x", FatDefine).Sum<double>("x");

auto start = std::chrono::high_resolution_clock::now();
*result;
auto finish = std::chrono::high_resolution_clock::now();
auto elapsed = finish - start;

RDataFrame just-in-time compiles your expressions in a single thread before processing the data, but that overhead (constant with the number of threads and the number of entries) is indeed counted as part of the event loop in your snippet, and apparently it makes up for a large percentage of the total runtime. I suggest we focus on runs with nothing just-in-time compiled at first.

Why does the C++ (on-the-fly, no JIT) line start at a point and then suddently go down?

Pinging @amadio and @xvallspl for KNL-specific suggestions (e.g. you might want to try CPU pinning to avoid too strong NUMA effects).

In the meanwhile, thank you for the feedback!
Enrico

kkrizka · September 7, 2019, 4:02pm

Hi @eguiraud,

I gave your example a try and I get very similar results. The first column is the number of threads and the second the time in seconds it took to execute the example.

1       19.8535
2       25.9802
4       19.9489
8       15.7371
16      11.9369
32      6.07841
64      3.17857
128     1.80281
256     1.47805
272     1.35839

I also tried running the example on a Haswell node and received similar results.

1       1.60788
2       18.3308
4       13.8879
8       12.5712
16      10.584
32      5.0262
64      4.20023

In both cases, I run the test using an executable compiled with g++ dfsimple.cpp -o dfsimple $(root-config --ldflags --cflags --libs) -O3. One thing to note is that both the example and root were compiled on the login node of our cluster, which has a Haswell CPU.

dfsimple.cpp (602 Bytes)

Your observation of the “ntreads=1”. being much faster is very reproducible and independent of the order I run the thread scan in. I am not sure why this is happening. Maybe it is a hint to the slowdown I am seeing when running with more threads?

–
Karol Krizka

eguiraud · September 10, 2019, 9:42am

Hi,
thank you for trying things out! This is unexpected, and I’m not sure what the cause is, but I can reproduce it on a Xeon Phi. Preliminary profiling suggests suboptimal memory access patterns might be the cause, but I don’t understand why we’d see it only now for these benchmarks (as per your first post, we have run other things on KNL before, with way better results).

I will definitely look into this a bit more, but I’m not sure I’ll be able to provide a solution anytime soon, sorry about that.

EDIT: note that the same benchmarks (both your dfsimple.cpp and the code I proposed before) scale well up to 8 threads (not many, but still the difference with KNL is quite stark) on my workstation and my laptop.

Cheers,
Enrico

eguiraud · September 10, 2019, 5:08pm

Some follow-up as @Axel and I looked into this for a while this afternoon:

As far as we can tell, the problem is that due to RDF’s memory layout and access pattern there is a high amount of contention between threads in performing load/stores of thread-local state (e.g. last result of a Define, value of a thread’s partial sum, etc). That state is not shared between threads, but the caches in which it’s stored is. This might not be the only effect (NUMA effects might play a role with higher number of cores, for example) but it seems to be the determining factor for these small benchmarks.

The good news is that, if we are right, as the analysis starts performing more and more meaningful work, (e.g. disk I/O!) scaling should get better. [Disk I/O, however, also does not scale to hundreds of cores that share the same disk].

The still-good-but-less-than-the-other news is that we have an idea how to fix or largely mitigate this problem, but it will take a bit. We are currently targeting v6.20 for at least an important mitigation of the problem.

@Axel please feel free to add anything!

Cheers,
Enrico

EDIT: here’s the repo with the benchmarks we used, for future reference

EDIT2: this patch to ROOT master completely resolves the issue for your dfsimple.cpp benchmark (it’s in no way production code, it’s just a proof of concept). What it does is to increase the distance (in address space) of the storage used by the various threads. It would be great if you could try to apply it and confirm that it also resolves the issue for you (at least for dfsimple.cpp: it’s a very targeted patch). If yes, then we understand the problem well.

system · September 24, 2019, 5:08pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.