Scalability of RDataFrames on 16+ cores

eguiraud · April 13, 2021, 1:31pm

I probably wasn’t clear, I am not blaming the benchmark. It’s just that from what I see short-lived tasks → weird TBB scheduling → bad CPU usage (and as a consequence we don’t see this behavior e.g. in the performance studies I linked in my very first reply).

As I mentioned we still want to properly understand what’s going on and fix the scaling for this case too if at all possible.

amadio · April 16, 2021, 10:19am

This is a good start. I have a few extra tips, though. First, try to make sure that you enable debugging info also for LLVM within ROOT if you want to see all symbols, as that’s disabled by default even in debug configurations of ROOT. Another thing is that any JIT compiled code has no debugging symbols, so it’s best to profile a compiled version of the benchmark if possible rather than something that relies on the JIT. Finally, I find that the flamegraph script often garbles the information and is not as reliable as just using perf report directly. I would stick to that for reliable information. It’s not as pretty, but the information is more accurate. When recording with perf, if you have enabled debugging info, try using --call-graph=dwarf and/or --call-graph=fp and compare the resulting flame graphs. I hope that at least some of these tips will help you getting better info on the performance of your benchmark.

eguiraud · April 20, 2021, 11:03am

Hi,
just a ping, I’ll take a deeper look this week.

My first assessment was not totally correct. The more correct version is that the most obvious problem with the length of the application (O(1s)) is that ROOT I/O has a certain “warm-up time” that increases with the number of threads (due to threads opening TFiles in sync at the very beginning of the application, competing for a global lock) and at 64 threads this warm-up time is long enough to impact the overall runtime performance (because the application itself runs for a relatively short time).

With that said, I’d like to:

show that this is indeed the case with some performance measurements
show that this is not a problem with larger datasets
measure the impact of certain root-readspeed upgrades I pushed that should mitigate this issue a bit in the case of root-readspeed
check where we stand with the RDataFrame-based benchmarks in scenarios in which ROOT I/O itself scales well

Cheers,
Enrico

eguiraud · April 25, 2021, 6:58pm

TL; DR

ROOT I/O does not scale too well on the original dataset (but better now with a few patches I added). However, runtimes of O(few seconds) at high core counts and different behavior with larger datasets make this a “weird” corner of the phase space
on an artificially enlarged dataset, root-readspeed scales well but the RDF macros showed a different scaling problem (a long-standing false sharing issue); with a dedicated patch, things look good for the RDF macros too
compiled versions of the RDF macros that @swunsch kindly implemented improve runtimes up to 3x

Summary

Running on the original dataset, both root-readspeed and the opendata benchmarks don’t scale very well, but after fixing a few issues I got some improvements: an 18.7x speed-up at 64 cores w.r.t. to the single-core run with root-reasdpeed, and a 15.9x speed-up with the benchmark8 macro. In both cases, CPU utilization is low (e.g. root-readspeed at 64 cores only yields 3336% average CPU utilization), which indicates threads are scheduled out of CPUs because of lock waits or similar (could also be because of I/O waits, but it’s not likely in this setup). I am still investigating what causes this exactly: the fact that it happens with root-readspeed indicates an issue with ROOT I/O; however, total runtimes in this scenario are in the order of few seconds, where load imbalances and lock contention at thread start-up play a much more important role than in longer-running analyses. And anyway 7 seconds is a totally ok runtime to have a quick turnaround cycle, as long as higher workloads do not suffer from this same issue.

So I ran the tests with an artificially enlarged dataset (90x the original data for root-readspeed, 20x the original data for the opendata benchmarks).
In that setting, CPU utilization is good (e.g., for root-readspeed with the latest patches, 6086% for 64 cores), and scaling of root-readspeed is good (~45x speed-up with 64 cores w.r.t. the single-core wall-clock time; looking at Amdahl’s law, that’s a very good scaling: as if 99.4% of the original single-thread workload was perfectly parallelized).
The RDF benchmarks, however, showed good CPU utilization but worse scaling than root-readspeed. perf points to a false-sharing problem in RFilter::CheckFilters and RDefine::Update. With a patch that removes the false sharing, the RDF macros are now also well behaved, e.g. benchmark #8 now has a 38x speed-up w.r.t. the single-core version for the macro, 35x for the compiled, optimized version (which is also 3x faster).

Benchmark setup

Data always read from the filesystem cache.

root-readspeed invocation as above:

./root-readspeed/build/src/root-readspeed --trees Events --files data/Run2012B_SingleMu.root \
               --branches nMuon Muon_pt Muon_eta Muon_phi Muon_mass Muon_charge MET_pt \
               --threads $t

Using /usr/bin/time to collect CPU %usage.

Machine CPU specs:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7702P 64-Core Processor
Stepping:            0
CPU MHz:             2000.000
CPU max MHz:         3353.5149
CPU min MHz:         1500.0000
BogoMIPS:            3992.45
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-127

Max speed reached when reading the data from filesystem cache with dd is ~12GB/s:

$ dd if=../data/Run2012B_SingleMu.root of=/dev/null bs=200k
84482+1 records in
84482+1 records out
17301918071 bytes (17 GB, 16 GiB) copied, 1.39704 s, 12.4 GB/s

Benchmark outputs and further discussion

root-readspeed scaling with the latest patches on a dataset 90x times the original

#threads        Throughput (MB/s)       %CPU
0               95.6228                 99
2               187.517                 199
4               369.332                 392
8               743.988                 780
16              1463.24                 1550
32              2752.07                 3085
48              3699.3                  4602
64              4412.84                 6063

This is a 46x speed-up with respect to the single-thread case with 64 threads. That is very good according to Amdahl’s law (it’s as if 99.4% of the single-thread runtime was perfectly parallelized between the 64 threads – there might be a number of effects that come into play here to the advantage of the multi-thread run, but the point remains that the observed scaling is very good in this case).

root-readspeed scaling on the original dataset

Before patches:

#threads   Throughput (MB/s)   %CPU
0                     93.255     99
2                    185.494    194
4                    356.754    378
8                    675.031    716
16                   1181.63   1310
32                   1210.60   1602
48                   1113.97   1673
64                   1069.62   1810

After patches:

#threads   Throughput (MB/s)   %CPU
0                      94.10     99
2                     182.49    195
4                     364.39    381
8                     676.34    713
16                   1210.47   1309
32                   1947.17   2214
48                   2375.57   2858
64                   2516.84   3336

That’s a 26x speed-up with 64 threads. Not terrible, and definitely better than a slow-down (like in some cases @ingomueller.net reported) but not fantastic either.

I can see that CPU usage is low with little data and gets better and better with more and more data. Running root-readspeed with a fixed 64 threads:

#files   Throughput (MB/s)  %CPU
1        2495.52            3309
2        2880.08            4259
5        3603.36            5190
10       3984.13            5594
20       4210.87            5829
40       4389.58            5971
60       4455.99            6021
80       4423.99            6012
90       4467.27            6086

One part of the problem is that we have a high amount of lock contention at start-up, when all 64 threads want to prepare ROOT I/O for reading the data – and given that the total runtime is O(1s), that warm-up phase weights a lot. Workload imbalance might also play a role, with tails in which many threads ran out of things to do that last for a significant fraction of the total runtime of few seconds.

However, I’d like the computer to tell me exactly why CPU usage is low: why are the threads going idle? What are they waiting on? The best I could do was this off-CPU flamegraph which should summarize the reasons why threads are scheduled out of the CPUs but does not correspond to my mental model of what’s happening much and is definitely missing information, so meh. So this story isn’t over (but it only regards workloads that take a few seconds on 48/64 CPUs).

Open data RDF benchmark #8

Taking benchmark #8 (one of those with bad scaling over 16 cores in Ingo’s tests), here’s what I see with the original workload (these are wall-clock times for the execution of the full script):

threads  time (s) [speed-up]   [%CPU]
1           114.23     [1.00]    [99%]
2            68.82     [1.65]   [192%]
4            40.67     [2.81]   [365%]
8            24.07     [4.75]   [667%]
16           15.29     [7.47]  [1175%]
32            9.85    [11.60]  [1855%]
48            8.60    [13.28]  [2298%]
64            8.01    [14.26]  [2668%]

Here RDF suffers from the same low CPU usage issue that we saw for root-readspeed but the extra work performed during the event loop (especially for the more “beefy” tasks) makes it a bit less of a concern. With a 20x bigger dataset:

threads  time (s) [speed-up]    [%CPU]
1          2312.99     [1.00]     [99%]
2          1329.10     [1.74]    [197%]
4           738.78     [3.13]    [391%]
8           398.64     [5.80]    [778%]
16          220.58    [10.48]   [1541%]
32          120.25    [19.23]   [3024%]
48           87.42    [26.46]   [4481%]
64           70.08    [33.00]   [5878%]

So, at least on this machine I actually don’t see as terrible a scaling behavior as the one initially reported by Ingo,
but a 14x speed-up with 64 threads is still not great. Luckily, this time CPU %usage was good in all cases, so the problem must have been different from the issue root-readspeed encounters (which causes low CPU %usage), so I took a look at whether something could be done.

Results with a fix for false sharing issues in RDataFrame

As @amadio pointed out several times before, RDF has false sharing problems: Filters and Defines, in each thread, read and write different but contiguous elements of std::vector<T>s.
If the workload of filters and defines is relatively small (as it happens for instance with Filter("AdditionalLepton_pt != -999") then the false sharing becomes a bottleneck.

With this patch the numbers become, for the original workload:

threads  time (s) [speed-up]    [%CPU]
1          119.00     [1.00]     [99%]
2           67.48     [1.76]    [191%]
4           37.40     [3.18]    [359%]
8           21.26     [5.60]    [652%]
16          12.93     [9.20]   [1091%]
32           9.00    [13.22]   [1680%]
48           7.88    [15.10]   [2097%]
64           7.48    [15.91]   [2355%]

Here the false sharing is fixed so CPU usage actually drops a bit as we go back to the ROOT I/O bottleneck.

For the 20x bigger dataset:

threads  time (s) [speed-up]    [%CPU]
1         2328.16     [1.00]     [99%]
2         1271.88     [1.83]    [196%]
4          660.75     [3.53]    [394%]
8          349.16     [6.67]    [778%]
16         182.78    [12.74]   [1535%]
32         100.28    [23.21]   [3017%]
48          74.34    [31.32]   [4432%]
64          61.09    [38.11]   [5835%]

Much better!

Conclusions

Please don’t use the opendata benchmarks as an indicator of performance of real-world analysis tasks at high core counts, they were not designed as such and happen to hit a corner of the phase space where ROOT I/O performs pretty badly (at high core counts). However the performance on this benchmarks is simply not indicative of the performance you will get with a realistic workload – unless your workload is really as small as in those benchmarks, in which case you will get total runtimes of O(few seconds), which I guess is totally ok anyway.

We will upstream the optimized versions of the RDF benchmarks; the root-readspeed patches are already in, the RDataFrame patches will be merged in master as soon as possible (might need some tweaks to make them prettier). We are now tracking both the version with just-in-time compilation and the fully compiled one for each benchmark as part of rootbench, so I might reply here again in case we introduce performance optimizations that impact them.

Ingo, it would be great if you could check what performance you see for the RDF benchmarks with the RDF patch applied.

ingomueller.net · April 26, 2021, 6:34am

@eguiraud: Thanks a lot for the thorough investigation and the fixes. This looks really promising! I will indeed run the benchmarks again and report the results here, hopefully towards the end of this week.

What would you suggest to use as a benchmark then? Is the data size the only problem, i.e., should we run the same queries on a larger data set? Which one/how big should that be? Or do you see any other problems or ways to improve, extend, or complement the benchmark? What other benchmarks are you aware of?

eguiraud · April 26, 2021, 7:12am

The best benchmark is of course the one that emulates your actual analysis

But definitely, a lesson learned is that if you have runtimes O(few seconds) at high core counts it’s interesting to check whether some “small dataset” effects disappear with larger datasets.

ingomueller.net · April 29, 2021, 9:01am

I have just started to try out your patch, @eguiraud, and will report results as soon as I have them.

In the meantime, I can finally share some context of how we encountered these issues. We ran the OpenData benchmarks not only on ROOT, but also on a number of general-purpose data processing systems with declarative query languages and just published a pre-print of the results this week:

Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason Proffitt, Gustavo Alonso. “Evaluating Query Languages and Systems for High-Energy Physics Data.” arXiv: 2104.12615 [cs.DB], 2021.

The high-level take-away is that ROOT is the fastest on-premise system in the comparison, despite the scaling issues we encountered and despite the fact that we used the jitted version of RDataFrames. However, we also show that modern SQL dialects are not only capable of expressing the benchmarks, but some of them produce actually quite readable and elegant implementations and more modern query languages such as JSONiq can improve those even more. Finally, we also compared against some cloud-based systems, which have pretty attractive performance characteristics as well.

The main reason I want to share this in this thread is ask you for a quick soundness check in how we conducted the experiments with ROOT. (I guess that this thread has done a great deal of that already, so thank you already so much for the help so far!) If you find any potential issues, we’d be glad to know about them!

eguiraud · April 29, 2021, 9:21am

Wow, ok I’ll have to study that paper

Unless I miss something, two important issues I see are:

the conclusions in term of scalability on the original dataset (54M events) are misleading: at least in the case of ROOT (possibly also other frameworks) scaling will be much better when it matters more, i.e. when runtimes are longer than a few seconds (because the dataset is larger than 17GB and/or because you read more of it)
if you ran ROOT macros via the ROOT interpreter (as it seems from opendata-benchmarks/run_benchmark.sh at master · masonproffitt/opendata-benchmarks · GitHub) , you are running C++ code at O0 optimization: that’s “wrong”, as in no analysis group that cares about performance would do that. This is separate from the matter of df.Filter("x > 0") vs df.Filter([] (float x) { return x > 0; }), where the latter will give you much better performance but one could argue that the former is much nicer to write. In contrast, there is little motivation to run an analysis on 64 cores at O0 optimization

ingomueller.net · April 29, 2021, 11:16am

This is great feedback!

About the first point: As a non-physicist, I do not have a great understanding of how big “typical” data sets are. I image that there is quite a variety. Do you have any pointer to material that characterizes this variety? Alternatively, what fix do you suggest? Should be duplicate the current data set to cover more data set sizes (and which ones should we cover)?
About the second: What is the correct way to do this? This forum post suggests to use .L and from the context of the whole thread I understand that that does compile the macro with optimizations. Note that our script does use .L. Are the macros not compiled that way? Update: I just saw the + flag to .L. Is that all it takes to fix our script?

eguiraud · April 29, 2021, 4:36pm

@gwatts can probably comment with more authority than me on the first point but also: variance between analyses is large – the issue is not as much that the dataset size you used is not representative of any realistic use case, it’s just that (from what I see in my tests) the conclusions about scaling that you can draw on experiments at that scale do not apply to larger datasets – you can’t generalize them.

about the second: yes, .L macro.C+O (the O is just to make sure optimizations are there) should compile the macro into a shared library with optimizations (as long as the output of gSystem->GetFlagsOpt() does contain -O3 or similar, which should on well-behaved ROOT builds.

ingomueller.net · April 29, 2021, 6:06pm

OK, I see. The scalability issues that we faced are in parts due to the small data set size, so we either have to rephrase the conclusion or try larger data sets (or both).

We’ll also try out the .L macro.C+O configuration.

In the meantime, I have rerun the experiments with the RDF patch. I did not change anything else in the setup; in particular, we still run in interpreted mode without optimizations. The results look promising:

For the longest-running query, the scaling is now almost ideal; for almost all others, it is at least close to ideal up to 32 cores, and only three queries get slightly slower with 48 cores than with 32. That’s a huge improvement to the last stable version. Also, the running time is always < 10s, which is probably OK in the vast majority of cases.

eguiraud · April 29, 2021, 7:35pm

Great!

The false sharing fix is now in ROOT’s master branch.

Scaling issues are now more pronounced in benchmarks with lighter-weight analysis workloads (e.g. Q1): I bet what you see there is the small-dataset effect that is also visible with just root-readspeed (i.e. the lightest-weight workload possible, none). My educated guess is that the situation would be much better on a dataset 10x or 20x in size.

I’ll see whether we can improve ROOT I/O’s scaling also with the original dataset size.

swunsch · April 30, 2021, 8:38am

@ingomueller.net Interesting work! Haven’t read it thoroughly yet, but one thing I’d like to point out is that you state that the dataset has 17 GB. I find this always kind of misleading because you assume that you read everything at some point. Actually, the dataset you cite has 16.1 TB (CERN Open Data Portal)
We track the benchmarks now here (rootbench/RDataFrameOpenDataBenchmarks.cxx at master · root-project/rootbench · GitHub), and use files reduce to 1M or 10M events, with just the data you need to have all examples running. That translates to 91 MB and 878 MB respectively, or upscaled to the 54M events it is 4.7 GB. Especially on massively distributed systems, that’s not really a lot of data and may have an impact on your performance measure. I’d definitely recommend to upscale the dataset, at least to a 10x or better a 100x size, to be in a relevant scale for HEP. Currently, I’d say a typical Run 2 CMS analysis has at least 1 TB of ntuple (NanoAOD-like datasets) to process.

ingomueller.net · April 30, 2021, 9:13am

@swunsch: This is great input! To put what you say in my own words:

(1) Even though the whole data set (with all branches) has 17GB, queries typically read a only subset of the branches and thus much less data. That is true. However, note that we give the amount of data each query touches in Figure 4b. One interesting observation is that not all systems are able to access only the strictly required bytes (but that’s of course their weakness).

(2) The 54M events used in the current benchmark are in fact only a (small) subset of a much smaller data set. This is new to me and really good news Indeed, to test the scaling capabilities of distributed systems, the 54M events are quite small, and taking a 10x or 100x (i.e., the full data set) would make sense. If you say, 1TB is typical, does this include all branches or just the once read by a query?

swunsch · April 30, 2021, 9:46am

Ah nice, I missed Figure 4b, great! That’s often missing in such papers Still, I’d put the actual total transferred data there, which makes clear what happens and what can be expected in a distributed system with N cores.

Now regarding (2): The original dataset has 16.1 TB, but in AOD format. There’s a lot of stuff in it, down to the hits in the detector. The actual “useful” physics data on NanoAOD level used by the end-user taks, so the reconstructed objects with a minimal baseline selection, is just the few GB. In general, the open data AODs have about 500kb/event whereas a typical NanoAOD/ntuple has about 2kb/event.
Shameless self promotion, but here is the actual publication describing the NanoAOD-like dataset on the Open Data portal for usecases like yours: https://www.epj-conferences.org/articles/epjconf/abs/2020/21/epjconf_chep2020_08006/epjconf_chep2020_08006.html

And with 1 TB being typical, I mean you really touch almost everything at some point. Most analyses follow the schema to reduce (aka skim and slim) the dataset from the generic samples provided by the collaboration down to what you actually use in your analysis. On this very-end analysis level, which I think you tackle, my experience is the low TB scale for todays analyses.

ingomueller.net · May 1, 2021, 2:56pm

@eguiraud: A brief update on .L macro.C+O. I have just run the benchmarks again with that option (as implemented here) and got the following results (both versions run on the previous binary of ROOT with your patch):

There doesn’t seem to be a significant difference. Are we doing something wrong? Or is this maybe expected? After all, the performance-critical parts, namely the functions of RDataFrame as well as the lambdas they are parametrized with, are both either pre-compiled or jitted and should thus be optimized, no?

gwatts · May 3, 2021, 1:29am

The comment about the variance in dataset sizes and averages sizes is well taken, thanks! We’ve started a discussion.

eguiraud · May 3, 2021, 9:56am

Yeah that’s a bit underwhelming, on the machine I use I see ~7.4s for benchmark8 without the + and ~5.3s with it, at 64 cores, so I expected a bit more difference, but ok, it is what it is. Independently of the results that’s the “right” way; then it’s on us to actually make it matter

(The problem is that the jitting performed by RDataFrame is still always at O0 (and ROOT libraries are always compiled with Release settings). We are working on setting that to O1 by default but there are broken edge cases that need fixing in cling first. The difference would be much larger between running the new compiled versions of the benchmarks at O0 and O2. The compiled versions can run up to 3x faster than the versions you used for the paper; up to you whether you want to include them in your study or not (they are now available at GitHub - root-project/opendata-benchmarks: Example repository showing benchmarks with open data) – it is partly our fault for tuning the original versions for prettiness rather than performance, but hopefully future such studies will be able to see that in RDF you can trade off the two and it makes a difference)

masonp · May 4, 2021, 11:07am

(Just chiming in with some thoughts–thanks for all your help here!)

The version of the benchmark implementations that we’re using was fairly extensively rewritten by me, as there were/are several issues in the original (wrong variables/selections being used and a few bugs in the code), so the performance difference between the fully pre-compiled version and the version we’re using might not be exactly the same, depending on which version you were comparing to before.

The purpose of these benchmarks before the paper was almost exclusively to demonstrate functionality and readability–performance was never considered until now. I think even for this paper it’s better not to use something with code that’s maximally fine-tuned for performance, although that might still be interesting as an extra data point. The idea is that this is supposed to be code representative of what a typical user might actually write during studies for their analysis. And of course these short tasks are not a full analysis, so no one would spend too much time getting every last drop of performance out of them.

eguiraud · May 5, 2021, 8:00am

Hi @masonp ,
thank you for the comments.
I am totally on board with measuring the performance of what the “average” user would write (it’s up to us to make that perform as well as possible out of the box), but even in that mindset the kind of scaling and performance you get when reading a few hundred MBs in total is different from what you will get when running on dataset sizes similar to what we expect in Run 3 analyses – I think this point still stands.

In other words, to be very concrete, since scaling behavior might (or, in the case of ROOT I/O, probably will) be different with more realistic dataset sizes, stating that ROOT has scalability issues based on runtimes in the order of seconds might be wrong/very misleading: as per my measurements above, scaling might be much better for workloads in which it actually matters.

Some extra clarifications:

what @ingomueller.net and me are discussing is not the performance difference between the fully pre-compiled version and the macro version, but the performance difference of the macro version (i.e. alwys with just-in-time compiled parts of the computation graph) when loaded in the interpreter as .L macro.C vs .L macro.C+. My tests were performed using the macros at https://github.com/masonproffitt/opendata-benchmarks.git , revision f3ca4ac4ba17ed628888e0e0f0c09f88fa00c7a6 (adding the necessary includes to make .L macro.C+ work). With that said, running on different machines at different amounts of cores it is absolutely possible that we see differences in performance gains even if we run exactly the same code. One thing that might be different is that I always read the data from the filesystem cache to factor out as much as possible disk speed.
the fully compiled versions of the RDF solutions, tuned for performance, would likely perform much better than the macros you are using in the paper (they were up to 3x faster in my tests), but as I mention it is totally up to you whether you want to include them in the discussion - our “fault” for having this large variance in performance and requiring users to trade off performance and usability
@swunsch incorporated most of @ingomueller.net feedback on ROOT’s open data benchmarks in the past few days (EDIT: and thank you very much for reporting those problems upstream, much appreciated!). Please open other issues/upstream further changes if that’s not the full set of changes you applied: it is really important that we agree on the implementation

Let me also take this chance to thank you all for the super useful discussion, it brought to light actual technical issues and ROOT is all the better for it (and we are not done yet!).

Cheers,
Enrico