Scalability of RDataFrames on 16+ cores

ingomueller.net · March 30, 2021, 3:16pm

Thanks a lot for the quick reply. I just ran a quick test with root-readspeed on the largest instance types, which have 48 cores and 4 SSDs configured as RAID 0 (to maximize bandwidth). The machine has 380GB of DRAM and I did some warm-up runs, so the benchmark should actually read from the OS cache. Just to be sure, I also ran the following test with dd (the RAID is mounted to /data/):

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ dd if=/data/input/Run2012B_SingleMu.root of=/dev/null bs=1M count=16k
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 4.23678 s, 4.1 GB/s
$ dd if=/data/input/Run2012B_SingleMu.root of=/dev/null bs=1M count=16k
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 2.59621 s, 6.6 GB/s

Now the result with root-readspeed. I took the list of branches from Q5.

for t in 1 2 4 8 12 24 48; do echo "Threads:                        $t"; ./src/root-readspeed --trees Events --files /data/Run2012B_SingleMu_65536000.root --branches nMuon Muon_pt Muon_eta Muon_phi Muon_mass Muon_charge MET_pt --threads $t; echo; done
Threads:                        1
Real time to setup MT run:      0.0879152 s
CPU time to setup MT run:       0.1 s
Real time:                      17.8905 s
CPU time:                       17.88 s
Uncompressed data read:         1916353044 bytes
Throughput:                     102.153 MB/s

Threads:                        2
Real time to setup MT run:      0.0875871 s
CPU time to setup MT run:       0.08 s
Real time:                      9.15649 s
CPU time:                       18.32 s
Uncompressed data read:         1916353044 bytes
Throughput:                     199.594 MB/s

Threads:                        4
Real time to setup MT run:      0.0845201 s
CPU time to setup MT run:       0.09 s
Real time:                      4.87768 s
CPU time:                       19.24 s
Uncompressed data read:         1916353044 bytes
Throughput:                     374.681 MB/s

Threads:                        8
Real time to setup MT run:      0.087266 s
CPU time to setup MT run:       0.09 s
Real time:                      2.86205 s
CPU time:                       22.36 s
Uncompressed data read:         1916353044 bytes
Throughput:                     638.556 MB/s

Threads:                        12
Real time to setup MT run:      0.086822 s
CPU time to setup MT run:       0.09 s
Real time:                      2.34745 s
CPU time:                       25.06 s
Uncompressed data read:         1916353044 bytes
Throughput:                     778.538 MB/s

Threads:                        24
Real time to setup MT run:      0.0854959 s
CPU time to setup MT run:       0.09 s
Real time:                      4.99729 s
CPU time:                       40.51 s
Uncompressed data read:         1916353044 bytes
Throughput:                     365.713 MB/s

Threads:                        48
Real time to setup MT run:      0.0869319 s
CPU time to setup MT run:       0.09 s
Real time:                      12.566 s
CPU time:                       66.22 s
Uncompressed data read:         1916353044 bytes
Throughput:                     145.439 MB/s

From the README of root-readspeed, it seems like I am usually in the “decompression is the bottleneck” category. However, scalability is also limited to 8-12 cores with that tool.

The file is about 17GB large; that’s 360MB per physical core. Is that “too small”? As @swunsch found out, it seems to have enough clusters. (How can I find that out, BTW)?

As you guessed, the 48 cores are split over 2 NUMA nodes; however, I also run the above test on a single NUMA node, which shows a scalability problem as well:

for t in 1 2 4 8 12 24; do echo "Threads:                        $t"; taskset -c 0-23 ./src/root-readspeed --trees Events --files /data/Run2012B_SingleMu_65536000.root --branches nMuon Muon_pt Muon_eta Muon_phi Muon_mass Muon_charge MET_pt --threads $t; echo; done
Threads:                        1
Real time to setup MT run:      0.0933809 s
CPU time to setup MT run:       0.09 s
Real time:                      19.3513 s
CPU time:                       18.81 s
Uncompressed data read:         1916353044 bytes
Throughput:                     94.442 MB/s

Threads:                        2
Real time to setup MT run:      0.085289 s
CPU time to setup MT run:       0.08 s
Real time:                      9.43973 s
CPU time:                       18.89 s
Uncompressed data read:         1916353044 bytes
Throughput:                     193.605 MB/s

Threads:                        4
Real time to setup MT run:      0.085273 s
CPU time to setup MT run:       0.09 s
Real time:                      5.16885 s
CPU time:                       20.41 s
Uncompressed data read:         1916353044 bytes
Throughput:                     353.575 MB/s

Threads:                        8
Real time to setup MT run:      0.085464 s
CPU time to setup MT run:       0.08 s
Real time:                      2.89653 s
CPU time:                       22.65 s
Uncompressed data read:         1916353044 bytes
Throughput:                     630.955 MB/s

Threads:                        12
Real time to setup MT run:      0.0852692 s
CPU time to setup MT run:       0.08 s
Real time:                      2.51951 s
CPU time:                       25.38 s
Uncompressed data read:         1916353044 bytes
Throughput:                     725.371 MB/s

Threads:                        24
Real time to setup MT run:      0.0857332 s
CPU time to setup MT run:       0.08 s
Real time:                      4.7662 s
CPU time:                       39.32 s
Uncompressed data read:         1916353044 bytes
Throughput:                     383.445 MB/s

I understand that the RDataFrame implementations of the benchmark are not optimized for performance. However, (1) the performance is still interesting, since there is generally a trade-off between implementation time and obtained performance (though, I agree that a more optimized implementation with a different trade-off would be very interesting to compare!) and (2) I do not see yet why that should affect scalability rather than “just” general (single-core) efficiency.

Cheers,
Ingo