Hi @eguiraud,
Thanks a lot for the quick reply. I just ran a quick test with root-readspeed
on the largest instance types, which have 48 cores and 4 SSDs configured as RAID 0 (to maximize bandwidth). The machine has 380GB of DRAM and I did some warm-up runs, so the benchmark should actually read from the OS cache. Just to be sure, I also ran the following test with dd (the RAID is mounted to /data/):
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
$ dd if=/data/input/Run2012B_SingleMu.root of=/dev/null bs=1M count=16k
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 4.23678 s, 4.1 GB/s
$ dd if=/data/input/Run2012B_SingleMu.root of=/dev/null bs=1M count=16k
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 2.59621 s, 6.6 GB/s
Now the result with root-readspeed
. I took the list of branches from Q5.
for t in 1 2 4 8 12 24 48; do echo "Threads: $t"; ./src/root-readspeed --trees Events --files /data/Run2012B_SingleMu_65536000.root --branches nMuon Muon_pt Muon_eta Muon_phi Muon_mass Muon_charge MET_pt --threads $t; echo; done
Threads: 1
Real time to setup MT run: 0.0879152 s
CPU time to setup MT run: 0.1 s
Real time: 17.8905 s
CPU time: 17.88 s
Uncompressed data read: 1916353044 bytes
Throughput: 102.153 MB/s
Threads: 2
Real time to setup MT run: 0.0875871 s
CPU time to setup MT run: 0.08 s
Real time: 9.15649 s
CPU time: 18.32 s
Uncompressed data read: 1916353044 bytes
Throughput: 199.594 MB/s
Threads: 4
Real time to setup MT run: 0.0845201 s
CPU time to setup MT run: 0.09 s
Real time: 4.87768 s
CPU time: 19.24 s
Uncompressed data read: 1916353044 bytes
Throughput: 374.681 MB/s
Threads: 8
Real time to setup MT run: 0.087266 s
CPU time to setup MT run: 0.09 s
Real time: 2.86205 s
CPU time: 22.36 s
Uncompressed data read: 1916353044 bytes
Throughput: 638.556 MB/s
Threads: 12
Real time to setup MT run: 0.086822 s
CPU time to setup MT run: 0.09 s
Real time: 2.34745 s
CPU time: 25.06 s
Uncompressed data read: 1916353044 bytes
Throughput: 778.538 MB/s
Threads: 24
Real time to setup MT run: 0.0854959 s
CPU time to setup MT run: 0.09 s
Real time: 4.99729 s
CPU time: 40.51 s
Uncompressed data read: 1916353044 bytes
Throughput: 365.713 MB/s
Threads: 48
Real time to setup MT run: 0.0869319 s
CPU time to setup MT run: 0.09 s
Real time: 12.566 s
CPU time: 66.22 s
Uncompressed data read: 1916353044 bytes
Throughput: 145.439 MB/s
From the README of root-readspeed
, it seems like I am usually in the “decompression is the bottleneck” category. However, scalability is also limited to 8-12 cores with that tool.
The file is about 17GB large; that’s 360MB per physical core. Is that “too small”? As @swunsch found out, it seems to have enough clusters. (How can I find that out, BTW)?
As you guessed, the 48 cores are split over 2 NUMA nodes; however, I also run the above test on a single NUMA node, which shows a scalability problem as well:
for t in 1 2 4 8 12 24; do echo "Threads: $t"; taskset -c 0-23 ./src/root-readspeed --trees Events --files /data/Run2012B_SingleMu_65536000.root --branches nMuon Muon_pt Muon_eta Muon_phi Muon_mass Muon_charge MET_pt --threads $t; echo; done
Threads: 1
Real time to setup MT run: 0.0933809 s
CPU time to setup MT run: 0.09 s
Real time: 19.3513 s
CPU time: 18.81 s
Uncompressed data read: 1916353044 bytes
Throughput: 94.442 MB/s
Threads: 2
Real time to setup MT run: 0.085289 s
CPU time to setup MT run: 0.08 s
Real time: 9.43973 s
CPU time: 18.89 s
Uncompressed data read: 1916353044 bytes
Throughput: 193.605 MB/s
Threads: 4
Real time to setup MT run: 0.085273 s
CPU time to setup MT run: 0.09 s
Real time: 5.16885 s
CPU time: 20.41 s
Uncompressed data read: 1916353044 bytes
Throughput: 353.575 MB/s
Threads: 8
Real time to setup MT run: 0.085464 s
CPU time to setup MT run: 0.08 s
Real time: 2.89653 s
CPU time: 22.65 s
Uncompressed data read: 1916353044 bytes
Throughput: 630.955 MB/s
Threads: 12
Real time to setup MT run: 0.0852692 s
CPU time to setup MT run: 0.08 s
Real time: 2.51951 s
CPU time: 25.38 s
Uncompressed data read: 1916353044 bytes
Throughput: 725.371 MB/s
Threads: 24
Real time to setup MT run: 0.0857332 s
CPU time to setup MT run: 0.08 s
Real time: 4.7662 s
CPU time: 39.32 s
Uncompressed data read: 1916353044 bytes
Throughput: 383.445 MB/s
I understand that the RDataFrame implementations of the benchmark are not optimized for performance. However, (1) the performance is still interesting, since there is generally a trade-off between implementation time and obtained performance (though, I agree that a more optimized implementation with a different trade-off would be very interesting to compare!) and (2) I do not see yet why that should affect scalability rather than “just” general (single-core) efficiency.
Cheers,
Ingo