Hi @eguiraud ,
thanks for your continued advice with this. I did some more tests and made some more observations, summarised here.
(0) I confirmed that there is no system limitation on running multithreaded jobs on our system. I wrote a simple C++ program randomly generating numbers and then adding them up (e.g. no I/O) and the execution time decreased as expected as the size of the thread pool was increased.
(1) I have realised that the way I was running the tests above, switching the number of threads within a single execution, isn’t a valid approach, since the number of threads sticks at the first encountered setting of EnableImplicitMT(ncores)
. So in the results I showed above, the timing is flat above two threads just because it is always running with 2 threads (apart from the first one where no MT was set) - subsequent EnableImplicitMT(ncores)
were ignored. I confirmed this by looking at top -H
whilst the job was running, and confirmed that only two cores were active for all of the tests. I’ve now fixed this by re-starting the program each time with an nThreads value set from the command line, and now the results are more understandable. See the new code below.
(2) I used fixed binning in the new tests, as you can see in the code snippet at the end of this post. With this change and the adjustment in (1) I get the following (again mean of five tests per nThreads and times in milliseconds):
Cores = 0 : 265902 +/- 770.27
Cores = 2 : 220590 +/- 1365.78
Cores = 4 : 175986 +/- 1060.36
Cores = 8 : 401575 +/- 23488.7
Cores = 16 : 649170 +/- 23897.1
Cores = 32 : 1.7848e+06 +/- 17963.5
Cores = 64 : 4.95775e+06 +/- 235233
Cores = 128 : 1.20217e+07 +/- 978047
So you see that the time decreases up to four threads but adding more above that becomes increasingly counter productive.
(3) I also looked at the CPU utilisation via top -H. I observe that as the number of threads is increased, the activity of each committed core decreases, which is line with the observation above. Of course this is just a snapshot, but I think it is instructive.
With 16 threads
492314 jamesrc 20 0 4919656 3.9g 215148 S 30.3 0.2 7:58.49 a.out
492288 jamesrc 20 0 4919656 3.9g 215148 S 24.4 0.2 8:04.07 a.out
492316 jamesrc 20 0 4919656 3.9g 215148 R 21.2 0.2 8:07.33 a.out
492319 jamesrc 20 0 4919656 3.9g 215148 R 20.2 0.2 7:59.12 a.out
492315 jamesrc 20 0 4919656 3.9g 215148 S 19.9 0.2 7:54.09 a.out
492321 jamesrc 20 0 4919656 3.9g 215148 R 19.5 0.2 8:18.57 a.out
492323 jamesrc 20 0 4919656 3.9g 215148 S 19.2 0.2 8:04.62 a.out
492326 jamesrc 20 0 4919656 3.9g 215148 S 17.9 0.2 7:52.47 a.out
492322 jamesrc 20 0 4919656 3.9g 215148 R 17.3 0.2 7:52.98 a.out
492327 jamesrc 20 0 4919656 3.9g 215148 S 17.3 0.2 7:59.11 a.out
492320 jamesrc 20 0 4919656 3.9g 215148 S 16.6 0.2 7:59.09 a.out
492325 jamesrc 20 0 4919656 3.9g 215148 S 16.6 0.2 7:50.36 a.out
492328 jamesrc 20 0 4919656 3.9g 215148 S 16.6 0.2 7:53.66 a.out
492317 jamesrc 20 0 4919656 3.9g 215148 S 16.0 0.2 7:47.94 a.out
492318 jamesrc 20 0 4919656 3.9g 215148 S 16.0 0.2 7:52.97 a.out
492324 jamesrc 20 0 4919656 3.9g 215148 S 16.0 0.2 8:02.34 a.out
With 32 threads:
501194 jamesrc 20 0 7605448 5.9g 214888 R 11.4 0.3 3:38.88 a.out
501168 jamesrc 20 0 7605448 5.9g 214888 S 11.1 0.3 3:27.68 a.out
501172 jamesrc 20 0 7605448 5.9g 214888 R 11.1 0.3 3:27.20 a.out
501178 jamesrc 20 0 7605448 5.9g 214888 S 11.1 0.3 3:26.33 a.out
501184 jamesrc 20 0 7605448 5.9g 214888 R 11.1 0.3 3:27.79 a.out
501189 jamesrc 20 0 7605448 5.9g 214888 S 11.1 0.3 3:30.77 a.out
501142 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:30.91 a.out
501170 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:25.89 a.out
501171 jamesrc 20 0 7605448 5.9g 214888 R 10.7 0.3 3:26.65 a.out
501180 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:26.16 a.out
501181 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:26.88 a.out
501182 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:28.06 a.out
501185 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:27.78 a.out
501188 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:24.44 a.out
501190 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:25.81 a.out
501191 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:26.93 a.out
501192 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:24.29 a.out
501193 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:33.65 a.out
501196 jamesrc 20 0 7605448 5.9g 214888 R 10.7 0.3 3:26.19 a.out
501197 jamesrc 20 0 7605448 5.9g 214888 S 10.7 0.3 3:27.51 a.out
501169 jamesrc 20 0 7605448 5.9g 214888 S 10.4 0.3 3:26.11 a.out
501174 jamesrc 20 0 7605448 5.9g 214888 S 10.4 0.3 3:24.42 a.out
501176 jamesrc 20 0 7605448 5.9g 214888 S 10.4 0.3 3:25.84 a.out
501177 jamesrc 20 0 7605448 5.9g 214888 S 10.4 0.3 3:26.15 a.out
(4) I re-ran a few of the tests using /usr/bin/time
, and the results are as follows:
1 thread:
Time measured by C++: 267865ms
264.00user 3.23system 4:28.11elapsed 99%CPU (0avgtext+0avgdata 724748maxresident)k
0inputs+32outputs (0major+139217minor)pagefaults 0swaps
2 threads:
Time measured by C++: 222339ms
326.68user 10.22system 3:42.63elapsed 151%CPU (0avgtext+0avgdata 1024552maxresident)k
0inputs+32outputs (0major+1221848minor)pagefaults 0swaps
Time measured by C++: 353561ms
448.77user 449.42system 5:53.88elapsed 253%CPU (0avgtext+0avgdata 2064396maxresident)k
0inputs+16outputs (0major+6354196minor)pagefaults 0swaps
In each case the elapsed time matches the time measured by the program itself (which is a nice sanity check). You can see that there are diminishing returns as the number of cores increases.
(5) I set up root-readspeed
; I’m not 100% sure how to interpret the results, and also whether the results depend on how many files are passed to it (as with a real test) - I used five inputs which is about 100K rows. This is the command:
./build/src/root-readspeed --trees CollectionTree --files /storage/jamesrc/validation/DAOD_PHYS.29580128._000001.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000002.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000003.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000004.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000005.pool.root.1 --branches InDetTrackParticlesAuxDyn.phi --threads 1
These are the results:
1 thread
Total number of tasks: 10
Thread pool size: 1
Real time to setup MT run: 0.347169 s
CPU time to setup MT run: 0.35 s
Real time: 0.407732 s
CPU time: 0.4 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12436253 bytes
Uncompressed throughput: 31.6689 MB/s
31.6689 MB/s/thread for 1 threads
Compressed throughput: 29.0881 MB/s
29.0881 MB/s/thread for 1 threads
2 threads
Total number of tasks: 20
Thread pool size: 2
Real time to setup MT run: 0.34556 s
CPU time to setup MT run: 0.34 s
Real time: 0.411653 s
CPU time: 0.81 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12436253 bytes
Uncompressed throughput: 31.3673 MB/s
15.6836 MB/s/thread for 2 threads
Compressed throughput: 28.811 MB/s
14.4055 MB/s/thread for 2 threads```
4 threads
Total number of tasks: 40
Thread pool size: 4
Real time to setup MT run: 0.35015 s
CPU time to setup MT run: 0.35 s
Real time: 0.459726 s
CPU time: 1.76 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12436253 bytes
Uncompressed throughput: 28.0872 MB/s
7.0218 MB/s/thread for 4 threads
Compressed throughput: 25.7983 MB/s
6.44957 MB/s/thread for 4 threads
8 threads
Total number of tasks: 80
Thread pool size: 8
Real time to setup MT run: 0.350529 s
CPU time to setup MT run: 0.35 s
Real time: 0.499285 s
CPU time: 3.82 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12611871 bytes
Uncompressed throughput: 25.8618 MB/s
3.23273 MB/s/thread for 8 threads
Compressed throughput: 24.0897 MB/s
3.01121 MB/s/thread for 8 threads
16 threads
Total number of tasks: 160
Thread pool size: 16
Real time to setup MT run: 0.345709 s
CPU time to setup MT run: 0.35 s
Real time: 0.642795 s
CPU time: 9.33 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12980994 bytes
Uncompressed throughput: 20.0879 MB/s
1.2555 MB/s/thread for 16 threads
Compressed throughput: 19.2591 MB/s
1.20369 MB/s/thread for 16 threads
32 threads
Total number of tasks: 180
Thread pool size: 32
Real time to setup MT run: 0.339456 s
CPU time to setup MT run: 0.33 s
Real time: 0.737031 s
CPU time: 12.46 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12980994 bytes
Uncompressed throughput: 17.5195 MB/s
0.547485 MB/s/thread for 32 threads
Compressed throughput: 16.7966 MB/s
0.524895 MB/s/thread for 32 threads
64 threads
Total number of tasks: 180
Thread pool size: 64
Real time to setup MT run: 0.346924 s
CPU time to setup MT run: 0.35 s
Real time: 1.12878 s
CPU time: 19.92 s
Uncompressed data read: 13539652 bytes
Compressed data read: 12980994 bytes
Uncompressed throughput: 11.4393 MB/s
0.178739 MB/s/thread for 64 threads
Compressed throughput: 10.9673 MB/s
0.171364 MB/s/thread for 64 threads
If I understand the notes on GitHub correctly, this implies that decompression is the bottleneck, since by the time we get to 64 threads we have real time of 1.13s and CPU time / thread of 0.3s
I guess this means that if we want to get better performance, we should probably look at our compression settings. In fact this data is the DAOD_PHYS format (analogous to CMS’s miniAOD) and in “real life” we will be more likely to be passing DAOD_PHYSLITE (somewhat like nanoAOD) through RDataFrame, so it’s likely that we can optimise the compression differently. I’ll talk with our core software team about this.
In the meantime, in case you want to take a look, I put one of the files here:
/afs/cern.ch/work/j/jcatmore/public/DAOD_PHYS.29580128._000001.pool.root.1
As you can tell from above the main data payload is in CollectionTree
and the branch I’ve been looking at is called InDetTrackParticlesAuxDyn.phi
which goes from -pi to pi.
I also wondered about repeatedly processing the same file multiple times to avoid having to download all of these files, but I was worried about caching messing up the results (I don’t have sudo rights on this system so I can’t flush it). I guess I could share all of the files with you, but it is 2.5TB in total which might be a bit much 
Anyway, thanks again for your help!
James