Understanding performance of EnableImplicitMT in RDataFrame

jcatmore · July 18, 2022, 11:59pm

Hello,

I’ve observed that use of EnableImplicitMT seems to slow down the performance of a simple RDataFrame operation such as writing a single column variable to a histogram. This is the same issue (and the same computing cluster) as was reported by my colleague in this ticket:

The only difference is that in my case I’m using compiled C++ rather than Python. A stripped down version of my code is as follows:

#include <ROOT/RDataFrame.hxx>
#include <TROOT.h>
#include <TStyle.h>
#include "TCanvas.h"
#include "TH1D.h"
#include "TF1.h"
#include <iostream>
#include <chrono>

void make_histos()
{
    // Run in batch mode - output is pdf
    gROOT->SetBatch(kTRUE);
    // No stats box
    gStyle->SetOptStat(0);
    // Run parallel RDF operations
    ROOT::EnableImplicitMT();

    // Create RDataFrame
    ROOT::RDataFrame refr("CollectionTree", "./p5201/*");

    // Canvas (for PDF) 
    auto c1 = new TCanvas("c1", "DAOD comparison");

    // Loop over the required columns and plot them for each sample along with the ratio
    // Write resulting plots to a pdf file
    const char * colNameCh = "InDetTrackParticlesAuxDyn.phi";
    auto start = chrono::high_resolution_clock::now();
    auto h1 = refr.Histo1D(colNameCh);
    h1->Draw();
    auto stop = chrono::high_resolution_clock::now();
    auto duration = chrono::duration_cast<chrono::microseconds>(stop - start);
    c1->Print("p5201_p5205.pdf");
    std::cout << "Duration: " << duration.count() << std::endl;
}

The input file I’m using (an ATLAS DAOD) can be found here:

/afs/cern.ch/work/j/jcatmore/public/DAOD_PHYS.29580128._000001.pool.root.1

I observe the following timings for the step where the data is extracted from the RDF column and written into a histogram:

No EnableImplicitMT: 618966
EnableImplicitMT(2): 616910
EnableImplicitMT(4): 629247
EnableImplicitMT(8): 631385
EnableImplicitMT(): 634479

So, broadly, the execution time increases with the number of threads and the worst (or almost worst) performance is when the number of threads is left unspecified.

Reading previous posts on this topic, the issue of there being too few clusters in the file to get decent parallelism is often mentioned, so I checked this for the example file here:

******************************************************************************
*Tree    :CollectionTree: CollectionTree                                         *
*Entries :    18000 : Total =      5468072149 bytes  File  Size =  949057834 *
*        :          : Tree compression factor =   5.76                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry           Size   Number of clusters
0                0                17999                 500         36
Total number of clusters: 36

Note that in reality I am trying to read around 50 files like the one above, and I observe the same behaviour, but much more exaggerated.

I’m using a 256 core machine - full lscpu details are at the end of the post. Finally, one other detail that may be pertinent is that I’m running in a CentOS 7 singularity container rather than on the base system OS.

If you think that this issue is likely due to something in our local system configuration, I’ll follow this up with our admins, but it would be useful to have some pointers about what I should tell them, so if you can advise, this would be very helpful!

Thanks and best wishes,

James Catmore, University of Oslo & ATLAS

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             2250.000
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000
BogoMIPS:            4500.03
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.24/06
Platform: CentOS7 in Singularity container
Compiler: GCC 8.3.0

eguiraud · July 19, 2022, 9:22am

Hello @jcatmore ,

thank you for the detailed report.

Can you reproduce the issue outside of that computing cluster?

Some hypotheses regarding what the problem could be (it could very well be a combination of these + something else I did not think of):

Task overhead. Given that every task does little work, it might be that the time it takes to set it up and tear it down (e.g. opening/closing the file it needs to read) is comparable to the time actual work takes. In that case, more threads mean more tasks and therefore more global overhead. In 6.24.06 by default we try to spawn a maximum of 24 tasks per thread. In v6.26 that was reduced to 10 tasks per thread precisely because less tasks == less task setup/teardown overhead (but we still want a certain number of tasks per thread because of load balancing and other considerations). You could try limiting the number of tasks to, say, 1 or 2 per thread calling the static method ROOT::TTreeProcessorMT::SetTasksPerWorkerHint(2). You could also check whether things still get worse with more threads if you are actually running always with the same number of tasks, e.g. with ROOT::EnableImplicitMT(n); ROOT::TTreeProcessorMT::SetTasksPerWorkerHint(8 / n);.
NUMA effects. Does this still happen if you pin the execution to just some CPUs, e.g. with taskset? (I don’t know how taskset and singularity interact, but the idea is that you want to run on a specific set of physical CPUs that live on the same NUMA domain rather than allowing the host and its kernel to migrate your threads between CPUs in different NUMA domains)
Threads frequently being migrated between CPUs. For example because of some settings regarding the priority/“nice-ness” of jobs running inside singularity containers.

Let us know if any of this helps, or otherwise whether you can reproduce this outside of that specific cluster.

Cheers,
Enrico

P.S.
I assume that runtimes such as 618966 are in milliseconds. If those are instead microseconds, that’s just too small of a runtime for RDF multi-threading to make sense.

P.S. 2

…but the situation might still improve by forcing less tasks as mentioned above and by switching to v6.26.04 which reduces some start-up thread contention when many threads are used

eguiraud · July 19, 2022, 10:34am

P.S. 3

@vpadulan pointed out that the chrono timer should be around h1->Draw and not around refr.Histo1D(colNameCh) (which just books the filling of the histogram, but RDF is lazy and won’t start the event loop until any result is accessed for the first time), which trumps everything else I said

jcatmore · July 19, 2022, 12:37pm

Hi @eguiraud , @vpadulan ,

thanks for the fast replies! Good point about the positioning of the timing measurement - I forgot about the lazy reading. I re-ran with this adjustment. I also processed 1M events rather than the 18K to give the job some more substance, which I think should address your point (1) above. I observe the same trend, but more pronounced (confirming what I observed when I’ve been trying to use this for real):

No MT:    12 089 993
MT(2):    18 752 123
MT(4):    31 889 555
MT(8):    66 976 548 
MT(16):  176 405 504 
MT(): still running after many minutes

The units are indeed microseconds rather than milliseconds.

I’ll try to repeat the exercise on a desktop machine, and will also try to pin the tasks to certain CPUs as you suggested in point (2). I’ll also look if I can run this outside of a container (this will mean using CentsOS8).

Thanks again!

James

eguiraud · July 19, 2022, 1:08pm

v6.26 introduced some scaling improvements that might be relevant here, it would be nice to see what’s the situation there because it might be very different (unless you are reading via xrootd, then we need to fix `READ_WITHOUT_GLOBALREGISTRATION` has no effect on remote files · Issue #10742 · root-project/root · GitHub first). Reducing the number of tasks should also help. Let us know how the tests go

jcatmore · July 25, 2022, 4:33pm

Hi again @eguiraud ,

I tried this again, a bit more systematically. This is what I did now:

System: CentOS8, no container
ROOT version 6.26/04, local installation
Different machine to the one in the previous test, with no other users, but basically the same set up (see end)
Processed 10 million events over 544 input files stored on a local disk
Read in a single column, as before
Ran without MT and with 2, 4, 8, 16, and 32 cores
Ran with and without SetTasksPerWorkerHint(1);
Repeated each test 5 times and the mean recorded
Tried to pin the task to a given set of cores using the following command:

taskset -c 0-31 ./run_test

The results are as follows. All values are milliseconds. Cores = 0 means ROOT::EnableImplicitMT() isn’t set.

Cores = 0 : 257715 +/- 804.626
Cores = 2 : 231662 +/- 1362.99
Cores = 4 : 231929 +/- 1216.48
Cores = 8 : 232911 +/- 1299.78
Cores = 16 : 234121 +/- 989.924
Cores = 32 : 230121 +/- 1714.59

SetTasksPerWorkerHint(1);
Cores = 0 : 256150 +/- 572.275
Cores = 2 : 229457 +/- 1352.39
Cores = 4 : 228317 +/- 646.704
Cores = 8 : 228857 +/- 1103.62
Cores = 16 : 229042 +/- 840.582
Cores = 32 : 228737 +/- 1121.42

With taskset -c 0-31 ./run_test the results are broadly the same (unfortunately I accidentally overwrote the log file, but nothing in the results indicated any difference.

I’m not really sure where this leads us… I observe that with 6.26/04 there isn’t an increase in the execution time with cores, but adding more cores doesn’t seem to improve the performance.

The only other things I can think of are:

large number of files causes a big I/O overhead… I guess I’d need to run this through a profiler to understand that
access to more than a single core is for some reason is being blocked. I can find out about this from our system admins.

Anyway, I thought I’d report these extra findings. The code I’m using is updated below.

Cheers,

James.

processor	: {0-255}
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7702 64-Core Processor
stepping	: 0
microcode	: 0x8301052
cpu MHz		: 1996.192
cache size	: 512 KB
physical id	: 1
siblings	: 128
core id		: 63
cpu cores	: 64
apicid		: 255
initial apicid	: 255
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3981.31
TLB size	: 3072 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

#include <ROOT/RDataFrame.hxx>
#include <ROOT/TTreeProcessorMT.hxx>
#include <TROOT.h>
#include <TStyle.h>
#include "TCanvas.h"
#include "TH1D.h"
#include "TF1.h"
#include "TMath.h"
#include <iostream>
#include <chrono>
#include <fstream>

int timing(int ncores)
{
    // Run in batch mode - output is pdf
    gROOT->SetBatch(kTRUE);

    if (ncores!=0) {
        // Run parallel RDF operations
        if (ncores==99) {ROOT::EnableImplicitMT();}
        else {ROOT::EnableImplicitMT(ncores);}
        ROOT::TTreeProcessorMT::SetTasksPerWorkerHint(1);
    }

    // Create RDataFrame
    ROOT::RDataFrame refr("CollectionTree", "./files/*.pool.root.1");

    // Canvas
    auto c1 = new TCanvas("c1", "DAOD comparison");

    // Read in a column
    const char * colNameCh = "InDetTrackParticlesAuxDyn.phi";
    auto start = std::chrono::high_resolution_clock::now();
    auto h1 = refr.Histo1D(colNameCh);
    h1->Draw();
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    return(duration.count());
}

int main(void) {
    // open text file for writing
    std::ofstream fw("./results_1task_worker.txt", std::ofstream::out);
    std::vector<int> ncores{ 0, 2, 4, 8, 16, 32};
    for (int i : ncores) {
        std::cout << i << " cores" << std::endl;
        std::vector<int> results;
        results.push_back(timing(i));
        results.push_back(timing(i));
        results.push_back(timing(i));
        results.push_back(timing(i));
        results.push_back(timing(i));
        auto mean = TMath::Mean(results.begin(), results.end());
        auto rms = TMath::RMS(results.begin(), results.end());
        fw << "Cores = " << i << " : " << mean << " +/- " << rms << std::endl;
    }
    fw.close();
    return(1);
}

eguiraud · August 1, 2022, 7:11pm

Hi @jcatmore ,

thank you for investigating further and sorry for the late reply, I was off last week.

I’m happy to see the situation is more sensible with v6.26.04. With these new numbers my updated educated guess would be that for whatever reason (a bottleneck in upload or download of the data, for example) 1 thread is already reading at close to the maximum possible throughput.

There are two things that it would be interesting to know:

what is the average CPU usage of those runs (e.g. only without taskset and without SetTasksPerWorkerHint as they don’t seem to make a difference), measured e.g. with /usr/bin/time ./run_test N_THREADS? If CPU usage is very low, most of the time is likely being spent in I/O waits, so it makes sense that adding more CPUs to the mix does not help
is there any scaling if you use GitHub - eguiraud/root-readspeed: A tool to measure what throughput can be expected from ROOT for a given application. instead of RDataFrame? root-readspeed is a simple command-line tool that provides some read speed metrics for a given set of input files, branches read, and number of cores. It should be painless to git-clone the repo, build the tool following the instructions at the bottom of the README, and then run e.g. ./root-readspeed --files $YOUR_FILES --trees $TREE_NAME --branches $YOUR_BRANCH --threads $N_THREADS

I can also try to reproduce the problem on our end if you can share the input files with us, even privately (or even just one file that we can then replicate N times).

Cheers,
Enrico

eguiraud · August 1, 2022, 9:32pm

Ah also:

Can you please try with fixed bins, e.g. refr.Histo1D({"h","h",100, 0, 100}, colNameCh) (or whatever range makes sense for your variable)?

Maybe try 3 even before 1 and 2, in case that’s the issue. Multi-thread filling of a histogram without a specified binning is more expensive and less thread-friendly (I did not think it would have such a dramatic effect, but local tests suggest it might).

eguiraud · August 3, 2022, 4:47pm

P.S.

a mitigation for the scaling issues with the version without fixed binning is now included in master and will be part of the next patch release, [DF] Remove false sharing from BufferedFillHelper by eguiraud · Pull Request #11089 · root-project/root · GitHub

jcatmore · August 9, 2022, 10:10am

Hi @eguiraud ,

thanks for your continued advice with this. I did some more tests and made some more observations, summarised here.

(0) I confirmed that there is no system limitation on running multithreaded jobs on our system. I wrote a simple C++ program randomly generating numbers and then adding them up (e.g. no I/O) and the execution time decreased as expected as the size of the thread pool was increased.

(1) I have realised that the way I was running the tests above, switching the number of threads within a single execution, isn’t a valid approach, since the number of threads sticks at the first encountered setting of EnableImplicitMT(ncores). So in the results I showed above, the timing is flat above two threads just because it is always running with 2 threads (apart from the first one where no MT was set) - subsequent EnableImplicitMT(ncores) were ignored. I confirmed this by looking at top -H whilst the job was running, and confirmed that only two cores were active for all of the tests. I’ve now fixed this by re-starting the program each time with an nThreads value set from the command line, and now the results are more understandable. See the new code below.

(2) I used fixed binning in the new tests, as you can see in the code snippet at the end of this post. With this change and the adjustment in (1) I get the following (again mean of five tests per nThreads and times in milliseconds):

Cores = 0 : 265902 +/- 770.27
Cores = 2 : 220590 +/- 1365.78
Cores = 4 : 175986 +/- 1060.36
Cores = 8 : 401575 +/- 23488.7
Cores = 16 : 649170 +/- 23897.1
Cores = 32 : 1.7848e+06 +/- 17963.5
Cores = 64 : 4.95775e+06 +/- 235233
Cores = 128 : 1.20217e+07 +/- 978047

So you see that the time decreases up to four threads but adding more above that becomes increasingly counter productive.

(3) I also looked at the CPU utilisation via top -H. I observe that as the number of threads is increased, the activity of each committed core decreases, which is line with the observation above. Of course this is just a snapshot, but I think it is instructive.

With 16 threads

 492314 jamesrc   20   0 4919656   3.9g 215148 S  30.3   0.2   7:58.49 a.out                                                                                  
 492288 jamesrc   20   0 4919656   3.9g 215148 S  24.4   0.2   8:04.07 a.out                                                                                 
 492316 jamesrc   20   0 4919656   3.9g 215148 R  21.2   0.2   8:07.33 a.out                                                                                 
 492319 jamesrc   20   0 4919656   3.9g 215148 R  20.2   0.2   7:59.12 a.out                                                                                 
 492315 jamesrc   20   0 4919656   3.9g 215148 S  19.9   0.2   7:54.09 a.out                                                                                 
 492321 jamesrc   20   0 4919656   3.9g 215148 R  19.5   0.2   8:18.57 a.out                                                                                 
 492323 jamesrc   20   0 4919656   3.9g 215148 S  19.2   0.2   8:04.62 a.out                                                                                 
 492326 jamesrc   20   0 4919656   3.9g 215148 S  17.9   0.2   7:52.47 a.out                                                                                 
 492322 jamesrc   20   0 4919656   3.9g 215148 R  17.3   0.2   7:52.98 a.out                                                                                 
 492327 jamesrc   20   0 4919656   3.9g 215148 S  17.3   0.2   7:59.11 a.out                                                                                 
 492320 jamesrc   20   0 4919656   3.9g 215148 S  16.6   0.2   7:59.09 a.out                                                                                 
 492325 jamesrc   20   0 4919656   3.9g 215148 S  16.6   0.2   7:50.36 a.out                                                                                 
 492328 jamesrc   20   0 4919656   3.9g 215148 S  16.6   0.2   7:53.66 a.out                                                                                 
 492317 jamesrc   20   0 4919656   3.9g 215148 S  16.0   0.2   7:47.94 a.out                                                                                 
 492318 jamesrc   20   0 4919656   3.9g 215148 S  16.0   0.2   7:52.97 a.out                                                                                 
 492324 jamesrc   20   0 4919656   3.9g 215148 S  16.0   0.2   8:02.34 a.out

With 32 threads:

501194 jamesrc   20   0 7605448   5.9g 214888 R  11.4   0.3   3:38.88 a.out                                                                                  
 501168 jamesrc   20   0 7605448   5.9g 214888 S  11.1   0.3   3:27.68 a.out                                                                                 
 501172 jamesrc   20   0 7605448   5.9g 214888 R  11.1   0.3   3:27.20 a.out                                                                                 
 501178 jamesrc   20   0 7605448   5.9g 214888 S  11.1   0.3   3:26.33 a.out                                                                                 
 501184 jamesrc   20   0 7605448   5.9g 214888 R  11.1   0.3   3:27.79 a.out                                                                                 
 501189 jamesrc   20   0 7605448   5.9g 214888 S  11.1   0.3   3:30.77 a.out                                                                                 
 501142 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:30.91 a.out                                                                                 
 501170 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:25.89 a.out                                                                                 
 501171 jamesrc   20   0 7605448   5.9g 214888 R  10.7   0.3   3:26.65 a.out                                                                                 
 501180 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:26.16 a.out                                                                                 
 501181 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:26.88 a.out                                                                                 
 501182 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:28.06 a.out                                                                                 
 501185 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:27.78 a.out                                                                                 
 501188 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:24.44 a.out                                                                                 
 501190 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:25.81 a.out                                                                                 
 501191 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:26.93 a.out                                                                                 
 501192 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:24.29 a.out                                                                                 
 501193 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:33.65 a.out                                                                                 
 501196 jamesrc   20   0 7605448   5.9g 214888 R  10.7   0.3   3:26.19 a.out                                                                                 
 501197 jamesrc   20   0 7605448   5.9g 214888 S  10.7   0.3   3:27.51 a.out                                                                                 
 501169 jamesrc   20   0 7605448   5.9g 214888 S  10.4   0.3   3:26.11 a.out                                                                                 
 501174 jamesrc   20   0 7605448   5.9g 214888 S  10.4   0.3   3:24.42 a.out                                                                                 
 501176 jamesrc   20   0 7605448   5.9g 214888 S  10.4   0.3   3:25.84 a.out                                                                                 
 501177 jamesrc   20   0 7605448   5.9g 214888 S  10.4   0.3   3:26.15 a.out

(4) I re-ran a few of the tests using /usr/bin/time, and the results are as follows:

1 thread:

Time measured by C++: 267865ms
264.00user 3.23system 4:28.11elapsed 99%CPU (0avgtext+0avgdata 724748maxresident)k
0inputs+32outputs (0major+139217minor)pagefaults 0swaps

2 threads:

Time measured by C++: 222339ms
326.68user 10.22system 3:42.63elapsed 151%CPU (0avgtext+0avgdata 1024552maxresident)k
0inputs+32outputs (0major+1221848minor)pagefaults 0swaps

Time measured by C++: 353561ms
448.77user 449.42system 5:53.88elapsed 253%CPU (0avgtext+0avgdata 2064396maxresident)k
0inputs+16outputs (0major+6354196minor)pagefaults 0swaps

In each case the elapsed time matches the time measured by the program itself (which is a nice sanity check). You can see that there are diminishing returns as the number of cores increases.

(5) I set up root-readspeed; I’m not 100% sure how to interpret the results, and also whether the results depend on how many files are passed to it (as with a real test) - I used five inputs which is about 100K rows. This is the command:

./build/src/root-readspeed --trees CollectionTree --files /storage/jamesrc/validation/DAOD_PHYS.29580128._000001.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000002.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000003.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000004.pool.root.1 /storage/jamesrc/validation/DAOD_PHYS.29580128._000005.pool.root.1 --branches InDetTrackParticlesAuxDyn.phi --threads 1

These are the results:

1 thread

Total number of tasks: 10
Thread pool size:               1
Real time to setup MT run:      0.347169 s
CPU time to setup MT run:       0.35 s
Real time:                      0.407732 s
CPU time:                       0.4 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12436253 bytes
Uncompressed throughput:        31.6689 MB/s
                                31.6689 MB/s/thread for 1 threads
Compressed throughput:          29.0881 MB/s
                                29.0881 MB/s/thread for 1 threads

2 threads

Total number of tasks: 20
Thread pool size:               2
Real time to setup MT run:      0.34556 s
CPU time to setup MT run:       0.34 s
Real time:                      0.411653 s
CPU time:                       0.81 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12436253 bytes
Uncompressed throughput:        31.3673 MB/s
                                15.6836 MB/s/thread for 2 threads
Compressed throughput:          28.811 MB/s
                                14.4055 MB/s/thread for 2 threads```

4 threads

Total number of tasks: 40
Thread pool size:               4
Real time to setup MT run:      0.35015 s
CPU time to setup MT run:       0.35 s
Real time:                      0.459726 s
CPU time:                       1.76 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12436253 bytes
Uncompressed throughput:        28.0872 MB/s
                                7.0218 MB/s/thread for 4 threads
Compressed throughput:          25.7983 MB/s
                                6.44957 MB/s/thread for 4 threads

8 threads

Total number of tasks: 80
Thread pool size:               8
Real time to setup MT run:      0.350529 s
CPU time to setup MT run:       0.35 s
Real time:                      0.499285 s
CPU time:                       3.82 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12611871 bytes
Uncompressed throughput:        25.8618 MB/s
                                3.23273 MB/s/thread for 8 threads
Compressed throughput:          24.0897 MB/s
                                3.01121 MB/s/thread for 8 threads

16 threads

Total number of tasks: 160
Thread pool size:               16
Real time to setup MT run:      0.345709 s
CPU time to setup MT run:       0.35 s
Real time:                      0.642795 s
CPU time:                       9.33 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12980994 bytes
Uncompressed throughput:        20.0879 MB/s
                                1.2555 MB/s/thread for 16 threads
Compressed throughput:          19.2591 MB/s
                                1.20369 MB/s/thread for 16 threads

32 threads

Total number of tasks: 180
Thread pool size:               32
Real time to setup MT run:      0.339456 s
CPU time to setup MT run:       0.33 s
Real time:                      0.737031 s
CPU time:                       12.46 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12980994 bytes
Uncompressed throughput:        17.5195 MB/s
                                0.547485 MB/s/thread for 32 threads
Compressed throughput:          16.7966 MB/s
                                0.524895 MB/s/thread for 32 threads

64 threads

Total number of tasks: 180
Thread pool size:               64
Real time to setup MT run:      0.346924 s
CPU time to setup MT run:       0.35 s
Real time:                      1.12878 s
CPU time:                       19.92 s
Uncompressed data read:         13539652 bytes
Compressed data read:           12980994 bytes
Uncompressed throughput:        11.4393 MB/s
                                0.178739 MB/s/thread for 64 threads
Compressed throughput:          10.9673 MB/s
                                0.171364 MB/s/thread for 64 threads

If I understand the notes on GitHub correctly, this implies that decompression is the bottleneck, since by the time we get to 64 threads we have real time of 1.13s and CPU time / thread of 0.3s

I guess this means that if we want to get better performance, we should probably look at our compression settings. In fact this data is the DAOD_PHYS format (analogous to CMS’s miniAOD) and in “real life” we will be more likely to be passing DAOD_PHYSLITE (somewhat like nanoAOD) through RDataFrame, so it’s likely that we can optimise the compression differently. I’ll talk with our core software team about this.

In the meantime, in case you want to take a look, I put one of the files here:

/afs/cern.ch/work/j/jcatmore/public/DAOD_PHYS.29580128._000001.pool.root.1

As you can tell from above the main data payload is in CollectionTree and the branch I’ve been looking at is called InDetTrackParticlesAuxDyn.phi which goes from -pi to pi.

I also wondered about repeatedly processing the same file multiple times to avoid having to download all of these files, but I was worried about caching messing up the results (I don’t have sudo rights on this system so I can’t flush it). I guess I could share all of the files with you, but it is 2.5TB in total which might be a bit much

Anyway, thanks again for your help!

James

jcatmore · August 9, 2022, 10:11am

Latest code

#include <ROOT/RDataFrame.hxx>
#include <ROOT/TTreeProcessorMT.hxx>
#include <TROOT.h>
#include <TStyle.h>
#include "TCanvas.h"
#include "TH1D.h"
#include "TF1.h"
#include "TMath.h"
#include <iostream>
#include <chrono>
#include <fstream>
#include <string>

int timing(int ncores)
{
    // Run in batch mode - output is pdf
    gROOT->SetBatch(kTRUE);

    if (ncores!=0) {
        // Run parallel RDF operations
        if (ncores==99) {ROOT::EnableImplicitMT();}
        else {ROOT::EnableImplicitMT(ncores);}
        //ROOT::TTreeProcessorMT::SetTasksPerWorkerHint(1);
    }

    // Create RDataFrame
    ROOT::RDataFrame refr("CollectionTree", "./files/*.pool.root.1");
    //ROOT::RDataFrame refr("CollectionTree", "./files/DAOD_PHYS.29580128._000001.pool.root.1");

    // Canvas
    auto c1 = new TCanvas("c1", "DAOD comparison");

    // Read in a column
    const char * colNameCh = "InDetTrackParticlesAuxDyn.phi";
    auto start = std::chrono::high_resolution_clock::now();
    //auto h1 = refr.Histo1D(colNameCh);
    //auto max = refr.Max(colNameCh).GetValue();
    //auto min = refr.Min(colNameCh).GetValue();
    float max{3.2};
    float min{-3.2};
    auto h1 = refr.Histo1D({colNameCh,colNameCh,128, min, max}, colNameCh);
    h1->Draw();
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    return(duration.count());
}

int main(int argc, char* argv[]) {
    int nCores = strtol(argv[1], nullptr, 0);
    std::cout << "Number of cores = " << nCores << std::endl;
    // open text file for writing
    std::string arg1(argv[1]);
    std::string outputFName{"./results_"+arg1+".txt"};
    std::ofstream fw(outputFName, std::ofstream::out);
    // run the tests
    std::vector<int> results;
    results.push_back(timing(nCores));
    results.push_back(timing(nCores));
    results.push_back(timing(nCores));
    results.push_back(timing(nCores));
    results.push_back(timing(nCores));
    auto mean = TMath::Mean(results.begin(), results.end());
    auto rms = TMath::RMS(results.begin(), results.end());
    fw << "Cores = " << nCores << " : " << mean << " +/- " << rms << std::endl;
    fw.close();
    return(0);
}

eguiraud · August 9, 2022, 4:26pm

Hi @jcatmore ,

thank you very much for keeping this up!

The root-readspeed numbers are very interesting. They also show the terrible scaling behavior, so it looks like the problem is not related to RDF (root-readspeed uses raw TTrees).
The real time is much higher than the CPU time per thread, so the threads are mostly idling (which is confirmed by the average CPU% usage shown by /usr/bin/time), which is an indication that they probably spend most of their time waiting for data → raw I/O is the bottleneck.

Also as you can see the throughput peaks at around 30 MB/s and then only goes down. That’s a throughput in the order of magnitude of the typical throughput of spinning disks. If you are reading data from a single spinning disk, that also explains why throughput gets worse with more and more threads: the disk wastes more and more time seeking back and forth to read a different sector for each thread. Could this be what’s happening?

Cheers,
Enrico

jcatmore · August 9, 2022, 4:39pm

Hi @eguiraud ,

yes, @Attila_Krasznahorkay (who I forwarded this thread today) made a very similar comment about the raw I/O. I’ll find out how this storage is arranged - it’s a volume connected to storage resources in the data centre (rather than being a disk on a desktop) so I’ll drop a mail to the system admins.

Cheers,

James

system · August 23, 2022, 4:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.