RDataFrame multi threading slower than one thread

Dear experts,
could be possible that EnableImplicitMT() in RDataFrame were slower than disable multi threading ?

Imanol


Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi @imanol097 ,
it’s certainly possible in princple, but it’s hard to tell whether it should be expected or not in your case without further context.

Cheers,
Enrico

Well, i would like to know more about this.
What kind of information about my context do you need to say something @eguiraud ?

Imanol

With no info, it could be anything:

  • if the machine where you run is very busy, threads will be scheduled on and off CPUs a lot resulting in more overhead than if you run with a single thread
  • old CPUs are bad at hyper-threading, so you might have better results using as many threads as the number of physical cores rather than the logical cores (which is what EnableImplicitMT() uses as default number of worker threads)
  • above 32 threads, until recent ROOT versions, scaling was limited by an internal issue (resolved in 6.24.02)
  • there could be locks or similar contention issues in the user logic injected in the RDataFrame event loop
  • the dataset could be small enough that additional cores bring no benefit because they have no work to do
  • when reading data over the network or from a slow disk, using a high amount of cores could choke the I/O bandwidth

…or a combination of these, or something else I did not think of.

It would be nice to have a reproducer and/or information about your platform (e.g. from lscpu, root-config --version, rootls -t <dataset>, etc.).

Cheers,
Enrico

From lscpu i got:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2200.000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

From root-config --version i got 6.22/08

And I am currently working with root files organized in collections, I don´t know if this helps

Imanol

With such a large amount of cores I strongly recommend to try out v6.24.06 rather than v6.22.08 for the reason mentioned above.

Threads migrating between CPUs on different NUMA nodes might be another issue, you might want to limit the execution to cores in the same numa node using taskset.

EDIT: and if you run with 56 or 112 cores you should make sure you have at least double the amount of “clusters” of entries in the input files. rootls -t <filename> displays the amount of clusters for the trees in the file <filename>. tree->Print("clusters") can be used for the same purpose.

EDIT 2: if the scaling issues persist with v6.24.06, limiting the execution to a single NUMA node and having enough data, I would love to see a scaling plot to get an idea of the behavior you see.

EDIT 3: Also what kind of runtimes are we talking about? On hundreds of cores there might be a few seconds of warm-up time due to locks in ROOT I/O. This might be a major issue if total runtimes are in the order of seconds.

Dear @eguiraud,

thanks for your detailed answer.
I will check and try carefully your suggestions.
I am talking about runtimes not less than 1minute

Imanol

1 Like

If you can share a reproducer (code + data) we can also investigate on our side.

Cheers,
Enrico

@eguiraud After moving to v6.24.06 the performance improved a lot.

1 thread → 30min
EnableImplicitMT() → 8-10m

I checked the number of clusters in the trees and it is 1

About taskset, for some reason is not working properly.

Alright, although a factor 3 speed-up with 100 cores is still abysmal (you should get pretty much the same speed-up just using 3-4 cores). But this might be the problem:

If that’s true, RDataFrame can do no parallelization within each TTree, as the parallelization happens over clusters of entries. RDataFrame will still parallelize over different TTrees if your dataset consists of multiple trees.

However:

  • the typical size of one cluster is ~30 MB of data
  • a runtime of 30 minutes indicates that you are processing a lot of data

So either you are processing a lot of single-cluster trees, or there is something we are missing.
In case it can be useful, activating RDataFrame’s logging will print to screen some info about the beginning and end of each parallel processing task, see " Performance profiling of RDataFrame applications" at ROOT: ROOT::RDataFrame Class Reference.

Cheers,
Enrico

@eguiraud Yes, i am processing several .root files with one TTree in each file.
So I guess that RDataFrame is doing in parallel the analysis of each TTree, in fact if I compare the execution time of one TTree with and without parallelization it doesn’t change.

Now my question is: Could the execution time be reduced by clustering my TTrees?

If you are processing N .root files (each with one TTree with one cluster), you won’t be able to get a speed-up higher than N, in the ideal case. In practice you would need at least 2 or 3 times more trees than threads to avoid imbalance effects (e.g. one tree that’s slower to process and that causes a long single-thread processing tail).

If the TTrees are large but somehow not clustered, clustering them would increase the granularity of the parallelization and might be beneficial (more threads can be working at the same time and as there will be more overall tasks imbalance effects will be reduced). If the TTrees contain only one cluster because they are small, then you might see some benefit by forcing very small clusters, but the overhead of starting/finishing/merging the tasks might be significant with respect to the runtime of each task (for good scaling you want each task to take much more time than its starting/stopping/handling overhead).

Another strategy could be to only use a number of threads equal to a third or a fifth of the number of files, so you get a reasonable speed-up for the workload you have and leave other CPUs free for other jobs (e.g. that run on other datasets), then start multiple few-core jobs rather than a single job that requests 100 core but can’t properly utilize them (I can’t say whether this makes sense or not for your usecase).

Cheers,
Enrico

P.S.
I would still strongly suggest to only schedule the run on a single NUMA domain (e.g. only on even-numbered CPUs), threads migrating between cores on different domains or accessing memory from another domain might be slowing down the processing.

P.S. 2:
to clarify the behavior it would be good to know the total number of TTrees you are processing and to produce a scaling plot when running on a single numa domain (runtimes at 1, 4, 16, 64 cores on one domain) – then we can extend the measurements to cross numa domains but I would expect scaling to take a hit in that case.

Hi @eguiraud, I have tried schedule the run on one single NUMA domain but the performance improvement is negligible if any. Maybe in this situation is sightly different because I have hyper-threading ? I don’t know what can be happening.

Hi @imanol097 ,
that’s unfortunate. Of course we’d like to scale well out of the box, but I’m still not sure what’s going on exactly.

What’s the total number of single-cluster TTrees you are using? How does a scaling plot inside a single NUMA domain look like (and are you sure threads only run on either even-numbered or odd-numbered CPUs)?

Can you share a reproducer that we can run? I’d like to check whether scaling is ok on a 64-core machine that does not have multiple NUMA domains.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.