[RDataFrame] Poor multithreading performance on certain platform

AlkaidCheng · March 25, 2024, 11:52pm

I am trying to process some xAOD ntuples using RDataFrame with multithreading enabled. However, the multithreaded performance is at least 10 times slower than the single threaded performance, despite showing much higher CPU usage. It seems to be platform dependent. I ran a test script with/without multithreading with same root version (ROOT 6.30/04) and gcc version (12.3.0), on perlmutter (Linux 5.14.21-150400.24.81_12.0.87-cray_shasta_c) and on lxplus. The multithreading issue is found on perlmutter but not lxplus. Below are the details of my checks:

I have two scripts:

singlethread.C

void singlethread(){
    ROOT::RDataFrame("CollectionTree", "sample.root").Count().GetValue();
}

multithread.C

void multithread(){
    ROOT::EnableImplicitMT();
    ROOT::RDataFrame("CollectionTree", "sample.root").Count().GetValue();
}

Below are the flamegraphs:

perlmutter single-thread
kernel_singlethread_perlmutter

perlmutter multi-thread
kernel_multithread_perlmutter

lxplus single-thread
kernel_singlethread_lxplus

lxplus multi-thread
kernel_multithread_lxplus

CPU and Memory usage:

perlmutter single-thread
activity_singlethread_perlmutter

perlmutter multi-thread
activity_multithread_perlmutter

lxplus single-thread
activity_singlethread_lxplus

lxplus multi-thread
activity_multithread_lxplus

Please let me know if you need the input file. I can share it privately.

ROOT Version: 6.30/04
Platform: perlmutter (Linux 5.14.21-150400.24.81_12.0.87-cray_shasta_c)
Compiler: gcc 12.3.0

bellenot · March 26, 2024, 7:28am

Hi, thanks for the very detailed description of the issue. I’m sure @vpadulan and @Danilo will be interested and will help with this

Danilo · March 26, 2024, 7:35am

Hi @AlkaidCheng ,

Thanks for the post and measurements. Let me add in the loop @mczurylo and @vpadulan.

Looking at the flamegraph and code, my first Ansatz would be IO latency on Perlmutter. You are basically just counting entries, without deserialising and uncompressing much.

I expect on a local machine to see the same behaviour of lxplus: did you check that?

Best,
D

AlkaidCheng · March 26, 2024, 7:48am

Hi @Danilo ,

I tested on local cluster from my institute (which uses alma linux) and it behaves similar to lxplus. I don’t have local machines with the same architecture as Perlmutter so I can’t test that.

Regards,
Alkaid

Danilo · March 26, 2024, 7:56am

Dear Alkaid,

Thanks: that helps.
I do not think it’s the node architecture that counts here but the way in which it accesses storage. I am not familiar with Perlmutter, but can it be that you are trying to read the root file from some network attached storage that cannot sustain the reads coming from the many cores on the node? Could you perhaps try to move the file(s) on the local storage of the Perlmutter node?

In this particular case I am not sure ROOT can help a lot on the software side.

Cheers,
Danilo

AlkaidCheng · March 26, 2024, 8:00am

Dear @Danilo ,

The results being shown is after transferring the file to local storage (aka to /tmp), so there is no network transfer issue here.

Best,
Alkaid

Danilo · March 26, 2024, 8:02am

Thanks Alkaid.

Is the number of cores used on the Perlmutter node and your University cluster similar?
One option I may suggest is to contact the Perlmutter helpdesk to ask about this performance regression.

Best,
Danilo

AlkaidCheng · March 26, 2024, 8:40am

Hi @Danilo ,
Thanks for the reply! It gave me some insights into what the problem might be. Each node on Perlmutter has 64 cores (or more). So it makes me think that maybe we are using too many threads. Since we can set the number of threads used when calling ROOT.EnableImplicitMT, I did some tests and indeed the run time increases with the number of threads. I don’t know what default values ROOT sets it to (tbb::task_scheduler_init::default_num_threads() from my understanding) but the default gives an extremely long runtime of > 100 s. Below is the benchmark plot:

Cheers,
Alkaid

Danilo · March 26, 2024, 9:04am

Dear Alkaid,

The default number of threads is the number of hardware cores on the node, as per TBB functionality.

I am sorry you are experiencing this issue on Perlmutter. If I had to summarise the current status, according to the information we have at the moment, I do not think this behaviour is an issue of ROOT but of the node configuration (ROOT works fine on a variety of other configurations, as you confirmed).

Best,
D

StephanH · March 26, 2024, 11:40am

Hello,

apart from I/O latency (which could still be the case if /tmp is a network filesystem), there is one common cause which could slow your program down, and that’s the cpu scheduler limiting the amount of CPU time.

Basically, if you asked for a job with 8 CPUs, and you land on a node that has 64 cores, ROOT might well start 64 threads (the number of cores exposed by TBB, as you noted). If your process runs in a cgroup with CPU time limits, like e.g. in docker containers or for some batch systems, ROOT might be able to find out that you are only allowed to use 8 cores, so it would limit itself to 8 threads.

Nevertheless, this detection might fail … In this case, the CPU scheduler will run the 64 threads for a bit, and stop them once they have consumed the CPU time of 8 threads. Then, they wait 7/8 of the scheduler interval, get woken up, run for 1/8, get stopped, sleep, get woken up, and so on.
When this happens, your program will get slower as you increase the number of threads without asking for more CPUs.
Whether this is the case is hard to say from the flame graph, but you should be able to check your job options for the number of cores that’s assigned.

AlkaidCheng · March 26, 2024, 12:31pm

Hi @StephanH ,

I have tested this on a dedicated CPU node (no sharing, all CPU can be used) with 256 cores. The same multithreading issues still persist.

Cheers,
Alkaid

StephanH · March 27, 2024, 7:59am

Hello Alkaid,

that’s good to know, but that unfortunately means that my idea was wrong.

I think in this case, off-cpu profiling might give more insights:
See perf record --off-cpu.

system · April 10, 2024, 8:00am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

vpadulan · April 10, 2024, 4:16pm

Dear @AlkaidCheng ,

Thanks for the detailed explanation so far. I will ask a question that I have not seen yet, what is the size of this file? For just counting the entries, if the file is anywhere less than O(10GB), using 64+ cores to parallelize the counting might be counterproductive, just the extra scheduling of the TBB threads might be too much.

Also, even the original post showing the difference in CPU usage on lxplus: 300% for the single-thread case and 250% for the IMT case? That’s suspicious. In any case, feel free to share the file with me privately and I will try to reproduce.

Cheers,
Vincenzo

vpadulan · April 10, 2024, 10:20pm

Dear @AlkaidCheng ,

Thank you for sending the data. I can now report the following from my machine (I have a total of 16 physical cores).

Given the following simple example

#include <ROOT/RDataFrame.hxx>
#include <TROOT.h>
#include <vector>
#include <string>

int main(){
    //ROOT::EnableImplicitMT();
    std::vector<std::string> filenames(1, "sample.root");
    ROOT::RDataFrame("CollectionTree", filenames).Count().GetValue();
}

I can measure with/without IMT and with an increasing number of files (given by the input argument to the constructor of std::vector). All runs have been repeated many times so we are sure the file is in the filesystem cache.

1 file, single thread

real	0m1.706s
user	0m1.336s
sys	0m0.366s

1 file, 2 threads

real	0m2.460s
user	0m2.481s
sys	0m0.694s

1 file, 16 threads

real	0m3.310s
user	0m11.456s
sys	0m4.871s

10 files, single thread

real	0m7.680s
user	0m6.438s
sys	0m1.218s

10 files, 16 threads

real	0m3.292s
user	0m17.377s
sys	0m5.405s

100 files, single thread

real	1m7.463s
user	0m57.759s
sys	0m9.484s

100 files, 16 threads

real	0m8.461s
user	1m14.260s
sys	0m17.038s

As you can see, the dataset in this file is just too small to benefit from IMT just for the simple task of counting its entries. The larger the dataset, the less the overhead of having a parallel execution is visible and its benefits really start showing. As usual, one either has to have a very compute-intensive graph, or a very large dataset, or both, to really benefit from parallelising the application.

Cheers,
Vincenzo