Specifying number of CPUs with EnableImplicitMT slows down runtime

I have been experimenting with multithreading using ROOT’s EnableImplicitMT and RDataFrame. I am running a python program that uses the Filter-method to create a new RDataFrame based on some condition. Before adding ROOT.EnableImplicitMT to the code, the program had a runtime of about 2.5 seconds. When I added ROOT.EnableImplicitMT to specifiy the number of CPUs the runtime increased. I have summarized the runtimes (wall and CPU) for different number of CPUs below. For timing I am using the time module in python, time.time() and time.process_time() for wall time and CPU time respectively. The computer I am using has 256 CPUs.

Not using EnableImplicitMT
Walltime: 2.62 s. CPU time: 2.61 s
Using EnableImplicitMT(1)
Walltime: 11.99 s. CPU time: 11.95 s
Using EnableImplicitMT(2)
Walltime: 18.3 s. CPU time: 21.9 s
Using EnableImplicitMT(4)
Walltime: 30.0 s. CPU time: 41.4 s
Using EnableImplicitMT(8)
Walltime: 103.9 s. CPU time: 211.2

I also tried using all 256 CPUs, but I killed the program after it had ran for about 20 minutes.

Why is the program so much faster before adding the EnableImplicitMT?
And why does the runtime seem to increase proportionally to the number of CPUs?

Thank you in advance!

ROOT Version: 6.24/06
Platform: Red Gat 8.5
Compiler: GCC 9.4.0

Welcome to the ROOT Forum! I’m sure @eguiraud will be interested by your report

Hi @olangrek ,
and welcome to the ROOT forum. Needless to say, this should not happen :slight_smile: and it does not happen with any of the benchmarks we have at GitHub - root-project/rootbench: Collection of benchmarks and performance monitoring applications . Could you provide a reproducer that we can play with?


Hi @eguiraud
Thanks for the quick reply!

Below is a very simplified version of the program I’m running. I have checked that the runtimes still increase with number of CPUs as before. As you can see there is not much going on in the code :slight_smile:

import ROOT
import time

tot_start_time = time.process_time()
start_wall_time = time.time()
df = ROOT.RDataFrame("CollectionTree", "DAOD_PHYSLITE.stringIndexBig.pool.root")

df_new = df.Filter('AnalysisElectronsAuxDyn.pt.size() == 1', 'Exactly one electron')
df_count = df_new.Count().GetValue()

tot_end_time = time.process_time() - tot_start_time
end_wall_time = time.time() - start_wall_time
print(f'Total wall time {end_wall_time} s')
print(f'Total CPU time: {tot_end_time} s')

I just tested running the program on another computer (8 CPUs) and another (and much smaller) dataset. In that case there was no substantial difference between running on 1, 2, 4 or 8 CPUs, and the code was only slightly faster when not using the EnableImplicitMT.
It is possible that the problem is on my end :slight_smile:

RDataFrame parallelizes over “TTree clusters” of events. Very small datasets might only have one cluster or very few, and that likely explains the lack of scaling for the small-scale test: there is not enough “meat” to parallelize over (you can check with tree->Print("clusters") how many clusters are present, for example).

The behavior you describe in the first post is still worrying, but if it is not reproducible outside of that particular machine and setup there might be something wrong there. E.g. you could try using our pre-compiled binaries or the conda installation of 6.26 and see if there is a change.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.