Hello,
I have been experimenting with multithreading using ROOT’s EnableImplicitMT and RDataFrame. I am running a python program that uses the Filter-method to create a new RDataFrame based on some condition. Before adding ROOT.EnableImplicitMT to the code, the program had a runtime of about 2.5 seconds. When I added ROOT.EnableImplicitMT to specifiy the number of CPUs the runtime increased. I have summarized the runtimes (wall and CPU) for different number of CPUs below. For timing I am using the time module in python, time.time() and time.process_time() for wall time and CPU time respectively. The computer I am using has 256 CPUs.
Not using EnableImplicitMT
Walltime: 2.62 s. CPU time: 2.61 s Using EnableImplicitMT(1)
Walltime: 11.99 s. CPU time: 11.95 s Using EnableImplicitMT(2)
Walltime: 18.3 s. CPU time: 21.9 s Using EnableImplicitMT(4)
Walltime: 30.0 s. CPU time: 41.4 s Using EnableImplicitMT(8)
Walltime: 103.9 s. CPU time: 211.2
I also tried using all 256 CPUs, but I killed the program after it had ran for about 20 minutes.
Why is the program so much faster before adding the EnableImplicitMT?
And why does the runtime seem to increase proportionally to the number of CPUs?
Thank you in advance!
ROOT Version: 6.24/06 Platform: Red Gat 8.5 Compiler: GCC 9.4.0
Below is a very simplified version of the program I’m running. I have checked that the runtimes still increase with number of CPUs as before. As you can see there is not much going on in the code
I just tested running the program on another computer (8 CPUs) and another (and much smaller) dataset. In that case there was no substantial difference between running on 1, 2, 4 or 8 CPUs, and the code was only slightly faster when not using the EnableImplicitMT.
It is possible that the problem is on my end
RDataFrame parallelizes over “TTree clusters” of events. Very small datasets might only have one cluster or very few, and that likely explains the lack of scaling for the small-scale test: there is not enough “meat” to parallelize over (you can check with tree->Print("clusters") how many clusters are present, for example).
The behavior you describe in the first post is still worrying, but if it is not reproducible outside of that particular machine and setup there might be something wrong there. E.g. you could try using our pre-compiled binaries or the conda installation of 6.26 and see if there is a change.