RDataFrame Multithreading questions

eneb · May 25, 2024, 8:22am

Hello experts,

I have some questions regarding the EnableImplicitMT() when using RDataFrame. I have a C++ executable in which I do

ROOT::EnableImplicitMT(numCores);

where numCores is just an integer indicating the number of cores I want to run with. When I run the executable I checked with htop on the machine I am using what is happening.

For the test I used numCores=12. In htop the command is visible 12 times, as I would expect, but just 1 of these is actually in Running state (R) and all other 11 in the “uninterruptible sleep” stage (D). I am not very firm with this, but a quick google seach seems to suggest this is the stage when I/O is performed. My question is just if this is expected? Further just 2 cores seem to be active (100% usage) while all others seemed to be in idle (~2%). As the machine has 20 cores I was expecting 12 cores to show a significant usage.

Using the EnableImplicitMT(12) for ~17Mio events combined from 11 input files together, it takes 7 Minutes to run. What I do is, I define like 20 new variables and create ~50 histograms using Histo1D. (No Snapshot!) I have no feeling for if this is considered a time which is expected or not.

As I also read some posts in this forum, as to why the internal parallelization might be slow(er), I checked one of the input files I use. The output of tree->Print("clusters") is

******************************************************************************
*Tree    :nominal   : tree                                                   *
*Entries :  2578115 : Total =      2078727229 bytes  File  Size =  579280161 *
*        :          : Tree compression factor =   3.59                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                2578114           1000

But I do not really know what the output tells me.

Now I am just wondering if the run time and CPU usage makes sense. Thanks for any input on this!

Danilo · May 25, 2024, 9:35am

Hi,

RDataFrame is battle tested, up to hundreds of cores - even in a distributed manner on many machines with DistRDF.

If you are reading 12 files, I expect good scaling until 10-12 cores, then depending on your workload and structure of the files, even more. Slow downs can happen if the disks are too slow or if you are reading remotely with poor network, but again these are corner cases with this low number of cores.

In some corner cases, multithreading can be slower. This is not due to ROOT in particular, but a generic behaviour when the overhead is too big wrt the work actually being done - for realistic cases, it is never the case.

Let us know how your analysis goes, we are always encouraging our community members to share their analysis experiences with the others!

Cheers,
Danilo

eneb · May 25, 2024, 6:30pm

Hi Danilo,

thanks for the answer. I am not sure if the filesystem might be a problem (it is not lxplus but lxplus like machine). I tested it today, and there is no difference for me using EnableImplicitMT(1) or EnableImplicitMT(12), the executable needs the same time.
I also tried EnableImplicitMT(50) just to see what happens, but here it takes even longer than before (over 11min instead of 7min) which baffles me. But maybe it also points to other problems? I also tested

root [1] ROOT::EnableImplicitMT(0)
root [2] ROOT::GetThreadPoolSize()
20

which confused me even more as to how it was possible to request 50 cores…

To give some more information. I am using a python script with subprocess.Popen to call my executable, which is from a larger C++ framework. This framework is not based on my own code entirely, so maybe I miss something but I also checked that just one event loop is run. And the idea is, that all Defines, and Histo1D, Histo2D etc is booked and just called once (as I think is the idea of the lazy actions).

I also read some further posts in the forum, and it seems like there are several reasons why performance can be worse than expected - also from a point of view how much speed-up is realistic.

eneb · June 1, 2024, 11:44am

Hello again,

I tested it a bit more to find out what is going on. I tried to run the executable directly ./bin/Selection.exe instead of using python Subprocess.Popen but it did not change anything.

I double checked that I call the EnableImplicitMT(12) before I construct the Dataframe.

When constructing it, I do

ROOT::RDataFrame df(*fileChain.release());
std::cout << "SLOTS df: " << df.GetNSlots() << std::endl;
ROOT::RDF::RNode runFrame = df;
std::cout << "SLOTS runFrame: " << runFrame.GetNSlots() << std::endl;

The GetNSlots() call was used to check that it prints 12, which it does.

When constructing new variables, I use lambda functions and not strings for the definition. Depending on the variables in these lambdas I construct e.g. a ROOT::Math::PtEtaPhiEVector and I also use some helper class functions like

float obj_eta = HelperClass::maxabs(obj1.size() > 0 ? obj1.at(0).eta() : 0, obj2.size() > 0 ? obj2.at(0).eta() : 0);

This should not be the reason for just “ignoring” the EnableImplicitMT() call right?

I even managed once to get it run on 12 cores for like a second or so (checking with htop), before it felt back to using just 1 core again. So my guess is something is restricting RDataFrame to just use one core (or is preventing it from using more.) What could I test else to find out what the problem is?

Danilo · June 1, 2024, 1:55pm

Hi,

This can depend on a plethora of factors. I/O bandwidth from your storage, the number of files you are running on, how they are clustered (i.e. how many “compressed sub-units” they contain).
Are you able to provide us a reproducer of the behaviour?

Best,m
Danil

eneb · June 1, 2024, 5:42pm

Hi,

regarding the clusters, I did some tests and posted the result in the original post. As stated there, I am not sure I understand the output completely. I/O bandwith, I still need to do some checks, but I also need to figure out how.

As for a reproducer, as I am using a quite big framework which is the backbone of this executable, I need to think about how I could do this (and to not leave out functionalities of the code).

If there are any other tests I could do regarding RDataFrame or so, please let me know! Thanks!

Danilo · June 1, 2024, 5:52pm

Hi,

Thanks a lot for the follow up.
We cannot really associate the issue you are reporting with a known scaling issue of RDF (actually, we are not aware of any of such issues).
While trying to create the reproducer and measuring could you use the latest ROOT version, ROOT 6.32.00?

We are very interested in understanding what is limiting you and, if this is ROOT, fix it.

Best,
D

system · June 15, 2024, 5:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.