Hello, I’m trying out TDataFrame for my analysis class after reading about the ease of using MultiThreading.
I’ve attached my Analysis class, it doesn’t do anything fancy except create a TDataFrame, split it based on some filters, and create some histograms. My problem is that with Multithreading on the code runs about 2-3x slower. I’ve checked this using TStopwatch.
I’m using ROOT 6.11/02 on a Mac with 8 processors. ROOT should be compiled automatically with imt on. Thanks.
TDataFrame parallelises on “clusters”, more or less the bunches of rows in the tree compressed together in order to avoid to duplicate the work needed to decompress.
How many clusters has your dataset?*
Do you see a speedup enabling parallelism on fewer cores, e.g. ROOT::EnableImplicitMT(4)? or ROOT::EnableImplicitMT(2)?
Would it be possible to reproduce the issue accessing your ntuple?
Cheers,
Danilo
You can measure them like this:
// t is a pointer to a TTree
auto it = t->GetClusterIterator(0);
Long64_t e = it.Next();
Long64_t p = -1;
std::cout << "Cluster starts at entry " << e << std::endl;
while ( (e = it.Next()) && e!=p) {
std::cout << "Cluster starts at entry " << e << std::endl;
p = e;
}
thanks. It’s expected to see some sophisticated effects when increasing the pool size, especially on laptops (frquency scaling and all sorts of power saving procedures, different with and without being on battery power only etc.).
Still, I’d like to see clearer. I have 2 other proposals for you: the first one is a tiny action on you, namely to try out the same with ROOT 6.12/04, which includes several improvements wrt the 6.11 devel release. The second one is an action on me, which is benchmarking your workflow on your data: I’ll contact you privately about that.
meanwhile, we were analysing your code.
One thing which would give you a tangible speedup is the way in which you are initialising the histograms which are a data member of your class.
Your preprocessor macro, DFHisto, immediately calls GetValue. This triggers an event loop per histogram. What we’d suggest is to save the TResultProxy’s as intermediate, local results once your entire calculation is set up and only then trigger the event loop invoking GetValue on those to fill the data members. Alternatively, you might also have TResultProxy instances as data members: they behave like histograms, you can invoke GetValue on those and the behaviour of your class will inherit the desirable laziness of TDF.
Yes! I removed the parts of the macro “.GetValue()” and then I save the TProxyValue as intermediate, and only use .GetValue at the very end to assign the output to my member histograms. It’s definitely faster. Down from 8.5 seconds to 2 seconds. I also updated to ROOT 6.12, no change
this is very good: thanks for diving into this. The numbers are what we expected: we went from 4 event loops to 1 and we got a factor 4.
Now, about the parallelism: I think that maybe using a few cores you can still gain something but parallelising a workflow that takes 2 seconds might not be always gratifying.
Did you see any speedup perhaps using 2 or 4 cores?
Hi, I’m still having trouble. I’ve tried my TDataFrame on larger files, ones that take ~5 minutes without multithreading, and whenever I activate EnableImplicitMT, it takes much much longer. . .
Yes this is with version 6.12. And trying a smaller number of threads (2 or 4) does not help, though it does make it slightly faster than with 8 threads.