TDataFrame and EnableImplicitMT

Hello, I’m trying out TDataFrame for my analysis class after reading about the ease of using MultiThreading.

I’ve attached my Analysis class, it doesn’t do anything fancy except create a TDataFrame, split it based on some filters, and create some histograms. My problem is that with Multithreading on the code runs about 2-3x slower. I’ve checked this using TStopwatch.

I’m using ROOT 6.11/02 on a Mac with 8 processors. ROOT should be compiled automatically with imt on. Thanks.

Analyzer.C (7.1 KB)

Hi,

TDataFrame parallelises on “clusters”, more or less the bunches of rows in the tree compressed together in order to avoid to duplicate the work needed to decompress.
How many clusters has your dataset?*
Do you see a speedup enabling parallelism on fewer cores, e.g. ROOT::EnableImplicitMT(4)? or ROOT::EnableImplicitMT(2)?
Would it be possible to reproduce the issue accessing your ntuple?

Cheers,
Danilo

You can measure them like this:

// t is a pointer to a TTree
auto it = t->GetClusterIterator(0);
Long64_t e = it.Next();
Long64_t p = -1;

std::cout << "Cluster starts at entry " << e << std::endl;
while ( (e = it.Next()) && e!=p) {
   std::cout << "Cluster starts at entry " << e << std::endl;
   p = e;
}

Hello, I ran your provided code on my TTree, which I think is a bit too long to upload, but here is the output.

root [3] auto it = T->GetClusterIterator(0)
(TTree::TClusterIterator &) @0x110c0e7d0
root [4] Long64_t e = it.Next();
root [5] Long64_t p = -1;
root [6] std::cout << "Cluster starts at entry " << e << std::endl;
Cluster starts at entry 0
root [7] while ( (e = it.Next()) && e!=p) {
root (cont'ed, cancel with .@) [8]std::cout << "Clusters starts at entry " << e << std::endl; p = e; 
root (cont'ed, cancel with .@) [9]}
Clusters starts at entry 74369
Clusters starts at entry 148738
Clusters starts at entry 223107
Clusters starts at entry 297476
Clusters starts at entry 371845
Clusters starts at entry 446214
Clusters starts at entry 459780

Additionally, using 2 cores is faster than 4 cores, and 4 cores is faster than 8 cores. Seems to be working in reverse :stuck_out_tongue:

Hi Noah,

thanks. It’s expected to see some sophisticated effects when increasing the pool size, especially on laptops (frquency scaling and all sorts of power saving procedures, different with and without being on battery power only etc.).
Still, I’d like to see clearer. I have 2 other proposals for you: the first one is a tiny action on you, namely to try out the same with ROOT 6.12/04, which includes several improvements wrt the 6.11 devel release. The second one is an action on me, which is benchmarking your workflow on your data: I’ll contact you privately about that.

Cheers,
D

Hi,

meanwhile, we were analysing your code.
One thing which would give you a tangible speedup is the way in which you are initialising the histograms which are a data member of your class.
Your preprocessor macro, DFHisto, immediately calls GetValue. This triggers an event loop per histogram. What we’d suggest is to save the TResultProxy’s as intermediate, local results once your entire calculation is set up and only then trigger the event loop invoking GetValue on those to fill the data members. Alternatively, you might also have TResultProxy instances as data members: they behave like histograms, you can invoke GetValue on those and the behaviour of your class will inherit the desirable laziness of TDF.

Cheers,
D

Hi,

I was trying to reproduce your analysis. How exactly did you invoke it (cuts values etc.): the exact commandline would be great.

Cheers,
Danilo

Hi, sorry I took so long to get back to you.

root [0] .L Analyzer.C++
Info in <TMacOSXSystem::ACLiC>: creating shared library /Users/noahsteinberg/Physics/anamuse-build/./Analyzer_C.so

root [1] Analysis analysis("title", "muon", "115", "MeV", 20, 100, "bigger_test.root", true, 30.0, 300, 100000, "chamber")
We are looping over 964688 events.
  Norm: tgt yield = 0.694398 per 10^6 beam particles
  TGT processes: Coulomb = 273683 eIoni = 75 eBrem = 318

The code took 8.39542 seconds.
(Analysis &) @0x1185ca0a0
root [2] .ls
 OBJ: TH1D	h_thetaMeV	 : 0 at: 0x1185ca158
 OBJ: TH1D	h_theta_tgt_MeV	 : 0 at: 0x1185ca540
 OBJ: TH1D	h_theta_no_tgt_MeV	 : 0 at: 0x1185ca928
 OBJ: TH1D	h_dtheta_MeV	 : 0 at: 0x1185cb8c8
root [3] h_thetaMeV->Draw()

Hi, thanks.
Any look in removing the extra event loops? Those are a huge penalty/

Cheers,
D

Yes! I removed the parts of the macro “.GetValue()” and then I save the TProxyValue as intermediate, and only use .GetValue at the very end to assign the output to my member histograms. It’s definitely faster. Down from 8.5 seconds to 2 seconds. I also updated to ROOT 6.12, no change :slight_smile:

Hi,

this is very good: thanks for diving into this. The numbers are what we expected: we went from 4 event loops to 1 and we got a factor 4.
Now, about the parallelism: I think that maybe using a few cores you can still gain something but parallelising a workflow that takes 2 seconds might not be always gratifying.
Did you see any speedup perhaps using 2 or 4 cores?

Cheers,
D

Hi, I’m still having trouble. I’ve tried my TDataFrame on larger files, ones that take ~5 minutes without multithreading, and whenever I activate EnableImplicitMT, it takes much much longer. . .

and this is with v6.12?

Yes this is with version 6.12. And trying a smaller number of threads (2 or 4) does not help, though it does make it slightly faster than with 8 threads.

Hi,

Can we reproduce the issue, i.e. have the input data and code you are running?

Cheers,
D

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.