Slow performance when using multithreading with TTaskGroup

mtakacs · January 3, 2025, 2:30pm

Hello,

I’m currently trying to figure out how I can take advantage of the multithreading offered in Root 6 for my existing analysis codes. As a first exercise, I wanted to write a short macro which reads two root trees from separate files and simply goes through the entries using GetEntry(). (I’m aware of the fact, that there are better, more modern options to loop through root trees, however, the logic of my existing analysis codes rely heavily on the GetEntry() entry procedure.) The files “Test1.root” and “Test2.root” are 3 Gb large. Both contain a single tree with six branches, containing typical event by event measurement data.

In the first version of the code, the analysis is done sequentially, processing one tree at a time.

void singlecore2(){
	Long64_t TimeStamp;
	Long64_t n1, n2;
	TStopwatch watch;
	watch.Start();
	
	TFile *f1 = new TFile("Test1.root", "read");
	TTree *Traw1 = (TTree*)f1->Get("RawData/Tglobal");
	Traw1->SetBranchAddress("TimeStampGlobal",&TimeStamp);
	TFile *f2 = new TFile("Test2.root", "read");
	TTree *Traw2 = (TTree*)f2->Get("RawData/Tglobal");
	Traw2->SetBranchAddress("TimeStampGlobal",&TimeStamp);
	n1 = Traw1->GetEntries();
	n2 = Traw2->GetEntries();

	for(Long64_t j=0; j<n1; j++){
		Traw1->GetEntry(j);	
	}
	printf("File 1 - done\n");
	
	for(Long64_t j=0; j<n2; j++){
		Traw2->GetEntry(j);	
	}
	printf("File 2 - done\n");
	f1->Close();
	f2->Close();
	
  cout << "(Processing time: " << watch.RealTime() << ")" <<endl;
}

In the second version of the code, I wanted to do the same thing, by analyzing the two trees in parallel on two different CPU cores using multithreading. For this, I used TTaskGroup following the tutorial mt301_TTaskGroupSimple.C.

void multicore2(){
	TStopwatch watch;
	watch.Start();
	ROOT::EnableImplicitMT(2);
	
	ROOT::Experimental::TTaskGroup tg;
	
	tg.Run([]() {
				    Long64_t TimeStamp1;
					Long64_t n1;
				    TFile *f1 = new TFile("Test1.root", "read");
					TTree *Traw1 = (TTree*)f1->Get("RawData/Tglobal");
					Traw1->SetBranchAddress("TimeStampGlobal",&TimeStamp1);
					n1 = Traw1->GetEntries();
					for(Long64_t j=0; j<n1; j++){
						Traw1->GetEntry(j);	
					}
					f1->Close();
					cout << TimeStamp1 << endl;
					printf("File 1 - done\n");
					}
	);
	tg.Run([]() {
				    Long64_t TimeStamp2;
					Long64_t n2;
				    TFile *f2 = new TFile("Test2.root", "read");
					TTree *Traw2 = (TTree*)f2->Get("RawData/Tglobal");
					Traw2->SetBranchAddress("TimeStampGlobal",&TimeStamp2);
					n2 = Traw2->GetEntries();
					for(Long64_t j=0; j<n2; j++){
						Traw2->GetEntry(j);	
					}
					f2->Close();
					cout << TimeStamp2 << endl;
					printf("File 2 - done\n");
					}
	);
		
   tg.Wait();
   
  ROOT::DisableImplicitMT();
  cout << "(Processing time: " << watch.RealTime() << ")" <<endl;
}

Unfortunately, I’m getting considerably worse performance with multithreading enabled than in sequential mode. Moreover, the processing time seems to be longer if I assign more cores to the multithreading:
Multithreading disabled: 68 s
EnableImplicitMT(2): 247s
EnableImplicitMT(4): 429s
While the code is running, one can observe in the task manager that the activity of the set number of cores indeed goes to 100%.

Any hint or suggestion would be highly appreciated.

ROOT Version: 6.34.02
Platform: Windows 11
Compiler: Visual Studio 17.12.2

Danilo · January 4, 2025, 9:51am

Hi Marcell,

Thanks for the report: interesting.
TTaskGroup is a tool we will not actively support in the future, even if for the moment is functional (you can see a sign of this here, the tutorial has been moved to the legacy category).

I cannot exclude that this is an effect of threading on Windows: do you have the possibility to try the same code on Linux/mac or to run some profiling to see where the program spends time?

More in general, my suggestion would be to move to RDataFrame for all parallel data processing.

I hope this helps a bit.

Best,
D

mtakacs · January 6, 2025, 10:14am

Hi Danilo,

I tried to run the code on Ubuntu 24.04 and the results were basically the same. Once again, the processing time is much longer with multithreading enabled. Although, I have to add that I could only test Ubuntu using WSL (Windows Subsystem for Linux) due to our IT policy.

Could you perhaps give me a hint what the RDataFrame equivalent of my macro above would be?

Thanks,
Marcell

Danilo · January 6, 2025, 12:05pm

Hi Marcell,

If you happen to have a profile to share, happy to have a look.
About the RDF example: sure! Your code is very good, but more of a technical test. I would like to propose this minimal example, filling a histogram:

// Fill a TH1D with the "MET" branch
ROOT::RDataFrame d("myTree", "file.root");
auto h = d.Histo1D("MET");
h->Draw();

It comes from this page.

Just for my curiosity, what happens if you take your example and run the two lambdas in two std::threads that then you join? (just trying to get a hint about where the time is being spent)

Cheers,
D

mtakacs · January 6, 2025, 4:54pm

Hi Danilo,

I rewrote the code using std::thread as follows:

#include <TTree.h>
#include <TFile.h>
#include <TString.h>
#include <iostream>
#include <fstream>
#include <TSystem.h>
#include <TStopwatch.h>
#include <thread>

void task1(){
	Long64_t TimeStamp1;
	Long64_t n1;
	TFile *f1 = new TFile("Test1.root", "read");
	TTree *Traw1 = (TTree*)f1->Get("RawData/Tglobal");
	Traw1->SetBranchAddress("TimeStampGlobal",&TimeStamp1);
	n1 = Traw1->GetEntries();
	for(Long64_t j=0; j<n1; j++){
		Traw1->GetEntry(j);	
	}
	f1->Close();
	cout << TimeStamp1 << endl;
	printf("File 1 - done\n");	
}

void task2(){
	 Long64_t TimeStamp2;
	 Long64_t n2;
	TFile *f2 = new TFile("Test2.root", "read");
	TTree *Traw2 = (TTree*)f2->Get("RawData/Tglobal");
	Traw2->SetBranchAddress("TimeStampGlobal",&TimeStamp2);
	n2 = Traw2->GetEntries();
	for(Long64_t j=0; j<n2; j++){
		Traw2->GetEntry(j);	
	}
	f2->Close();
	cout << TimeStamp2 << endl;
	printf("File 2 - done\n");
}


void multicore3(){
	TStopwatch watch;
	watch.Start();
		
	std::thread t1(task1);
	cout << "Task 1 running" << endl;	
	std::thread t2(task2);
	cout << "Task 2 running" << endl;
		
   t1.join();
   t2.join();

  cout << "(Processing time: " << watch.RealTime() << ")" <<endl;
}

However, ROOT either exits abruptly or crashes with the following error message:

root [0] .L multicore3.C+
root [1] multicore3()
Task 1 running
Task 2 running
input_line_12:1:10: fatal error: error opening file 'D:\ROOT\etc\plugins\TArchiveFile\P010_TZIPFile.C':
#include "D:\ROOT\etc\plugins\TArchiveFile\P010_TZIPFile.C"
         ^
input_line_11: error: unknown type name 'include'
input_line_12: error: expected ';' after top level declarator
warning: Failed to call `P010_TZIPFile()` to execute the macro.
Add this function or rename the macro. Falling back to `.L`.

Do you have any idea what is going on with “P010_TZIPFile()”?

Best,
Marcell

mtakacs · January 6, 2025, 5:16pm

EDIT:
I figured it out: I forgot to put ROOT::EnableThreadSafety() before defining the threads.

#include <TTree.h>
#include <TFile.h>
#include <TString.h>
#include <iostream>
#include <fstream>
#include <TROOT.h>
#include <TSystem.h>
#include <TStopwatch.h>
#include <thread>

void task1(){
	Long64_t TimeStamp1;
	Long64_t n1;
	TFile *f1 = new TFile("Test1.root", "read");
	TTree *Traw1 = (TTree*)f1->Get("RawData/Tglobal");
	Traw1->SetBranchAddress("TimeStampGlobal",&TimeStamp1);
	n1 = Traw1->GetEntries();
	for(Long64_t j=0; j<n1; j++){
		Traw1->GetEntry(j);	
	}
	f1->Close();
	cout << TimeStamp1 << endl;
	printf("File 1 - done\n");	
}

void task2(){
	 Long64_t TimeStamp2;
	 Long64_t n2;
	TFile *f2 = new TFile("Test2.root", "read");
	TTree *Traw2 = (TTree*)f2->Get("RawData/Tglobal");
	Traw2->SetBranchAddress("TimeStampGlobal",&TimeStamp2);
	n2 = Traw2->GetEntries();
	for(Long64_t j=0; j<n2; j++){
		Traw2->GetEntry(j);	
	}
	f2->Close();
	cout << TimeStamp2 << endl;
	printf("File 2 - done\n");
}


void multicore3(){
	TStopwatch watch;
	watch.Start();
	
	ROOT::EnableThreadSafety();
		
	std::thread t1(task1);
	cout << "Task 1 running" << endl;	
	std::thread t2(task2);
	cout << "Task 2 running" << endl;
		
   t1.join();
   t2.join();

  cout << "(Processing time: " << watch.RealTime() << ")" <<endl;
}

With this modification, std:thread seems to work fine. The processing time went down to 41 s, which is faster than the single core version of the code (68 s).

system · January 20, 2025, 5:16pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.