Confusion on how to run processes in parallel

scheng5134 · December 8, 2022, 7:21am

I’m running into a problem where I don’t know how to tell root to run some for loops in parallel threads.
Currently my code

Grabs a data file with a TTree called “Data_R”
Assigns a variable “rt” to hook onto the “Time” branch values
Then loops through the data where I just try to find the time difference between adjacent counts.

But I want it to

Grabs a data file with a TTree called “Data_R”
Assigns a variable “rt” to hook onto the “Time” branch values
Splits the dataframe down into 6 pieces (if someone knows a better way to do this please let me know)
Then loops through the separated data pieces where I just try to find the time difference between adjacent counts on individual threads and ideally lower the processing time.

Currently the code looks like

TFile *g0 = new TFile(fname.Data());
TTree *kListData = (TTree*)g0->Get("Data_R");
ULong64_t rt{0};
kListData->SetBranchAddress("Time",&rt);
ROOT::RDataFrame data("Data_R",fname.Data());

auto Task1 = data.Range(0*count/6,1*count/6);
auto Task2 = data.Range(1*count/6,2*count/6);
auto Task3 = data.Range(2*count/6,3*count/6);
auto Task4 = data.Range(3*count/6,4*count/6);
auto Task5 = data.Range(4*count/6,5*count/6);
auto Task6 = data.Range(5*count/6,6*count/6);

for (Long64_t i = 0; i < kListData->GetEntries(); i++)
{
	kListData->GetEntry(i);
	double r1 = rt;
	double diff = 0;

	kListData->GetEntry(i+1);
	diff = (rt - r1);
}

I’m not sure how to tell root to run the for loop for each of the “Task” dataframe on a individual thread. Can anyone tell me how?

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

eguiraud · December 8, 2022, 10:54am

Hi @scheng5134 ,

in your example code you are creating an RDataFrame object but then you are not using it to loop over the data: you do that by directly calling GetEntry on the TTree.

RDataFrame parallelizes work internally by splitting the dataset as it sees fit and then execute the operations you register on it for every dataset chunk. You can find more information in the RDF user guide.

If you want/need to be in control of how the dataset is split, your best bet is to use TThreadExecutor or another multi-thread (or multi-processing) scheduling mechanism and schedule, on different threads, the processing of each dataset split. The most important thing you have to take care of is that every thread must open and close its own TFile and use a different copy of the TTree extracted from the thread-local TFile.

Cheers,
Enrico

system · December 22, 2022, 10:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.