main.cpp (1.2 KB)
ROOT Version: 6.18/04
Platform: Ubuntu linux
Compiler: GCC 9.2.1
Hi everyone,
I was updating my code to run it multi-threaded way using OpenMP, and noticed that in a scenario with more than 1 thread, instead of running faster the code slows down a lot.
I ended up producing a close-to-minimal working example, which is attached. Note that it is a standalone C++ program and not a ROOT macro.
The code essentially samples random numbers in a loop (from 0 to Ncasc-1), via TF1::GetRandom(), with the TF1 object (TF1* flt [nthreads] ) created based on ff_lt function.
To make the code parallel, I create an array of TF1 objects, one object per thread.
double ff_lt(double* x, double* par)
{
double lt2 = x[0]x[0]; // l_t^2
double& mg = par[0]; // m_g,min
double& mu = par[1]; // \mu
return x[0] * log(1. + lt2/(exp(1.0) * mgmg)) / pow(lt2 + mu*mu, 2);
}int main(int argc, char **argv)
{
int nthreads = atoi(argv[1]);
std::cout << “running with " << nthreads << " threads\n”;
int rseed = 438468301;
double mg = 0.3, mu = 0.4;
const int Ncasc = 10000;
omp_set_num_threads(nthreads);
ROOT::EnableThreadSafety();
// initializing the TF1 objects for all threads
TF1 * flt [nthreads];
for(int i=0; i<nthreads; i++) {
char flt_ROOT_name [20];
sprintf(flt_ROOT_name, “dsigmadlt_%i”, i);
flt[i] = new TF1(flt_ROOT_name, ff_lt, 0., 10.0, 2);
}
std::cout << “init done\n”;
// parallel loop
#pragma omp parallel for schedule(static, 1000)
for(int icasc=0; icasc<Ncasc; icasc++) {
int thread_id = omp_get_thread_num();
flt[thread_id]->SetParameters(mg, mu);
double lt = flt[thread_id]->GetRandom();
} // end parallel loop
return 0;
}
on Core i5-8xxx desktop/laptop, such code with nthreads==1 runs for 6.7 sec, whereas with nthreads==4 (4 threads) the execution time is 22.8 sec.
So, parallelizing the main loop into 4 threads makes the code almost 4 times slower, instead of 4 times faster!
To my understanding supported by a quick debugging with gdb, what happens when the loop is parallelized using OpenMP is the following:
all the created TF1 objects flt[thread_id] call the same shared random number generator (presumably gRandom ?). Presumably, gRandom can’t be used in parallel, which is ensured after the call to ROOT::EnableThreadSafety().
So in the main loop, each thread spends most of its time waiting when the gRandom is unlocked (unused by other threads) to call gRandom->Rndm()
Perhaps it works like that by design. The question is: can one make each of the TF1 objects to use its own instance of TRandom3 class to do the random sampling completely separately?
I couldn’t find anything like SetRandom() in TF1 class.
There seems to be a newer class ROOT::Math::DistSampler which allows to set an individual TRandom instance for each instance of the sampler. However, I wasn’t able to find or construct a minimal working example, completely equivalent to TF1::GetRandom() but using the DistSampler class, not in a standalone C++ program (not in a ROOT macros).