Activating multi thread slows down file reading speed

LongHoa · March 25, 2020, 7:34am

Dear expert,

I read a topic about activating multi-thread in ROOT using ROOT::EnableImplicitMT(nthreads).
I tested my new “discovery” on a tree I created myself. The tree is simple, it contains 4 branches. I fill the tree with 10 million events.
I wrote a simple code to read the tree and calculate the execution time. My code allows me to either activate or deactivate the multi-thread mode.
The result is unexpected (to me) as it cost 22 seconds to read the tree with multi-thread on (4 threads), while the mode is off, it took only 1 second.

I used top command to check the cpu usage, it was 400% when I activated the multi-thread as expected.

I use two version of root, one is from CMSSW_9_4_0 and one sourced from: /cvmfs/sft.cern.ch/lcg/views/dev3/latest/x86_64-centos7-gcc7-opt/setup.sh

Here is the code I used to generate the tree:

#include <stdio.h>
#include <iostream>
#include <pthread.h>


float myPi = 3.1415927;


void  xTest_Wmulticore (bool usenthread = 1,  int nthreads = 4)
{
	if (usenthread==true)   ROOT::EnableImplicitMT(nthreads);
	
	//system ("mkdir -p  /home/hoa/Task_TestCode/File_Root/");
	system ("mkdir -p  /afs/cern.ch/work/l/lcaophuc/Task_Test/Output/");
	
	
	TRandom *rand = new TRandom3 (0);
	
	//TFile *file = new TFile ("/home/hoa/Task_TestCode/File_Root/multiThread_Write.root", "recreate");
	TFile *file = new TFile ("/afs/cern.ch/work/l/lcaophuc/Task_Test/Output/multiThread_Write.root", "recreate");
	TTree *tree = new TTree ("tree", "my out tree");
	
	float  mass   = 0.0;
	float  energy = 0.0;
	float  theta  = 0.0;
	float  phi    = 0.0;
	
	tree -> Branch ("mass",    &mass);
	tree -> Branch ("energy",  &energy);
	tree -> Branch ("theta",   &theta);
	tree -> Branch ("phi",     &phi);
	
	
	printf ("Clock starts\n");
	TStopwatch* clock = new TStopwatch();
	clock -> Start();
	
	
	for (long ev=0; ev<10000000; ev++)
	{
		mass   = rand -> Gaus (90, 1);
		energy = rand -> Landau (40, 5);
		
		float tmptheta = rand -> Gaus (myPi/2.0, 0.5);
		float tmpphi   = rand -> Rndm ();
		
		float sinetheta = sin(tmptheta);
		
		theta = abs(asin(sinetheta));
		phi = tmpphi * myPi;
		
		tree -> Fill ();
	}
	
	
	float mytime = clock -> RealTime();
	
	int    nHour, nMinute;
	float  nSecond;
	
	nHour   = mytime / 3600;
	nMinute = (mytime - 3600*nHour)/60;
	nSecond = (mytime - 3600*nHour - 60*nMinute);
	
	printf ("it takes %02d hour(s) %02d minute(s) and %010.7f second(s)\n", nHour, nMinute, nSecond);
	
	tree ->Write("", TObject::kOverwrite);
}

and here is the one I used to read:

#include <stdio.h>
#include <iostream>

void  xTest_Rmulticore (bool usenthread)
{
	int nthreads = 4;
	
	if (usenthread==true)   ROOT::EnableImplicitMT(nthreads);
	
	//TFile *file = new TFile ("/home/hoa/Task_ScaleFactor/Input/TnPTreeZ_SingleMuon_Run2016.root", "read");
	TFile *file = new TFile ("/afs/cern.ch/work/l/lcaophuc/Task_Test/Output/multiThread_Write.root", "read");
	TTree *tree = (TTree*)file -> Get ("tree");
	
	long nEvent = tree -> GetEntries();
	long nSum = 0;
	
	
	printf ("There are %ld events, Clock starts\n", nEvent);
	TStopwatch* clock = new TStopwatch();
	clock -> Start();
	
	
	for (long i=0; i<nEvent; i++)
	{
		tree -> GetEntry (i);
		nSum ++;
	}
	
	
	Float_t mytime = clock -> RealTime();
	
	int    nHour, nMinute;
	float  nSecond;
	
	nHour   = mytime / 3600;
	nMinute = (mytime - 3600*nHour)/60;
	nSecond = (mytime - 3600*nHour - 60*nMinute);
	
	printf ("File is read. The sum has been added %ld times\n", nSum);
	printf ("it takes %02d hour(s) %02d minute(s) and %010.7f second(s)\n", nHour, nMinute, nSecond);
}

Are the processing times expected. If multithread should run faster, then what could be wrong with my code?

Thank you and best regards,
Hoa.

ROOT Version: 6.10/09
Platform: Not Provided
Compiler: Not Provided

bellenot · March 25, 2020, 7:40am

Maybe using RDataframe would help improving the situation. @eguiraud could most probably help you better than me

eguiraud · March 25, 2020, 7:56am

I don’t think this is expected and it would be interesting to profile where time is spent e.g. with perf or vtune. What ROOT version are you on?

@pcanal might be interested to know you have observed this and/or might have an explanation.

RDataFrame provides a higher-level interface that typically scales well, in which each thread processes a different TTree cluster, so as long as your file has many clusters you should see an improvement there.

Cheers,
Enrico

LongHoa · March 25, 2020, 8:16am

Thank you for your replies.
The ROOT version I used is 6.10/09. How can I use perf to profile the time spending?

Hoa.

eguiraud · March 25, 2020, 8:46am

That’s a very old ROOT version, the issue might have been fixed since then. You should try again with ROOT v6.20.

How can I use perf to profile the time spending?

You need to compile ROOT with debug symbols, i.e. specify cmake’s -DCMAKE_BUILD_TYPE=RelWithDebInfo when building. This is a tutorial on perf+flamegraphs which are one of my favourite tools to investigate runtime bottlenecks.

LongHoa · March 25, 2020, 8:56am

I’ve just tested with 6.20/02 (on lxplus). It yields the same result. Multithread still takes around 23 second.

As for perf, I’ll need time to learn, as I’m very new to linux/computing.

amadio · March 25, 2020, 2:53pm

Can you try outside of afs?

LongHoa · March 26, 2020, 6:14am

I had tested my code on my department server before I ran it on lxplus (the server uses Scientific Linux release 6.10 (Carbon)). The result is the same.

I also tested on my laptop (I’m using Ubuntu 18.04, Root 6.18/04). The result is still similar. Going through 1 millions events took me 00.9776149 second with multithread on, and only 00.1313970 second without multithread.

eguiraud · March 26, 2020, 9:02am

Hi,
is there any way you can share a little bit of the data? So we can reproduce and investigate. Just a few representative events would be enough, we can just copy them n times to make more if needed.

Cheers,
Enrico

LongHoa · March 26, 2020, 9:27am

Hi Enrico,

I have tested my code on 2 files. One is a file from CMS Muon group, I’m not sure if I can share the file to public or not. Another file I used is a file I generated myself. Both gives me the similar timing, so I think using my generated root file would be ok. the description is below:

I have 1 code to create the root file with a tree in it, I fill the tree with some random number and then read the file I created with another code.

Here are the codes I used:
For creating the tree:

#include <stdio.h>
#include <iostream>
#include <pthread.h>


float myPi = 3.1415927;


void  xTest_Wmulticore (bool usenthread = 1,  int nthreads = 4)
{
	if (usenthread==true)   ROOT::EnableImplicitMT(nthreads);
	
	//system ("mkdir -p  /home/hoa/Task_TestCode/File_Root/");
	system ("mkdir -p  /home/longhoa/ROOT_Work/Task_Test/Output/");
	
	
	TRandom *rand = new TRandom3 (0);
	
	//TFile *file = new TFile ("/home/hoa/Task_TestCode/File_Root/multiThread_Write.root", "recreate");
	//TFile *file = new TFile ("/afs/cern.ch/work/l/lcaophuc/Task_Test/Output/multiThread_Write.root", "recreate");
	TFile *file = new TFile ("/home/longhoa/ROOT_Work/Task_Test/Output/multiThread_Write.root", "recreate");
	TTree *tree = new TTree ("tree", "my out tree");
	
	float  mass   = 0.0;
	float  energy = 0.0;
	float  theta  = 0.0;
	float  phi    = 0.0;
	
	tree -> Branch ("mass",    &mass);
	tree -> Branch ("energy",  &energy);
	tree -> Branch ("theta",   &theta);
	tree -> Branch ("phi",     &phi);
	
	
	printf ("Clock starts\n");
	TStopwatch* clock = new TStopwatch();
	clock -> Start();
	
	// Genrated random value and store in the tree
	for (long ev=0; ev<1000000; ev++)
	{
		mass   = rand -> Gaus (90, 1);
		energy = rand -> Landau (40, 5);
		
		float tmptheta = rand -> Gaus (myPi/2.0, 0.5);
		float tmpphi   = rand -> Rndm ();
		
		float sinetheta = sin(tmptheta);
		
		theta = abs(asin(sinetheta));
		phi = tmpphi * myPi;
		
		tree -> Fill ();
	}
	
	// Print the processing time
	float mytime = clock -> RealTime();
	
	int    nHour, nMinute;
	float  nSecond;
	
	nHour   = mytime / 3600;
	nMinute = (mytime - 3600*nHour)/60;
	nSecond = (mytime - 3600*nHour - 60*nMinute);
	
	printf ("it takes %02d hour(s) %02d minute(s) and %010.7f second(s)\n", nHour, nMinute, nSecond);
	
	tree ->Write("", TObject::kOverwrite);
}

For reading the tree:

#include <stdio.h>
#include <iostream>

void  xTest_Rmulticore (bool usenthread)
{
	int nthreads = 4;
	
	if (usenthread==true)   ROOT::EnableImplicitMT(nthreads);
	
	//TFile *file = new TFile ("/home/hoa/Task_ScaleFactor/Input/TnPTreeZ_SingleMuon_Run2016.root", "read");
	//TFile *file = new TFile ("/afs/cern.ch/work/l/lcaophuc/Task_Test/Output/multiThread_Write.root", "read");
	TFile *file = new TFile ("/home/longhoa/ROOT_Work/Task_Test/Output/multiThread_Write.root", "read");
	TTree *tree = (TTree*)file -> Get ("tree");
	
	long nEvent = tree -> GetEntries();
	long nSum = 0;
	
	
	printf ("There are %ld events, Clock starts\n", nEvent);
	TStopwatch* clock = new TStopwatch();
	clock -> Start();
	
	
	for (long i=0; i<nEvent; i++)
	{
		tree -> GetEntry (i);
		nSum ++;
	}
	
	
	Float_t mytime = clock -> RealTime();
	
	int    nHour, nMinute;
	float  nSecond;
	
	nHour   = mytime / 3600;
	nMinute = (mytime - 3600*nHour)/60;
	nSecond = (mytime - 3600*nHour - 60*nMinute);
	
	printf ("File is read. The sum has been added %ld times\n", nSum);
	printf ("it takes %02d hour(s) %02d minute(s) and %010.7f second(s)\n", nHour, nMinute, nSecond);
}

You can test on any machine as it take only a few second to generate the root file.
I will ask for permission to share the CMS file. So I hope the generated file can help for the time being.

Best,
Hoa.

eguiraud · March 26, 2020, 9:28am

My bad for not remembering you shared this in your first post! I’ll take a look, thanks!

LongHoa · March 26, 2020, 9:30am

Thank you very much!!!

eguiraud · March 27, 2020, 6:03pm

Hi,
so the problem is easy to reproduce and here are two flamegraphs, one for the no multi-thread case and one for the IMT case: https://eguiraud.web.cern.ch/eguiraud/flamegraphs/imt_getentry

Long story short, I think we are using a TBB feature (TBB is the multi-thread task scheduler ROOT uses internally) that has a pretty big runtime cost and, at least in this benchmark, that cost is larger than the runtime of the actual TBB tasks.

@pcanal might be able to comment further. If not, I’ll open a bug report next week and we’ll go from there.

Cheers,
Enrico

LongHoa · March 30, 2020, 6:14am

Hi,

Thank you very much for investigating my case.
I’m just curious, if this issue is really caused by the current TBB feature, would there be any work around?

By the way, when I generated the tree, multithread doesn’t seem to do much. I see 120% of CPU was used instead of 400% as I set. Is it also unexpected.

Best,
Hoa.

eguiraud · March 31, 2020, 7:39am

Hi,
I do not know of a workaround for IMT GetEntry, but you might be able to divide the workload differently, e.g. one TTree per core instead of all cores working on one TTree.

The issue with IMT TTree::GetEntry is now https://sft.its.cern.ch/jira/browse/ROOT-10649.

@pcanal please give us an opinion

Cheers,
Enrico

pcanal · April 2, 2020, 5:06pm

In this case the task are just simply too small … each task is the GetEntry reading a single float … so the overhead is (of course) way too large. For example using:

	float  mass   = 0.0;
	float  energy = 0.0;
	float  theta  = 0.0;
	float  phi    = 0.0;

   std::vector<float> mass_vec(1000, 1.0);
   std::vector<float> energy_vec(1000, 2.0);
   std::vector<float> theta_vec(1000, 3.0);
   std::vector<float> phi_vec(1000, 4.0);

	tree -> Branch ("mass",    &mass);
	tree -> Branch ("energy",  &energy);
	tree -> Branch ("theta",   &theta);
	tree -> Branch ("phi",     &phi);

   tree -> Branch ("mass_vec",    &mass_vec);
   tree -> Branch ("energy_vec",  &energy_vec);
   tree -> Branch ("theta_vec",   &theta_vec);
   tree -> Branch ("phi_vec",     &phi_vec);

you can see a speedup from the use of thread.

For the original file, using RDF style parallelism will be much efficient.

In term of IMT and GetEntry, this is indeed technically a deficiency. We should extent the logic to regroup “small” branches or to process them sequentially.

Cheers,
Philippe.