Cannot set more than 5 threads for EnableImplicitMT

sandor.lokos · March 21, 2023, 1:39pm

Dear Experts,

I am experimenting with the ROOT parallel tree processing. The kind of a skeleton of the macro can be found below, it’s based on imt101__parTreeProcessing_8C example macro. I succesfully extended it to do my task and it is compiled and works well up to 5 cores. But if I pass larger number then 5 to ROOT::EnableImplicitMT(nthreads) it still uses 5 cores. I checked it in top, if nthreads=10 or whatever, only 5 cores are used. I made a plot on the average speed of the task to run which is quite heavy and the difference between 1, 2, 3, 4, 5 cores are really there, however, after 5 no changes can be seen. The desktop I’m running on has 24 cores.
Of course, there are other way to parallelize further, e.g., simply in bash but it would be more convenient to use just one method: ROOT.

I’m using ROOT 6.26/10 on Ubuntu 22.04.2 LTS.

Thanks in advance, Sandor

The Makefile:

CC=g++
CPPFLAGS=-std=c++20 -lpthread
TARGET=<macroname>
OBJS=<macroname>.o
ROOTLIBS=`${ROOTSYS}/bin/root-config --cflags --glibs`


.SUFFIXES   : .o .cc
.SUFFIXES   : .o .C

.cc.o :
	$(CC) $(FFLAGS) $(ROOTLIBS) -c $<
.C.o :
	$(CC) $(FFLAGS) $(ROOTLIBS) -c $<


all: $(OBJS) $(HEADERS)
	$(CC) $(OBJS) $(CPPFLAGS) $(ROOTLIBS) -o $(TARGET) 

clean:
	-rm -rf *.o *.d $(TARGET)

#include <random>
#include <iostream>
#include <future>
#include "TROOT.h"
#include "TCanvas.h"
#include "TLegend.h"
#include <TFile.h>
#include <TF1.h>
#include <TH2F.h>
#include <TTree.h>
#include <cstdlib>
#include <TLorentzVector.h>
#include <vector>
#include "ROOT/TExecutor.hxx"
#include "ROOT/TThreadedObject.hxx"
#include "ROOT/TTreeProcessorMT.hxx"
#include "TTreeReader.h"
#include <TTreeReaderArray.h>
#include <chrono>

using namespace std;
using namespace std::chrono;


const int NFILE = 1000;

TLorentzVector signal(vector<TLorentzVector> a, vector<TLorentzVector> b)
{
	/* Do stuff */
}


int main(int argc, char** argv)
{
    if ( argc < 2 )
	{
		std::cerr << "Need one argument!" << std::endl;
		return 1;
	}
	auto start1 = high_resolution_clock::now();
	TFile * theFile;
	int nthreads = atoi(argv[1]);
	ROOT::EnableImplicitMT(nthreads);
	std::cerr << "Number of threads: " << nthreads << std::endl;
	ROOT::TThreadedObject<TH1F> some_histo_merged("some_histo",";)",300,0.6,0.9); some_histo->Sumw2();
	
	for( int ifile = 0 ; ifile < NFILE ; ifile++)
	{
		ROOT::TThreadedObject<TH1F> some_histo("some_histo",";)",300,0.6,0.9);
		some_histo->Sumw2();
		ROOT::TTreeProcessorMT tp(Form("<path>","<theTree>");
		
		auto myFunction = [&](TTreeReader &myReader)
		{
			TTreeReaderArray<double> some_array(myReader, "some_array");
			TTreeReaderValue<int> some_int(myReader, "some_int");
					
			while(myReader.Next())
			{
				/* Do stuff with signal() function and fill some_histo */
			}
		};
	
		tp.Process(myFunction);
		auto MergedHisto = some_histo.Merge();
		some_histo_merged->Add((TH1F*)MergedHisto->Clone());
		
	}
	
	TFile * outfile = new TFile("<path>","recreate");
	some_histo_merged->Write();
	outfile->Close();
	delete outfile;
	auto stop1 = high_resolution_clock::now();
	
	auto duration1 = duration_cast<microseconds>(stop1 - start1);
	std::cout << duration1.count() << std::endl;

	
	return 0;
}

Wile_E_Coyote · March 21, 2023, 1:46pm

Have you tried with 0?

sandor.lokos · March 21, 2023, 1:58pm

Interesting. If I give ROOT::EnableImplicitMT(0) it also uses 5 cores.

Wile_E_Coyote · March 21, 2023, 2:06pm

So, it looks like something on your system allows you to use only 5 cores.
Check “nproc” and “top -1” (for the current “%Cpu” usage).

sandor.lokos · March 21, 2023, 2:12pm

Thanks for the answer.

nproc returns with 24. When I go to top I can see only 5 cores are in use if nthread=0 or more then 4.

Interestingly, if I run with a bash script like this:

for i in {0..3}; do
    ./macro 5 $i &
done
wait

20 cores are used according to top. So it’s not really the system, I think, but I could be wrong.

Wile_E_Coyote · March 21, 2023, 2:17pm

So, maybe the ROOT::TTreeProcessorMT or the TTreeReader thinks it doesn’t make sense to use more than 5 cores?

sandor.lokos · March 21, 2023, 2:42pm

That’s possible but I don’t know why? It doesn’t seem to be the optimal distribution.

Actually, I observed now what’s happening in top and for a brief moment 6 cores were active but not with 100% usage. Then, I guess ROOT::TTreeProcessorMT or the TTreeReader rearranged the distribution and several seconds after the launch of the program, it runs on 5 cores.

So may be this is really the optimal number of cores. So, afterall, I learned that the maximum number of threads is not defined by the user but is determined by ROOT. I think this makes sense.

Thanks for the comments!

Wile_E_Coyote · March 21, 2023, 2:57pm

You could try to test it … use the “hadd” utility to sum, e.g., 10 or 100 files, and then try your “macro”.

eguiraud · March 21, 2023, 3:01pm

Hi @sandor.lokos ,

and welcome to the ROOT forum!

My best guess is that the input tree only contains 5 or 6 TTree clusters, so TTreeProcessorMT does not create more than 5 tasks. If this is correct, running on a larger input (e.g. a TChain that concatenates your original input 100 times) should make the CPU utilization go up.

Cheers,
Enrico

sandor.lokos · March 21, 2023, 3:08pm

Thanks, this is the solution, also @eguiraud pointed out the reason.

I have 1700 rootfiles around 110 MB in size with small trees. Now, I add together 10 of them and run on the single resulted file with sze around 1.4 GB and now the code, indeed, is using the 20 cores as I said so.

So the TTreeProcessorMT smartly find out if the specified number of cores are really necessary or not.

Thanks for both of you for the answers!

Wile_E_Coyote · March 21, 2023, 3:12pm

So, create a TChain with all your (small) files, and then run the ROOT::TTreeProcessorMT on it.

eguiraud · March 21, 2023, 3:14pm

If you check with a debugger you will most probably see that (up to the available number of logical cores) the number of threads spawned is equal to what you specified as an argument to EnableImplicitMT. What’s happening is that some of the threads are idle as they don’t have any task to pick up. The smallest granularity of a multi-thread task in TTreeProcessorMT is that of a single TTree cluster (tree->Print("clusters") gives information about the number of clusters in a tree.

Cheers,
Enrico

system · April 4, 2023, 3:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.