Need Help with TTreeProcessorMT: Increasing Number of Threads Beyond 7

Hi,

Hi everyone,

I’m working with TTreeProcessorMT and have come across some behavior that I’m having trouble understanding. I’ve provided a simplified version of my implementation below:

void treeProcessor::filter_events_v2(likelihoodNet &lnet)
{
	int numberOfThreads = 7;
	ROOT::EnableImplicitMT(numberOfThreads);

	// Create a TThreadedObject to hold a TGraph for each thread
    ROOT::TThreadedObject<TGraph> threadedScatter;

	// Create a TTreeProcessorMT: specify the file and the tree in it
	ROOT::TTreeProcessorMT processor(fileName, "myTree");

	std::atomic<int> taskCounter{0};

	// Define the function that will process a subrange of the tree.
	// The function must receive only one parameter, a TTreeReader,
	// and it must be thread safe. To enforce the latter requirement,
	// TThreadedObject histograms will be used.
    auto processFunction = [&](TTreeReader &reader) 
	{
		// Access the event branch using TTreeReaderValue
		TTreeReaderValue<ProcessedEvent> processed_event(reader, "event");

		int taskNumber = taskCounter.fetch_add(1);

		// For performance reasons, a copy of the pointer associated to this thread on the
		// stack is used
		auto localThreadedScatter = threadedScatter.Get();

		int localGraphPointCount = 0;
		
		// Process each entry in the current task's range
		while (reader.Next()) 
		{		
			if (/*some filtering logic*/)
			{
				localThreadedScatter->SetPoint(localThreadedScatter->GetN(), xval, yval);
				localGraphPointCount++;
			}
		}
		std::cout << "Task " << taskNumber << " added " << localGraphPointCount << " points." << std::endl;
    };

	// Launch the parallel processing of the tree
	processor.Process(processFunction);

	// Use the TThreadedObject::Merge method to merge the thread private scatter plots
  	// into the final result
	auto scatterMerged = threadedScatter.Merge();

	// Set the scatter TGraph equal to mergedGraph
    *scatter = *scatterMerged;
}

In this function, I noticed that regardless of the value assigned to numberOfThreads, there are always 7 clusters. Interestingly, the minimum execution time is achieved when numberOfThreads = 7. As I increase the number of threads from 1 to 7, the execution time decreases. However, going beyond 7 threads has no effect other than random fluctuations in execution time.

My computer has 30 available threads, and I’d like to make full use of them. Is there a way to manually set the number of clusters or configure the function to use more than 7 threads?

Any help or suggestions would be greatly appreciated!

Thanks in advance!

In this particular case my guess would be that your input tree has 7 TTree entry clusters :slight_smile:

EDIT:
the reason why a TTree entry cluster is the smallest granularity for parallelism is that if 2 threads processed different parts of the same cluster, each thread would have to read and decompress the same data, resulting in some redundant work.