TChain/TClass Poor Multi-Threaded Scaling

Attila_Krasznahorkay · October 23, 2018, 11:24am

Dear All,

I’ve been playing for the last ~1 week with defining a custom data source for ROOT::RDataFrame, and came across the following issue while developing that code.

What I observed is that certain operations are very inefficient when running in multiple threads. I was trying to push as much of the initialisation of my custom data source to the multi-threaded execution of RDataFrame, as possible. But to my big surprise I found that this made my tests a lot slower than just executing that same initialisation in a single thread before the rest of the code would run multi-threaded. (Which means that I have an unavoidable, ~8 second initialisation time to all of these jobs at the moment. )

To demonstrate the issue in piece of code that only uses ROOT’s own classes, I wrote this example:

https://gitlab.cern.ch/akraszna/xAODDataSource/blob/master/xAODDataFrameTests/util/threadChainTest.cxx

Since the repo is not public, the relevant code from this file is:

int main( int argc, char* argv[] ) {

   // Read the command line options.
   const xDFT::CommandLineOptions cmdl( argc, argv );

   // Set up the runtime environment.
   ROOT::EnableThreadSafety();
   RETURN_CHECK( APP_NAME, xAOD::Init() );

   // Execute the file scanning using N parallel threads, X times.
   const std::vector< std::vector< std::string > >
      args( 50, cmdl.inputFiles() );
   ROOT::TThreadExecutor pool( cmdl.nThreads() );
   pool.Foreach( scanFiles, args );

   // Return gracefully.
   return 0;
}

void scanFiles( const std::vector< std::string >& fileNames ) {

   // Set up a TChain for reading the files.
   TChain chain( "CollectionTree" );
   for( const std::string& fname : fileNames ) {
      chain.Add( fname.c_str() );
   }

   // Load the first entry/file.
   chain.LoadTree( 0 );

   // Scan the branches of the tree.
   TObjArray* branches = chain.GetListOfBranches();
   for( Int_t i = 0; i < branches->GetEntries(); ++i ) {
      TBranchElement* br = dynamic_cast< TBranchElement* >( branches->At( i ) );
      if( ! br ) {
         continue;
      }
      TClass::GetClass( br->GetClassName() );
   }

   return;
}

Now, when I run this test with different number of threads, I see the following scaling behaviour:

I.e. After a certain number of threads the internal locks of ROOT start to hurt the execution pretty badly.

Just to show one more thing, this is the profile I get from GPerfTools when running the executable with 8 threads:

threadChainTest_t8.pdf (17.7 KB)

I thought I’d write this up on the forum, instead of opening a Jira ticket with it. Since it’s not really a bug in the code. I just wanted to discuss a bit if it could be possible to improve on this situation…

Cheers,
Attila

ROOT Version: 6.14/04
Platform: x86_64-slc6-gcc62-opt
Compiler: GCC 6.2

amadio · October 23, 2018, 2:09pm

Hi Attila, Danilo and I have identified TClass::GetBaseClassOffset() as a problem in a few different occasions. I was using TBufferMerger benchmarks to optimize ROOT I/O at the time and using VTune. This is something that already received some attention a couple of months ago (see, e.g. commit 9ded3b85), and even before. We plan to revisit this later to try to optimize things some more. We want to identify when TClass::GetBaseClassOffset() simply returns 0 (vast majority of cases), and not take a lock at all if possible. This may also be interesting for you to look at. Specifically the comment right after the one linked above. @Danilo may have more to add on the optimization. We were discussing this on the place on they way back from CHEP.

amadio · October 23, 2018, 2:17pm

BTW, any chance you could share one of the files with us, so that we can use this for working on optimization in ROOT?

system · November 17, 2018, 4:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.