Openmp and TTree:GetEntry()

preimer · February 26, 2020, 3:40am

I would like to use openmp to parallelize the processing of an ntuple. The choice of openmp is driven by the fact that it is already used to parallelize the routine that processes each entry. My problem is that I’m not able to find the correct way to tell openmp to access the tree. None of “copyin”, “shared”, or “private” clauses seem to work with “myTree”. I would like to do something like this:

  // Tell root that I do want to do this safely in parallel
  ROOT::EnableThreadSafety();
  int nthreads = 4;
  ROOT::EnableImplicitMT(nthreads);

  // open the file and get the ntuple object
  TTree *evtTree;
  TFile *inFile = new TFile("inputFile.root", "read", "inFile", 0);
  inFile->GetObject("theTree", myTree);
  myTree->SetImplicitMT(false);

  static long int nEvts;
  nEvts = myTree->GetEntries();

// tell the compiler that we want to parallelize this for loop and how the
// threads should access the variables
#pragma omp parallel for \
  num_threads(4),    \
  default(none),        \
  copyin(nEvts),	\
  shared(evtTree),	\
  reduction(+:result)\

  for (long int iEvt = 0; iEvt<nEvts; iEvt++) {
      Int_t          iTree;
      Int_t          jTree;
      Double_t       xTree;
      Double_t        yTree;
      
      // Set branch addresses.
      evtTree->SetBranchAddress("i",&iTree);
      evtTree->SetBranchAddress("j",&jTree);
      evtTree->SetBranchAddress("x",&xTree);
      evtTree->SetBranchAddress("y",&yTree);
  
      evtTree->GetEntry(iEvt);

      /*
      The routine processEvent will do many things to the event and already uses 
       openmp.  I now have access to a machine with the ability to use more cores than 
       processEvent can efficiently use
      */
      double localResult = processEvent(i, j, x, y);
      result += localResult;
    }

Without the “#pragma omp parallel for” the code does what I expect, although using fewer cores and slower than I would like.

Thanks in advance for your advice,

Paul

ROOT Version: 6.19/01
Platform: OSX and Ubuntu
Compiler: Not Provided

Axel · February 26, 2020, 8:24pm

Hi Paul,

We’re still discussing how to help you. The underlying issue is that a TTree cannot be accessed from multiple threads. Instead you need to create one TFile object per tread, each reading one TTree - even if they are reading the same file on disk. But how to get there with OMP isn’t clear to us yet. Ideas?

Cheers, Axel.

preimer · February 26, 2020, 9:17pm

I thought this would have been a common problem for which I had not yet stumbled upon the correct solution. My present thought is that if

/*
      The routine processEvent will do many things to the event and already uses 
       openmp.  I now have access to a machine with the ability to use more cores than 
       processEvent can efficiently use
      */
      double localResult = processEvent(i, j, x, y);

takes a macroscopic amount of time compared to

      Int_t          iTree;
      Int_t          jTree;
      Double_t       xTree;
      Double_t        yTree;
      
      // Set branch addresses.
      evtTree->SetBranchAddress("i",&iTree);
      evtTree->SetBranchAddress("j",&jTree);
      evtTree->SetBranchAddress("x",&xTree);
      evtTree->SetBranchAddress("y",&yTree);
  
      evtTree->GetEntry(iEvt);

I could put a make evtTree (some times in this example I see that I called it myTree) and put a lock around the latter section, thus ensuring that evtTree is never accessed in parallel. The SetBranchAddress is necessary in each thread each time, since other threads will change the address to point to their versions of iTree, jTree, xTree, and yTree. Thinking like a FORTRAN programmer, if the point in memory occupied by evtTree is allowed to be accessed by all threads, just not at the same time, might it work?

I haven’t tried it yet, but was writing a small example to see if it might.

Paul

StephanH · February 27, 2020, 2:10pm

What about pulling the opening of the files and SetBranchAddress (including the value buffers) etc into the parallel OMP section?
It will open the file four times, and you can do whatever you want with each instance, because it’s thread-local. Now, the challenge is “just” to divide the range to be processed into four chunks. You can ask OMP for the task ID, you ask the file for how many events it has, and compute the range to process.

It’s not the most efficient way to do it, since you might have to decompress clusters in the file multiple times when two threads read inside the same cluster, but that’s not a problem if processEvent anyway takes longer than loading the data.

preimer · February 27, 2020, 9:38pm

I have two solutions that both seem to work on my trivial test case. The first is what I suggested. That is to use

omp_set_lock(&getEventLock);

and

omp_unset_lock(&getEventLock);

around the GetEvent.
The second is what StephanH suggested and divide the data in to n chunks, one for each thread, and then open the input root file in each thread with a new, local instance of the TTree in each thread.
I find StephanH’s solution better, as the locking could hypothetically have significant speed costs. Attached is a tar file with my code that (1) writes a root file on which testTree.tar (17.5 KB) to test things (writeTestTree.cc), (2) read the root file using locks (testTree.cc), and (3) reads the root file using multiple instances of the TTree (testTree2.cc).

Paul

pcanal · February 27, 2020, 9:55pm

[That is probably what you already did] They lock needs to be around the SetBranchAddress and the GetEntry at the same time.

Note that the ‘right’ iteration length is the cluster size which is a variable number of entry. See TTree::GetClusterIterator

Wile_E_Coyote · February 28, 2020, 7:21am

It seems to me that this is a good candidate for a PROOF-Lite usage (i.e. no OpenMP at all).

See: ${ROOTSYS}/tutorials/proof
or maybe also: ${ROOTSYS}/tutorials/multicore

system · March 13, 2020, 7:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.