Speed up branch->GetEntry() with implicit MT?

Dear all,

I have an event loop application that reads a TTree with multiple branches and fills histograms. I compile the code with gcc8 and I’m using ROOT 6.18.04. Timing the code with std::chrono I determined that most of the time (>90%) is spent calling branch->GetEntry(entry).

Is it possible to speed this up with implicit multithreading? I followed an example here: https://root.cern/doc/master/imt001__parBranchProcessing_8C.html, but I see no change in performance when I add ROOT::EnableImplicitMT(nThread); to my code. I also tried options like TTree->SetParallelUnzip() which also did not have an effect on performance.

My general configuration for the TTree object looks like this:

m_tree->SetCacheSize(-1);
for (auto *br : m_branches) {
    m_tree->AddBranchToCache(br, true);
}
m_tree->SetCacheLearnEntries(1);
m_tree->SetCacheEntryRange(first, last);
m_tree->StopCacheLearningPhase();

during event loop…

int64_t localEntry = m_tree->LoadTree(m_current_event);
for (auto *br : m_branches) {
        br->GetEntry(localEntry);
}

Does something like this make sense If I want to reach maximum possible reading speed? Is it possible to speed it up further with MT features?

Best,
Miha


_ROOT Version: 6.18.04
Platform: SUSE Linux Enterprise Server 15
Compiler: gcc 8.2.0


1 Like

You can try;

m_tree->SetCacheSize(-1);
m_tree->SetBranchStatus( "*", kFALSE);
for (auto *br : m_branches) {
    m_tree->AddBranchToCache(br, true);
    m_tree->SetBranchStatus( br->GetName(), kTRUE);
}
....
int64_t localEntry = m_tree->LoadTree(m_current_event);
m_tree->GetEntry(m_current_event); 

and turn on implicit MT.

Cheers,
Philippe.

Dear Philippe,

thanks for looking into this! I tried explicitly activating the branches as you suggested, but it did not have an effect because I think I already had activated only the branches that I need before.

I still cannot seem to get implicit MT working. Does it matter that I compile my code, i.e. it is not a ROOT macro? At which point in the code should ROOT::EnableImplicitMT() be called? Is there a way to test whether it is working or not when I read the branches?

Best,
Miha

No, definitely not.

Note that TBranch::GetEntry does not take advantage by implicit multi-threading while TTree::GetEntry does.

Dear Enrico,

ah, sorry, I did not realize that in Philippe’s script we replace TBranch::GetEntry with TTree::GetEntry.

I replaced it with TTree::GetEntry, but I still cannot figure out whether MT is working or not. For example, the walltime stays the same with ROOT::EnableImplicitMT(1) and ROOT::EnableImplicitMT(12). Sorry for being so slow with this…

Also because of my specific setup, the overall time is slower with TTree::GetEntry. Before I would read the tree like this:

for branch in few_branches:
    branch->GetEntry()

perform basic event selection

for branch in all_other_branches:
    branch->GetEntry()

calculate all quantities and fill histograms

So I would only read all branches for events that would pass my analysis selection, while with TTree:GetEntry, I always read all branches which ends up being slower.

Hi all,

now playing a bit with positioning of ROOT::EnableImplicitMT(1); in the code, I can get a bus error if ‘ROOT::EnableImplicitMT(1);’ is called before I create my ‘event loop class’ that initializes the TTree setup:

*** Break *** bus error

===========================================================
There was a crash.
This is the entire stack trace of all threads:

#0 0x00002aaaaf797e0a in waitpid () from /lib64/libc.so.6
#1 0x00002aaaaf7154af in do_system () from /lib64/libc.so.6
#2 0x00002aaaab295093 in TUnixSystem::Exec (shellcmd=, this=0x63ab80) at /global/common/software/atlas/root-6.18.04/core/unix/src/TUnixSystem.cxx:2106
#3 TUnixSystem::StackTrace (this=0x63ab80) at /global/common/software/atlas/root-6.18.04/core/unix/src/TUnixSystem.cxx:2400
#4 0x00002aaaab2978d4 in TUnixSystem::DispatchSignals (this=0x63ab80, sig=kSigBus) at /global/common/software/atlas/root-6.18.04/core/unix/src/TUnixSystem.cxx:3631
#5
#6 std::vector<float, std::allocator >::resize (__new_size=, this=) at /global/common/software/atlas/build_root-6.18.04/include/Bytes.h:323
#7 TStreamerInfoActions::VectorLooper::ReadCollectionBasicType (buf=…, addr=, conf=0x1771fd0) at /global/common/software/atlas/root-6.18.04/io/io/src/TStreamerInfoActions.cxx:1881
#8 0x00002aaaab8bdf85 in TStreamerInfoActions::TConfiguredAction::operator() (this=0x1771f10, this=0x1771f10, object=0x732e636f72506974, buffer=…) at /global/common/software/atlas/build_root-6.18.04/include/TStreamerInfoActions.h:124
#9 TBufferFile::ApplySequence (this=0x48fe570, sequence=…, obj=0x732e636f72506974) at /global/common/software/atlas/root-6.18.04/io/io/src/TBufferFile.cxx:3564
#10 0x00002aaaad2bd31d in TBranchElement::ReadLeavesMember (this=0x2735050, b=…) at /global/common/software/atlas/root-6.18.04/tree/tree/src/TBranchElement.cxx:4421
#11 0x00002aaaad2b5417 in TBranch::GetEntry (this=this
entry=0x2735050, entry=entry
entry=10, getall=getall
entry=0) at /global/common/software/atlas/root-6.18.04/tree/tree/src/TBranch.cxx:1626
#12 0x00002aaaad2c5529 in TBranchElement::GetEntry (this=0x2735050, entry=10, getall=0) at /global/common/software/atlas/root-6.18.04/tree/tree/src/TBranchElement.cxx:2652
#13 0x00002aaaad31e639 in TTree::<lambda()>::operator()(void) const (__closure=0x7fffffff6450) at /global/common/software/atlas/root-6.18.04/tree/tree/src/TTree.cxx:5486
#14 0x00002aaaab5ec6df in std::function<void (unsigned int)>::operator()(unsigned int) const (__args#0=, this=) at /opt/gcc/8.2.0/snos/include/g++/bits/std_function.h:260
#15 tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>::operator()(tbb::blocked_range const&) const (r=…, this=0x2aaabe70fd58) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:178
#16 tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::run_body(tbb::blocked_range&) (r=…, this=0x2aaabe70fd40) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:116
#17 tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type> >::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>, tbb::blocked_range >(tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>&, tbb::blocked_range&) (range=…, start=…, this=0x2aaabe70fd68) at /global/common/software/atlas/build_root-6.18.04/include/tbb/partitioner.h:454
#18 tbb::interface9::internal::partition_type_base<tbb::interface9::internal::auto_partition_type>::execute<tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>, tbb::blocked_range >(tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>&, tbb::blocked_range&) (range=…, start=…, this=0x2aaabe70fd68) at /global/common/software/atlas/build_root-6.18.04/include/tbb/partitioner.h:257
#19 tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::execute() (this=0x2aaabe70fd40) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:143
#20 0x00002aaab03955c5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2aaabe70b200, parent=…, child=) at …/…/include/tbb/machine/gcc_ia32_common.h:100
#21 0x00002aaab0392ac8 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2aaabe70b200, first=0x2aaabe70fd40, next=
0x2aaabe70fd38: 0x0) at …/…/src/tbb/scheduler_utility.h:45
#22 0x00002aaaab5eab78 in tbb::task::spawn_root_and_wait (root=…) at /global/common/software/atlas/build_root-6.18.04/include/tbb/task.h:921
#23 tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::run(tbb::blocked_range const&, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int> const&, tbb::auto_partitioner const&) (partitioner=…, body=…, range=…) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:96
#24 tbb::parallel_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int> >(tbb::blocked_range const&, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int> const&, tbb::auto_partitioner const&) (partitioner=…, body=…, range=…) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:216
#25 tbb::strict_ppl::parallel_for_impl<unsigned int, std::function<void (unsigned int)>, tbb::auto_partitioner const>(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&, tbb::auto_partitioner const&) (first=0, last=, step=1, f=…, partitioner=…) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:284
#26 0x00002aaaab5eabde in tbb::strict_ppl::parallel_for_impl<unsigned int, std::function<void (unsigned int)>, tbb::auto_partitioner const>(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&, tbb::auto_partitioner const&) (partitioner=…, f=…, step=, last=, first=) at /global/common/software/atlas/root-6.18.04/core/imt/src/TThreadExecutor.cxx:150
#27 tbb::strict_ppl::parallel_for<unsigned int, std::function<void (unsigned int)> >(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&) (f=…, step=, last=, first=) at /global/common/software/atlas/build_root-6.18.04/include/tbb/parallel_for.h:291
#28 ROOT::TThreadExecutor::<lambda()>::operator() (__closure=) at /global/common/software/atlas/root-6.18.04/core/imt/src/TThreadExecutor.cxx:150
#29 tbb::interface7::internal::delegated_function<const ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, const std::function<void(unsigned int)>&)::<lambda()>, void>::operator()(void) const (this=) at /global/common/software/atlas/build_root-6.18.04/include/tbb/task_arena.h:94
#30 0x00002aaab038f443 in tbb::interface7::internal::isolate_within_arena (d=…, reserved=reserved
entry=0) at …/…/src/tbb/arena.cpp:994
#31 0x00002aaaab5ebd45 in tbb::interface7::internal::isolate_impl<void, const ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, const std::function<void(unsigned int)>&)::<lambda()> > (f=…) at /global/common/software/atlas/root-6.18.04/core/imt/src/TThreadExecutor.cxx:149
#32 tbb::interface7::this_task_arena::isolate<ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, const std::function<void(unsigned int)>&)::<lambda()> > (f=…) at /global/common/software/atlas/build_root-6.18.04/include/tbb/task_arena.h:381
#33 ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&) (this=this
entry=0x7fffffff6420, start=, start
entry=0, end=, step=, step
entry=1, f=…) at /global/common/software/atlas/root-6.18.04/core/imt/src/TThreadExecutor.cxx:149
#34 0x00002aaaad322bde in ROOT::TThreadExecutor::Foreach<TTree::GetEntry(Long64_t, Int_t)::<lambda()> > (nChunks=0, nTimes=, func=…, this=0x7fffffff6420) at /opt/gcc/8.2.0/snos/include/g++/new:169
#35 TTree::GetEntry (this=0x13cca30, entry=, getall=) at /global/common/software/atlas/root-6.18.04/tree/tree/src/TTree.cxx:5497
#36 0x00002aaaaad14273 in Charm::EventLoopBase::loop(long long, long long) () from /global/homes/m/mmuskinj/work/build/charmpp/src/libCharmpp.so
#37 0x00002aaaaad12d6c in Charm::EventLoopBase::run(long long, long long) () from /global/homes/m/mmuskinj/work/build/charmpp/src/libCharmpp.so
#38 0x00000000004017e2 in main ()

To get close to this, you would need to call SetBranchStatus inside the loop. Something like:

m_tree->SetBranchStatus("*", false);
for branch in few_branches:
    m_tree->SetBranchStatus( branch->GetName(), kTRUE);
m_tree->GetEntry(...);

perform basic event selection

m_tree->SetBranchStatus("*", false);
for branch in all_other_branches:
    m_tree->SetBranchStatus( branch->GetName(), kTRUE);
m_tree->GetEntry(...);

Right … sorry for not mentioning this … a TTree default to the ‘IMT global state’ when created … so indeed if you called only ROOT::EnableImplicitMT after the TTree is created you did not enable it for that TTree.

Not what we hope for :(.

Can you share the file so that we try to reproduce it?
Can you run valgrind on the failing test?

Dear Philippe,

I will think about what would be the best way to share my core or try to make a small snippet that could reproduce the crash. I also tried running it on another (lxplus-like) platform which gave the same bus error. For now, here is the valgrind output:

valgrind.txt (326.7 KB)

Dear all,

looking more carefully in the valgrind output I realized that some member objects (e.g. std::vector<…>) of the event loop were not initialized before branches were read into them. Apparently this did not cause a crash for the sequential version, but it crashed in MT.

I confirm that the crash is gone and I see “100%” CPU usage in top with ROOT::EnableImplicitMT(1) and >100% CPU usage with ROOT::EnableImplicitMT(>1);. However, the total walltime remains roughly the same for now in both cases. I will investigate further…

Looks like time is not spent in what is being parallelized.

Hi, thanks for your help gain, it was very useful. Just one more question regarding ‘TTree::SetParallelUnzip’. I cannot see an effect by enabling it. In what context is it supposed to be used?

TTree::SetParallelUnzip would help compared to single thread in case the the time for decompression is quite large compare to the time for deserialization. Compare to IMT (which parallelize both decompression and deserialization) it would fare worse (unless usage of IMT requires/leads-to more branch being read).

Dear Philippe,

what I’m after is indeed running multiple processes on the same node (e.g. one input file per process). This is how my profile looks like with VTune. To me it seems like there is not much room for improvement since I am already reading only the branches that I need. And it looks like these jobs are completely I/O limited?

I’m just not sure whether decompression happens in parallel or not… I was running this example with SetParallelUnzip(true);.

it is not clear from this picture whether anything is happening in parallel. But indeed it seems that you are spending 37s in ‘raw I/O’ and 5s in decompression and the rest is negligible … The most you could gain is to overlap some of 5s of decompressions …

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.