ROOT 6.26/00 issue with multi-threaded RDataFrame and RVec

Dear ROOT experts,

I’m switching to ROOT 6.26/00 from 6.24/06 recently and ran into problem never seen before. This happens when using RDataFrame + RVec + EnableImplicitMT().

The code basically looks like this:

  ROOT::EnableImplicitMT();
  ROOT::RDataFrame df00{"t", "./rc_*MC.root"};
  auto df0 = df00
    .Define("flag0",            "(truthCaloPt > 50.) && (truthCaloPt < 1000.)")
    .Define("flag1",            "flag0 && (truthCaloPt > 80.)")
    .Define("flag2",            "flag0 && (truthCaloPt > 90.)")
    .Define("flag3",            "flag0 && (truthCaloPt > 100.)")
    .Define("pJetWeight",       "flag0 * flag1 * flag2 * flag3");

  df0.Snapshot("t", "plots_rc.root", {"pJetWeight"});
  // Or Histo1D(), similar error

This code is only problematic when running with multile threads and RVec, in ROOT 6.26/00. It may or may not crash and the error message is not always the same. I couldn’t reproduce the errors after skimming the tree to only one branch.

One example error message:

Processing plot_rc.cpp...
RDataFrame::Run: event loop was interrupted
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot call operator && on vectors of different sizes.

Another long and rare one is attached at the end.

Am I misusing some ROOT features or could this be a ROOT bug? The thread safety might be broken. I’m not sure if this is related to RDataFrame or the RVec (new buffer).

Thank you so much!


Second error:

Processing plot_rc.cpp...
Fatal in <TEmulatedCollectionProxy>: Resize> Logic error - no proxy object set.
aborting

Thread 4 (Thread 0x7f9db95ee640 (LWP 900893) "root.exe"):
#0  0x00007f9dce94e4cf in wait4 () from /usr/lib/libc.so.6
#1  0x00007f9dce8c009b in do_system () from /usr/lib/libc.so.6
#2  0x00007f9dcefb8e81 in TUnixSystem::StackTrace() () from /usr/lib/root/libCore.so
#3  0x00007f9dcee756db in DefaultErrorHandler(int, bool, char const*, char const*) () from /usr/lib/root/libCore.so
#4  0x00007f9dcef3aa63 in ErrorHandler () from /usr/lib/root/libCore.so
#5  0x00007f9dcef3b8fa in Fatal(char const*, char const*, ...) () from /usr/lib/root/libCore.so
#6  0x00007f9dce2fd599 in TEmulatedCollectionProxy::Destructor(void*, bool) const () from /usr/lib/root/libRIO.so
#7  0x00007f9dbd4865b3 in TBranchElement::ReleaseObject() () from /usr/lib/root/libTree.so
#8  0x00007f9dbd4897ff in TBranchElement::ResetAddress() () from /usr/lib/root/libTree.so
#9  0x00007f9dbd4898a6 in TBranchElement::~TBranchElement() () from /usr/lib/root/libTree.so
#10 0x00007f9dbd489e7e in TBranchElement::~TBranchElement() () from /usr/lib/root/libTree.so
#11 0x00007f9dcef11285 in TObjArray::Delete(char const*) () from /usr/lib/root/libCore.so
#12 0x00007f9dbd4fbefe in TTree::~TTree() () from /usr/lib/root/libTree.so
#13 0x00007f9dbd4fc52e in TTree::~TTree() () from /usr/lib/root/libTree.so
#14 0x00007f9dcef0ae11 in TList::Delete(char const*) () from /usr/lib/root/libCore.so
#15 0x00007f9dcef00a5a in THashList::Delete(char const*) () from /usr/lib/root/libCore.so
#16 0x00007f9dce303726 in TDirectoryFile::Close(char const*) () from /usr/lib/root/libRIO.so
#17 0x00007f9dce3221c7 in TFile::Close(char const*) () from /usr/lib/root/libRIO.so
#18 0x00007f9dce322717 in TFile::~TFile() () from /usr/lib/root/libRIO.so
#19 0x00007f9dce322a0e in TFile::~TFile() () from /usr/lib/root/libRIO.so
#20 0x00007f9dbd4ac277 in TChain::~TChain() () from /usr/lib/root/libTree.so
#21 0x00007f9dbd4ac49e in TChain::~TChain() () from /usr/lib/root/libTree.so
#22 0x00007f9dbd8c00af in ROOT::Internal::TTreeView::MakeChain(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ROOT::Internal::TreeUtils::RFriendInfo const&, std::vector<long long, std::allocator<long long> > const&, std::vector<std::vector<long long, std::allocator<long long> >, std::allocator<std::vector<long long, std::allocator<long long> > > > const&) () from /usr/lib/root/libTreePlayer.so
#23 0x00007f9dbd8c0f16 in ROOT::Internal::TTreeView::GetTreeReader(long long, long long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ROOT::Internal::TreeUtils::RFriendInfo const&, TEntryList const&, std::vector<long long, std::allocator<long long> > const&, std::vector<std::vector<long long, std::allocator<long long> >, std::allocator<std::vector<long long, std::allocator<long long> > > > const&) () from /usr/lib/root/libTreePlayer.so
#24 0x00007f9dbd8c3275 in ?? () from /usr/lib/root/libTreePlayer.so
#25 0x00007f9dcf1a45d4 in tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::d1::parallel_for_body_wrapper<std::function<void (unsigned int)>, unsigned int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) () from /usr/lib/root/libImt.so
#26 0x00007f9dce13f5b3 in ?? () from /usr/lib/libtbb.so.12
#27 0x00007f9dcf1a21b7 in ?? () from /usr/lib/root/libImt.so
#28 0x00007f9dce12d06f in tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, long) () from /usr/lib/libtbb.so.12
#29 0x00007f9dcf1a1e96 in ?? () from /usr/lib/root/libImt.so
#30 0x00007f9dce12c7e2 in ?? () from /usr/lib/libtbb.so.12
#31 0x00007f9dcf1a3b02 in ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&) () from /usr/lib/root/libImt.so
#32 0x00007f9dbd8c2966 in ?? () from /usr/lib/root/libTreePlayer.so
#33 0x00007f9dcf1a45d4 in tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::d1::parallel_for_body_wrapper<std::function<void (unsigned int)>, unsigned int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) () from /usr/lib/root/libImt.so
#34 0x00007f9dce137abe in ?? () from /usr/lib/libtbb.so.12
#35 0x00007f9dce139d44 in ?? () from /usr/lib/libtbb.so.12
#36 0x00007f9dce8fe5c2 in start_thread () from /usr/lib/libc.so.6
#37 0x00007f9dce983584 in clone () from /usr/lib/libc.so.6
  
Thread 3 (Thread 0x7f9db99ef640 (LWP 900892) "root.exe"):
#0  0x00007f9db9e0d993 in ?? ()
#1  0x0000564e6fc9c268 in ?? ()
#2  0x0000000000000002 in ?? ()
#3  0x000000000211988e in ?? ()
#4  0x00007f9db9e0c93d in ?? ()
#5  0x00007f9dac019770 in ?? ()
#6  0x00007f9db9e0da30 in ?? ()
#7  0x00007f9db99e9af8 in ?? ()
#8  0x0000564e72006980 in ?? ()
#9  0x00007f9db9e224c0 in ?? ()
#10 0x00007f9db9e2cc80 in ?? ()
#11 0x00007f9db9e323f0 in ?? ()
#12 0x0000564e725f0b78 in ?? ()
#13 0x0000564e725f0a10 in ?? ()
#14 0x00007f9db99e9b08 in ?? ()
#15 0x0000000c00000000 in ?? ()
#16 0x0000000000000000 in ?? ()
  
Thread 2 (Thread 0x7f9db9df0640 (LWP 900891) "root.exe"):
#0  0x00007f9dbcf0bb30 in ROOT::VecOps::RVec<decltype (({parm#1}[0])*({parm#2}[0]))> ROOT::VecOps::operator*<int, int>(ROOT::VecOps::RVec<int> const&, ROOT::VecOps::RVec<int> const&) () from /usr/lib/root/libROOTVecOps.so
#1  0x00007f9db9e10bc3 in ?? ()
#2  0x00007f9db9e224c0 in ?? ()
#3  0x00007f9db9e2cc80 in ?? ()
#4  0x00007f9dbcf0bb30 in ?? () from /usr/lib/root/libROOTVecOps.so
#5  0x00007f9db9deac28 in ?? ()
#6  0x0000564e6fa22560 in ?? ()
#7  0x00007f9db9deab08 in ?? ()
#8  0x0000000c00000000 in ?? ()
#9  0x0000000100000001 in ?? ()
#10 0x0000564e722db5d0 in ?? ()
#11 0x0000564e71c34ce0 in ?? ()
#12 0x00007f9db9deab68 in ?? ()
#13 0x0000000000000000 in ?? ()
  
Thread 1 (Thread 0x7f9dce5e2a40 (LWP 900868) "root.exe"):
#0  0x00007f9db9e1bd4a in ?? ()
#1  0x0000000000000000 in ?? ()


ROOT Version: 6.26/00
Platform: arch linux
Compiler: GCC 11.2.0


1 Like

Hi @Kevin1 ,
thank you for the report, there were significant changes in RVec in v6.26 and this might indeed be a bug that escaped our test infrastructure. The code seems fine.

Could you please share a reproducer with me, even privately? I.e. data + code that I can use to debug this.

The second error might or might not be related…it might have to do with RVec I/O.

Cheers,
Enrico

Hi @eguiraud, thank you! I’m trying to make a reproducer.

Those root files don’t belong to me so I need to make a changed copy. And it seems that if I make too many changes I can’t reproduce the issue any more… So this may take me a little while to figure it out.

Thanks again.

Hi @eguiraud, I just shared the files (along with the code) with you privately on cernbox.

The files have only one branch truthCaloPt and the code is basically what I posted, with the file name pattern modified.

(One strange thing I noticed when generating the reproducer is that if I merge the files or use 6.26/00 to skim the original file, I couldn’t reproduce the errors. The issue has something to do with threads so this observation can’t be fully deterministic. But anyways this is what I saw.)

Thank you!

Thank you @Kevin1 ,
I will take a look as soon as possible.

Cheers,
Enrico

Update: I can reproduce the problem, debugging… :slight_smile:

Hi @eguiraud! Have you figured out what’s going on behind this bug? Thank you!

Hi @Kevin1 ,
unfortunately not yet. This turns out to be a complicated problem deep in the belly of ROOT I/O and requires a certain amount of continuous focus to be disentangled – that I didn’t get to have in the last few days. It’s top of my to-do list though, and I appreciate the keepalive ping! :slight_smile:

Cheers,
Enrico

Thank you @eguiraud! Yes just wanted to keep the thread alive.

At least so far I could exclude Snapshot as the culprit and I got this minimal reproducer, which is reading half of the original files and only using 2 threads:

#include "ROOT/RDataFrame.hxx"
#include "ROOT/RVec.hxx"

int main() {
  ROOT::EnableImplicitMT(2);
  ROOT::RDataFrame df("t", "./test*_PbPb.root");
  df.Sum<ROOT::RVecD>("truthCaloPt").GetValue();
}

I see what the problem is, still have to figure out how it can happen :slight_smile:

Thank you @eguiraud. This reproducer is impressively simple.

This is now [I/O] Race condition when reading vectors with custom allocators with TTreeProcessorMT · Issue #10357 · root-project/root · GitHub .

Thank you! The solution seems reasonable and clear.

Just to confirm is it expected that this error shouldn’t happen in v6.24? (Was this collection proxy recently changed in v6.26.00?)

The problem should only be present if you Snapshot RVecs with RDataFrame in ROOT <v6.26 and then read them back with ROOT v6.26.00 and multi-threading activated. Any other combination (e.g. write/read with the same ROOT version) should already work fine, and the upcoming v6.26.02 will fix this problem too.

Thank you so much!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.