Segmentation fault when doing Minimisation with RDataFrame

clementhelsens · November 13, 2020, 9:06am

Dear Experts,

following this minimizer example:

I implemeted a version of it in my RDataFrame code. I am able to run it, but it eventually produces a segmentation fault, never at the same event.

Below is the stack trace when running with valgrind --tool=massif I also attach the full log.

Is there anything obvious you could spot? Is running a minimisation in RDF something natural?
Thanks,
Clement

ROOT/compiler version /cvmfs/sw.hsf.org/spackages/linux-centos7-broadwell/gcc-8.3.0/root-6.20.04-oml2a44t4nifq7zor3jxyi3zzttkumil/bin/root
Runing on CentOS7 at CERN

Minimum: f(-0.0552392,-0.10813,-0.0204248): -0.971498
Minimum: f(-0.0626526,0.0114545,0.12024): -0.970525
==32678==
==32678== Process terminating with default action of signal 11 (SIGSEGV)
==32678== Access not within mapped region at address 0x6661656C2D86
==32678== at 0xF3CC21A: TDirectory::RegisterContext(TDirectory::TContext*) (TDirectory.cxx:1296)
==32678== by 0xEB37BAB: TContext (TDirectory.h:78)
==32678== by 0xEB37BAB: TDirectoryFile::Save() (TDirectoryFile.cxx:1490)
==32678== by 0xEB3787B: Close (TDirectoryFile.cxx:551)
==32678== by 0xEB3787B: TDirectoryFile::Close(char const*) (TDirectoryFile.cxx:544)
==32678== by 0xEB53185: TFile::Close(char const*) (TFile.cxx:895)
==32678== by 0xF3A20B8: (anonymous namespace)::R__ListSlowClose(TList*) (TROOT.cxx:1121)
==32678== by 0xF3A2B53: TROOT::CloseFiles() (TROOT.cxx:1169)
==32678== by 0xF3A3221: TROOT::EndOfProcessCleanups() (TROOT.cxx:1248)
==32678== by 0xF4D765E: TUnixSystem::Exit(int, bool) (TUnixSystem.cxx:2141)
==32678== by 0xF4DC5ED: TUnixSystem::DispatchSignals(ESignals) (TUnixSystem.cxx:3631)
==32678== by 0x57B862F: ??? (in /usr/lib64/libpthread-2.17.so)
==32678== by 0x24EF704D: Read (TBranchProxy.h:138)
==32678== by 0x24EF704D: GetCP (TTreeReaderArray.cxx:111)
==32678== by 0x24EF704D: (anonymous namespace)::TCollectionLessSTLReader::GetSize(ROOT::Detail::TBranchProxy*) (TTreeReaderArray.cxx:125)
==32678== by 0x2F9A6CE7: ???
==32678== by 0x2F99E0DF: ???
==32678== by 0x2F973B28: ???
==32678== by 0x2F97395B: ???
==32678== by 0x2F4FF3C7: ???
==32678== by 0x2F515395: ???
==32678== by 0x2F516A9C: ???
==32678== by 0x2F4FD004: ???
==32678== by 0x24BCFF09: ROOT::Detail::RDF::RLoopManager::RunAndCheckFilters(unsigned int, long long) (RLoopManager.cxx:384)
==32678== by 0x24BD0259: operator() (RLoopManager.cxx:296)
==32678== by 0x24BD0259: std::_Function_handler<void (TTreeReader&), ROOT::Detail::RDF::RLoopManager::RunTreeProcessorMT()::{lambda(TTreeReader&)#1}>::_M_invoke(std::_Any_data const&, TTreeReader&) (std_function.h:297)
==32678== by 0x24F19C2F: ROOT::TTreeProcessorMT::Process(std::function<void (TTreeReader&)>)::{lambda(unsigned long)#1}::operator()(unsigned long) const::{lambda(ROOT::Internal::EntryCluster const&)#1}::operator()(ROOT::Internal::EntryCluster const) const [clone .isra.441] (std_function.h:687)
==32678== by 0xE53A5B1: operator() (std_function.h:687)
==32678== by 0xE53A5B1: operator() (parallel_for.h:177)
==32678== by 0xE53A5B1: run_body (parallel_for.h:115)
==32678== by 0xE53A5B1: work_balance<tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void(unsigned int)>, unsigned int>, const tbb::auto_partitioner>, tbb::blocked_range > (partitioner.h:423)
==32678== by 0xE53A5B1: execute<tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void(unsigned int)>, unsigned int>, const tbb::auto_partitioner>, tbb::blocked_range > (partitioner.h:256)
==32678== by 0xE53A5B1: tbb::interface9::internal::start_for<tbb::blocked_range, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::execute() (parallel_for.h:142)
==32678== by 0xFE3711C: tbb::internal::custom_schedulertbb::internal::IntelSchedulerTraits::process_bypass_loop(tbb::internal::context_guard_helper&, tbb::task*, long) (custom_scheduler.h:474)
==32678== by 0xFE3740C: tbb::internal::custom_schedulertbb::internal::IntelSchedulerTraits::local_wait_for_all(tbb::task&, tbb::task*) (custom_scheduler.h:636)
==32678== by 0xFE30F4E: tbb::internal::arena::process(tbb::internal::generic_scheduler&) (arena.cpp:196)
==32678== by 0xFE2F932: tbb::internal::market::process(rml::job&) (market.cpp:667)
==32678== by 0xFE2BD0B: tbb::internal::rml::private_worker::run() (private_server.cpp:266)
==32678== by 0xFE2BF18: tbb::internal::rml::private_worker::thread_routine(void*) (private_server.cpp:219)
==32678== by 0x57B0EA4: start_thread (in /usr/lib64/libpthread-2.17.so)
==32678== by 0x61CC8DC: clone (in /usr/lib64/libc-2.17.so)
==32678== If you believe this happened as a result of a stack
==32678== overflow in your program’s main thread (unlikely but
==32678== possible), you can try to increase the size of the
==32678== main thread stack using the --main-stacksize= flag.
==32678== The main thread stack size used in this run was 8388608.
==32678==

Thread 1 (process 32678):
#0 0x00000000580ca5a8 in ?? ()
#1 0x00000000580632ca in ?? ()
#2 0x000000005805f72b in ?? ()
#3 0x0000000058061c07 in ?? ()
#4 0x00000000580ca6bb in ?? ()
#5 0x0000000000000000 in ?? ()
Segmentation fault (core dumped)
log.txt (758.0 KB)

eguiraud · November 13, 2020, 9:37am

Hi @clementhelsens,
a segmentation fault most likely indicates a bug in user code that leads to accesses to invalid memory regions (e.g. dereferencing a null pointer, out-of-bounds access of an array). Another possibility, if you are running with EnableImplicitMT, is that some of the operations that you schedule with RDF are not thread-safe (e.g. a Define that performs a TH1 fit is not thread-safe).

massif logs who allocates how much memory, we need a different kind of log to figure what’s going wrong. If you use a build of ROOT with debug symbols, it should print a stack-trace right after the crash, which shows where exactly things went wrong. If a stack-trace is not printed, we’ll need a small reproducer that we can run to debug this on our side.

Cheers,
Enrico

clementhelsens · November 13, 2020, 9:42am

Hi @eguiraud, yes, it seems it’s on my side, I thought I was testing this on 1 thread only, but this was not the case. It runs well with one thread, so it’s my implementation that is not thread safe.
Sorry for the noise, and thanks for the quick reply
Clement

system · November 27, 2020, 9:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.