Segmentation Fault when running large data file

Did you add the missing #include statements?

This is what I am trying to figure out, as this does not happen when I try to run a smaller file, e.g. of 20 MB.

2 ways to go further. One is to add print statement to figure out ‘where’ and ‘when’ the pointer turn to a nullptr. The other is to fix the compilation error and use valgrind to pin point the memory error (if any).

Hello Daniel,

In my opinion, you really have to fix the includes and compile the macro. If we don’t manage to do that, we will always be in the dark as to where exactly it crashes.
For every class the compiler is complaining about (such as RooAddPdf, TMath), add #include "RooAddPdf.h", #include "TMath.h" etc.
Please also run the script as

root.exe -b -l -q  code/ModelFixing.C+g

or from within root

.x code/ModelFixing.C+g

The g tells root to add debug symbols, so we can get line numbers when it crashes. When you get this to work, we might be 70% towards the solution of the problem.

Thank you all for the support.

As I am trying to reproduce the analysis (i.e. not my code), I was a bit reluctant to make to many changes to it, but I’ll proceed with all the suggestions you have made. I’ll keep you posted.

Hello dear root experts and enthusiasts,

I included all missing libraries in the respective scripts and uploaded the tracebacks to a debugging branch that I get from each file. As far as I checked them, they are identical in errors.

The scripts still work like a charm for the small files.

Hi,

I see 3 stack trace in https://github.com/dprelipcean/reana-demo-lhcb-d2pimumu/tree/debugging_large_files/debug_tracebacks

In all 3 cases, the crash happens in the scripts themselves:

#5  <signal handler called>
#6  0x00007fb191c001e1 in ModelFixing(char const*, char const*) () from /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/code/ModelFixing_C.so
#5  <signal handler called>
#6  0x00007fc1fa77335e in Optimise(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/code/Optimise_C.so
#5  <signal handler called>
#6  0x00007f4504b66687 in OSMassFit(char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/code/OSMassFit_C.so

On way to understand better what is going is to build the script is debug mode by adding ‘g’ to the ACLiC prefix. You should either remove the *_C.so to force a rebuild or using two pluses. i.e.

rm *_C.so
root.exe .... myscript.C+g

or

root.exe .... myscript.C++g

The stack trace should now contain the line number at which point the failure happens and you can then examine the code there or you can run in a debugger, eg.:

gdb --args root.exe ... myscript.C+g ....

Cheers,
Philippe

Have done this and the stack trace is identical.

I guess the reana part needs to be recompiled with debug symbols. Is there a Makefile or some kind of other build step that has to be executed when you set this up?

That is odd. At the very least the line:

#6  0x00007fc1fa77335e in Optimise(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () from /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/code/Optimise_C.so

should now indicate in line number in Optimize.C. Has the library really been deleted and regenerated with debug symbols?

I am compiling locally on my machine, i.e. outside reana.

Yes, I have followed these instructions

Did you download reana, and compile it yourself, or do you link against a precompiled one,e.g. from cvmfs?

For this debugging, I am running root locally on my machine, outside reana. My version is:

ROOT Version: 6.18/00
Built for linuxx8664gcc

But to answer your question:

I am running reana on production, i.e. using a token from the reana team on their clusters.

Hi,

as Philippe mentioned in a previous post, the crashes always seem to happen in the reana part, not in root. Could you ask the reana people if they have a debug version of reana? We are still not getting any line numbers, so it’s hard to drill down to the cause of this issue.

Hello,
Thank you so much for your support. I would like to stress out that for the debugging, I am running my analysis locally on my machine (i.e. I am not using reana!).

To reiterate, after removing all the *_C from previous runs, I am running the command:
root.exe -b -l -q 'code/ModelFixing.C+("data/1GB/D2PiMuMuOS.root", "PhiModels.root")'

and I still get the same traceback

Are there other debugging options (if this does not seem to work) that we could explore?

Try:

 root.exe -b -l -q 'code/ModelFixing.C++g("data/1GB/D2PiMuMuOS.root", "PhiModels.root")'

Identical output, namely:

$ root.exe -b -l -q 'code/ModelFixing.C++g("data/1GB/D2PiMuMuOS.root", "PhiModels.root")'

Processing code/ModelFixing.C++g("data/1GB/D2PiMuMuOS.root", "PhiModels.root")...
Info in <TUnixSystem::ACLiC>: creating shared library /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/./code/ModelFixing_C.so

RooFit v3.60 -- Developed by Wouter Verkerke and David Kirkby 
                Copyright (C) 2000-2013 NIKHEF, University of California & Stanford University
                All rights reserved, please read http://roofit.sourceforge.net/license.txt

Running ModelFixing
Start analysis

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f5e30592687 in __GI___waitpid (pid=8935, stat_loc=stat_loc
entry=0x7ffc17b20a68, options=options
entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f5e304fd067 in do_system (line=<optimised out>) at ../sysdeps/posix/system.c:149
#2  0x00007f5e311a1763 in TUnixSystem::Exec (shellcmd=<optimised out>, this=0x559c0bfa0b80) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:2106
#3  TUnixSystem::StackTrace (this=0x559c0bfa0b80) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:2400
#4  0x00007f5e311a4154 in TUnixSystem::DispatchSignals (this=0x559c0bfa0b80, sig=kSigSegmentationViolation) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:3631
#5  <signal handler called>
#6  0x00007f5e1f88ec1c in ModelFixing (inputfilename=0x7f5e3196d000 "data/1GB/D2PiMuMuOS.root", phimodels_filename=0x7f5e3196d019 "PhiModels.root") at /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/./code/ModelFixing.C:193
#7  0x00007f5e3196e07e in ?? ()
#8  0x0000559c0c058ea0 in ?? ()
#9  0x000000013196e000 in ?? ()
#10 0x00007f5e2b6772d0 in ?? () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#11 0x00007ffc17b24ab0 in ?? ()
#12 0x00007f5e3196e000 in ?? ()
#13 0x00007f5e2b6508c0 in cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#14 0x00007f5e2b5e6177 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#15 0x00007f5e2b5e77df in cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#16 0x00007f5e2b5e7a87 in cling::Interpreter::process(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::Value*, cling::Transaction**, bool) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#17 0x00007f5e2b6a7f4d in cling::MetaProcessor::process(llvm::StringRef, cling::Interpreter::CompilationResult&, cling::Value*, bool) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#18 0x00007f5e2b56660e in HandleInterpreterException (metaProcessor=0x559c0c576e30, input_line=<optimised out>, compRes=
0x7ffc17b24a9c: cling::Interpreter::kSuccess, result=result
entry=0x7ffc17b24ab0) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/metacling/src/TCling.cxx:2123
#19 0x00007f5e2b57a94f in TCling::ProcessLine (this=0x559c0c003660, line=<optimised out>, error=0x7ffc17b25c1c) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/metacling/src/TCling.cxx:2240
#20 0x00007f5e2b56f2f7 in TCling::ProcessLineSynch (this=0x559c0c003660, line=0x559c0c573990 ".X  /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/./code/ModelFixing.C++g(\"data/1GB/D2PiMuMuOS.root\", \"PhiModels.root\")", error=0x7ffc17b25c1c) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/metacling/src/TCling.cxx:3147
#21 0x00007f5e3104fbc8 in TApplication::ExecuteFile (file=<optimised out>, error=0x7ffc17b25c1c, keep=<optimised out>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TApplication.cxx:1162
#22 0x00007f5e3104f36c in TApplication::ProcessLine (this=0x559c0bfef680, line=<optimised out>, sync=<optimised out>, err=0x7ffc17b25c1c) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TApplication.cxx:1007
#23 0x00007f5e31533902 in TRint::ProcessLineNr (this=this
entry=0x559c0bfef680, filestem=filestem
entry=0x7f5e31545bf7 "ROOT_cli_", line=line
entry=0x7ffc17b25c20 ".x code/ModelFixing.C++g(\"data/1GB/D2PiMuMuOS.root\", \"PhiModels.root\")", error=error
entry=0x7ffc17b25c1c) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/rint/src/TRint.cxx:761
#24 0x00007f5e315351f9 in TRint::Run (this=0x559c0bfef680, retrn=<optimised out>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/rint/src/TRint.cxx:421
#25 0x0000559c0b65da2c in main (argc=<optimised out>, argv=0x7ffc17b27da8) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/main/src/rmain.cxx:30
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  0x00007f5e1f88ec1c in ModelFixing (inputfilename=0x7f5e3196d000 "data/1GB/D2PiMuMuOS.root", phimodels_filename=0x7f5e3196d019 "PhiModels.root") at /home/dprelipc/project/reana/reana-demo-lhcb-d2pimumu/./code/ModelFixing.C:193
#7  0x00007f5e3196e07e in ?? ()
#8  0x0000559c0c058ea0 in ?? ()
#9  0x000000013196e000 in ?? ()
#10 0x00007f5e2b6772d0 in ?? () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#11 0x00007ffc17b24ab0 in ?? ()
#12 0x00007f5e3196e000 in ?? ()
#13 0x00007f5e2b6508c0 in cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#14 0x00007f5e2b5e6177 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#15 0x00007f5e2b5e77df in cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#16 0x00007f5e2b5e7a87 in cling::Interpreter::process(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::Value*, cling::Transaction**, bool) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#17 0x00007f5e2b6a7f4d in cling::MetaProcessor::process(llvm::StringRef, cling::Interpreter::CompilationResult&, cling::Value*, bool) () from /home/dprelipc/Documents/root/root_from_source/build/lib/libCling.so
#18 0x00007f5e2b56660e in HandleInterpreterException (metaProcessor=0x559c0c576e30, input_line=<optimised out>, compRes=
0x7ffc17b24a9c: cling::Interpreter::kSuccess, result=result
entry=0x7ffc17b24ab0) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/metacling/src/TCling.cxx:2123
===========================================================


Root > 
 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f5e30592687 in __GI___waitpid (pid=9033, stat_loc=stat_loc
entry=0x7ffc17b22ee8, options=options
entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007f5e304fd067 in do_system (line=<optimised out>) at ../sysdeps/posix/system.c:149
#2  0x00007f5e311a1763 in TUnixSystem::Exec (shellcmd=<optimised out>, this=0x559c0bfa0b80) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:2106
#3  TUnixSystem::StackTrace (this=0x559c0bfa0b80) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:2400
#4  0x00007f5e311a4154 in TUnixSystem::DispatchSignals (this=0x559c0bfa0b80, sig=kSigSegmentationViolation) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:3631
#5  <signal handler called>
#6  0x00007f5e3104101b in (anonymous namespace)::R__ListSlowClose (files=0x559c0bfbe870) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TROOT.cxx:1123
#7  0x00007f5e31041cac in TROOT::CloseFiles (this=this
entry=0x7f5e31505200 <ROOT::Internal::GetROOT1()::alloc>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TROOT.cxx:1171
#8  0x00007f5e310423c2 in TROOT::EndOfProcessCleanups (this=0x7f5e31505200 <ROOT::Internal::GetROOT1()::alloc>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TROOT.cxx:1250
#9  0x00007f5e3119e6d1 in TUnixSystem::Exit (this=<optimised out>, code=129, mode=<optimised out>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:2141
#10 0x00007f5e310502ad in TApplication::Terminate (this=0x559c0bfef680, status=129) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TApplication.cxx:1252
#11 0x00007f5e31535295 in TRint::Run (this=0x559c0bfef680, retrn=<optimised out>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/rint/src/TRint.cxx:442
#12 0x0000559c0b65da2c in main (argc=<optimised out>, argv=0x7ffc17b27da8) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/main/src/rmain.cxx:30
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  0x00007f5e3104101b in (anonymous namespace)::R__ListSlowClose (files=0x559c0bfbe870) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TROOT.cxx:1123
#7  0x00007f5e31041cac in TROOT::CloseFiles (this=this
entry=0x7f5e31505200 <ROOT::Internal::GetROOT1()::alloc>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TROOT.cxx:1171
#8  0x00007f5e310423c2 in TROOT::EndOfProcessCleanups (this=0x7f5e31505200 <ROOT::Internal::GetROOT1()::alloc>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TROOT.cxx:1250
#9  0x00007f5e3119e6d1 in TUnixSystem::Exit (this=<optimised out>, code=129, mode=<optimised out>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/unix/src/TUnixSystem.cxx:2141
#10 0x00007f5e310502ad in TApplication::Terminate (this=0x559c0bfef680, status=129) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/base/src/TApplication.cxx:1252
#11 0x00007f5e31535295 in TRint::Run (this=0x559c0bfef680, retrn=<optimised out>) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/core/rint/src/TRint.cxx:442
#12 0x0000559c0b65da2c in main (argc=<optimised out>, argv=0x7ffc17b27da8) at /home/dprelipc/Documents/root/root_from_source/root-6.18.00/main/src/rmain.cxx:30
===========================================================


Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1

And if I run it on a 20MB file, then it works and I added its stdout in this file.

Seems it died in: /…/./code/ModelFixing.C:193

Try to replace the line:

TTree* D2PimumuTree = (TTree*) f1.Get(tree_location);

with:

if (f1.IsZombie()) return; // just a precaution
TTree *D2PimumuTree;
f1.GetObject(tree_location, D2PimumuTree);
if (!D2PimumuTree) return; // just a precaution

Thank you so much for all the help!
I finally identified the issue: as you hinted, my tree_location' variable taken from the ReadTreeLocationFromFileName function would not return the appropriate location for the TreeFile inside my root file, and that made everything else fail. Now it is running for all file sizes!

Big applause for the root team,
Daniel