Intermittent * Break * bus error

paco_uk · February 4, 2014, 3:45pm

Hi,

I have a problem which is difficult to reproduce. It occurs on about 30% of jobs I submit to the lxplus batch system via bsub but I’ve only seen it once or twice when running locally. I’ve tried running my program over a small number of events many times but I still haven’t seen the problem. I guess it’s something to do with the shell environment or maybe my linking is wrong but I don’t know where to start looking.

Typically the program crashes on startup, although occasionally it crashes at the start of my event loop. Here is the error message I receive. Is it possible to tell what the problem is being caused by?

 *** Break *** bus error



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00002b3b86d258be in waitpid () from /lib64/libc.so.6
#1  0x00002b3b86cb7909 in do_system () from /lib64/libc.so.6
#2  0x00002b3b80faf1a0 in TUnixSystem::StackTrace() () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libCore.so
#3  0x00002b3b80fb19f3 in TUnixSystem::DispatchSignals(ESignals) () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libCore.so
#4  <signal handler called>
#5  0x00002b3b83e1f5cd in G__set_cpp_environmentG__Tree () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#6  0x00002b3b83ed6d99 in G__cpp_setupG__Tree () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#7  0x00002b3b8180f783 in G__call_setup_funcs () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libCint.so
#8  0x00002b3b83edb01d in G__cpp_setup_initG__Tree::G__cpp_setup_initG__Tree() () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#9  0x00002b3b83ed71bf in __static_initialization_and_destruction_0(int, int) () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#10 0x00002b3b83ed71f4 in _GLOBAL__sub_I_G__Tree.cxx () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#11 0x00002b3b83edf066 in __do_global_ctors_aux () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#12 0x00002b3b83d95a8b in _init () from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libTree.so
#13 0x00002b3b86c78000 in ?? ()
#14 0x00002b3b7fa2c555 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#15 0x00002b3b7fa1eb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#16 0x0000000000000009 in ?? ()
#17 0x00007fffda64034e in ?? ()
#18 0x00007fffda64035d in ?? ()
#19 0x00007fffda640367 in ?? ()
#20 0x00007fffda64037d in ?? ()
#21 0x00007fffda640387 in ?? ()
#22 0x00007fffda640395 in ?? ()
#23 0x00007fffda6403a6 in ?? ()
#24 0x00007fffda6403fd in ?? ()
#25 0x00007fffda64040d in ?? ()
#26 0x0000000000000000 in ?? ()
===========================================================

The crash is most likely caused by a problem in your script.
Try to compile it (.L myscript.C+g) and fix any errors.
If that does not help then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

My code is compiled C++ linked to ROOT like this:

COMPILE PHASE

./build/ccd-gcc -O3 -pg -Wall -fPIC -std=c++0x -I/afs/cern.ch/sw/lcg/external/Boost/1.50.0_python2.7/x86_64-slc6-gcc46-opt/include/boost-1_50/ -I-pthread -m64 -I/afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/include -I/afs/cern.ch/user/f/fnewson/work/hnu/frog/code/event/inc -Iapps/inc  -Isubscriber/inc  -Ievent/inc  -Itools/inc  -Ireco/inc  -Iparse/inc  -Ianalyses/inc  -Ifactory/inc -o apps/obj/hnureco.o -c apps/src/hnureco.cpp

LINK PHASE

./build/ccd-gcc -O3 -pg -o apps/hnureco apps/obj/hnureco.o subscriber/subscriber.a event/event.a tools/tools.a reco/reco.a parse/parse.a analyses/analyses.a factory/factory.a -L/afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib -lCore -lCint -lRIO -lNet -lHist -lGraf -lGraf3d -lGpad -lTree -lRint -lPostscript -lMatrix -lPhysics -lMathCore -lThread -pthread -lm -ldl -rdynamic -lMathMore -L/afs/cern.ch/sw/lcg/external/Boost/1.50.0_python2.7/x86_64-slc6-gcc46-opt/lib -lboost_program_options-gcc46-mt-1_50 -lboost_system-gcc46-mt-1_50 -lboost_serialization-gcc46-mt-1_50 -lboost_filesystem-gcc46-mt-1_50

./build/ccd-gcc is just a wrapper for gcc.

Danilo · February 5, 2014, 7:45am

Hi,

could Valgrind or gdb tell you something more about the crash? It looks like a memory corruption caused by your code.

Cheers,
Danilo

paco_uk · February 5, 2014, 9:56am

Hi Danilo,

Thanks for your advice. The last couple of jobs I submitted have worked so its difficult to know what difference my changes are making but these are my results from trying Valgrind.

I ran:

 valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes --track-origins=yes ./apps/hnureco -c p5.data.kp.q11t --auto -m input/halo_plots.info -n 1000 > output/valgrind_check2 2>&1

It produced this error during the start-up phase:

==769== Conditional jump or move depends on uninitialised value(s)
==769==    at 0x416CED: std::basic_istream<char, std::char_traits<char> >& boost::io::detail::operator>><char, std::char_traits<char>, std::allocator<char> >(std::basic_istream<char, std::char_traits<char> >&, boost::io::detail::quoted_proxy<std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char> const&) (quoted_manip.hpp:131)
==769==    by 0x41724F: std::back_insert_iterator<std::vector<boost::filesystem::path, std::allocator<boost::filesystem::path> > > std::copy<std::istream_iterator<boost::filesystem::path, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<boost::filesystem::path, std::allocator<boost::filesystem::path> > > >(std::istream_iterator<boost::filesystem::path, char, std::char_traits<char>, long>, std::istream_iterator<boost::filesystem::path, char, std::char_traits<char>, long>, std::back_insert_iterator<std::vector<boost::filesystem::path, std::allocator<boost::filesystem::path> > >) (path.hpp:667)
==769==    by 0x418DA5: main (hnureco.cpp:209)
==769==  Uninitialised value was created by a stack allocation
==769==    at 0x416CB0: std::basic_istream<char, std::char_traits<char> >& boost::io::detail::operator>><char, std::char_traits<char>, std::allocator<char> >(std::basic_istream<char, std::char_traits<char> >&, boost::io::detail::quoted_proxy<std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char> const&) (quoted_manip.hpp:125)
==769==

It turned out I could make this error disappear by replacing this line:

std::copy( std::istream_iterator<path>( ifs ), std::istream_iterator<path>(), std::back_inserter( filenames ) );

with these two lines:

std::vector<std::string> temp_filenames{ std::istream_iterator<std::string>( ifs ),
            std::istream_iterator<std::string>() };

        std::transform( temp_filenames.begin(), temp_filenames.end(),
                std::back_inserter( filenames) , []( std::string& s ){ return path( s ); } );

suggesting that std::istream_iterator<path> was doing something odd that std::istream_iterator<std::string> doesn’t.

The rest of my program ran without any messages from valgrind until my program exited when I received 1000000 (!) lines of errors, beginning like this:

==30633==
==30633== FILE DESCRIPTORS: 5 open at exit.
==30633== Open file descriptor 12: /var/lib/sss/mc/group
==30633==    at 0xBC2CF80: __open_nocancel (in /lib64/libpthread-2.12.so)
==30633==    by 0xE685250: ???
==30633==    by 0xE685A00: ???
==30633==    by 0xE683C49: ???
==30633==    by 0xBEE562C: getgrgid_r@@GLIBC_2.2.5 (in /lib64/libc-2.12.so)
==30633==    by 0xBEE4D6E: getgrgid (in /lib64/libc-2.12.so)
==30633==    by 0x1909260D: getDefaultForGlobal (in /usr/lib64/libcastorclient.so.2.1.13.5)
==30633==    by 0x18C1A9A7: rfioTURLFromString (in /usr/lib64/libcastorrfio.so.2.1.13.5)
==30633==    by 0x18BF9268: rfio_parseln (in /usr/lib64/libcastorrfio.so.2.1.13.5)
==30633==    by 0x18BFA0B4: rfio_parse (in /usr/lib64/libcastorrfio.so.2.1.13.5)
==30633==    by 0x189D4C87: TCastorFile::FindServerAndPath() (in /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.14/x86_64-slc6-gcc46-opt/root/lib/libRCastor.so)

and ending like this:

==30633== LEAK SUMMARY:
==30633==    definitely lost: 1,695 bytes in 12 blocks
==30633==    indirectly lost: 5,530 bytes in 99 blocks
==30633==      possibly lost: 22,743 bytes in 265 blocks
==30633==    still reachable: 8,159,959 bytes in 92,584 blocks
==30633==         suppressed: 0 bytes in 0 blocks
==30633==
==30633== For counts of detected and suppressed errors, rerun with: -v
==30633== ERROR SUMMARY: 227 errors from 227 contexts (suppressed: 1668 from 14)

Do you know if it’s normal to get messages like this when using ROOT or does that mean I’ve broken something?

Intermittent *** Break *** bus error

Intermittent * Break * bus error