Seg fault between terminate and destructor

Hi

i’m trying to adapt an analysis skeleton we add using make class, to the same thing but using TSelector to run it with proof. the thing seem to run normally but between the terminate method and the constructor it craches (a segmentation violation is displayed on the remote machine i’m running on).

At the end it seems that in the output file, i have plot filled but nit with the correct number of events (there is always one or several files missing as if last file on very workers was not written)

i’ve try to run with valgrind, but apart from indirectly lost (corrected now) I didn’t find anything suspicious in the log (normal or valgrind’s one)

here is an example of stack trace at the end of the job on the remote machine

===========================================================
There was a crash (kSigSegmentationViolation).
This is the entire stack trace of all threads:
===========================================================
#0  0x00002b4551bba115 in waitpid () from /lib64/libc.so.6
#1  0x00002b4551b5c481 in do_system () from /lib64/libc.so.6
#2  0x00002b4549123e69 in TUnixSystem::Exec (this=0x1175d3a0, 
    shellcmd=0x13bb8628 "/afs/cern.ch/sw/lcg/app/releases/ROOT/5.28.00g/x86_64-slc5-gcc43-dbg/root/etc/gdb-backtrace.sh 31370 1>&2")
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:2031
#3  0x00002b4549123022 in TUnixSystem::StackTrace (this=0x1175d3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:2253
#4  0x00002b454912655e in TUnixSystem::DispatchSignals (this=0x1175d3a0, 
    sig=kSigSegmentationViolation)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1157
#5  0x00002b4549126688 in SigHandler (sig=kSigSegmentationViolation)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:357
#6  0x00002b454911b79c in sighandler (sig=11)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:3521
#7  <signal handler called>
#8  0x00002b4548b1908e in AnalysisSkel::Terminate() ()
   from /afs/cern.ch/user/j/jblancha/JBExample/lib/libJBB.so.0.0
#9  0x00002b455574c1bb in TProofPlayerLite::Finalize (this=0x13534190, 
    force=false, sync=true)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayerLite.cxx:291
#10 0x00002b455574d4ac in TProofPlayerLite::Process (this=0x13534190, dset=
    0x7fffb470d660, selector_file=0x135424b8 "MyAnalysis/AnalysisSkel.C++", 
    option=0x40a2a4 "", nentries=-1, first=0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayerLite.cxx:230
#11 0x00002b4550df4636 in TProofLite::Process (this=0x134d62f0, 
    dset=0x7fffb470d660, selector=0x40a32f "MyAnalysis/AnalysisSkel.C++", 
    option=0x40a2a4 "", nentries=-1, first=0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofLite.cxx:1123
#12 0x0000000000406b51 in main ()
===========================================================

to be more specific i look at log of every worker (4 for 20000 evts for this test) and there is always the same stack trace on workers but now after destructor (i’ve put a lot of cout to see where it crashed)

===========================================================
There was a crash (kSigSegmentationViolation).
This is the entire stack trace of all threads:
===========================================================
#0  0x00002ac237f3f115 in waitpid () from /lib64/libc.so.6
#1  0x00002ac237ee1481 in do_system () from /lib64/libc.so.6
#2  0x00002ac235b65e69 in TUnixSystem::Exec (this=0x912a3a0, 
    shellcmd=0xb4390a8 "/afs/cern.ch/sw/lcg/app/releases/ROOT/5.28.00g/x86_64-slc5-gcc43-dbg/root/etc/gdb-backtrace.sh 14637 1>&2")
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:2031
#3  0x00002ac235b65022 in TUnixSystem::StackTrace (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:2253
#4  0x00002ac235b6855e in TUnixSystem::DispatchSignals (this=0x912a3a0, 
    sig=kSigSegmentationViolation)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1157
#5  0x00002ac235b68688 in SigHandler (sig=kSigSegmentationViolation)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:357
#6  0x00002ac235b5d79c in sighandler (sig=11)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:3521
#7  <signal handler called>
#8  0x0000000000000051 in ?? ()
#9  0x00002ac235af6ec7 in TList::Delete (this=0x9c18500, 
    option=0x2ac235f54f70 "")
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/cont/src/TList.cxx:414
#10 0x00002ac235af61c1 in TList::Clear (this=0x9c18500, 
    option=0x2ac235f54f70 "")
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/cont/src/TList.cxx:350
#11 0x00002ac235af7075 in TList::~TList (this=0x9c18500, 
    __in_chrg=<value optimized out>)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/cont/src/TList.cxx:83
#12 0x00002ac239703bbd in TSelectorList::~TSelectorList (this=0x9c18500, 
    __in_chrg=<value optimized out>) at include/TSelectorList.h:33
#13 0x00002ac239702f12 in TSelector::~TSelector (this=0x9d28130, 
    __in_chrg=<value optimized out>)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/tree/tree/src/TSelector.cxx:98
#14 0x00002ac23a432494 in AnalysisSkel::~AnalysisSkel() ()
   from /afs/cern.ch/user/j/jblancha/JBExample/lib/libJBB.so
#15 0x00002ac240159140 in TProofPlayer::~TProofPlayer (this=0x9c14c70, 
    __in_chrg=<value optimized out>)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayer.cxx:226
#16 0x00002ac240170fdd in TProofPlayerSlave::~TProofPlayerSlave (
    this=0x9c14c70, __in_chrg=<value optimized out>)
    at include/TProofPlayer.h:337
#17 0x00002ac239b7f33a in TProofServ::DeletePlayer (this=0x96db480)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:6003
#18 0x00002ac239b8b844 in TProofServ::HandleProcess (this=0x96db480, mess=
    0x96f48f0, slb=0x0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:3825
#19 0x00002ac239ba1105 in TProofServ::HandleSocketInput (this=0x96db480, 
    mess=0x96f48f0, all=true)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1595
#20 0x00002ac239b92b49 in TProofServ::HandleSocketInput (this=0x96db480)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1328
#21 0x00002ac239baae5b in TProofServLiteInputHandler::Notify (this=0x96dcfc0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:162
#22 0x00002ac239badfa0 in TProofServLiteInputHandler::ReadNotify (
    this=0x96dcfc0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:154
#23 0x00002ac235b678e5 in TUnixSystem::CheckDescriptors (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1259
#24 0x00002ac235b68057 in TUnixSystem::DispatchOneEvent (this=0x912a3a0, 
    pendingOnly=false)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:966
#25 0x00002ac235aaf89a in TSystem::InnerLoop (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:406
#26 0x00002ac235abf2b0 in TSystem::Run (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:356
#27 0x00002ac235a2e73f in TApplication::Run (this=0x96db480, retrn=false)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TApplication.cxx:1052
#28 0x00002ac239b904ac in TProofServ::Run (this=0x96db480, retrn=false)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:2472
#29 0x0000000000402348 in main (argc=5, argv=0x7fffc6094568)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/main/src/pmain.cxx:314
===========================================================


The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#8  0x0000000000000051 in ?? ()
#9  0x00002ac235af6ec7 in TList::Delete (this=0x9c18500, 
    option=0x2ac235f54f70 "")
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/cont/src/TList.cxx:414
#10 0x00002ac235af61c1 in TList::Clear (this=0x9c18500, 
    option=0x2ac235f54f70 "")
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/cont/src/TList.cxx:350
#11 0x00002ac235af7075 in TList::~TList (this=0x9c18500, 
    __in_chrg=<value optimized out>)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/cont/src/TList.cxx:83
#12 0x00002ac239703bbd in TSelectorList::~TSelectorList (this=0x9c18500, 
    __in_chrg=<value optimized out>) at include/TSelectorList.h:33
#13 0x00002ac239702f12 in TSelector::~TSelector (this=0x9d28130, 
    __in_chrg=<value optimized out>)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/tree/tree/src/TSelector.cxx:98
#14 0x00002ac23a432494 in AnalysisSkel::~AnalysisSkel() ()
   from /afs/cern.ch/user/j/jblancha/JBExample/lib/libJBB.so
#15 0x00002ac240159140 in TProofPlayer::~TProofPlayer (this=0x9c14c70, 
    __in_chrg=<value optimized out>)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayer.cxx:226
#16 0x00002ac240170fdd in TProofPlayerSlave::~TProofPlayerSlave (
    this=0x9c14c70, __in_chrg=<value optimized out>)
    at include/TProofPlayer.h:337
#17 0x00002ac239b7f33a in TProofServ::DeletePlayer (this=0x96db480)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:6003
#18 0x00002ac239b8b844 in TProofServ::HandleProcess (this=0x96db480, mess=
    0x96f48f0, slb=0x0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:3825
#19 0x00002ac239ba1105 in TProofServ::HandleSocketInput (this=0x96db480, 
    mess=0x96f48f0, all=true)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1595
#20 0x00002ac239b92b49 in TProofServ::HandleSocketInput (this=0x96db480)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1328
#21 0x00002ac239baae5b in TProofServLiteInputHandler::Notify (this=0x96dcfc0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:162
#22 0x00002ac239badfa0 in TProofServLiteInputHandler::ReadNotify (
    this=0x96dcfc0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:154
#23 0x00002ac235b678e5 in TUnixSystem::CheckDescriptors (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1259
#24 0x00002ac235b68057 in TUnixSystem::DispatchOneEvent (this=0x912a3a0, 
    pendingOnly=false)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:966
#25 0x00002ac235aaf89a in TSystem::InnerLoop (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:406
#26 0x00002ac235abf2b0 in TSystem::Run (this=0x912a3a0)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:356
#27 0x00002ac235a2e73f in TApplication::Run (this=0x96db480, retrn=false)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TApplication.cxx:1052
#28 0x00002ac239b904ac in TProofServ::Run (this=0x96db480, retrn=false)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:2472
#29 0x0000000000402348 in main (argc=5, argv=0x7fffc6094568)
    at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/main/src/pmain.cxx:314
===========================================================

followed by an error i don’t understand ;

on node 0 and 2

18:27:48 14614 Wrk-0.0 | Error in <TProofServLite::HandleException>: caugth exception triggered by signal '1' <undef>

on node 1

18:27:42 14628 Wrk-0.1 | Error in <TProofServLite::HandleException>: caugth exception triggered by signal '1' while processing dset:'TDSet:physics', file:'/tmp/jblancha/Test/NTUP_SMWZ.591363._000162.root', event:8213 - check logs for possible stacktrace

on node 3

18:27:42 14637 Wrk-0.3 | Error in <TProofServLite::HandleException>: caugth exception triggered by signal '1' while processing dset:'TDSet:physics', file:'/tmp/jblancha/Test/NTUP_SMWZ.591363._000162.root', event:9999 - check logs for possible stacktrace

ideas are more than welcome
many thanks
jb

Hi

I’m sorry to insist but i can’t find anything to explain and it seems to be out of my code.
No one has any idea ?

cheers
jb

In your selector, are you deleting the objects that you put into the output list?
Can you post the essential parts of your selector?
Or, can you post the simplest code reproducing the problem?
Please, also specify the ROOT version which you are using.

G. Ganis

Hi

many thank for your questions : i spent the night doing brainstorming about what you said and i’ve rewritten 50 % of my proof code. i was deleting object in my loop (objects that hold histos)…

btw i’m running with 5.30.05

Now the code is working well for a part : it split the DS and send jobs to every worker. Every worker is doing is job, creating and filling histograms, transmetting them to master and once this is done it crashes, printing an enormous amount of line and then it goes to next worker. Once this happen for every workers it merge the ouput and create the output file and finally complain about previous crashes. This does not prevent the output file to be correctly filled but i’ll be glad to correct this…

Do you have any idea for that ?

I’m putting in attach file what’s prompted on the screen (worker log files seem error free)

many thnaks in any case for the idea and comments
regards
jb

Edit : I’ve tested my code on my own laptop and it works fine… Could this comes from a particuliar root version or an envireonnment i didn’t setup correctly ?
on my laptop root version is 5.26.00 and i m using 5.30.05 at cern
ProofPrompted.txt (192 KB)

Hi

I’ve been trying to localise the source of this error but it is only there when i’m requiring more than 2 workers so i can run vallgrind as its use is limited to 2 nodes.
any idea ?

cheers

Hi,

The crash in the destruction of the output list: it finds an invalid pointer, probably an object already deleted.
Please make sure that you do not delete by yourself any of the objects that you registered in the output list.

You can run valgrind with more workers: check
root.cern.ch/drupal/content/runn … to_workers

G. Ganis

Hi

Thanks for the reply

this is what i was afraid of, and i didn’t find anything by myself. (even with valgrind with more than 2 nodes)
I’m joining to this mail the smallest example of what i need as you’ve asked previously. I’ve made it as automatic as possible so you can try run it easily

you just have to source the setup (that should source root and 2 environnement variable ) and then do make (that would create libraries and needed par files).

data are in the tar file. the only thing to do is then ./bin/main to run normally and ./bin/main -proof true to run with proof.

Hope you’ll have time to look at it

many thanks in advance
jb
Example.tar (1.45 MB)

Hi

do you have time to look at the small example ?
I’ve been trying to identify a possibly remove object in the output list but I didn’t find anything so far

thanks
jb

Hi,

I am not sure what I need or should I do to run your code, but can you try by using pointers to histograms and not histograms? I mean

vector<TH1F *> m_histArray1D;
vector<TProfile *> m_profArray;
vector<TH2F *> m_histArray2D;

and all related changes …
Otherwise 1) you always duplicate all object creation, and 2) - even worse - when the vector is deleted the histograms, which you have added to the output list, are also deleted.

G. Ganis

Hi

I’ll look with pointers…

When do i duplicate things ?
I’m not deleting the histograms. I’ve looked at that : vector of histograms are attached to selector and selectors are created in the slave begin and not deleted until the destructor (or at least i think so)…

to run you’ll have to

*source setup.sh (for makefile generation)
*source rootsetup.sh (for env variable and root - eventthough you know how to setup root :smiley:)
*do make

  • do ./bin/main to run without proof or ./bin/main -proof true to run proof

thanks anyway
jb

When you push back the histogram to the vector. Otherwise you go out of scope and the histogram is gone.

Yes, but the vectors are members, therefore are destructed even if you do not do it explicitely; and when the vectors are destructed their content is destructed too; and since they are histos and not pointers to histos, the histos are destructed.

Try with pointers, and then we see what to do.

G. Ganis

Wonderful !

it works now perfectly… but i don’t understand why :
Even though histos were not pointers, i though that they w’d belong to a selector. What is the difference in the way memory is orgniased

in any case many thanks

cheers
jb

[quote=“jb”]but i don’t understand why :
Even though histos were not pointers, i though that they w’d belong to a selector. What is the difference in the way memory is orgniased [/quote]
In one case the selector will delete histograms; this is invalid because, as members of the output list, the histograms have been already deleted; the double deletion ends up in a crash.
In the other case the selector deletes pointers to histograms, which are perfectly valid and the operation will succeed.

G. Ganis