PROOF intermittant segfault

Hi everyone,

I have been trying to use Proof to run some code. When I run the code locally(RunMode=“LOCAL” ProofServer=“lite://”) it works fine. When I try to run it using PoD (RunMode=“PROOF” ProofServer=“pmullen@lxplus440.cern.ch:21004”) It crashes. The crash is intermittent. For instance I have ran it using 20 PROOF nodes and all of the nodes crashed but at a later date I ran the same code with 10 nodes 5 crashed and 5 completed without a problem. The error message given is

Worker ‘lxbsq2008.cern.ch-0.2’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.4’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.3’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.5’ has been removed from the active list

+++ Message from top master at lxplus440.cern.ch:21004 : marking lxbsq2008.cern.ch:21006 (0.2) as bad
+++ Reason: received kPROOF_FATAL

+++ Most likely your code crashed on worker 0.2 at lxbsq2008.cern.ch:21006.
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“lxplus440.cern.ch:21004”)->GetSessionLogs()->Display(“0.2”,0)

Where the 2nd part of the error repeats for each node that crashes. The error log returned to my machine from the crashed nodes looks like this

fraction b changed: 0.000114142
fraction c changed: 1
fraction l changed: 0.0216113
( ERROR ) TUnixSystem::Di… : segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:

Thread 2 (Thread 0x403e6940 (LWP 11807)):
#0 0x0000003d950ca366 in poll () from /lib64/libc.so.6
#1 0x000000355d01ec56 in XrdClientSock::RecvRaw(void*, int, int, int*) ()
from /usr/lib64/libXrdClient.so.0
#2 0x000000355d037615 in XrdClientPhyConnection::ReadRaw(void*, int, int, int*) () from /usr/lib64/libXrdClient.so.0
#3 0x000000355d034928 in XrdClientMessage::ReadRaw(XrdClientPhyConnection*)
() from /usr/lib64/libXrdClient.so.0
#4 0x000000355d036dbc in XrdClientPhyConnection::BuildMessage(bool, bool) ()
from /usr/lib64/libXrdClient.so.0
#5 0x000000355d0373f2 in SocketReaderThread(void*, XrdClientThread*) ()
from /usr/lib64/libXrdClient.so.0
#6 0x00002ac70d91c067 in XrdSysThread_Xeq (myargs=)
at /build/hegner/LCGCMT/work/xrootd-3.1.0p2/src/XrdSys/XrdSysPthread.cc:87
#7 0x0000003d9600677d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003d950d325d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2ac70bc632c0 (LWP 11735)):
#0 0x0000003d95098e2f in waitpid () from /lib64/libc.so.6
#1 0x0000003d9503c491 in do_system () from /lib64/libc.so.6
#2 0x0000003d9503c7e7 in system () from /lib64/libc.so.6
#3 0x00002ac70a38b7d6 in TUnixSystem::StackTrace() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#4 0x00002ac70a38b0ac in TUnixSystem::DispatchSignals(ESignals) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#5
#6 0x00002aaab642731c in FatJetStore::Delete() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#7 0x00002aaab63cb193 in AnalysisManager::EndInputData(SInputData const&) ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#8 0x00002aaaabbd8901 in SCycleBaseExec::SlaveTerminate() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libSFrameCore.so
#9 0x00002aaaab91240f in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so
#10 0x00002ac70d365a7f in TProofServ::HandleProcess(TMessage*, TString*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#11 0x00002ac70d367f0c in TProofServ::HandleSocketInput(TMessage*, bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#12 0x00002ac70d35d611 in TProofServ::HandleSocketInput() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#13 0x00002ac70d6b8f19 in TXProofServ::HandleInput(void const*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#14 0x00002ac70d6c93cd in TXSocketHandler::Notify() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#15 0x00002ac70a389704 in TUnixSystem::CheckDescriptors() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#16 0x00002ac70a389d21 in TUnixSystem::DispatchOneEvent(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#17 0x00002ac70a300b46 in TSystem::InnerLoop() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#18 0x00002ac70a302dfc in TSystem::Run() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#19 0x00002ac70a29696f in TApplication::Run(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#20 0x0000000000401c55 in main ()

The lines below might hint at the cause of the crash.
If they do not help you then pleae submit a bug report at
root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

#6 0x00002aaab642731c in FatJetStore::Delete() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#7 0x00002aaab63cb193 in AnalysisManager::EndInputData(SInputData const&) ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#8 0x00002aaaabbd8901 in SCycleBaseExec::SlaveTerminate() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libSFrameCore.so
#9 0x00002aaaab91240f in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so
#10 0x00002ac70d365a7f in TProofServ::HandleProcess(TMessage*, TString*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#11 0x00002ac70d367f0c in TProofServ::HandleSocketInput(TMessage*, bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#12 0x00002ac70d35d611 in TProofServ::HandleSocketInput() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#13 0x00002ac70d6b8f19 in TXProofServ::HandleInput(void const*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#14 0x00002ac70d6c93cd in TXSocketHandler::Notify() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#15 0x00002ac70a389704 in TUnixSystem::CheckDescriptors() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#16 0x00002ac70a389d21 in TUnixSystem::DispatchOneEvent(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#17 0x00002ac70a300b46 in TSystem::InnerLoop() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#18 0x00002ac70a302dfc in TSystem::Run() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#19 0x00002ac70a29696f in TApplication::Run(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#20 0x0000000000401c55 in main ()

( ERROR ) TXProofServ::Ha… : caugth exception triggered by signal ‘1’ while processing dset:‘TDSet:physics’, file:’/afs/cern.ch/work/p/pmullen/VH-llbb-files/NTUP_SMWZ.00923979._000021.root.1’ - check logs for possible stacktrace - last event: 9803

I am looking at the location in my code where the segfault happens but I dont think that is the cause due to the intermittance of the crash. Does anyone have any idea what is causing this issue?

Thanks,
Paul

Hi Paul,

Sorry for the late reply.
The crash is inside code called from SFrame.
Since you are taking ROOT from AFS, can you change the ROOT path in your scripts and use the one at

    /afs/cern.ch/user/g/ganis/work/public/root/5.34.00-patches/x86_64-slc5-gcc43-dbg/root

so that we can obtain more information from the crash backtrace? It would be good to compile also SFrame and the rest of the code in debug mode.

Gerri Ganis