Hi everyone,
I have been trying to use Proof to run some code. When I run the code locally(RunMode=“LOCAL” ProofServer=“lite://”) it works fine. When I try to run it using PoD (RunMode=“PROOF” ProofServer=“pmullen@lxplus440.cern.ch:21004”) It crashes. The crash is intermittent. For instance I have ran it using 20 PROOF nodes and all of the nodes crashed but at a later date I ran the same code with 10 nodes 5 crashed and 5 completed without a problem. The error message given is
Worker ‘lxbsq2008.cern.ch-0.2’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.4’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.3’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.5’ has been removed from the active list
+++ Message from top master at lxplus440.cern.ch:21004 : marking lxbsq2008.cern.ch:21006 (0.2) as bad
+++ Reason: received kPROOF_FATAL
+++ Most likely your code crashed on worker 0.2 at lxbsq2008.cern.ch:21006.
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“lxplus440.cern.ch:21004”)->GetSessionLogs()->Display(“0.2”,0)
Where the 2nd part of the error repeats for each node that crashes. The error log returned to my machine from the crashed nodes looks like this
fraction b changed: 0.000114142
fraction c changed: 1
fraction l changed: 0.0216113
( ERROR ) TUnixSystem::Di… : segmentation violation
===========================================================
There was a crash.
This is the entire stack trace of all threads:
Thread 2 (Thread 0x403e6940 (LWP 11807)):
#0 0x0000003d950ca366 in poll () from /lib64/libc.so.6
#1 0x000000355d01ec56 in XrdClientSock::RecvRaw(void*, int, int, int*) ()
from /usr/lib64/libXrdClient.so.0
#2 0x000000355d037615 in XrdClientPhyConnection::ReadRaw(void*, int, int, int*) () from /usr/lib64/libXrdClient.so.0
#3 0x000000355d034928 in XrdClientMessage::ReadRaw(XrdClientPhyConnection*)
() from /usr/lib64/libXrdClient.so.0
#4 0x000000355d036dbc in XrdClientPhyConnection::BuildMessage(bool, bool) ()
from /usr/lib64/libXrdClient.so.0
#5 0x000000355d0373f2 in SocketReaderThread(void*, XrdClientThread*) ()
from /usr/lib64/libXrdClient.so.0
#6 0x00002ac70d91c067 in XrdSysThread_Xeq (myargs=)
at /build/hegner/LCGCMT/work/xrootd-3.1.0p2/src/XrdSys/XrdSysPthread.cc:87
#7 0x0000003d9600677d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003d950d325d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ac70bc632c0 (LWP 11735)):
#0 0x0000003d95098e2f in waitpid () from /lib64/libc.so.6
#1 0x0000003d9503c491 in do_system () from /lib64/libc.so.6
#2 0x0000003d9503c7e7 in system () from /lib64/libc.so.6
#3 0x00002ac70a38b7d6 in TUnixSystem::StackTrace() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#4 0x00002ac70a38b0ac in TUnixSystem::DispatchSignals(ESignals) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#5
#6 0x00002aaab642731c in FatJetStore::Delete() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#7 0x00002aaab63cb193 in AnalysisManager::EndInputData(SInputData const&) ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#8 0x00002aaaabbd8901 in SCycleBaseExec::SlaveTerminate() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libSFrameCore.so
#9 0x00002aaaab91240f in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so
#10 0x00002ac70d365a7f in TProofServ::HandleProcess(TMessage*, TString*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#11 0x00002ac70d367f0c in TProofServ::HandleSocketInput(TMessage*, bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#12 0x00002ac70d35d611 in TProofServ::HandleSocketInput() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#13 0x00002ac70d6b8f19 in TXProofServ::HandleInput(void const*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#14 0x00002ac70d6c93cd in TXSocketHandler::Notify() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#15 0x00002ac70a389704 in TUnixSystem::CheckDescriptors() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#16 0x00002ac70a389d21 in TUnixSystem::DispatchOneEvent(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#17 0x00002ac70a300b46 in TSystem::InnerLoop() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#18 0x00002ac70a302dfc in TSystem::Run() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#19 0x00002ac70a29696f in TApplication::Run(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#20 0x0000000000401c55 in main ()
The lines below might hint at the cause of the crash.
If they do not help you then pleae submit a bug report at
root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
#6 0x00002aaab642731c in FatJetStore::Delete() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#7 0x00002aaab63cb193 in AnalysisManager::EndInputData(SInputData const&) ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libAnalysisBase.so
#8 0x00002aaaabbd8901 in SCycleBaseExec::SlaveTerminate() ()
from /afs/cern.ch/user/p/pmullen/VH/trunk/SFrame/lib/libSFrameCore.so
#9 0x00002aaaab91240f in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so
#10 0x00002ac70d365a7f in TProofServ::HandleProcess(TMessage*, TString*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#11 0x00002ac70d367f0c in TProofServ::HandleSocketInput(TMessage*, bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#12 0x00002ac70d35d611 in TProofServ::HandleSocketInput() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProof.so
#13 0x00002ac70d6b8f19 in TXProofServ::HandleInput(void const*) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#14 0x00002ac70d6c93cd in TXSocketHandler::Notify() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libProofx.so
#15 0x00002ac70a389704 in TUnixSystem::CheckDescriptors() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#16 0x00002ac70a389d21 in TUnixSystem::DispatchOneEvent(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#17 0x00002ac70a300b46 in TSystem::InnerLoop() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#18 0x00002ac70a302dfc in TSystem::Run() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#19 0x00002ac70a29696f in TApplication::Run(bool) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.00/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#20 0x0000000000401c55 in main ()
( ERROR ) TXProofServ::Ha… : caugth exception triggered by signal ‘1’ while processing dset:‘TDSet:physics’, file:’/afs/cern.ch/work/p/pmullen/VH-llbb-files/NTUP_SMWZ.00923979._000021.root.1’ - check logs for possible stacktrace - last event: 9803
I am looking at the location in my code where the segfault happens but I dont think that is the cause due to the intermittance of the crash. Does anyone have any idea what is causing this issue?
Thanks,
Paul