Dear all,
i’m running PROOF on a PoD cluster from the lxbatch system and reading data from CERN eos disk.
Sometime i experience a crash in the merging phase, see below. I’m using ROOT 5.34.01 and xrootd 3.2.2.
From the worker log (see below) it seems the crash happens after:
121016 15:36:47 27513 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
and from the stack (see below) after XrdClientPhyConnection::ReadRaw and the XrdClientSock::RecvRaw calls.
I thought the above error from xrootd was a known issue and solved in 3.2.2 but apparently this is not the case.
Any help is much appreciated, many thanks in advance.
best,
Max
…
Wrk-0.7: Compiling src/Wmunu.cxx
Wrk-0.7: Generating dictionary src/AnalysisWmunu_Dict.cxx
Wrk-0.7: Compiling src/AnalysisWmunu_Dict.cxx
Wrk-0.7: Making shared library: libAnalysisWmunu.so
( INFO ) SCycleController : Processing input data type: Wmunu version: mc11_7TeV.p833_v0115_signals_plus
(WARNING) AnalysisManager : Property “UsePileupReweighting” is getting set multiple times
(WARNING) AnalysisManager : Now taking value: true
( INFO ) AnalysisManager : Created instance "wmunu_pt20_mt40_isoid40"of tool "Wmunu"
Looking up for exact location of files: OK (1150 files)
Looking up for exact location of files: OK (1150 files)
Validating files: OK (1150 files)
[TProof:] Total 22996963 events workers|====================| 100.00 % [38356.0 evts/s, 256.6 MB/s]
Worker ‘lxbsp3026.cern.ch-0.93’ has been removed from the active list
Worker ‘lxbsq2943.cern.ch-0.67’ has been removed from the active list
Worker ‘lxbsq2939.cern.ch-0.64’ has been removed from the active list
Worker ‘lxbsp0818.cern.ch-0.94’ has been removed from the active list
Worker ‘lxbse08c04.cern.ch-0.65’ has been removed from the active list
Worker ‘lxbsu2712.cern.ch-0.69’ has been removed from the active list
Worker ‘lxbsu0740.cern.ch-0.66’ has been removed from the active list
Worker ‘lxbrd12c06.cern.ch-0.96’ has been removed from the active list
Worker ‘lxbst1139.cern.ch-0.98’ has been removed from the active list
Worker ‘lxbsq2205.cern.ch-0.95’ has been removed from the active list
Worker ‘lxbsp1209.cern.ch-0.97’ has been removed from the active list
Worker ‘lxbst0739.cern.ch-0.100’ has been removed from the active list
Worker ‘lxbsp1337.cern.ch-0.99’ has been removed from the active list
Worker ‘lxbse11c02.cern.ch-0.70’ has been removed from the active list
Worker ‘lxbsp1241.cern.ch-0.101’ has been removed from the active list
Worker ‘lxbsp1314.cern.ch-0.102’ has been removed from the active list
Worker ‘lxbsu0828.cern.ch-0.103’ has been removed from the active list
Worker ‘lxbsp0848.cern.ch-0.104’ has been removed from the active list
Worker ‘lxbsp2410.cern.ch-0.105’ has been removed from the active list
Worker ‘lxbst2240.cern.ch-0.106’ has been removed from the active list
Worker ‘lxbsq0742.cern.ch-0.71’ has been removed from the active list
Worker ‘lxbsq0522.cern.ch-0.108’ has been removed from the active list
Worker ‘lxbrd16c01.cern.ch-0.107’ has been removed from the active list
Worker ‘lxbrd48c02.cern.ch-0.72’ has been removed from the active list
Worker ‘lxbsu1241.cern.ch-0.74’ has been removed from the active list
Worker ‘lxbsq2305.cern.ch-0.73’ has been removed from the active list
Worker ‘lxbst1205.cern.ch-0.109’ has been removed from the active list
Worker ‘lxbsq2942.cern.ch-0.75’ has been removed from the active list
Worker ‘lxbrf47c05.cern.ch-0.76’ has been removed from the active list
Worker ‘lxbst2305.cern.ch-0.2’ has been removed from the active list
Worker ‘lxbst2314.cern.ch-0.3’ has been removed from the active list
Worker ‘lxbsq2110.cern.ch-0.0’ has been removed from the active list
Worker ‘lxbse08c11.cern.ch-0.77’ has been removed from the active list
Worker ‘lxbst2312.cern.ch-0.13’ has been removed from the active list
Worker ‘lxbst2316.cern.ch-0.6’ has been removed from the active list
Worker ‘lxbsp2817.cern.ch-0.78’ has been removed from the active list
Worker ‘lxbst2314.cern.ch-0.15’ has been removed from the active list
Worker ‘lxbst2305.cern.ch-0.14’ has been removed from the active list
Worker ‘lxbsq2105.cern.ch-0.1’ has been removed from the active list
Worker ‘lxbst2312.cern.ch-0.10’ has been removed from the active list
Worker ‘lxbst2316.cern.ch-0.17’ has been removed from the active list
Worker ‘lxbst2316.cern.ch-0.16’ has been removed from the active list
Worker ‘lxbsq2028.cern.ch-0.9’ has been removed from the active list
Worker ‘lxbsq2023.cern.ch-0.4’ has been removed from the active list
Worker ‘lxbsq2024.cern.ch-0.7’ has been removed from the active list
Worker ‘lxbsq2032.cern.ch-0.8’ has been removed from the active list
Worker ‘lxbsq2010.cern.ch-0.5’ has been removed from the active list
Worker ‘lxbse09c11.cern.ch-0.79’ has been removed from the active list
Worker ‘lxbst2314.cern.ch-0.18’ has been removed from the active list
Worker ‘lxbst2223.cern.ch-0.19’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.11’ has been removed from the active list
Worker ‘lxbsq2108.cern.ch-0.12’ has been removed from the active list
Worker ‘lxbsp1340.cern.ch-0.20’ has been removed from the active list
Worker ‘lxbse09c06.cern.ch-0.80’ has been removed from the active list
Worker ‘lxbre40c05.cern.ch-0.21’ has been removed from the active list
Worker ‘lxbre66c03.cern.ch-0.22’ has been removed from the active list
Worker ‘lxbrk60c05.cern.ch-0.23’ has been removed from the active list
Worker ‘lxbse07c08.cern.ch-0.26’ has been removed from the active list
Worker ‘lxbsu1528.cern.ch-0.27’ has been removed from the active list
Worker ‘lxbse06c05.cern.ch-0.25’ has been removed from the active list
Worker ‘lxbsq2119.cern.ch-0.28’ has been removed from the active list
Worker ‘lxbsu2012.cern.ch-0.81’ has been removed from the active list
Worker ‘lxbsp1324.cern.ch-0.24’ has been removed from the active list
Worker ‘lxbrk62c08.cern.ch-0.29’ has been removed from the active list
Worker ‘lxbse12c08.cern.ch-0.30’ has been removed from the active list
Worker ‘lxbst0834.cern.ch-0.33’ has been removed from the active list
Worker ‘lxbsq2303.cern.ch-0.31’ has been removed from the active list
Worker ‘lxbrd06c04.cern.ch-0.32’ has been removed from the active list
Worker ‘lxbst2205.cern.ch-0.35’ has been removed from the active list
Worker ‘lxbsu1316.cern.ch-0.36’ has been removed from the active list
Worker ‘lxbrd20c06.cern.ch-0.34’ has been removed from the active list
Worker ‘lxbrd20c06.cern.ch-0.110’ has been removed from the active list
Worker ‘lxbsp1242.cern.ch-0.37’ has been removed from the active list
Worker ‘lxbsq0639.cern.ch-0.40’ has been removed from the active list
Worker ‘lxbre48c02.cern.ch-0.111’ has been removed from the active list
Worker ‘lxbsq2226.cern.ch-0.39’ has been removed from the active list
Worker ‘lxbst1109.cern.ch-0.112’ has been removed from the active list
Worker ‘lxbsp1237.cern.ch-0.38’ has been removed from the active list
Worker ‘lxbsu2410.cern.ch-0.42’ has been removed from the active list
Worker ‘lxbsp2619.cern.ch-0.41’ has been removed from the active list
Worker ‘lxbrd08c03.cern.ch-0.82’ has been removed from the active list
Worker ‘lxbsu0817.cern.ch-0.43’ has been removed from the active list
Worker ‘lxbsp1148.cern.ch-0.83’ has been removed from the active list
Worker ‘lxbre64c08.cern.ch-0.44’ has been removed from the active list
Worker ‘lxbsp0926.cern.ch-0.113’ has been removed from the active list
Worker ‘lxbse07c03.cern.ch-0.45’ has been removed from the active list
Worker ‘lxbsp0610.cern.ch-0.84’ has been removed from the active list
Worker ‘lxbsq2403.cern.ch-0.114’ has been removed from the active list
Worker ‘lxbre58c08.cern.ch-0.46’ has been removed from the active list
Worker ‘lxbrk54c07.cern.ch-0.115’ has been removed from the active list
Worker ‘lxbsp2220.cern.ch-0.116’ has been removed from the active list
Worker ‘lxbrd42c04.cern.ch-0.86’ has been removed from the active list
Worker ‘lxbsu0833.cern.ch-0.85’ has been removed from the active list
Worker ‘lxbst0916.cern.ch-0.87’ has been removed from the active list
Worker ‘lxbrd60c02.cern.ch-0.47’ has been removed from the active list
Worker ‘lxbst2310.cern.ch-0.54’ has been removed from the active list
Worker ‘lxbrg2610.cern.ch-0.48’ has been removed from the active list
Worker ‘lxbsp3015.cern.ch-0.51’ has been removed from the active list
Worker ‘lxbse12c02.cern.ch-0.50’ has been removed from the active list
Worker ‘lxbsq0512.cern.ch-0.55’ has been removed from the active list
Worker ‘lxbrd14c01.cern.ch-0.117’ has been removed from the active list
Worker ‘lxbsp1312.cern.ch-0.52’ has been removed from the active list
Worker ‘lxbsu1529.cern.ch-0.49’ has been removed from the active list
Worker ‘lxbsq0537.cern.ch-0.88’ has been removed from the active list
Worker ‘lxbsp3006.cern.ch-0.118’ has been removed from the active list
Worker ‘lxbrd52c03.cern.ch-0.53’ has been removed from the active list
Worker ‘lxbst2313.cern.ch-0.57’ has been removed from the active list
Worker ‘lxbsu1333.cern.ch-0.89’ has been removed from the active list
Worker ‘lxbsq0729.cern.ch-0.60’ has been removed from the active list
Worker ‘lxbsp2336.cern.ch-0.56’ has been removed from the active list
Worker ‘lxbsu0820.cern.ch-0.119’ has been removed from the active list
Worker ‘lxbst0620.cern.ch-0.59’ has been removed from the active list
Worker ‘lxbrd38c07.cern.ch-0.90’ has been removed from the active list
Worker ‘lxbrd16c08.cern.ch-0.58’ has been removed from the active list
Worker ‘lxbsq3019.cern.ch-0.62’ has been removed from the active list
Worker ‘lxbst2326.cern.ch-0.91’ has been removed from the active list
Worker ‘lxbsq3016.cern.ch-0.61’ has been removed from the active list
Worker ‘lxbrk40c07.cern.ch-0.63’ has been removed from the active list
Worker ‘lxbsu1540.cern.ch-0.92’ has been removed from the active list
Worker ‘lxbst0635.cern.ch-0.68’ has been removed from the active list
| session: mbellomo.default.24050.status terminated by peer
( INFO ) TXSlave::Handle… : 0x14a15c90:lxplus435.cern.ch:0 got called … fProof: 0x144b5180, fSocket: 0x14a15f30 (valid: 1)
( INFO ) TXSlave::Handle… : 0x14a15c90: proof: 0x144b5180
TXSlave::HandleError: 0x14a15c90: DONE …
( INFO ) TProof::MarkBad :
( INFO ) TProof::MarkBad : +++ Message from local session : marking lxplus435.cern.ch:21002 (0) as bad
( INFO ) TProof::MarkBad : +++ Reason: received kPROOF_FATAL
+++ Message from local session : marking lxplus435.cern.ch:21002 (0) as bad
+++ Reason: received kPROOF_FATAL
+++ Most likely your code crashed
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“mbellomo@lxplus435.cern.ch:21002”)->GetSessionLogs()->Display("*")
(WARNING) TProofPlayerRem… : current TQueryResult object is undefined!
(WARNING) TProof::GetMiss… : no (last) query found: do nothing
(WARNING) SCycleController : Cycle statistics not received from: AnalysisManager
(WARNING) SCycleController : Printed statistics will not be correct!
( INFO ) SCycleController : Writing output of “AnalysisManager” to: /tmp/$USER/AnalysisManager.Wmunu.mc11_7TeV.p833_v0115_signals_plus.root
( INFO ) SCycleController : Processing input data type: Wmunu version: mc11_7TeV.p833_v0115_signals_minus
( ERROR ) TUnixSystem::Di… : floating point exception
===========================================================
There was a crash.
This is the entire stack trace of all threads:
Thread 3 (Thread 0x417dc940 (LWP 24029)):
#0 0x00002b377fc6c221 in nanosleep () from /lib64/libc.so.6
#1 0x00002b377fc6c044 in sleep () from /lib64/libc.so.6
#2 0x00002b3787bf2824 in GarbageCollectorThread (arg=0x14a2db40,
thr=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientConnMgr.cc:73
#3 0x00002b37879701ff in XrdSysThread_Xeq (myargs=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdSys/XrdSysPthread.cc:67
#4 0x00002b377f9bd77d in start_thread () from /lib64/libpthread.so.0
#5 0x00002b377fca625d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x42693940 (LWP 24031)):
#0 0x00002b377fc9d366 in poll () from /lib64/libc.so.6
#1 0x00002b3787bd9b56 in XrdClientSock::RecvRaw (this=0x144b4d00,
buffer=0x14417690, length=8, substreamid=-1, usedsubstreamid=0x0)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientSock.cc:133
#2 0x00002b3787bfbe90 in XrdClientPhyConnection::ReadRaw (this=0x144b3430,
buf=0x14417690, len=8, substreamid=-1, usedsubstreamid=0x42692d28)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientPhyConnection.cc:359
#3 0x00002b3787bff28c in XrdClientMessage::ReadRaw (this=0x14417650,
phy=0x144b3430)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientMessage.cc:152
#4 0x00002b3787bfa7aa in XrdClientPhyConnection::BuildMessage (
this=0x144b3430, IgnoreTimeouts=true, Enqueue=true)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientPhyConnection.cc:440
#5 0x00002b3787bfe89a in SocketReaderThread (arg=0x144b3430,
thr=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientPhyConnection.cc:57
#6 0x00002b37879701ff in XrdSysThread_Xeq (myargs=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdSys/XrdSysPthread.cc:67
#7 0x00002b377f9bd77d in start_thread () from /lib64/libpthread.so.0
#8 0x00002b377fca625d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b37823c3f10 (LWP 23648)):
#0 0x00002b377fc6be2f in waitpid () from /lib64/libc.so.6
#1 0x00002b377fc0f491 in do_system () from /lib64/libc.so.6
#2 0x00002b377fc0f7e7 in system () from /lib64/libc.so.6
#3 0x00002b3779b4d7d6 in TUnixSystem::StackTrace() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.01/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#4 0x00002b3779b4d0ac in TUnixSystem::DispatchSignals(ESignals) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.01/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#5
#6 0x00002b377964d0f1 in SCycleController::ExecuteNextCycle() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#7 0x00002b37796490ca in SCycleController::ExecuteAllCycles() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#8 0x000000000040189c in main ()
The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
#6 0x00002b377964d0f1 in SCycleController::ExecuteNextCycle() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#7 0x00002b37796490ca in SCycleController::ExecuteAllCycles() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#8 0x000000000040189c in main ()
Below you can find the log from one of the workers.
// --------- Start of element log -----------------
// Ordinal: 0.107 (role: worker)
// Path: mbellomo@lxbrd16c01.cern.ch:21001//tmp/PoD_vEivv25468/proof/mbellomo/session-lxplus435-1350391407-24050/worker-0.107-lxbrd16c01-1350391414-16025.log
// # of retrieved lines: 639
(displaying lines: 630 -> 639)
// ------------------------------------------------
121016 15:34:16 12412 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:35:16 23442 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:36:16 25176 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:36:46 26651 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:36:47 27513 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
Received SIGTERM: terminating
( INFO ) TXProofServ::Te… : starting session termination operations …
( INFO ) TXProofServ::Te… : process memory footprint: 1737252/-1 kB virtual, 1439684/-1 kB resident
( INFO ) TXProofServ::Te… : data directory ‘/tmp/PoD_vEivv25468/proof/mbellomo/data/0.107/lxbrd16c01-1350391414-16025’ has been removed
Terminate: termination operations ended: quitting!
// --------- End of element log -------------------