Xrootd crash

Dear all,

i’m running PROOF on a PoD cluster from the lxbatch system and reading data from CERN eos disk.
Sometime i experience a crash in the merging phase, see below. I’m using ROOT 5.34.01 and xrootd 3.2.2.
From the worker log (see below) it seems the crash happens after:

121016 15:36:47 27513 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).

and from the stack (see below) after XrdClientPhyConnection::ReadRaw and the XrdClientSock::RecvRaw calls.
I thought the above error from xrootd was a known issue and solved in 3.2.2 but apparently this is not the case.

Any help is much appreciated, many thanks in advance.

best,
Max


Wrk-0.7: Compiling src/Wmunu.cxx
Wrk-0.7: Generating dictionary src/AnalysisWmunu_Dict.cxx
Wrk-0.7: Compiling src/AnalysisWmunu_Dict.cxx
Wrk-0.7: Making shared library: libAnalysisWmunu.so
( INFO ) SCycleController : Processing input data type: Wmunu version: mc11_7TeV.p833_v0115_signals_plus
(WARNING) AnalysisManager : Property “UsePileupReweighting” is getting set multiple times
(WARNING) AnalysisManager : Now taking value: true
( INFO ) AnalysisManager : Created instance "wmunu_pt20_mt40_isoid40"of tool "Wmunu"
Looking up for exact location of files: OK (1150 files)
Looking up for exact location of files: OK (1150 files)
Validating files: OK (1150 files)
[TProof:] Total 22996963 events workers|====================| 100.00 % [38356.0 evts/s, 256.6 MB/s]
Worker ‘lxbsp3026.cern.ch-0.93’ has been removed from the active list
Worker ‘lxbsq2943.cern.ch-0.67’ has been removed from the active list
Worker ‘lxbsq2939.cern.ch-0.64’ has been removed from the active list
Worker ‘lxbsp0818.cern.ch-0.94’ has been removed from the active list
Worker ‘lxbse08c04.cern.ch-0.65’ has been removed from the active list
Worker ‘lxbsu2712.cern.ch-0.69’ has been removed from the active list
Worker ‘lxbsu0740.cern.ch-0.66’ has been removed from the active list
Worker ‘lxbrd12c06.cern.ch-0.96’ has been removed from the active list
Worker ‘lxbst1139.cern.ch-0.98’ has been removed from the active list
Worker ‘lxbsq2205.cern.ch-0.95’ has been removed from the active list
Worker ‘lxbsp1209.cern.ch-0.97’ has been removed from the active list
Worker ‘lxbst0739.cern.ch-0.100’ has been removed from the active list
Worker ‘lxbsp1337.cern.ch-0.99’ has been removed from the active list
Worker ‘lxbse11c02.cern.ch-0.70’ has been removed from the active list
Worker ‘lxbsp1241.cern.ch-0.101’ has been removed from the active list
Worker ‘lxbsp1314.cern.ch-0.102’ has been removed from the active list
Worker ‘lxbsu0828.cern.ch-0.103’ has been removed from the active list
Worker ‘lxbsp0848.cern.ch-0.104’ has been removed from the active list
Worker ‘lxbsp2410.cern.ch-0.105’ has been removed from the active list
Worker ‘lxbst2240.cern.ch-0.106’ has been removed from the active list
Worker ‘lxbsq0742.cern.ch-0.71’ has been removed from the active list
Worker ‘lxbsq0522.cern.ch-0.108’ has been removed from the active list
Worker ‘lxbrd16c01.cern.ch-0.107’ has been removed from the active list
Worker ‘lxbrd48c02.cern.ch-0.72’ has been removed from the active list
Worker ‘lxbsu1241.cern.ch-0.74’ has been removed from the active list
Worker ‘lxbsq2305.cern.ch-0.73’ has been removed from the active list
Worker ‘lxbst1205.cern.ch-0.109’ has been removed from the active list
Worker ‘lxbsq2942.cern.ch-0.75’ has been removed from the active list
Worker ‘lxbrf47c05.cern.ch-0.76’ has been removed from the active list
Worker ‘lxbst2305.cern.ch-0.2’ has been removed from the active list
Worker ‘lxbst2314.cern.ch-0.3’ has been removed from the active list
Worker ‘lxbsq2110.cern.ch-0.0’ has been removed from the active list
Worker ‘lxbse08c11.cern.ch-0.77’ has been removed from the active list
Worker ‘lxbst2312.cern.ch-0.13’ has been removed from the active list
Worker ‘lxbst2316.cern.ch-0.6’ has been removed from the active list
Worker ‘lxbsp2817.cern.ch-0.78’ has been removed from the active list
Worker ‘lxbst2314.cern.ch-0.15’ has been removed from the active list
Worker ‘lxbst2305.cern.ch-0.14’ has been removed from the active list
Worker ‘lxbsq2105.cern.ch-0.1’ has been removed from the active list
Worker ‘lxbst2312.cern.ch-0.10’ has been removed from the active list
Worker ‘lxbst2316.cern.ch-0.17’ has been removed from the active list
Worker ‘lxbst2316.cern.ch-0.16’ has been removed from the active list
Worker ‘lxbsq2028.cern.ch-0.9’ has been removed from the active list
Worker ‘lxbsq2023.cern.ch-0.4’ has been removed from the active list
Worker ‘lxbsq2024.cern.ch-0.7’ has been removed from the active list
Worker ‘lxbsq2032.cern.ch-0.8’ has been removed from the active list
Worker ‘lxbsq2010.cern.ch-0.5’ has been removed from the active list
Worker ‘lxbse09c11.cern.ch-0.79’ has been removed from the active list
Worker ‘lxbst2314.cern.ch-0.18’ has been removed from the active list
Worker ‘lxbst2223.cern.ch-0.19’ has been removed from the active list
Worker ‘lxbsq2008.cern.ch-0.11’ has been removed from the active list
Worker ‘lxbsq2108.cern.ch-0.12’ has been removed from the active list
Worker ‘lxbsp1340.cern.ch-0.20’ has been removed from the active list
Worker ‘lxbse09c06.cern.ch-0.80’ has been removed from the active list
Worker ‘lxbre40c05.cern.ch-0.21’ has been removed from the active list
Worker ‘lxbre66c03.cern.ch-0.22’ has been removed from the active list
Worker ‘lxbrk60c05.cern.ch-0.23’ has been removed from the active list
Worker ‘lxbse07c08.cern.ch-0.26’ has been removed from the active list
Worker ‘lxbsu1528.cern.ch-0.27’ has been removed from the active list
Worker ‘lxbse06c05.cern.ch-0.25’ has been removed from the active list
Worker ‘lxbsq2119.cern.ch-0.28’ has been removed from the active list
Worker ‘lxbsu2012.cern.ch-0.81’ has been removed from the active list
Worker ‘lxbsp1324.cern.ch-0.24’ has been removed from the active list
Worker ‘lxbrk62c08.cern.ch-0.29’ has been removed from the active list
Worker ‘lxbse12c08.cern.ch-0.30’ has been removed from the active list
Worker ‘lxbst0834.cern.ch-0.33’ has been removed from the active list
Worker ‘lxbsq2303.cern.ch-0.31’ has been removed from the active list
Worker ‘lxbrd06c04.cern.ch-0.32’ has been removed from the active list
Worker ‘lxbst2205.cern.ch-0.35’ has been removed from the active list
Worker ‘lxbsu1316.cern.ch-0.36’ has been removed from the active list
Worker ‘lxbrd20c06.cern.ch-0.34’ has been removed from the active list
Worker ‘lxbrd20c06.cern.ch-0.110’ has been removed from the active list
Worker ‘lxbsp1242.cern.ch-0.37’ has been removed from the active list
Worker ‘lxbsq0639.cern.ch-0.40’ has been removed from the active list
Worker ‘lxbre48c02.cern.ch-0.111’ has been removed from the active list
Worker ‘lxbsq2226.cern.ch-0.39’ has been removed from the active list
Worker ‘lxbst1109.cern.ch-0.112’ has been removed from the active list
Worker ‘lxbsp1237.cern.ch-0.38’ has been removed from the active list
Worker ‘lxbsu2410.cern.ch-0.42’ has been removed from the active list
Worker ‘lxbsp2619.cern.ch-0.41’ has been removed from the active list
Worker ‘lxbrd08c03.cern.ch-0.82’ has been removed from the active list
Worker ‘lxbsu0817.cern.ch-0.43’ has been removed from the active list
Worker ‘lxbsp1148.cern.ch-0.83’ has been removed from the active list
Worker ‘lxbre64c08.cern.ch-0.44’ has been removed from the active list
Worker ‘lxbsp0926.cern.ch-0.113’ has been removed from the active list
Worker ‘lxbse07c03.cern.ch-0.45’ has been removed from the active list
Worker ‘lxbsp0610.cern.ch-0.84’ has been removed from the active list
Worker ‘lxbsq2403.cern.ch-0.114’ has been removed from the active list
Worker ‘lxbre58c08.cern.ch-0.46’ has been removed from the active list
Worker ‘lxbrk54c07.cern.ch-0.115’ has been removed from the active list
Worker ‘lxbsp2220.cern.ch-0.116’ has been removed from the active list
Worker ‘lxbrd42c04.cern.ch-0.86’ has been removed from the active list
Worker ‘lxbsu0833.cern.ch-0.85’ has been removed from the active list
Worker ‘lxbst0916.cern.ch-0.87’ has been removed from the active list
Worker ‘lxbrd60c02.cern.ch-0.47’ has been removed from the active list
Worker ‘lxbst2310.cern.ch-0.54’ has been removed from the active list
Worker ‘lxbrg2610.cern.ch-0.48’ has been removed from the active list
Worker ‘lxbsp3015.cern.ch-0.51’ has been removed from the active list
Worker ‘lxbse12c02.cern.ch-0.50’ has been removed from the active list
Worker ‘lxbsq0512.cern.ch-0.55’ has been removed from the active list
Worker ‘lxbrd14c01.cern.ch-0.117’ has been removed from the active list
Worker ‘lxbsp1312.cern.ch-0.52’ has been removed from the active list
Worker ‘lxbsu1529.cern.ch-0.49’ has been removed from the active list
Worker ‘lxbsq0537.cern.ch-0.88’ has been removed from the active list
Worker ‘lxbsp3006.cern.ch-0.118’ has been removed from the active list
Worker ‘lxbrd52c03.cern.ch-0.53’ has been removed from the active list
Worker ‘lxbst2313.cern.ch-0.57’ has been removed from the active list
Worker ‘lxbsu1333.cern.ch-0.89’ has been removed from the active list
Worker ‘lxbsq0729.cern.ch-0.60’ has been removed from the active list
Worker ‘lxbsp2336.cern.ch-0.56’ has been removed from the active list
Worker ‘lxbsu0820.cern.ch-0.119’ has been removed from the active list
Worker ‘lxbst0620.cern.ch-0.59’ has been removed from the active list
Worker ‘lxbrd38c07.cern.ch-0.90’ has been removed from the active list
Worker ‘lxbrd16c08.cern.ch-0.58’ has been removed from the active list
Worker ‘lxbsq3019.cern.ch-0.62’ has been removed from the active list
Worker ‘lxbst2326.cern.ch-0.91’ has been removed from the active list
Worker ‘lxbsq3016.cern.ch-0.61’ has been removed from the active list
Worker ‘lxbrk40c07.cern.ch-0.63’ has been removed from the active list
Worker ‘lxbsu1540.cern.ch-0.92’ has been removed from the active list
Worker ‘lxbst0635.cern.ch-0.68’ has been removed from the active list

| session: mbellomo.default.24050.status terminated by peer
( INFO ) TXSlave::Handle… : 0x14a15c90:lxplus435.cern.ch:0 got called … fProof: 0x144b5180, fSocket: 0x14a15f30 (valid: 1)
( INFO ) TXSlave::Handle… : 0x14a15c90: proof: 0x144b5180
TXSlave::HandleError: 0x14a15c90: DONE …
( INFO ) TProof::MarkBad :
( INFO ) TProof::MarkBad : +++ Message from local session : marking lxplus435.cern.ch:21002 (0) as bad
( INFO ) TProof::MarkBad : +++ Reason: received kPROOF_FATAL

+++ Message from local session : marking lxplus435.cern.ch:21002 (0) as bad
+++ Reason: received kPROOF_FATAL

+++ Most likely your code crashed
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“mbellomo@lxplus435.cern.ch:21002”)->GetSessionLogs()->Display("*")

(WARNING) TProofPlayerRem… : current TQueryResult object is undefined!
(WARNING) TProof::GetMiss… : no (last) query found: do nothing
(WARNING) SCycleController : Cycle statistics not received from: AnalysisManager
(WARNING) SCycleController : Printed statistics will not be correct!
( INFO ) SCycleController : Writing output of “AnalysisManager” to: /tmp/$USER/AnalysisManager.Wmunu.mc11_7TeV.p833_v0115_signals_plus.root
( INFO ) SCycleController : Processing input data type: Wmunu version: mc11_7TeV.p833_v0115_signals_minus
( ERROR ) TUnixSystem::Di… : floating point exception

===========================================================
There was a crash.
This is the entire stack trace of all threads:

Thread 3 (Thread 0x417dc940 (LWP 24029)):
#0 0x00002b377fc6c221 in nanosleep () from /lib64/libc.so.6
#1 0x00002b377fc6c044 in sleep () from /lib64/libc.so.6
#2 0x00002b3787bf2824 in GarbageCollectorThread (arg=0x14a2db40,
thr=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientConnMgr.cc:73
#3 0x00002b37879701ff in XrdSysThread_Xeq (myargs=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdSys/XrdSysPthread.cc:67
#4 0x00002b377f9bd77d in start_thread () from /lib64/libpthread.so.0
#5 0x00002b377fca625d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x42693940 (LWP 24031)):
#0 0x00002b377fc9d366 in poll () from /lib64/libc.so.6
#1 0x00002b3787bd9b56 in XrdClientSock::RecvRaw (this=0x144b4d00,
buffer=0x14417690, length=8, substreamid=-1, usedsubstreamid=0x0)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientSock.cc:133
#2 0x00002b3787bfbe90 in XrdClientPhyConnection::ReadRaw (this=0x144b3430,
buf=0x14417690, len=8, substreamid=-1, usedsubstreamid=0x42692d28)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientPhyConnection.cc:359
#3 0x00002b3787bff28c in XrdClientMessage::ReadRaw (this=0x14417650,
phy=0x144b3430)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientMessage.cc:152
#4 0x00002b3787bfa7aa in XrdClientPhyConnection::BuildMessage (
this=0x144b3430, IgnoreTimeouts=true, Enqueue=true)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientPhyConnection.cc:440
#5 0x00002b3787bfe89a in SocketReaderThread (arg=0x144b3430,
thr=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdClient/XrdClientPhyConnection.cc:57
#6 0x00002b37879701ff in XrdSysThread_Xeq (myargs=)
at /build/hegner/LCGCMT/work/xrootd-3.2.2/src/XrdSys/XrdSysPthread.cc:67
#7 0x00002b377f9bd77d in start_thread () from /lib64/libpthread.so.0
#8 0x00002b377fca625d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b37823c3f10 (LWP 23648)):
#0 0x00002b377fc6be2f in waitpid () from /lib64/libc.so.6
#1 0x00002b377fc0f491 in do_system () from /lib64/libc.so.6
#2 0x00002b377fc0f7e7 in system () from /lib64/libc.so.6
#3 0x00002b3779b4d7d6 in TUnixSystem::StackTrace() ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.01/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#4 0x00002b3779b4d0ac in TUnixSystem::DispatchSignals(ESignals) ()
from /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.01/x86_64-slc5-gcc43-opt/root/lib/libCore.so
#5
#6 0x00002b377964d0f1 in SCycleController::ExecuteNextCycle() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#7 0x00002b37796490ca in SCycleController::ExecuteAllCycles() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#8 0x000000000040189c in main ()

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

#6 0x00002b377964d0f1 in SCycleController::ExecuteNextCycle() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#7 0x00002b37796490ca in SCycleController::ExecuteAllCycles() ()
from /afs/cern.ch/user/m/mbellomo/wk/ElectroweakBosons.next/trunk/SFrame/lib/libSFrameCore.so
#8 0x000000000040189c in main ()

Below you can find the log from one of the workers.

// --------- Start of element log -----------------

// Ordinal: 0.107 (role: worker)

// Path: mbellomo@lxbrd16c01.cern.ch:21001//tmp/PoD_vEivv25468/proof/mbellomo/session-lxplus435-1350391407-24050/worker-0.107-lxbrd16c01-1350391414-16025.log
// # of retrieved lines: 639
(displaying lines: 630 -> 639)

// ------------------------------------------------

121016 15:34:16 12412 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:35:16 23442 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:36:16 25176 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:36:46 26651 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
121016 15:36:47 27513 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
Received SIGTERM: terminating
( INFO ) TXProofServ::Te… : starting session termination operations …
( INFO ) TXProofServ::Te… : process memory footprint: 1737252/-1 kB virtual, 1439684/-1 kB resident
( INFO ) TXProofServ::Te… : data directory ‘/tmp/PoD_vEivv25468/proof/mbellomo/data/0.107/lxbrd16c01-1350391414-16025’ has been removed
Terminate: termination operations ended: quitting!
// --------- End of element log -------------------

Dear all,

i’m still stuck with this error, any help would be much appreciated. Also any hint on how to get more information from the log files would be important. Thanks in advance!

best,
Max