Hi Gerri,
thank you for joining the discussion. I reproduced the problem and did a backtrace.
This time wrk-0.15 got stuck:
12:34:40 2114 Wrk-0.15 | Info in MySelektor::Process: There are 0 negative tracks
12:34:40 2114 Wrk-0.15 | Info in TProofPlayerSlave::Process: Call Process(179387)
12:34:40 2114 Wrk-0.15 | Info in MySelektor::Process: Process is running on event 179387
12:34:44 2114 Wrk-0.15 | Info in TXProofServ::HandleUrgentData: got interrupt: 0
12:34:44 2114 Wrk-0.15 | Info in TXProofServ::HandleUrgentData: *** Ping
12:34:44 2114 Wrk-0.15 | Info in TXProofServ::UpdateSessionStatus: status (=1) update in path: /pool/admin/.xproofd.1093/activesessions/ubuntu.default.2114.status
12:35:14 2114 Wrk-0.15 | Info in TXProofServ::HandleUrgentData: got interrupt: 0
strace -p 2114
futex(0x7ff5c836f720, FUTEX_WAIT_PRIVATE, 2, NULL
gdb -p 2114
(gdb) thread apply all bt
Thread 4 (Thread 0x7ff5c3893700 (LWP 2122)):
#0 0x00007ff5c809fb03 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff5c40e42dc in XrdClientSock::RecvRaw (this=0x2b079a0, buffer=0x7ff5bc0009f0, length=8,
substreamid=, usedsubstreamid=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientSock.cc:133
#2 0x00007ff5c41095c3 in XrdClientPhyConnection::ReadRaw (this=0x2b06270, buf=0x7ff5bc0009f0, len=8, substreamid=-1,
usedsubstreamid=0x7ff5c389297c) at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:359
#3 0x00007ff5c4110fdd in XrdClientMessage::ReadRaw (this=0x7ff5bc0009b0, phy=0x2b06270)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientMessage.cc:152
#4 0x00007ff5c41050b5 in XrdClientPhyConnection::BuildMessage (this=0x2b06270, IgnoreTimeouts=true, Enqueue=true)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:440
#5 0x00007ff5c41074fa in SocketReaderThread (arg=0x2b06270, thr=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:57
#6 0x00007ff5c436fb2f in XrdSysThread_Xeq (myargs=0x2b07910)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdSys/XrdSysPthread.cc:67
#7 0x00007ff5c837de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007ff5c80ab4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()
Thread 3 (Thread 0x7ff5c1b14700 (LWP 2222)):
#0 0x00007ff5c807703d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
—Type to continue, or q to quit—
#1 0x00007ff5c8076edc in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ff5c40fc594 in GarbageCollectorThread (arg=0x2e64480, thr=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientConnMgr.cc:73
#3 0x00007ff5c436fb2f in XrdSysThread_Xeq (myargs=0x2e66730)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdSys/XrdSysPthread.cc:67
#4 0x00007ff5c837de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007ff5c80ab4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7ff5c0d11700 (LWP 2224)):
#0 0x00007ff5c809fb03 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff5c40e42dc in XrdClientSock::RecvRaw (this=0x2e6cae0, buffer=0x7ff5b4041f80, length=8,
substreamid=, usedsubstreamid=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientSock.cc:133
#2 0x00007ff5c41095c3 in XrdClientPhyConnection::ReadRaw (this=0x2e6b590, buf=0x7ff5b4041f80, len=8, substreamid=-1,
usedsubstreamid=0x7ff5c0d1097c) at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:359
#3 0x00007ff5c4110fdd in XrdClientMessage::ReadRaw (this=0x7ff5b4041f40, phy=0x2e6b590)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientMessage.cc:152
#4 0x00007ff5c41050b5 in XrdClientPhyConnection::BuildMessage (this=0x2e6b590, IgnoreTimeouts=true, Enqueue=true)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:440
#5 0x00007ff5c41074fa in SocketReaderThread (arg=0x2e6b590, thr=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:57
—Type to continue, or q to quit—
#6 0x00007ff5c436fb2f in XrdSysThread_Xeq (myargs=0x2e6cda0)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdSys/XrdSysPthread.cc:67
#7 0x00007ff5c837de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007ff5c80ab4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7ff5c966e740 (LWP 2114)):
#0 0x00007ff5c80b91bb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff5c803dcb1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ff5c803ba37 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ff5c885eded in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ff5c885ef09 in operator new[](unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ff5c8c74859 in TString::Replace(int, int, char const*, int) () from /opt/root/lib/libCore.so
#6 0x00007ff5c8c45e30 in TEnv::Getvalue(char const*) () from /opt/root/lib/libCore.so
#7 0x00007ff5c8c463dc in TEnv::GetValue(char const*, int) () from /opt/root/lib/libCore.so
#8 0x00007ff5c48f701e in TShutdownTimer::Notify() () from /opt/root/lib/libProof.so
#9 0x00007ff5c8c95c5d in TTimer::CheckTimer(TTime const&) () from /opt/root/lib/libCore.so
#10 0x00007ff5c8d01216 in TUnixSystem::DispatchTimers(bool) () from /opt/root/lib/libCore.so
#11 0x00007ff5c8d01417 in TUnixSystem::DispatchSignals(ESignals) () from /opt/root/lib/libCore.so
#12
#13 0x00007ff5c80377e5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x00007ff5c8038ec6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
—Type to continue, or q to quit—
#15 0x00007ff5c803ba45 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00007ff5c885eded in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ff5c885ef09 in operator new[](unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007ff5c8c726d6 in TStorage::Alloc(unsigned long) () from /opt/root/lib/libCore.so
#19 0x00007ff5c8cb858a in TObjArray::Init(int, int) () from /opt/root/lib/libCore.so
#20 0x00007ff5c8cb8942 in TObjArray::TObjArray(int, int) () from /opt/root/lib/libCore.so
#21 0x00007ff5c8cafb54 in TClonesArray::TClonesArray(char const*, int, bool) () from /opt/root/lib/libCore.so
#22 0x00007ff5c2e7a0d5 in myEventType::myEventType (this=0x2f95b00) at …/src/myEventType.cpp:26
#23 0x00007ff5c2e7265e in ROOT::new_myEventType (p=0x0) at …/src/myEventDict.cpp:237
#24 0x00007ff5c8cd2fad in TClass::New(TClass::ENewType) const () from /opt/root/lib/libCore.so
#25 0x00007ff5c4cd2957 in TBranchElement::SetAddress(void*) () from /opt/root/lib/libTree.so
#26 0x00007ff5c4ccd0d6 in TBranchElement::GetEntry(long long, int) () from /opt/root/lib/libTree.so
#27 0x00007ff5c4d0bd2c in TTree::GetEntry(long long, int) () from /opt/root/lib/libTree.so
#28 0x00007ff5c2e6ef44 in MySelektor::Process (this=0x2d6d1b0, entry=179387) at …/src/MySelektor.cpp:81
#29 0x00007ff5c1dbcb92 in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) ()
from /opt/root/lib/libProofPlayer.so
#30 0x00007ff5c4915b5d in TProofServ::HandleProcess(TMessage*, TString*) () from /opt/root/lib/libProof.so
#31 0x00007ff5c490f6df in TProofServ::HandleSocketInput(TMessage*, bool) () from /opt/root/lib/libProof.so
#32 0x00007ff5c4902c37 in TProofServ::HandleSocketInput() () from /opt/root/lib/libProof.so
#33 0x00007ff5c45e44d2 in TXProofServ::HandleInput(void const*) () from /opt/root/lib/libProofx.so
#34 0x00007ff5c45f318d in TXSocketHandler::Notify() () from /opt/root/lib/libProofx.so
#35 0x00007ff5c8d0035c in TUnixSystem::CheckDescriptors() () from /opt/root/lib/libCore.so
—Type to continue, or q to quit—
#36 0x00007ff5c8d01b06 in TUnixSystem::DispatchOneEvent(bool) () from /opt/root/lib/libCore.so
#37 0x00007ff5c8c85556 in TSystem::InnerLoop() () from /opt/root/lib/libCore.so
#38 0x00007ff5c8c87224 in TSystem::Run() () from /opt/root/lib/libCore.so
#39 0x00007ff5c8c2c4af in TApplication::Run(bool) () from /opt/root/lib/libCore.so
#40 0x0000000000401999 in main ()
I found a similar problem at stackoverflow:
http://stackoverflow.com/questions/15477385/segmentation-fault-while-calling-malloc-and-program-in-deadlock-futex
Looks like the signal handler was called while the main thread was calling malloc and did a malloc itself. As stated in the stackoverflow post this could cause a deadlock. By the way RÓOT version is 5.34/01 and i am
using Ubuntu 12.04.
Cheers
Fabian
Update: The problem seems to be the thread-safety of malloc(). I recompiled and relinked with -pthread. That might solve the issue.
Update: I just realized that it is not about the thread-safety of malloc(). The problem is that it is not reentrant. So as far as I understand it should never be called inside a signal handler or do I get something wrong here?