PROOF gets stuck during merging

Renaud_Vernet · September 14, 2010, 4:25pm

Hi folks,

With my setup, it happens that PROOF gets stuck just before the end of the session, during merging. This is not systematic, it happens with a varying frequence even though the same data and proof setup is used.
I am processing one or several datasets, the more datasets the more probable the problem happens.
This problem can happen with about 50% probability using large datasets.

Basically everthing is fine during the data processing. Then the merging begins, many workers send their output, and when only 2-3 workers still remain, nothing happens anymore. The progression bar doesn’t move anymore, stays at ~99.9% but doesn’t reach 100%.
Inspecting the machines, their CPU gets close to 0% use, on both workers and master.
The last message PROOF sends on the client is like that:
Mst-0: merging output objects … / (3 workers still sending)

As I said, the problem doesn’t happen at every try. Two consecutive tries can result in one success and one failure for example. I didn’t find any logic in that.

I attach a typical example of the master+worker log

Has anybody ever met this problem? Is the log file suspicious?
Thanks for any help.

Cheers,
Renaud
ccapl0008-1284480650-15334.txt (6.44 KB)

ganis · September 16, 2010, 2:14pm

Hi Renaud,

The log dos not say much.
Can you set

gProof->SetLogLevel(3)

and post a case when it hangs?

More generally, can you specify the version that you are using and describe a bit more what you job does, especially in terms of output?

Gerri

Renaud_Vernet · September 20, 2010, 4:04pm

Hello Gerri,

Sorry for the lack of precision in my previous email.

You’ll find in the attached tarball the client logs, and the logs from 2 workers.
The worker logs were very heavy, so I snipped them in the middle.
I didn’t touch the client logs. The only suspicious message I can see is this one:
15:01:57 31400 Wrk-0.13 | Info in TXProofServ::SendResults: message has 1299868 bytes: limit of 1000000 bytes reached - sending …
I use root v5-26-00 on the client and v5-27-04 is used on the nodes. (I checked using 27-04 on the client also and the problem is still there).
I don’t really know what the code does. The output objects seem to be only histos. (the owner of the code may update this thread and comment on that point)

Should you need anything else, let me know.
Thanks a lot in advance,
Renaud.
output.tgz (190 KB)

Renaud_Vernet · October 1, 2010, 3:03pm

Hello Gerri et al.

Sorry, I have no news from the owner of the code I was talking about. Let’s forget about for now.

In the meantime I have managed to reproduce the problem with an ALICE analysis task, reading AOD that are staged on my local xrootd server. This server is outside the proof cluster, ie. the data is not stored on the proof nodes, but on ‘external’ disks. The output contains 2 TLists :

1 list of TObString’s
1 list of TH1’s and TH2’s

Apparently it happens that a few workers stop their duty for some reason and stay stuck forever, which prevents the whole task to end.

My config is 1 master + 6 slaves, each host having 16 cores. The limit per user I have set is 64 workers : what I observe is that 4 slaves out of 6 are ‘chosen’ (randomly?) at each new session, and each ‘chosen’ slave uses its 16 cores.
(4 chosen slaves x 16 cores = 64 workers).

In my example, I find ONE worker in the whole set that seems weird. Let’s see:

I lauch my session at 16:17.
63 workers finish to write their log at about 16:40
1 worker finishes at 16:19

On the corresponding slave, all workers end their log with something like :

W-TObjectTable::Remove: 0xb0934020 not found at 8185

except ONE worker whose log file seems to be incomplete :

Run number: 117098
AliAnalysisTaskEMCALPi0CalibSelection Will use fLogWeight 4.500 .
AliAnalysisTaskEMCALPi0CalibSelection - N CaloClusters: 7
Std : i 1, E 0.591122, dispersion 0.820510, m02 0.568117, m20 0.105119
Std : i 1, x 18.174370, y 439.815125, z 33.738663
Recal: i 1, E 0.591122, dispersion 0.820510, m02 0.568117, m20 0.105119
Recal: i 1, x 18.17

the log is cut there.

Checking the system PIDs, this weird worker is found to me much less CPU consuming than the others (obviously) :

vernet@ccapl0014:/scratch/proofbox/vernet/last-worker-session$ ps aux|grep proofserv | grep -v grep
vernet 15146 52.5 0.2 319376 138068 ? Sl 16:17 19:39 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15148 52.5 0.2 327408 147576 ? Sl 16:17 19:41 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15151 51.8 0.2 314972 133280 ? Sl 16:17 19:25 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15153 52.3 0.2 319232 142064 ? Sl 16:17 19:33 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15155 50.8 0.3 326532 149976 ? Sl 16:17 18:59 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15157 51.4 0.2 328404 147928 ? Sl 16:17 19:13 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15159 52.9 0.3 330236 162024 ? Sl 16:17 19:46 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15161 51.8 0.3 384220 157968 ? Sl 16:17 19:21 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15163 52.4 0.3 338052 161684 ? Sl 16:17 19:34 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15165 53.6 0.2 319596 147556 ? Sl 16:17 20:01 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15167 52.5 0.3 333352 170148 ? Sl 16:17 19:36 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15169 52.6 0.2 319824 147180 ? Sl 16:17 19:38 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15171 53.1 0.2 323392 140488 ? Sl 16:17 19:51 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15173 51.7 0.3 341540 160908 ? Sl 16:17 19:17 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15175 1.9 0.2 290592 121204 ? Sl 16:17 0:43 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 15177 51.8 0.3 328552 151676 ? Sl 16:17 19:20 /usr/local/root/root_v5.27.04_gcc3.4/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0

Now the problem is that I have NO CLUE what can create this worker to stop processing.
Has anybody got an idea of what else I could check ?

Thanks,
renaud

ganis · October 1, 2010, 3:45pm

Hi Renaud,

Quickly about the output of the previous post: the worker for which the output was not received (causing the stuck state) was on ‘ccapl0008.in2p3.fr’, while the logs you posted were for workers on ‘ccapl0009’ and ‘ccapl0010’.
You do not have the logs for the workers on ccapl0008, right?
(The message that you see is normal: it just means that it sends the output in chunks of about 1MB …).

Now the new report.

Not sure to fully understand: how do you start PROOF? Or what do you have in the config file?

This message

is (probably) due to a long standing issue with object streamer info which got possibly fixed mid September.
Unfortunately, because of other problems, the fix could not be thoroughly tested. Martin Vala managed to run some tests and it seems the the messages disappeared.
But it is not just a question of warnings or error messages: the problem fixed was serious, and ALICE relies heavily on the streamer info.
The fix is in 5.27/06. Please try with this version, if you can.

This said, was there anything else in the logs of the worker not processing?

Gerri

Renaud_Vernet · October 4, 2010, 2:08pm

Hello Gerri,

[quote]Quickly about the output of the previous post: the worker for which the output was not received (causing the stuck state) was on ‘ccapl0008.in2p3.fr’, while the logs you posted were for workers on ‘ccapl0009’ and ‘ccapl0010’.
You do not have the logs for the workers on ccapl0008, right?[/quote]

OK, ccapl0008 is the master host.
I have posted new logs, see below.

I mean that the 64 workers are not spread over 6 slaves, but always 4 slaves. I didn’t choose that myslef, this is what I observe. I don’t know if that is of interest anyway.
In order to avoid confusion, I have set wmx:-1 , so that from now on all the slaves are always used (96 workers now)

OK I have just setup this new version. That’s the one I use in the following results.
(No ALICE code yet, I now have a crash with that ROOT version, so I will deal with ATLAS code below)

===> So, please find the tarball :
cern.ch/rvernet/full_debug.tgz

Inside it, you can find the logs of a session with SetLogLevel(3) as you suggested, used with the ATLAS code (not mine) :

xrd.cf
client.log : logs from the client side
laf_end_atlas_logs.txt : a view of the head and tail of all nodes logs once the merging should be done, but one worker remains
laf_end_atlas_psaux.txt : a view of the proof connections on each node once the merging should be done, but one worker remains
has_finished.log : the list of worker logs showing the occurrence of the string “has finished”
worker-0.71-ccapl0013-1286188839-5126.log : the full logs of the problematic worker
df.txt : a “df -h” launched on every node

As you can notice in has_finished.log, only ONE worker doesn’t show the string “has finished” in its logs (ccapl0013, ID 5126, worker 0.71).
In the laf_end_atlas_psaux.txt, you can see that this node has its CPU rate is different from the others (which are rather homogeneous).
I hope you’ll be able to find something suspicious in the full worker logs (because I don’t )

Besides, after I created those different files, I killed by hand the PID of the problematic process.
At that exact moment, the client logs shew that string :
“Worker ‘ccapl0013.in2p3.fr-0.71’ has been removed from the active list”
and the merging tries to resume.
In the end I get an error relative to space left on device. It may be related to the current problem. Unfortunately I don’t see any space missing on the machines … please see attached df.txt

Thanks a lot for looking into that problem !
Cheers,
Renaud

Renaud_Vernet · November 2, 2010, 10:27am

Hello Gerri,
do you have any news on that issue ?
thanks for any piece of information

ciao
renaud

Renaud_Vernet · March 1, 2011, 11:59am

Hello again,

additional info :
I am looking at all the different processes relative to a given PROOF session. Process #10173 is hanging (see ps aux) :
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
vernet 10171 34.4 1.3 882348 670632 ? Rl 12:39 5:16 /usr/local/root/root_v5.28.00/root/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 10173 3.8 0.7 716200 378944 ? Sl 12:39 0:35 /usr/local/root/root_v5.28.00/root/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0
vernet 10175 33.7 1.3 875832 680852 ? Sl 12:39 5:09 /usr/local/root/root_v5.28.00/root/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0

Here is what tells me “strace -p 10173” :

— SIGALRM (Alarm clock) @ 0 (0) —
rt_sigaction(SIGALRM, {0x2b04ddf9ae70, [], SA_RESTORER|SA_RESTART, 0x2b04dfe46b10}, NULL, = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
rt_sigreturn(0x1626db20) = -1 EINTR (Interrupted system call)
poll([{fd=64, events=POLLIN}], 1, 5000) = -1 EINTR (Interrupted system call)
— SIGALRM (Alarm clock) @ 0 (0) —
rt_sigaction(SIGALRM, {0x2b04ddf9ae70, [], SA_RESTORER|SA_RESTART, 0x2b04dfe46b10}, NULL, = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, NULL) = 0
rt_sigreturn(0x1626db20) = -1 EINTR (Interrupted system call)
poll([{fd=64, events=POLLIN}], 1, 5000) = -1 EINTR (Interrupted system call)

it keeps displaying this message, nothing else happens, and the other processes give a completely different output.

so as far as i understand, the PID is put in sleep mode, then receives a alarm signal, then goes back to sleep, in a infinite loop.

What can create such a behaviour ?
thanks for your answer.

Renaud

Renaud_Vernet · April 18, 2011, 10:19am

Hi,

Some news here, that may be helpful to others.

After playing around with the config, it appears that both

increasing the number of file descriptors in the daemon script and
reducing the number of workers in the session
makes PROOF finally end correctly with a much larger probability.

So it looks reasonable to increase the ulimit. I have set in xproofd :
ulimit -m 40000000 -v 45000000 -n 50000
it seems to be enough for me in most cases.
This is at the admin level.

If the stuck workers problems appears, the user may also ask for a lower number of workers. That should make the job hopefully more successful.

Cheers,

Renaud

fglas · March 18, 2013, 6:05pm

Hi,

I know this is quite an old topic, but I am experiencing the same problem right now. I have an experimental PROOF cluster running with 30 worker nodes. The analysis is getting stuck in the merging phase randomly.
It seems like at the end the master is only waiting for one worker:
Master-Log
15:54:15 31014 Mst-0 | Info in TProof::Collect: 2 node(s) still active:
15:54:15 31014 Mst-0 | Info in TProof::Collect: ip-10-0-216-223 (0.23)
15:54:15 31014 Mst-0 | Info in TProof::Collect: ip-10-0-166-133 (0.7)
15:54:15 31014 Mst-0 | Info in TProof::HandleInputMessage: got type 1005 from '0.7’
15:54:15 31014 Mst-0 | Info in TProof::HandleInputMessage: 0.7: kPROOF_LOGDONE: status 0 parallel 1
15:54:15 31014 Mst-0 | Info in TProof::Collect: deactivating 0x176af10 (active: 1, 0x1745b70)
15:54:15 31014 Mst-0 | Info in TProof::Collect: 1 node(s) still active:
15:54:15 31014 Mst-0 | Info in TProof::Collect: ip-10-0-216-223 (0.23)
15:54:15 31014 Mst-0 | Info in TPacketizerAdaptive::HandleTimer: fProgress: 0x1aec970, isDone: 0
15:54:15 31014 Mst-0 | Info in TPacketizerAdaptive::HandleTimer: ent:1499057, bytes:5621022935, proct:655.239014, evtrti:1346.211304, mbrti:0.000000 (5360.625205,5360.625205)

The last four lines are then repeated again and again.
The log of the worker shows, that it did not finish its analysis, but instead got stuck after an interrupt:
Corresponding worker log:
15:53:03 14991 Wrk-0.23 | Info in TProofPlayerSlave::Process: Call Process(278505)
15:53:03 14991 Wrk-0.23 | Info in MySelektor::Process: Process is running on event 278505
15:53:21 14991 Wrk-0.23 | Info in TXProofServ::HandleUrgentData: got interrupt: 0

15:53:21 14991 Wrk-0.23 | Info in TXProofServ::HandleUrgentData: *** Ping
15:53:21 14991 Wrk-0.23 | Info in TXProofServ::UpdateSessionStatus: status (=1) update in path: /pool/admin/.xproofd.1093/activesessions/ubuntu.default.14991.status
15:53:51 14991 Wrk-0.23 | Info in TXProofServ::HandleUrgentData: got interrupt: 0

15:53:51 14991 Wrk-0.23 | Info in TXProofServ::HandleUrgentData: *** Ping
15:53:51 14991 Wrk-0.23 | Info in TXProofServ::UpdateSessionStatus: status (=1) update in path: /pool/admin/.xproofd.1093/activesessions/ubuntu.default.14991.status
15:54:20 14991 Wrk-0.23 | Info in TXProofServ::HandleUrgentData: got interrupt: 0
…
Looks like it did not continue the analysis after answering to a Ping message.

As far as I can see the process got stuck at a mutex operation:
ubuntu@ip-10-0-216-223:~$ sudo strace -p 14991
Process 14991 attached - interrupt to quit
futex(0x7f609308f720, FUTEX_WAIT_PRIVATE, 2, NULL

Any help would be appreciated
Fabian

Renaud_Vernet · March 18, 2013, 9:28pm

Hi Fabian, have you tried my suggestions above ? Doesn’it help?
Cheers
renaud

fglas · March 18, 2013, 11:06pm

Hi Renaud,

yes I already increased the ulimit. I made pretty much the same experience as you did. The problem happens randomly with increasing probability when I increase the number of workers or the problem input size.
As far as I understand your last comment, increasing the ulimit did not solve the problem completely but made it less probably. I would prefer to completely get rid of it

Fabian

Renaud_Vernet · March 19, 2013, 9:19am

OK in my case the final settings made the problem so unlikely that I could live with it.
But I couldn’t get rid of it either. I made no further investigation since then.
Cheers
renaud

ganis · March 20, 2013, 10:08am

Hi Fabian,

The ping in principle is handled in a separated thread, and the messages indicate that it was treated correctly, updating the time stamp on the status file.
There are two possibilities: either the main thread(s) is(are) stuck somewhere or the last message from the worker got lost. The result of strace may push towards the first hypothesis. When it happens again, could you do a backtrace of the process?
Just attach with gdb and run ‘thread apply all bt’:

$ gdb -p <pid>
 (gdb)  thread apply all bt

This may give hints where it is stuck and where is the race condition.

Cheers,
Gerri

fglas · March 20, 2013, 1:24pm

Hi Gerri,

thank you for joining the discussion. I reproduced the problem and did a backtrace.
This time wrk-0.15 got stuck:

12:34:40 2114 Wrk-0.15 | Info in MySelektor::Process: There are 0 negative tracks
12:34:40 2114 Wrk-0.15 | Info in TProofPlayerSlave::Process: Call Process(179387)
12:34:40 2114 Wrk-0.15 | Info in MySelektor::Process: Process is running on event 179387
12:34:44 2114 Wrk-0.15 | Info in TXProofServ::HandleUrgentData: got interrupt: 0

12:34:44 2114 Wrk-0.15 | Info in TXProofServ::HandleUrgentData: *** Ping
12:34:44 2114 Wrk-0.15 | Info in TXProofServ::UpdateSessionStatus: status (=1) update in path: /pool/admin/.xproofd.1093/activesessions/ubuntu.default.2114.status
12:35:14 2114 Wrk-0.15 | Info in TXProofServ::HandleUrgentData: got interrupt: 0

strace -p 2114
futex(0x7ff5c836f720, FUTEX_WAIT_PRIVATE, 2, NULL

gdb -p 2114
(gdb) thread apply all bt

Thread 4 (Thread 0x7ff5c3893700 (LWP 2122)):
#0 0x00007ff5c809fb03 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff5c40e42dc in XrdClientSock::RecvRaw (this=0x2b079a0, buffer=0x7ff5bc0009f0, length=8,
substreamid=, usedsubstreamid=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientSock.cc:133
#2 0x00007ff5c41095c3 in XrdClientPhyConnection::ReadRaw (this=0x2b06270, buf=0x7ff5bc0009f0, len=8, substreamid=-1,
usedsubstreamid=0x7ff5c389297c) at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:359
#3 0x00007ff5c4110fdd in XrdClientMessage::ReadRaw (this=0x7ff5bc0009b0, phy=0x2b06270)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientMessage.cc:152
#4 0x00007ff5c41050b5 in XrdClientPhyConnection::BuildMessage (this=0x2b06270, IgnoreTimeouts=true, Enqueue=true)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:440
#5 0x00007ff5c41074fa in SocketReaderThread (arg=0x2b06270, thr=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:57
#6 0x00007ff5c436fb2f in XrdSysThread_Xeq (myargs=0x2b07910)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdSys/XrdSysPthread.cc:67
#7 0x00007ff5c837de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007ff5c80ab4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7ff5c1b14700 (LWP 2222)):
#0 0x00007ff5c807703d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
—Type to continue, or q to quit—
#1 0x00007ff5c8076edc in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ff5c40fc594 in GarbageCollectorThread (arg=0x2e64480, thr=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientConnMgr.cc:73
#3 0x00007ff5c436fb2f in XrdSysThread_Xeq (myargs=0x2e66730)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdSys/XrdSysPthread.cc:67
#4 0x00007ff5c837de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007ff5c80ab4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7ff5c0d11700 (LWP 2224)):
#0 0x00007ff5c809fb03 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff5c40e42dc in XrdClientSock::RecvRaw (this=0x2e6cae0, buffer=0x7ff5b4041f80, length=8,
substreamid=, usedsubstreamid=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientSock.cc:133
#2 0x00007ff5c41095c3 in XrdClientPhyConnection::ReadRaw (this=0x2e6b590, buf=0x7ff5b4041f80, len=8, substreamid=-1,
usedsubstreamid=0x7ff5c0d1097c) at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:359
#3 0x00007ff5c4110fdd in XrdClientMessage::ReadRaw (this=0x7ff5b4041f40, phy=0x2e6b590)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientMessage.cc:152
#4 0x00007ff5c41050b5 in XrdClientPhyConnection::BuildMessage (this=0x2e6b590, IgnoreTimeouts=true, Enqueue=true)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:440
#5 0x00007ff5c41074fa in SocketReaderThread (arg=0x2e6b590, thr=)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdClient/XrdClientPhyConnection.cc:57
—Type to continue, or q to quit—
#6 0x00007ff5c436fb2f in XrdSysThread_Xeq (myargs=0x2e6cda0)
at /tmp/xrootd-3.2.0-19891/xrootd-3.2.0/src/XrdSys/XrdSysPthread.cc:67
#7 0x00007ff5c837de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8 0x00007ff5c80ab4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ff5c966e740 (LWP 2114)):
#0 0x00007ff5c80b91bb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ff5c803dcb1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ff5c803ba37 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ff5c885eded in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ff5c885ef09 in operator new[](unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ff5c8c74859 in TString::Replace(int, int, char const*, int) () from /opt/root/lib/libCore.so
#6 0x00007ff5c8c45e30 in TEnv::Getvalue(char const*) () from /opt/root/lib/libCore.so
#7 0x00007ff5c8c463dc in TEnv::GetValue(char const*, int) () from /opt/root/lib/libCore.so
#8 0x00007ff5c48f701e in TShutdownTimer::Notify() () from /opt/root/lib/libProof.so
#9 0x00007ff5c8c95c5d in TTimer::CheckTimer(TTime const&) () from /opt/root/lib/libCore.so
#10 0x00007ff5c8d01216 in TUnixSystem::DispatchTimers(bool) () from /opt/root/lib/libCore.so
#11 0x00007ff5c8d01417 in TUnixSystem::DispatchSignals(ESignals) () from /opt/root/lib/libCore.so
#12
#13 0x00007ff5c80377e5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x00007ff5c8038ec6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
—Type to continue, or q to quit—
#15 0x00007ff5c803ba45 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00007ff5c885eded in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ff5c885ef09 in operator new[](unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007ff5c8c726d6 in TStorage::Alloc(unsigned long) () from /opt/root/lib/libCore.so
#19 0x00007ff5c8cb858a in TObjArray::Init(int, int) () from /opt/root/lib/libCore.so
#20 0x00007ff5c8cb8942 in TObjArray::TObjArray(int, int) () from /opt/root/lib/libCore.so
#21 0x00007ff5c8cafb54 in TClonesArray::TClonesArray(char const*, int, bool) () from /opt/root/lib/libCore.so
#22 0x00007ff5c2e7a0d5 in myEventType::myEventType (this=0x2f95b00) at …/src/myEventType.cpp:26
#23 0x00007ff5c2e7265e in ROOT::new_myEventType (p=0x0) at …/src/myEventDict.cpp:237
#24 0x00007ff5c8cd2fad in TClass::New(TClass::ENewType) const () from /opt/root/lib/libCore.so
#25 0x00007ff5c4cd2957 in TBranchElement::SetAddress(void*) () from /opt/root/lib/libTree.so
#26 0x00007ff5c4ccd0d6 in TBranchElement::GetEntry(long long, int) () from /opt/root/lib/libTree.so
#27 0x00007ff5c4d0bd2c in TTree::GetEntry(long long, int) () from /opt/root/lib/libTree.so
#28 0x00007ff5c2e6ef44 in MySelektor::Process (this=0x2d6d1b0, entry=179387) at …/src/MySelektor.cpp:81
#29 0x00007ff5c1dbcb92 in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) ()
from /opt/root/lib/libProofPlayer.so
#30 0x00007ff5c4915b5d in TProofServ::HandleProcess(TMessage*, TString*) () from /opt/root/lib/libProof.so
#31 0x00007ff5c490f6df in TProofServ::HandleSocketInput(TMessage*, bool) () from /opt/root/lib/libProof.so
#32 0x00007ff5c4902c37 in TProofServ::HandleSocketInput() () from /opt/root/lib/libProof.so
#33 0x00007ff5c45e44d2 in TXProofServ::HandleInput(void const*) () from /opt/root/lib/libProofx.so
#34 0x00007ff5c45f318d in TXSocketHandler::Notify() () from /opt/root/lib/libProofx.so
#35 0x00007ff5c8d0035c in TUnixSystem::CheckDescriptors() () from /opt/root/lib/libCore.so
—Type to continue, or q to quit—
#36 0x00007ff5c8d01b06 in TUnixSystem::DispatchOneEvent(bool) () from /opt/root/lib/libCore.so
#37 0x00007ff5c8c85556 in TSystem::InnerLoop() () from /opt/root/lib/libCore.so
#38 0x00007ff5c8c87224 in TSystem::Run() () from /opt/root/lib/libCore.so
#39 0x00007ff5c8c2c4af in TApplication::Run(bool) () from /opt/root/lib/libCore.so
#40 0x0000000000401999 in main ()

I found a similar problem at stackoverflow:
http://stackoverflow.com/questions/15477385/segmentation-fault-while-calling-malloc-and-program-in-deadlock-futex

Looks like the signal handler was called while the main thread was calling malloc and did a malloc itself. As stated in the stackoverflow post this could cause a deadlock. By the way RÓOT version is 5.34/01 and i am
using Ubuntu 12.04.

Cheers
Fabian

Update: The problem seems to be the thread-safety of malloc(). I recompiled and relinked with -pthread. That might solve the issue.

Update: I just realized that it is not about the thread-safety of malloc(). The problem is that it is not reentrant. So as far as I understand it should never be called inside a signal handler or do I get something wrong here?

ganis · March 26, 2013, 2:50pm

Hi Fabian,

Thank you very much for your posts. Right, signal handlers should never do anything but simple settings.
Will revise all that and try to provide a fix this ASAP.

I let you know,

Gerri

fglas · April 2, 2013, 2:03pm

Hi Gerri,

thanks for looking into it. Looking forward to your reply.
Since the problem did not appear earlier I suspect that Ubuntu (Debian-derivative) handles
the nested malloc call differently to ScientificLinux (RedHat-derivative) which might be the
more common solution for PROOF-users. But this is just guessing, since I am not an expert in
Linux distributions and I have no overview of the Linux distribution used by other PROOF users.

Maybe other forum users have more experience and can comment on this.

Cheers
Fabian

ganis · April 4, 2013, 11:26am

Hi Fabian,

I had a closer look at the problem and some of those bad-behaviors should already be fixed in 5.34/04 .
There are some others ‘Notify()’ of asynchronous timers that should be ‘cleaned’ (and I am doing that now), but the most crucial one should be ok.
Would it be possible for you to give a try with >= 5.34/04 ?
You’ll get also a very important set of fixes for the daemon.

Cheers, Gerri

fglas · April 4, 2013, 3:11pm

Hi Gerri,
I won´t have time before the beginning of next week, but as soon as I gave it a try I will let you know.
Regards,
Fabian

fglas · April 21, 2013, 3:57pm

Hi Gerri,
I finally managed to update my configuration and ran a few tests. Up to now the problem did not
reappear. Thanks for looking into it.
Cheers,
Fabian