I’m having an issue with 5.24 sl4_32 binaries:
I run the stressProof test from a remote client, and at some point an input file cannot be accesed, so far so good, but the analysis gets stuck and I have to manually kill the root session on that client.
And then, when I kill it, the Proof Master dies with this error:
Is this systematic? I mean, can you reproduce it? If yes, can you attach with gdb to the xrootd process, let it crash, and get a backtrace?
The ‘proofserv’ processes are presumably in some long waiting loop (waiting for the non available inputs, I guess). They should get cleaned (killed) the next time you start xrootd.
Yes, it’s reproductible. What I did to get the gdb trace was:
start the xrootd daemon on the master
start gdb and attach it’s pid
start the proof session from the client
kill the client’s root.exe
This is the master’s trace:
code continue
Continuing.
Detaching after fork from child process 22286.
[New Thread -1236092000 (LWP 22288)]
[New Thread -1236882528 (LWP 22298)]
Detaching after fork from child process 22300.
[New Thread -1239417952 (LWP 22302)]
Detaching after fork from child process 22303.
[New Thread -1240208480 (LWP 22305)]
Detaching after fork from child process 22306.
[New Thread -1240999008 (LWP 22308)]
Detaching after fork from child process 22309.
Detaching after fork from child process 22311.
Detaching after fork from child process 22313.
[New Thread -1241789536 (LWP 22315)]
[New Thread -1242580064 (LWP 22316)]
[New Thread -1243370592 (LWP 22317)]
Detaching after fork from child process 22318.
[New Thread -1244161120 (LWP 22319)]
[New Thread -1246532704 (LWP 22320)]
[New Thread -1247323232 (LWP 22321)]
[New Thread -1248113760 (LWP 22322)]
Detaching after fork from child process 22323.
Detaching after fork from child process 22324.
Detaching after fork from child process 22325.
Detaching after fork from child process 22326.
Detaching after fork from child process 22327.
Detaching after fork from child process 22328.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1236092000 (LWP 22288)]
0xb79dd16b in XrdProofdResponse::Send () from /opt/root/lib/libXrdProofd.so[/code]
It doesn’t say anything to me, but well, hope it does to you!!
I’ve tried with version 5.22.00d and it doesn’t crash. I’ve also tried compiling 5.24 myself and it crashes the same way.
Now I killed the client “too soon” and the master didn’t crash. Try to run stressProof.cxx interactively and killall root.exe when there’s a progress bar popped up.
Here’s the backtrace:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1241789536 (LWP 23723)]
0xb797316b in XrdProofdResponse::Send () from /opt/root/lib/libXrdProofd.so
(gdb) where
#0 0xb797316b in XrdProofdResponse::Send () from /opt/root/lib/libXrdProofd.so
#1 0xb794240f in XrdProofdProofServ::SendDataN () from /opt/root/lib/libXrdProofd.so
#2 0xb7968303 in XrdProofdProtocol::SendDataN () from /opt/root/lib/libXrdProofd.so
#3 0xb796bd99 in XrdProofdProtocol::SendMsg () from /opt/root/lib/libXrdProofd.so
#4 0xb796c5fc in XrdProofdProtocol::Process2 () from /opt/root/lib/libXrdProofd.so
#5 0xb796d6d2 in XrdProofdProtocol::Process () from /opt/root/lib/libXrdProofd.so
#6 0x0807647a in XrdLink::DoIt ()
#7 0x0807a83f in XrdScheduler::Run ()
#8 0x0807a98b in XrdStartWorking ()
#9 0x08088936 in XrdSysThread_Xeq ()
#10 0x41fb93cc in start_thread () from /lib/tls/libpthread.so.0
#11 0x41db419e in clone () from /lib/tls/libc.so.6
It will take me some time… now we’ve got 5.22d working and needs to be in production. However I’ve planned setting up a preproduction environment in the short term, and I’ll try your patch there.