Master crashes after client kill

Hi,

I’m having an issue with 5.24 sl4_32 binaries:
I run the stressProof test from a remote client, and at some point an input file cannot be accesed, so far so good, but the analysis gets stuck and I have to manually kill the root session on that client.

And then, when I kill it, the Proof Master dies with this error:

[code]090929 11:13:57 17929 xpd-I: pfernandez.16496:34@ui04: Protocol::recycle: user pfernandez disconnected; type: ClientMaster

*** Break *** segmentation violation
[/code]

Some more info: when the xrootd daemon dies, it leaves this two processes orphaned:

18470 ? Sl 0:01 /opt/root/bin/proofserv.exe proofserv xpd xpdpath:/tmp/.xproofd.1093 0 18481 ? Sl 0:02 /opt/root/bin/proofserv.exe proofslave xpd xpdpath:/tmp/.xproofd.1093 0

Does anyone have an idea on what’s going on?

Thanks a lot in advance,
Pablo

Dear Pablo,

You mean the master xrootd dies, right?

Is this systematic? I mean, can you reproduce it? If yes, can you attach with gdb to the xrootd process, let it crash, and get a backtrace?

The ‘proofserv’ processes are presumably in some long waiting loop (waiting for the non available inputs, I guess). They should get cleaned (killed) the next time you start xrootd.

Gerri

Hi,

Yes, it’s reproductible. What I did to get the gdb trace was:

  • start the xrootd daemon on the master
  • start gdb and attach it’s pid
  • start the proof session from the client
  • kill the client’s root.exe

This is the master’s trace:

code continue
Continuing.
Detaching after fork from child process 22286.
[New Thread -1236092000 (LWP 22288)]
[New Thread -1236882528 (LWP 22298)]
Detaching after fork from child process 22300.
[New Thread -1239417952 (LWP 22302)]
Detaching after fork from child process 22303.
[New Thread -1240208480 (LWP 22305)]
Detaching after fork from child process 22306.
[New Thread -1240999008 (LWP 22308)]
Detaching after fork from child process 22309.
Detaching after fork from child process 22311.
Detaching after fork from child process 22313.
[New Thread -1241789536 (LWP 22315)]
[New Thread -1242580064 (LWP 22316)]
[New Thread -1243370592 (LWP 22317)]
Detaching after fork from child process 22318.
[New Thread -1244161120 (LWP 22319)]
[New Thread -1246532704 (LWP 22320)]
[New Thread -1247323232 (LWP 22321)]
[New Thread -1248113760 (LWP 22322)]
Detaching after fork from child process 22323.
Detaching after fork from child process 22324.
Detaching after fork from child process 22325.
Detaching after fork from child process 22326.
Detaching after fork from child process 22327.
Detaching after fork from child process 22328.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1236092000 (LWP 22288)]
0xb79dd16b in XrdProofdResponse::Send () from /opt/root/lib/libXrdProofd.so[/code]

It doesn’t say anything to me, but well, hope it does to you!!

I’ve tried with version 5.22.00d and it doesn’t crash. I’ve also tried compiling 5.24 myself and it crashes the same way.

BR/Pablo

Hi,

I am not able to reproduce it, when I kill the client the daemon just reports the disconnection.

Can you type ‘where’ or ‘bt’ in gdb when you get the segv, so that we get all the calling stack?

Also, can you post the configuration file?

Gerri

Now I killed the client “too soon” and the master didn’t crash. Try to run stressProof.cxx interactively and killall root.exe when there’s a progress bar popped up.

Here’s the backtrace:

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1241789536 (LWP 23723)] 0xb797316b in XrdProofdResponse::Send () from /opt/root/lib/libXrdProofd.so (gdb) where #0 0xb797316b in XrdProofdResponse::Send () from /opt/root/lib/libXrdProofd.so #1 0xb794240f in XrdProofdProofServ::SendDataN () from /opt/root/lib/libXrdProofd.so #2 0xb7968303 in XrdProofdProtocol::SendDataN () from /opt/root/lib/libXrdProofd.so #3 0xb796bd99 in XrdProofdProtocol::SendMsg () from /opt/root/lib/libXrdProofd.so #4 0xb796c5fc in XrdProofdProtocol::Process2 () from /opt/root/lib/libXrdProofd.so #5 0xb796d6d2 in XrdProofdProtocol::Process () from /opt/root/lib/libXrdProofd.so #6 0x0807647a in XrdLink::DoIt () #7 0x0807a83f in XrdScheduler::Run () #8 0x0807a98b in XrdStartWorking () #9 0x08088936 in XrdSysThread_Xeq () #10 0x41fb93cc in start_thread () from /lib/tls/libpthread.so.0 #11 0x41db419e in clone () from /lib/tls/libc.so.6

This is the config… it’s really simple:

[code]xrd.protocol xproofd:1093 libXrdProofd.so

xpd.tmp /var/proof/tmp
xpd.workdir /var/proof
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker worker[01-03]
xpd.worker worker proof
xpd.worker worker proof
xpd.worker worker proof
xpd.worker worker proof
xpd.worker worker proof
xpd.worker worker proof
xpd.worker master proof

xpd.allow proof.ft.uam.es
if proof.ft.uam.es
xpd.role any
else
xpd.role worker
fi[/code]

Thanks,
BR/Pablo

Dear Pablo,

I have uploaded a patch in the trunk and 5-24-00-patches which could/should fix this problem.
Are you in the position to try the fix out?

Gerri

Hi Gerri,

It will take me some time… now we’ve got 5.22d working and needs to be in production. However I’ve planned setting up a preproduction environment in the short term, and I’ll try your patch there.

Kind regards,
Pablo