Xproofd keeps crashing on proof slaves

Dear Expert:

Recently on the proof WNs I see the following error message:

[root@valtical09 tmp]# service proofd status
xproofd dead but subsys locked

I removed /var/lock/subsys/xproofd and restarted the service but after some time they crash again.

I removed all the following directoreis on both WNs and proof master:

  1. /localdisk/proofbox/
  2. /tmp/.xproofd.1093

The restarting the proofd daemon on all machines again, but after some time, the proofd on the WNs crashed again while it works on the proof master.

From the logs on the proof WN I see a lot messages in connection:
130508 19:29:28 6560 xpd-E: NetMgr::Broadcast: problems sending request to qing@valtical06.cern.ch:1093
130508 19:29:28 6560 Xrd: Connect: can’t open connection to [valtical07.cern.ch:1093]
130508 19:29:28 6560 xpd-E: Conn::Connect: failed to connect to qing@valtical07.cern.ch:1093//
130508 19:29:28 6560 xpd-E: XrdProofConn: XrdProofConn: severe error occurred while opening a connection to server [valtical07.cern.ch:1093]

From the logs on the proof master I also see a lot of error message in connection:

130508 17:23:36 9386 xpd-E: Conn::SendRecv: reading msg from connmgr (server [valtical05.cern.ch:1093])
130508 17:24:01 9386 xpd-I: ProofServCron: 1 sessions are currently active
130508 17:24:01 9386 xpd-I: ProofServCron: next sessions check in 30 secs
130508 17:24:04 9386 xpd-I: SchedCron: running regular checks
130508 17:24:07 9386 xpd-E: Conn::SendRecv: reading msg from connmgr (server [valtical05.cern.ch:1093])
130508 17:24:07 9386 xpd-E: Conn::SendReq: max number of retries reached - Abort
130508 17:24:07 9386 xpd-E: Conn::GetAccessToSrv: client could not login at [valtical05.cern.ch:1093]
130508 17:24:07 9386 xpd-E: Conn::Connect: access to server failed ()
130508 17:24:07 9386 xpd-E: Conn::Connect: failed to connect to ruanxf@valtical05.cern.ch:1093//
130508 17:24:07 9386 xpd-E: XrdProofConn: XrdProofConn: severe error occurred while opening a connection to server [valtical05.cern.ch:1093]

But the network connection between these machines should be no problem, any idea where the problem is?

Cheers,Gang

Hi,

Which version was this with? 5.34.01?
Can you check if moving to 5.34.07 helps?
There have been significant improvements in daemon related issues from 5.34.05 on …

G. Ganis