Hi,
I’m using PROOF with xrootd. ROOT version on of both client and remote machines is current trunk (5.29/1). The configurations consists of 7 workers and 1 master on the same physical machine.
The problem is: some workers eventually die while processing a dataset. Sometimes it happens in 2 minutes, sometimes in 10, but it always happens.
All the dead workers have the following in the log:
110222 20:46:29 8684 xpd-I: ProofServMgr::Create: srvtype = 0
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserOwnerships: enter
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserOwnerships: done
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserEnvironment: enter
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserEnvironment: done
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: psid: 2, log: 0
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: ROOT dir: /home/proof/root
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: session rootrc file: /home/proof/myproof/workdir/proof/session-prfserver01-1298378788-9104/worker-
0.1-prfserver01-1298378789-9117.rootrc
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: environment file: /home/proof/myproof/workdir/proof/session-prfserver01-1298378788-9104/worker-0.1
-prfserver01-1298378789-9117.env
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: creating symlink
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: done
20:46:47 9117 Wrk-0.1 | Info in <TEventIterTree::GetTrees>: the tree cache is in learning phase
20:46:50 9117 Wrk-0.1 | Info in <TXProofServ::RestartComputeTime>: compute time restarted after 0.349464 secs (100 entries)
110222 20:47:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:17 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:17 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:18 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:18 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:19 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:19 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:20 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:20 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:21 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:21 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:22 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:22 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:23 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:23 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:24 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:24 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:25 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:25 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:25 001 Proofx-E: Conn::SendReq: max number of retries reached - Abort
prfserver01.ihep.ac.cn: SendMsg: INT: session is reconnecting: retry later
20:47:25 9117 Wrk-0.1 | Error in <TXSocket::SendRaw>: prfserver01.ihep.ac.cn: problems sending 153 bytes to server
20:47:25 9117 Wrk-0.1 | Error in <TXProofServ::GetNextPacket>: Send() failed, returned -1
110222 20:47:26 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:26 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
...
xrootd.log :
110222 20:47:57 8684 xpd-I: proof.9127:55@localhost.localdomain: Protocol::Process: sid: 1, req id: 3112 (XP_sendmsg), dlen: 8066
110222 20:47:57 8684 xpd-I: proof.9127:55@localhost.localdomain: Protocol::Process2: req id: 3112 (XP_sendmsg)
110222 20:47:57 8684 xpd-E: ProofServ::SendData: client ID not found (cid: 0, size: 0)
110222 20:47:57 8684 xpd-E: proof.9127:55@localhost.localdomain: Protocol::SendData: INT: client ID: 0, problems sending: 8066 bytes to client
110222 20:47:57 8684 xpd-I: 0100 proof.9127:55@localhost.localdomain: xrd->0.6: Response::Send:12: sending err 3114: SendMsg: INT: session is reconnecting: r
etry later
...
Due to randomness It’s unlikely that the problem is somehow related to my analysis code. The same is for 5.28 stable.
–
Best wishes,
Eugeny Boger, JINR Dubna