Random worker deaths

Hi,

I’m using PROOF with xrootd. ROOT version on of both client and remote machines is current trunk (5.29/1). The configurations consists of 7 workers and 1 master on the same physical machine.

The problem is: some workers eventually die while processing a dataset. Sometimes it happens in 2 minutes, sometimes in 10, but it always happens.

All the dead workers have the following in the log:

110222 20:46:29 8684 xpd-I: ProofServMgr::Create: srvtype = 0
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserOwnerships: enter
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserOwnerships: done
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserEnvironment: enter
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserEnvironment: done
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: psid: 2, log: 0
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: ROOT dir: /home/proof/root
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: session rootrc file: /home/proof/myproof/workdir/proof/session-prfserver01-1298378788-9104/worker-
0.1-prfserver01-1298378789-9117.rootrc
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: environment file: /home/proof/myproof/workdir/proof/session-prfserver01-1298378788-9104/worker-0.1
-prfserver01-1298378789-9117.env
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: creating symlink
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: done
20:46:47  9117 Wrk-0.1 | Info in <TEventIterTree::GetTrees>: the tree cache is in learning phase
20:46:50  9117 Wrk-0.1 | Info in <TXProofServ::RestartComputeTime>: compute time restarted after 0.349464 secs (100 entries)
110222 20:47:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:17 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:17 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:18 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:18 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:19 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:19 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:20 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:20 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:21 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:21 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:22 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:22 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:23 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:23 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:24 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:24 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:25 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:25 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:25 001 Proofx-E: Conn::SendReq: max number of retries reached - Abort
prfserver01.ihep.ac.cn: SendMsg: INT: session is reconnecting: retry later
20:47:25  9117 Wrk-0.1 | Error in <TXSocket::SendRaw>: prfserver01.ihep.ac.cn: problems sending 153 bytes to server
20:47:25  9117 Wrk-0.1 | Error in <TXProofServ::GetNextPacket>: Send() failed, returned -1
110222 20:47:26 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:26 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later

...

xrootd.log :

110222 20:47:57 8684 xpd-I: proof.9127:55@localhost.localdomain: Protocol::Process: sid: 1, req id: 3112 (XP_sendmsg), dlen: 8066
110222 20:47:57 8684 xpd-I: proof.9127:55@localhost.localdomain: Protocol::Process2: req id: 3112 (XP_sendmsg)
110222 20:47:57 8684 xpd-E: ProofServ::SendData: client ID not found (cid: 0, size: 0)
110222 20:47:57 8684 xpd-E: proof.9127:55@localhost.localdomain: Protocol::SendData: INT: client ID: 0, problems sending: 8066 bytes to client
110222 20:47:57 8684 xpd-I: 0100 proof.9127:55@localhost.localdomain: xrd->0.6: Response::Send:12: sending err 3114: SendMsg: INT: session is reconnecting: r
etry later

...

Due to randomness It’s unlikely that the problem is somehow related to my analysis code. The same is for 5.28 stable.


Best wishes,
Eugeny Boger, JINR Dubna

Hi,

could you run the server with option ‘-d’ and report the part of the log around the time the problems with the worker starts (in your example the worker starts having problems at 20:47:16 while the server log is around 20:47:57).

It may help to set

proof->SetLogLevel(2)

for the run, to have some more verbosity in the session logs.

G. Ganis

Hi,

It’s become a much worse. Now every single session I’m trying to start hangs on connections to 5 out of 7 workers. Sometimes it 3/7. But it works well while running client on gdb. Must be some timing issues. Steps to reproduce:

On master:
killall  -9 xrootd
killall  -9 proofserv.exe
rm -rf /tmp/.xproofd.2093/
rm -rf workdir/
xrootd -d -b -l xrootd.log -R proof -c xrootd.cfg

On client:

eugenyboger@lxslc18:~/panfs/bean/workdir$ killall  -9 bean.exe
bean.exe: no process killed
eugenyboger@lxslc18:~/panfs/bean/workdir$ ./bean.exe  -p "proof@prfserver01:2093" -g -u UserTest ds://test
Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: 1 out of 7 (14 %)

And after a while

Info in <TProof::Collect>:  1 node(s) went in timeout:
Info in <TProof::Collect>:    prfserver01.ihep.ac.cn

You see, now it’s only 1 out of 7.

Attached is the xrootd.log.
xrootd.log.gz (17.6 KB)

Hi,

In your log there is certainly an issue with a process trying to reconnect.
Did your cleaning operations fully succeed? I mean did you check the result after issuing

killall  -9 xrootd
killall  -9 proofserv.exe
rm -rf /tmp/.xproofd.2093/
rm -rf workdir/

Can you also check the status of possible existing connections with ‘netstat -tap’ ?

G. Ganis

Sure.

There are no connections on both client and master after cleanup.
I’ve repeated all the steps again

Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: 5 out of 7 (71 %)

At this moment there is one connection on client

eugenyboger@lxslc18:~$ netstat -tap -p | grep prf
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 lxslc18.ihep.ac.cn:54306    prfserver01.ihep.ac.cn:2093 ESTABLISHED 12003/bean.exe      

on master:

[proof@prfserver01 myproof]$ netstat -tap -p | grep lx
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 prfserver01.ihep.ac.:nbx-cc lxslc18.ihep.ac.cn:54306    ESTABLISHED 21590/xrootd        
[proof@prfserver01 myproof]$ netstat -tap -p | grep  xrootd
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 *:nbx-cc                    *:*                         LISTEN      21590/xrootd        
tcp        0      0 *:nbx-au                    *:*                         LISTEN      21590/xrootd        
tcp        0      0 prfserver01.ihep.ac.:nbx-cc prfserver01.ihep.ac.c:57389 ESTABLISHED 21590/xrootd        
tcp        0      0 prfserver01.ihep.ac.:nbx-cc lxslc18.ihep.ac.cn:54306    ESTABLISHED 21590/xrootd        

Hi,

It seems like that the problem was caused by

TProofMgr * manager = TProofMgr::Create( clr.c_str() );

followed by

TProof::Open(clr.c_str() );

Just wondering why it may cause problems? I’ve believed that manager instances are somehow cached.

Hi,

You mean that TProof::Open(…) did not find the instance previously created by TProofMgr::Create(…)?
Uhm … this maybe an issue with the matching …

Gerri

Not really.

I mean that TProof::Open just waits forever after 2 or 3 workers connected. The problem disappears when I’m running client under gdb or deleting TProofMgr call before TProof::Open.

UPDATE:
My fault. The code actually was

TProofMgr * manager = TProofMgr::Create( clr.c_str() );
<...>
delete manager;
<...>
proof = TProof::Open(clr_open.c_str(), param.c_str());

Which seems not to be correct.
Also the code works well on another machine even with this “delete” thing.