Random worker deaths

EugenyBoger · February 22, 2011, 12:56pm

Hi,

I’m using PROOF with xrootd. ROOT version on of both client and remote machines is current trunk (5.29/1). The configurations consists of 7 workers and 1 master on the same physical machine.

The problem is: some workers eventually die while processing a dataset. Sometimes it happens in 2 minutes, sometimes in 10, but it always happens.

All the dead workers have the following in the log:

110222 20:46:29 8684 xpd-I: ProofServMgr::Create: srvtype = 0
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserOwnerships: enter
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserOwnerships: done
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserEnvironment: enter
110222 20:46:29 8684 xpd-I: ProofServMgr::SetUserEnvironment: done
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: psid: 2, log: 0
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: ROOT dir: /home/proof/root
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: session rootrc file: /home/proof/myproof/workdir/proof/session-prfserver01-1298378788-9104/worker-
0.1-prfserver01-1298378789-9117.rootrc
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: environment file: /home/proof/myproof/workdir/proof/session-prfserver01-1298378788-9104/worker-0.1
-prfserver01-1298378789-9117.env
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: creating symlink
110222 20:46:29 8684 xpd-I: ProofServMgr::SetProofServEnv: done
20:46:47  9117 Wrk-0.1 | Info in <TEventIterTree::GetTrees>: the tree cache is in learning phase
20:46:50  9117 Wrk-0.1 | Info in <TXProofServ::RestartComputeTime>: compute time restarted after 0.349464 secs (100 entries)
110222 20:47:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:17 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:17 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:18 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:18 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:19 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:19 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:20 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:20 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:21 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:21 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:22 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:22 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:23 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:23 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:24 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:24 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:25 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:25 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110222 20:47:25 001 Proofx-E: Conn::SendReq: max number of retries reached - Abort
prfserver01.ihep.ac.cn: SendMsg: INT: session is reconnecting: retry later
20:47:25  9117 Wrk-0.1 | Error in <TXSocket::SendRaw>: prfserver01.ihep.ac.cn: problems sending 153 bytes to server
20:47:25  9117 Wrk-0.1 | Error in <TXProofServ::GetNextPacket>: Send() failed, returned -1
110222 20:47:26 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110222 20:47:26 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later

...

xrootd.log :

110222 20:47:57 8684 xpd-I: proof.9127:55@localhost.localdomain: Protocol::Process: sid: 1, req id: 3112 (XP_sendmsg), dlen: 8066
110222 20:47:57 8684 xpd-I: proof.9127:55@localhost.localdomain: Protocol::Process2: req id: 3112 (XP_sendmsg)
110222 20:47:57 8684 xpd-E: ProofServ::SendData: client ID not found (cid: 0, size: 0)
110222 20:47:57 8684 xpd-E: proof.9127:55@localhost.localdomain: Protocol::SendData: INT: client ID: 0, problems sending: 8066 bytes to client
110222 20:47:57 8684 xpd-I: 0100 proof.9127:55@localhost.localdomain: xrd->0.6: Response::Send:12: sending err 3114: SendMsg: INT: session is reconnecting: r
etry later

...

Due to randomness It’s unlikely that the problem is somehow related to my analysis code. The same is for 5.28 stable.

–
Best wishes,
Eugeny Boger, JINR Dubna

ganis · February 23, 2011, 9:39am

Hi,

could you run the server with option ‘-d’ and report the part of the log around the time the problems with the worker starts (in your example the worker starts having problems at 20:47:16 while the server log is around 20:47:57).

It may help to set

proof->SetLogLevel(2)

for the run, to have some more verbosity in the session logs.

G. Ganis

EugenyBoger · February 23, 2011, 12:00pm

Hi,

It’s become a much worse. Now every single session I’m trying to start hangs on connections to 5 out of 7 workers. Sometimes it 3/7. But it works well while running client on gdb. Must be some timing issues. Steps to reproduce:

On master:
killall  -9 xrootd
killall  -9 proofserv.exe
rm -rf /tmp/.xproofd.2093/
rm -rf workdir/
xrootd -d -b -l xrootd.log -R proof -c xrootd.cfg

On client:

eugenyboger@lxslc18:~/panfs/bean/workdir$ killall  -9 bean.exe
bean.exe: no process killed
eugenyboger@lxslc18:~/panfs/bean/workdir$ ./bean.exe  -p "proof@prfserver01:2093" -g -u UserTest ds://test
Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: 1 out of 7 (14 %)

And after a while

Info in <TProof::Collect>:  1 node(s) went in timeout:
Info in <TProof::Collect>:    prfserver01.ihep.ac.cn

You see, now it’s only 1 out of 7.

Attached is the xrootd.log.
xrootd.log.gz (17.6 KB)

ganis · February 23, 2011, 1:45pm

Hi,

In your log there is certainly an issue with a process trying to reconnect.
Did your cleaning operations fully succeed? I mean did you check the result after issuing

killall  -9 xrootd
killall  -9 proofserv.exe
rm -rf /tmp/.xproofd.2093/
rm -rf workdir/

Can you also check the status of possible existing connections with ‘netstat -tap’ ?

G. Ganis

EugenyBoger · February 23, 2011, 1:59pm

Sure.

There are no connections on both client and master after cleanup.
I’ve repeated all the steps again

Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: 5 out of 7 (71 %)

At this moment there is one connection on client

eugenyboger@lxslc18:~$ netstat -tap -p | grep prf
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 lxslc18.ihep.ac.cn:54306    prfserver01.ihep.ac.cn:2093 ESTABLISHED 12003/bean.exe

on master:

[proof@prfserver01 myproof]$ netstat -tap -p | grep lx
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 prfserver01.ihep.ac.:nbx-cc lxslc18.ihep.ac.cn:54306    ESTABLISHED 21590/xrootd

[proof@prfserver01 myproof]$ netstat -tap -p | grep  xrootd
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 *:nbx-cc                    *:*                         LISTEN      21590/xrootd        
tcp        0      0 *:nbx-au                    *:*                         LISTEN      21590/xrootd        
tcp        0      0 prfserver01.ihep.ac.:nbx-cc prfserver01.ihep.ac.c:57389 ESTABLISHED 21590/xrootd        
tcp        0      0 prfserver01.ihep.ac.:nbx-cc lxslc18.ihep.ac.cn:54306    ESTABLISHED 21590/xrootd

EugenyBoger · February 23, 2011, 3:48pm

Hi,

It seems like that the problem was caused by

TProofMgr * manager = TProofMgr::Create( clr.c_str() );

followed by

TProof::Open(clr.c_str() );

Just wondering why it may cause problems? I’ve believed that manager instances are somehow cached.

ganis · February 23, 2011, 4:14pm

Hi,

You mean that TProof::Open(…) did not find the instance previously created by TProofMgr::Create(…)?
Uhm … this maybe an issue with the matching …

Gerri

EugenyBoger · February 23, 2011, 4:18pm

Not really.

I mean that TProof::Open just waits forever after 2 or 3 workers connected. The problem disappears when I’m running client under gdb or deleting TProofMgr call before TProof::Open.

UPDATE:
My fault. The code actually was

TProofMgr * manager = TProofMgr::Create( clr.c_str() );
<...>
delete manager;
<...>
proof = TProof::Open(clr_open.c_str(), param.c_str());

Which seems not to be correct.
Also the code works well on another machine even with this “delete” thing.