TProof::Open hangs

cosuna · June 18, 2010, 3:39pm

Hello all,

I have a proof farm with a master and few workers (using root-5.26 on linuxx8664gcc), which performs well most of the time, but from time to time (let say every two days), any attempt to connect with the master fails. Daemons show “running” status in all the workers &master.

When this happens Im forced to restart the xrootd daemons and everything works again.

From the client

root [1] TProof::Open(“master01”)
Starting master: opening connection …
Starting master: OK

Info in TProof::Collect: 1 node(s) went in timeout:
Info in TProof::Collect: master01.pic.es

I looked at the master and workers log files and I couldnt see anything there that give a clue.
The master log shows:

100618 10:33:31 4081 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions/cosuna.atlifae.10435.status was successful!
100618 10:33:31 4081 xpd-I: cosuna.10435:32@localhost.localdomain: ClientMgr::MapClient: user cosuna logged-in; type: Internal
100618 10:33:47 4081 xpd-I: SchedCron: running regular checks
100618 10:33:56 4081 xpd-I: ProofServCron: 1 sessions are currently active
100618 10:33:56 4081 xpd-I: ProofServCron: next sessions check in 30 secs
100618 10:34:17 4081 xpd-I: SchedCron: running regular checks

and goes on forever…

while from the worker log it seems that they dont receive any connection from the master:

100618 10:35:19 18089 xpd-I: ProofServCron: 0 sessions are currently active
100618 10:35:19 18089 xpd-I: ProofServCron: next sessions check in 30 secs
100618 10:35:20 18089 xpd-I: SchedCron: running regular checks

(all worker logs look quite similar).

Can anyone tell me what to do in order to get more information to debug this problem?

thanks, carlos

ganis · June 22, 2010, 8:18am

Dear Carlos,

Can you check with netstat the status of the network connections when it gets stuck?
Also, would it be possible for you to use 5-26-00b instead of 5-26-00? It contains an upgraded
xrootd fixing an issue on pollers.

G. Ganis

cosuna · June 22, 2010, 1:01pm

Hello Ganis

yes, I checked the status of the connections in all, master and worker machines. The connections in the corresponding ports were alive.
I will upgrade the cluster to 5.26b to see if this makes the cluster more stable.

thanks, carlos