Hello all,
I have a proof farm with a master and few workers (using root-5.26 on linuxx8664gcc), which performs well most of the time, but from time to time (let say every two days), any attempt to connect with the master fails. Daemons show “running” status in all the workers &master.
When this happens Im forced to restart the xrootd daemons and everything works again.
From the client
root [1] TProof::Open(“master01”)
Starting master: opening connection …
Starting master: OK
Info in TProof::Collect: 1 node(s) went in timeout:
Info in TProof::Collect: master01.pic.es
I looked at the master and workers log files and I couldnt see anything there that give a clue.
The master log shows:
100618 10:33:31 4081 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions/cosuna.atlifae.10435.status was successful!
100618 10:33:31 4081 xpd-I: cosuna.10435:32@localhost.localdomain: ClientMgr::MapClient: user cosuna logged-in; type: Internal
100618 10:33:47 4081 xpd-I: SchedCron: running regular checks
100618 10:33:56 4081 xpd-I: ProofServCron: 1 sessions are currently active
100618 10:33:56 4081 xpd-I: ProofServCron: next sessions check in 30 secs
100618 10:34:17 4081 xpd-I: SchedCron: running regular checks
and goes on forever…
while from the worker log it seems that they dont receive any connection from the master:
100618 10:35:19 18089 xpd-I: ProofServCron: 0 sessions are currently active
100618 10:35:19 18089 xpd-I: ProofServCron: next sessions check in 30 secs
100618 10:35:20 18089 xpd-I: SchedCron: running regular checks
(all worker logs look quite similar).
Can anyone tell me what to do in order to get more information to debug this problem?
thanks, carlos