Hi,
I’ve made attempt to set up a proof cluster using two desktop computers. On both there is ROOT (5.34/36) and xrootd (4.7.1) installed. I’ve also installed PoD (3.16) on both machines and successfully (?) set it up according to instructions at http://pod.gsi.de/.
I am able to start pod server and verify the server status after submitting workers using pod-ssh:
rafal@120-D11:~$ pod-server start
Starting PoD server...
updating xproofd configuration file...
starting xproofd...
starting PoD agent...
preparing PoD worker package...
selecting pre-compiled bins to be added to worker package...
PoD worker package: /home/rafal/.PoD/wrk/PoDWorker.sh
------------------------
XPROOFD [19462] port: 21002
PoD agent [19492] port: 22002
PROOF connection string: rafal@120-D11:21002
------------------------
rafal@120-D11:~$ pod-ssh -c /usr/pod_ssh.cfg submit --debug
** [Thu, 09 Nov 2017 23:59:14 +0100] preparing PoD worker package...
** [Thu, 09 Nov 2017 23:59:14 +0100] selecting pre-compiled bins to be added to worker package...
** [Thu, 09 Nov 2017 23:59:14 +0100] PoD worker package: /home/rafal/.PoD/wrk/PoDWorker.sh
** [Thu, 09 Nov 2017 23:59:14 +0100] pod-ssh config contains an inline shell script. It will be injected it into wrk. package
** [Thu, 09 Nov 2017 23:59:14 +0100] preparing PoD worker package...
** [Thu, 09 Nov 2017 23:59:14 +0100] inline shell script is found and will be added to the package...
** [Thu, 09 Nov 2017 23:59:14 +0100] selecting pre-compiled bins to be added to worker package...
** [Thu, 09 Nov 2017 23:59:14 +0100] PoD worker package: /home/rafal/.PoD/wrk/PoDWorker.sh
** [Thu, 09 Nov 2017 23:59:14 +0100] There are 5 threads in the tread-pool.
** [Thu, 09 Nov 2017 23:59:14 +0100] Number of PoD workers: 2
** [Thu, 09 Nov 2017 23:59:14 +0100] Number of PROOF workers: 10
** [Thu, 09 Nov 2017 23:59:14 +0100] Workers list:
** [Thu, 09 Nov 2017 23:59:14 +0100] [kompRafal] with 6 workers at rafal@120-D11:/tmp/kompRafal
** [Thu, 09 Nov 2017 23:59:14 +0100] [kompStar] with 4 workers at rafal@star:/tmp/kompStar
kompRafal [czw, 09 lis 2017 23:59:14 +0100] pod-ssh-submit-worker is started for rafal@120-D11 (dir: /tmp/kompRafal, nworkers: 6, sshopt: -p 22)
kompStar [czw, 09 lis 2017 23:59:14 +0100] pod-ssh-submit-worker is started for rafal@star (dir: /tmp/kompStar, nworkers: 4, sshopt: -p 22)
** [Thu, 09 Nov 2017 23:59:15 +0100]
*******************
Successfully processed tasks: 2
Failed tasks: 0
*******************
rafal@120-D11:~$ pod-ssh status
PoD worker "kompRafal": RUN
PoD worker "kompStar": RUN
However, when I open ROOT and create a TProof object:
TProof *p = TProof::Open(“pod://”)
I get the following:
Starting master: opening connection ...
Starting master: OK
Opening connections to workers: OK (2 workers)
Note: File "iostream" already loaded
171109 23:59:43 20079 Proofx-E: Conn::Connect: failed to connect to proof://rafal:default@localhost:20000//
171109 23:59:43 20079 Proofx-E: XrdProofConn: XrdProofConn: severe error occurred while opening a connection to server [localhost:20000]
23:59:43 20079 Mst-0 | Warning in <TProof::AddWorkers>: worker '0.0' is invalid
171109 23:59:51 20079 Proofx-E: Conn::Connect: failed to connect to proof://rafal:default@localhost:20001//
171109 23:59:51 20079 Proofx-E: XrdProofConn: XrdProofConn: severe error occurred while opening a connection to server [localhost:20001]
23:59:51 20079 Mst-0 | Warning in <TProof::AddWorkers>: worker '0.1' is invalid
PROOF set to sequential mode
(class TProof*)0x2714960
root [1] *** No workers left: cannot continue! Terminating ... ***
| session: rafal.default.20079.status terminated by peer
Info in <TXSlave::HandleError>: 0x27f5f40:120-D11:0 got called ... fProof: 0x2714960, fSocket: 0x27f6170 (valid: 1)
Info in <TXSlave::HandleError>: 0x27f5f40: proof: 0x2714960
TXSlave::HandleError: 0x27f5f40: DONE ...
Info in <TProof::MarkBad>:
+++ Message from local session : marking 120-D11:21002 (0) as bad
+++ Reason: received kPROOF_FATAL
+++ Message from local session : marking 120-D11:21002 (0) as bad
+++ Reason: received kPROOF_FATAL
+++ Most likely your code crashed
+++ Please check the session logs for error messages either using
+++ the 'Show logs' button or executing
+++
+++ root [] TProof::Mgr("rafal@120-D11:21002")->GetSessionLogs()->Display("*")
Info in <TXSocket::Reconnect>: 0x270e310: reconnection attempts explicitly disabled!
I will be extremely grateful for helping me solving this problem. I will provide appropriate logs or file contents if needed.
Best regards,
Rafal