Problem setting up PoD with ssh

Hi…
I am trying to use PoD in my analysis.
For starter I am just trying to make PoD cluster
out of 2 machines by using ssh.
I installed PoD on my machine. And then did
pod-server start, it gives

Starting PoD server...
updating xproofd configuration file...
starting xproofd...
starting PoD agent...
preparing PoD worker package...
selecting pre-compiled bins to be added to worker package...
PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
------------------------
XPROOFD [23249] port: 21001
PoD agent [23281] port: 22001
PROOF connection string: chinmay@localhost.localdomain:21001
------------------------

Then I did
pod-ssh -c pod_ssh.cf --debug submit. pod_ssh.cf is attached here.

**	[Thu, 14 Sep 2017 14:33:02 +0530]	preparing PoD worker package...
**	[Thu, 14 Sep 2017 14:33:02 +0530]	selecting pre-compiled bins to be added to worker package...
**	[Thu, 14 Sep 2017 14:33:02 +0530]	PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
**	[Thu, 14 Sep 2017 14:33:02 +0530]	pod-ssh config contains an inline shell script. It will be injected it into wrk. package
**	[Thu, 14 Sep 2017 14:33:02 +0530]	preparing PoD worker package...
**	[Thu, 14 Sep 2017 14:33:02 +0530]	inline shell script is found and will be added to the package...
**	[Thu, 14 Sep 2017 14:33:02 +0530]	selecting pre-compiled bins to be added to worker package...
**	[Thu, 14 Sep 2017 14:33:02 +0530]	PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
**	[Thu, 14 Sep 2017 14:33:02 +0530]	There are 5 threads in the tread-pool.
**	[Thu, 14 Sep 2017 14:33:02 +0530]	Number of PoD workers: 1
**	[Thu, 14 Sep 2017 14:33:02 +0530]	Number of PROOF workers: 4
**	[Thu, 14 Sep 2017 14:33:02 +0530]	Workers list:
**	[Thu, 14 Sep 2017 14:33:02 +0530]	[nilay] with 4 workers at chinmay@10.159.63.110:/home/chinmay/tmp/nilay
nilay	[Thu, 14 Sep 2017 14:33:02 +0530]	pod-ssh-submit-worker is started for chinmay@10.159.63.110 (dir: /home/chinmay/tmp/nilay, nworkers: 4, sshopt: -X)
**	[Thu, 14 Sep 2017 14:33:03 +0530]	
*******************
Successfully processed tasks: 1
Failed tasks: 0
*******************

Now after this once I run root and try to open Proof session I get following error

[chinmay@localhost 3.16]$ root -l
root [0] TProof *proof = TProof::Open("chinmay@localhost.localdomain:21001")
Starting master: opening connection ...
Starting master: OK                                                 
no resource currently available for this session: please retry later
Error in <TProof::StartSlaves>: no resources available or problems setting up workers (check logs)
Error in <TProof::Open>: new session could not be created
(TProof *) nullptr

The pod.agent.client.log file shows the following error

2017-09-14 14:32:10.277 INF 0 [LOG singleton:thread-12171] LOG singleton has been initialized.
2017-09-14 14:32:10.277 INF 0 [PROOFAgent:thread-12171] pod-agent v.3.16
2017-09-14 14:32:10.277 INF 0 [CORE:thread-12171] Bringing >>> AgentClient <<< to life...
2017-09-14 14:32:10.277 INF 0 [AgentClient:thread-12171] Detected xpd [12119] on port 21001
2017-09-14 14:32:10.277 INF 0 [AgentClient:thread-12171] starting a monitor
2017-09-14 14:32:10.277 DBG 0 [AgentClient:thread-12171] Creating a PROOF configuration file...
2017-09-14 14:32:10.280 INF 0 [AgentClient:thread-12171] looking for PROOFAgent server to connect...
2017-09-14 14:32:10.280 ERR 1 [AgentClient:thread-12171] Can't connect to the server
Error on Socket<:55156>: Transport endpoint is not connected
2017-09-14 14:32:10.280 INF 0 [CORE:thread-12171] Shutting down >>> AgentClient <<<
2017-09-14 14:32:10.280 INF 0 [CORE:thread-12171] Shutting down >>> PROOFAgent <<<

Ca someone help ?

The xpd.log of the PoD is as follows

++++++ xrootd anon@ubuntu initialization started.
Config using configuration file /home/chinmay/tmp/nilay/xpd.cf
=====> xrd.adminpath /tmp/PoDWorker_4KKJlANMkh
170914 14:32:09 12119 XrdSetIF: Skipping duplicate private interface [::192.168.122.1]
Config maximum number of connections restricted to 4096
Plugin No such file or directory loading protocol /home/chinmay/root-6.08.04/PROOF_INSTALLATION/lib/libXrdProofd-4.so
Config Falling back to using /home/chinmay/root-6.08.04/PROOF_INSTALLATION/lib/libXrdProofd.so
Plugin loaded 
170914 14:32:09 12119 xpd.port 21001
170914 14:32:09 12119 xpd.port 21001
170914 14:32:09 12119 xpd.sockpathdir /tmp/PoDWorkerSockets_ByIczmogjO
170914 14:32:09 12119 xpd.tmp /tmp/PoDWorker_4KKJlANMkh
170914 14:32:09 12119 xpd.workdir /tmp/PoDWorker_gd2rDfNIKB/proof
170914 14:32:09 12119 xpd.role worker
170914 14:32:09 12119 xpd-I: Manager::Config: configuring
170914 14:32:09 12119 xpd-I: Manager::Config: listening on port 21001
170914 14:32:09 12119 xpd-I: Manager::Config: using temp dir: /tmp/PoDWorker_4KKJlANMkh
170914 14:32:09 12119 xpd-I: Manager::Config: role set to: worker
170914 14:32:09 12119 xpd-I: Manager::Config: admin path set to: /tmp/PoDWorker_4KKJlANMkh/.xproofd.21001
170914 14:32:09 12119 xpd-I: Manager::Config: unix sockets under: /tmp/PoDWorkerSockets_ByIczmogjO
170914 14:32:09 12119 xpd-I: Manager::Config: working directories under: /tmp/PoDWorker_gd2rDfNIKB/proof
170914 14:32:09 12119 xpd-I: Manager::Config: masters allowed to connect: any
170914 14:32:09 12119 xpd-I: Manager::Config: PROOF pool: root://ubuntu
170914 14:32:09 12119 xpd-I: Manager::Config: PROOF pool namespace: /proofpool
170914 14:32:09 12119 xpd-I: Manager::Config: no dataset sources defined
170914 14:32:09 12119 xpd-I: Manager::Config: list of superusers: chinmay
170914 14:32:09 12119 xpd-I: Manager::Config: bare lib path for proofserv (full LD_LIBRARY_PATH): /home/chinmay/root-6.08.04/PROOF_INSTALLATION/lib:/home/chinmay/tmp/nilay:/home/chinmay/xrootd-4.5.0//lib64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/home/chinmay/root-6.08.04/PROOF_INSTALLATION/bin
170914 14:32:09 12119 xpd-I: Group::Print: +++ Group: default
170914 14:32:09 12119 xpd-I: Group::Print: +++ Priority: -1, fraction: -1
170914 14:32:09 12119 xpd-I: Group::Print: +++ End of Group: default
170914 14:32:09 12119 xpd-I: Admin::Config: configuring
170914 14:32:09 12119 xpd-I: Admin::Config: allowed/supported copy commands: root:xrdcp,https:wget,file:cp,http:wget,xrd:xrdcp
170914 14:32:09 12119 xpd.resource static /home/chinmay/tmp/nilay/proof.conf
170914 14:32:09 12119 xpd-I: NetMgr::Config: configuring
170914 14:32:09 12119 xpd-I: NetMgr::Config: 0 worker nodes defined at start-up
170914 14:32:09 12119 xpd-I: PriorityMgr::Config: configuring
170914 14:32:09 12119 xpd-I: PriorityMgr::Config: no priority changes requested
170914 14:32:09 12119 xpd-I: PriorityMgr::Config: poller thread started
170914 14:32:09 12119 xpd-I: ROOTMgr::SetLogDir: rootsys log validation path: /tmp/PoDWorker_4KKJlANMkh/.xproofd.21001/rootsysvalidation
170914 14:32:09 12119 xpd-I: ROOTMgr::Config: configuring
170914 14:32:10 12119 xpd-I: ROOTMgr::Config: ROOT dist: '6.08/04 6.08/04 /home/chinmay/root-6.08.04/PROOF_INSTALLATION 37' validated
170914 14:32:10 12119 xpd-I: ROOTMgr::Config: ROOT version details: git: 'v6-08-04', code: 0, {mnp} = {6,8,4}
170914 14:32:10 12119 xpd-I: ClientMgr::Config: configuring
170914 14:32:10 12119 xpd-I: ClientMgr::Config: clients admin path set to: /tmp/PoDWorker_4KKJlANMkh/.xproofd.21001/clients
170914 14:32:10 12119 xpd-I: ClientMgr::Config: XRD seclib not specified; strong authentication disabled
170914 14:32:10 12119 xpd-I: ClientMgr::Config: cron thread started
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: configuring
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: setting internal timeout to 10 secs
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: client sessions kept idle for 0 secs after disconnection
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: active sessions admin path set to: /tmp/PoDWorker_4KKJlANMkh/.xproofd.21001/activesessions
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: terminated sessions admin path set to /tmp/PoDWorker_4KKJlANMkh/.xproofd.21001/terminatedsessions
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: RC settings: 0
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: ENV settings: 0
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: using fork() to start proofserv sessions
170914 14:32:10 12119 xpd-I: ProofServMgr::Config: cron thread started
170914 14:32:10 12147 xpd-I: ProofServCron: next full sessions check in 30 secs
Plugin No such file or directory loading (null) <>
170914 14:32:10 12119 xpd-E: Manager::LoadXrootd: could not find 'XrdgetProtocol()' in <>
170914 14:32:10 12119 xpd-I: Manager::Config: file serving (protocol: 'rootd://') explicitly disabled
170914 14:32:10 12119 xpd-I: Manager::Config: manager cron thread started
170914 14:32:10 12119 xpd-I: Protocol::Configure: global manager created
170914 14:32:10 12119 xpd-I: Protocol::Configure: xproofd protocol version 0.7 build v4.5.0 successfully loaded
------ xrootd anon@ubuntu:21001 initialization completed.

Dear Chinmay,

I suspect a firewall problem: is port 21001 of the worker machine reachable from the machine where you start the server?

Btw, I cannot find your pod_ssh.cf in attachment.

G Ganis

Hi,
I am using the pod_ssh.cfg file you had shown (in another thread). It is as follows,

@bash_begin@    
    # Temp dir
    export TMPDIR=/tmp
    # ROOT
    . /home/chinmay/root-6.08.04/PROOF_INSTALLATION/bin/thisroot.sh
    # XROOTD
    . /home/chinmay/root-6.08.04/PROOF_INSTALLATION/bin/setxrd.sh /home/chinmay/xrootd-4.5.0/
@bash_end@

nilay,chinmay@10.159.63.110,-X,/home/chinmay/tmp/,4

port should be reachable since firewall is off on both machines.
How can I check this accessibility ?

After issuing

  pod-ssh -c pod_ssh.cf --debug submit

you can try

  $ nmap -v -p20000-22000 cernvm14.cern.ch
  
  Starting Nmap 7.01 ( https://nmap.org ) at 2017-09-14 18:54 CEST
  Initiating Ping Scan at 18:54
  Scanning cernvm14.cern.ch (137.138.234.68) [2 ports]
  Completed Ping Scan at 18:54, 0.00s elapsed (1 total hosts)
  Initiating Parallel DNS resolution of 1 host. at 18:54
  Completed Parallel DNS resolution of 1 host. at 18:54, 0.00s elapsed
  Initiating Connect Scan at 18:54
  Scanning cernvm14.cern.ch (137.138.234.68) [2001 ports]
  Discovered open port 21001/tcp on 137.138.234.68
  Completed Connect Scan at 18:54, 0.04s elapsed (2001 total ports)
  Nmap scan report for cernvm14.cern.ch (137.138.234.68)
  Host is up (0.00026s latency).
  Notshown: 2000 closed ports
  PORT      STATE SERVICE
  21001/tcp open  unknown
    
  Read data files from: /usr/bin/../share/nmap
  Nmap done: 1 IP address (1 host up) scanned in 0.09 seconds

or

   $ telnet cernvm14.cern.ch 21001
   Trying 137.138.234.68...
   Connected to cernvm14.cern.ch.
   Escape character is '^]'.
   Connection closed by foreign host.

If you do not get similar outputs or errors the port is likely closed.
You may need to install telnet or nmap.

G Ganis

Dear Ganis,
It indeed seems that port is closed. I get following output.

[chinmay@localhost 3.16]$ pod-ssh -c pod_ssh.cf --debug submit
**	[Fri, 15 Sep 2017 10:12:27 +0530]	preparing PoD worker package...
**	[Fri, 15 Sep 2017 10:12:27 +0530]	selecting pre-compiled bins to be added to worker package...
**	[Fri, 15 Sep 2017 10:12:27 +0530]	PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
**	[Fri, 15 Sep 2017 10:12:27 +0530]	pod-ssh config contains an inline shell script. It will be injected it into wrk. package
**	[Fri, 15 Sep 2017 10:12:28 +0530]	preparing PoD worker package...
**	[Fri, 15 Sep 2017 10:12:28 +0530]	inline shell script is found and will be added to the package...
**	[Fri, 15 Sep 2017 10:12:28 +0530]	selecting pre-compiled bins to be added to worker package...
**	[Fri, 15 Sep 2017 10:12:28 +0530]	PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
**	[Fri, 15 Sep 2017 10:12:28 +0530]	There are 5 threads in the tread-pool.
**	[Fri, 15 Sep 2017 10:12:28 +0530]	Number of PoD workers: 1
**	[Fri, 15 Sep 2017 10:12:28 +0530]	Number of PROOF workers: 4
**	[Fri, 15 Sep 2017 10:12:28 +0530]	Workers list:
**	[Fri, 15 Sep 2017 10:12:28 +0530]	[nilay] with 4 workers at chinmay@10.159.63.110:/home/chinmay/tmp/nilay
nilay	[Fri, 15 Sep 2017 10:12:28 +0530]	pod-ssh-submit-worker is started for chinmay@10.159.63.110 (dir: /home/chinmay/tmp/nilay, nworkers: 4, sshopt: -X)
**	[Fri, 15 Sep 2017 10:12:30 +0530]	
*******************
Successfully processed tasks: 1
Failed tasks: 0
*******************
[chinmay@localhost 3.16]$ nmap -v -p20000-22000 10.159.63.110

Starting Nmap 6.47 ( http://nmap.org ) at 2017-09-15 10:13 IST
Initiating Ping Scan at 10:13
Scanning 10.159.63.110 [2 ports]
Completed Ping Scan at 10:13, 0.00s elapsed (1 total hosts)
Initiating Parallel DNS resolution of 1 host. at 10:13
Completed Parallel DNS resolution of 1 host. at 10:13, 0.00s elapsed
Initiating Connect Scan at 10:13
Scanning 10.159.63.110 [2001 ports]
Completed Connect Scan at 10:13, 0.07s elapsed (2001 total ports)
Nmap scan report for 10.159.63.110
Host is up (0.00063s latency).
All 2001 scanned ports on 10.159.63.110 are closed

Read data files from: /usr/bin/../share/nmap
Nmap done: 1 IP address (1 host up) scanned in 0.13 seconds

How can I resolve this problem ?
Should I use some other port ?
Thanks.

Dear Chinmay,

So you have a firewall running on the machine. If you cannot turnoff that, at least for a range of ports, then we have to play with SSH tunnels . I have to make a few tries and come back to you.

G Ganis

Dear Ganis,
I tried to reverse the master-worker machines.
Somehow in this case pod-ssh status gives “RUN” status and port 21001 is open.
However port 22001 is still closed.

[chinmay@ubuntu INSTALLATION]$ pod-server start
Starting PoD server...
updating xproofd configuration file...
starting xproofd...
starting PoD agent...
preparing PoD worker package...
selecting pre-compiled bins to be added to worker package...
PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
------------------------
XPROOFD [17026] port: 21001
PoD agent [17059] port: 22001
PROOF connection string: chinmay@ubuntu:21001
------------------------
[chinmay@ubuntu INSTALLATION]$ pod-ssh -c pod_ssh.cf --debug submit
**	[Fri, 15 Sep 2017 15:40:24 +0530]	preparing PoD worker package...
**	[Fri, 15 Sep 2017 15:40:24 +0530]	selecting pre-compiled bins to be added to worker package...
**	[Fri, 15 Sep 2017 15:40:24 +0530]	PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
**	[Fri, 15 Sep 2017 15:40:24 +0530]	pod-ssh config contains an inline shell script. It will be injected it into wrk. package
**	[Fri, 15 Sep 2017 15:40:24 +0530]	preparing PoD worker package...
**	[Fri, 15 Sep 2017 15:40:24 +0530]	inline shell script is found and will be added to the package...
**	[Fri, 15 Sep 2017 15:40:24 +0530]	selecting pre-compiled bins to be added to worker package...
**	[Fri, 15 Sep 2017 15:40:24 +0530]	PoD worker package: /home/chinmay/.PoD/wrk/PoDWorker.sh
**	[Fri, 15 Sep 2017 15:40:24 +0530]	There are 5 threads in the tread-pool.
**	[Fri, 15 Sep 2017 15:40:24 +0530]	Number of PoD workers: 1
**	[Fri, 15 Sep 2017 15:40:24 +0530]	Number of PROOF workers: 8
**	[Fri, 15 Sep 2017 15:40:24 +0530]	Workers list:
**	[Fri, 15 Sep 2017 15:40:24 +0530]	[chinmay] with 8 workers at chinmay@10.159.63.240:/tmp/.pod_test/chinmay
chinmay	[Fri, 15 Sep 2017 15:40:24 +0530]	pod-ssh-submit-worker is started for chinmay@10.159.63.240 (dir: /tmp/.pod_test/chinmay, nworkers: 8, sshopt: -X)
**	[Fri, 15 Sep 2017 15:40:25 +0530]	
*******************
Successfully processed tasks: 1
Failed tasks: 0
*******************
[chinmay@ubuntu INSTALLATION]$ pod-ssh status
PoD worker "chinmay": RUN
[chinmay@ubuntu INSTALLATION]$ nmap -v -p21001,22001 10.159.63.240

Starting Nmap 6.40 ( http://nmap.org ) at 2017-09-15 15:40 IST
Initiating Ping Scan at 15:40
Scanning 10.159.63.240 [2 ports]
Completed Ping Scan at 15:40, 0.00s elapsed (1 total hosts)
Initiating Parallel DNS resolution of 1 host. at 15:40
Completed Parallel DNS resolution of 1 host. at 15:40, 0.00s elapsed
Initiating Connect Scan at 15:40
Scanning 10.159.63.240 [2 ports]
Discovered open port 21001/tcp on 10.159.63.240
Completed Connect Scan at 15:40, 0.00s elapsed (2 total ports)
Nmap scan report for 10.159.63.240
Host is up (0.00066s latency).
PORT      STATE  SERVICE
21001/tcp open   unknown
22001/tcp closed unknown

Read data files from: /usr/bin/../share/nmap
Nmap done: 1 IP address (1 host up) scanned in 0.02 seconds
[chinmay@ubuntu INSTALLATION]$ root -l
root [0] TProof *proof = TProof::Open("chinmay@ubuntu:21001")
Starting master: opening connection ...
Starting master: OK                                                 
no resource currently available for this session: please retry later
Error in <TProof::StartSlaves>: no resources available or problems setting up workers (check logs)
Error in <TProof::Open>: new session could not be created
(TProof *) nullptr
root [1] .q

Thanks.

After submitting the workers you should have a proof.conf file under $(HOME)/.PoD: can you check which port is used in there?
Look for lines such as:

    worker ganis@cernvm14.cern.ch port=21001 pref=100

G Ganis

Btw, are you at CERN by any chance?

G Ganis

Okay. Do I have to write this proof.conf file ?
in first case it is there but it has only one line

master localhost.localdomain

In the second case there is no proof.conf file in $HOME/.PoD/ .
I am not from HEP community. So this is the only way for me to communicate :frowning:
Frustrating part (more for you) is I understand very little about networking.

Hi,
In reversed configuration it worked after adding proof.conf file to $HOME/.PoD.
Though it is not clear to me yet why it was so ?
Anyways thanks a lot for your patience and help. :slightly_smiling_face:

Good that now this works.
However, the proof.conf file should be automatically created (and deleted) by pod-ssh ... submit and it gets dynamically updated when workers come or go away.

This is what I have in my case and it is strange that does not happen to you, and certainly may explain the fact that it was not working. I think all this process happens to the PoD agents, which are sort of monitors.
It may be that they are not able to communicate.

If you do netstat -tap dod you get a line like this?

   tcp        0      0 pcphsft64.dyndns.:22001 cernvm14.cern.ch:41612  ESTABLISHED 13632/pod-agent

G Ganis

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.