Problems accepting callback: process killed

Hello,

I am getting the error below after I start PoD (using pod-server start). I downloaded the PoD-3.10-Source.tar.gz and I am running with ROOT version 5.34.05 and xrootd 3.2.4. Would you know how to get rid of the error? I suppose I am missing some settings. Thanks.

All the best,

David

root [2] p = TProof::Open(“pod://”);
130305 11:26:51 22586 Proofx-E: Conn::LowWrite: sending header to server [lxplus436.cern.ch:21003] (rc=-3)
130305 11:26:51 22586 Proofx-E: Conn::SendRecv: problems sending request to server [lxplus436.cern.ch:21003]
Starting master: opening connection …
Starting master: connection open: setting up server …

| Error condition occured: message from server:
| mst-0:problems accepting callback: process killed - pid: 22628

Info in TXSlave::HandleError: 0x1d118d30:lxplus436.cern.ch:0 got called … fProof: 0x1d1177d0, fSocket: 0x1d118fe0 (valid: 1)
Info in TXSlave::HandleError: 0x1d118d30: proof: 0x1d1177d0
Error in TXSlave::SetupServ: failed to receive slave startup message
TXSlave::HandleError: 0x1d118d30: DONE …
Starting master: failure
PROOF set to sequential mode

Hi,

Sorry for the late reply.
Which LSF plugin do you use?
Do you get workers up? Result of ‘pod-info -n’ …
Also, is there a reason why you cannot use PoD 3.12 ? (not sure it matters, but you’ll get better support for
the latest version.

G. Ganis

Of course I meant ‘Which PoD plugin do you use?’ …

Hi,

I switched to PoD 3.12 and nothing has changed. First, I do:

pod-server start pod-submit -r lsf -q 1nh -n 20

Then the workers are up:

salekd@lxplus412:~/work/SFrame/Monojet/PoD$ pod-server status XPROOFD [27452] port: 21001 PoD agent [27545] port: 22001 PROOF connection string: salekd@lxplus412.cern.ch:21001 salekd@lxplus412:~/work/SFrame/Monojet/PoD$ pod-info -n 20

What do you mean by PoD plugin?

Cheers,

David

Hi,

By PoD plugin I meant which batch backend. I see it is LSF at CERN.

It looks like the workers are not able to connect back. At CERN working dirs are by default on AFS and there are some issues with that. Please look at

pod.gsi.de/doc/3.12/Known_Issues.html#PoDonAFS

and try again after moving the work_dir under your local temporary directory.

G. Ganis

Hi,

I was running with work_dir and logfile_dir set to /tmp/salekd. I anyway set the access rights to my afs .PoD subdirectories as it is recommended in the manual. The error still remains.

Do you have any other ideas? Thanks.

All the best,

David

Hey David!

Can you please send me PoD logs after a failed PoD session?
~/.PoD/log

I need also your ~/.PoD/PoD.cfg.

thx!

Hi,

In fact, I tried starting from scratch and reinstalled PoD 3.12. I again tried to start the pod server but I no longer see any workers running (“pod-info -n” returns 0). I do “pod-server start” and “pod-submit -r lsf -q 1nh -n 20”. Then “pod-info -ls” gives me:
XPROOFD [17995] port: 21002
PoD agent [18141] port: 22002

All my 20 jobs are running but I am still getting 0 with “pod-info -n”. I have no idea what is the difference since I reinstalled the PoD package.

Here is my PoD.cfg file:

[code]# PoD user defaults

version: 0.8

Please use PoD User’s Manual to find out more details on

keys and values of this configuration file.

PoD User’s Manual can be found in $POD_LOCATION/doc folder or

by the following address: http://pod.gsi.de/documentation.html

[server]
#work_dir=$HOME/.PoD
#sandbox_dir=$HOME/.PoD
#logfile_dir=$HOME/.PoD/log
work_dir=/tmp/$USER/.PoD
sandbox_dir=/tmp/$USER/.PoD
logfile_dir=/tmp/$USER/.PoD/log
logfile_overwrite=yes
log_level=1
agent_shutdown_if_idle_for_sec=86400
agent_local_client_port_min=20000
agent_local_client_port_max=25000
xproof_ports_range_min=21001
xproof_ports_range_max=22000
agent_ports_range_min=22001
agent_ports_range_max=23000
agent_threads=5
agent_node_readbuffer=5000
packet_forwarding=auto
proof_cfg_entry_pattern=worker %user%@%host% port=%port% pref=100
[worker]
work_dir=$POD_LOCATION/
logfile_dir=$POD_LOCATION/
logfile_overwrite=yes
log_level=1
set_my_rootsys=yes
my_rootsys=$ROOTSYS
agent_shutdown_if_idle_for_sec=86400
xproof_ports_range_min=21001
xproof_ports_range_max=22000
agent_node_readbuffer=5000
[lsf_plugin]
email_job_output=no
upload_job_log=no
[pbs_plugin]
upload_job_log=no
[ge_plugin]
upload_job_log=no
options_file=$POD_LOCATION/etc/Job.oge.option
[condor_plugin]
upload_job_log=no
options_file=$POD_LOCATION/etc/Job.condor.option
[/code]

Here is my user_worker_env.sh file:

! /usr/bin/env bash echo "Setting user environment for workers ..." source /afs/cern.ch/sw/lcg/contrib/gcc/4.3.5/x86_64-slc5-gcc34-opt/setup.sh export LD_LIBRARY_PATH=/afs/cern.ch/sw/lcg/external/qt/4.4.2/x86_64-slc5-gcc43-opt/lib:\ /afs/cern.ch/sw/lcg/external/Boost/1.47.0_python2.6/x86_64-slc5-gcc43-opt//lib:\ /afs/cern.ch/sw/lcg/external/xrootd/3.2.4/x86_64-slc5-gcc43-opt/lib64:\ /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.05/x86_64-slc5-gcc43-opt/root/lib:\ /afs/cern.ch/sw/lcg/external/Python/2.6.2/x86_64-slc5-gcc43-opt/lib:\ /afs/cern.ch/sw/lcg/contrib/gcc/4.3.5/x86_64-slc5-gcc43-opt/lib64:\ /afs/cern.ch/sw/lcg/contrib/mpfr/2.3.1/x86_64-slc5-gcc43-opt/lib:\ /afs/cern.ch/sw/lcg/contrib/gmp/4.2.2/x86_64-slc5-gcc43-opt/lib echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"

Here is my user_xpd.cf file:

xpd.rootd allow xpd.intwait 600

All these three files are under my $HOME/.PoD directory. On each start of the session, I do “source $HOME/PoD/3.12/PoD_env.sh”.

Finally, here are my log files:

tmp/salekd/.PoD/log/pod-agent.server.log

2013-03-18 16:15:01.770 INF 0 [LOG singleton:thread-16820] LOG singleton has been initialized. 2013-03-18 16:15:01.770 INF 0 [PROOFAgent:thread-16820] pod-agent v.3.12 2013-03-18 16:15:01.771 INF 0 [CORE:thread-16820] Bringing >>> AgentServer <<< to life... 2013-03-18 16:15:01.771 INF 0 [CORE:thread-16820] Bringing >>> ThreadPool <<< to life... 2013-03-18 16:15:01.771 INF 0 [ThreadPool:thread-16822] starting a thread worker. 2013-03-18 16:15:01.771 INF 0 [ThreadPool:thread-16823] starting a thread worker. 2013-03-18 16:15:01.771 INF 0 [ThreadPool:thread-16824] starting a thread worker. 2013-03-18 16:15:01.771 INF 0 [ThreadPool:thread-16825] starting a thread worker. 2013-03-18 16:15:01.771 INF 0 [ThreadPool:thread-16826] starting a thread worker. 2013-03-18 16:15:01.771 INF 0 [AgentServer:thread-16820] Detected xpd [16285] on port 21001 2013-03-18 16:15:01.771 INF 0 [AgentServer:thread-16820] starting a monitor 2013-03-18 16:15:01.773 INF 0 [AgentServer:thread-16820] Entering into the main 'select' loop... 2013-03-18 16:15:28.210 INF 0 [AgentServer:thread-16820] Accepting the connetion from PoD UI: lxplus415.cern.ch:50266 2013-03-18 16:15:28.210 INF 0 [AgentServer:thread-16820] Client requests a list of available workers. 2013-03-18 16:15:28.210 INF 0 [AgentServer:thread-16820] Client lxplus415.cern.ch:50266 has just dropped the connection 2013-03-18 16:15:28.220 INF 0 [AgentServer:thread-16820] Accepting the connetion from PoD UI: lxplus415.cern.ch:50267 2013-03-18 16:15:28.220 INF 0 [AgentServer:thread-16820] Client requests a list of available workers. 2013-03-18 16:15:28.220 INF 0 [AgentServer:thread-16820] Client lxplus415.cern.ch:50267 has just dropped the connection 2013-03-18 16:15:38.437 INF 0 [AgentServer:thread-16820] Accepting the connetion from PoD UI: lxplus415.cern.ch:50291 2013-03-18 16:15:38.437 INF 0 [AgentServer:thread-16820] Client requests a list of available workers. 2013-03-18 16:15:38.437 INF 0 [AgentServer:thread-16820] Client lxplus415.cern.ch:50291 has just dropped the connection 2013-03-18 16:15:38.478 INF 0 [AgentServer:thread-16820] Accepting the connetion from PoD UI: lxplus415.cern.ch:50292 2013-03-18 16:15:38.478 INF 0 [AgentServer:thread-16820] Client requests a list of available workers. 2013-03-18 16:15:38.478 INF 0 [AgentServer:thread-16820] Client lxplus415.cern.ch:50292 has just dropped the connection

/tmp/salekd/.PoD/log/PoDServer/xpd.log

130318 16:14:56 16285 Scalla is starting. . . Copr. 2010 Stanford University, xrd version v3.2.4 ++++++ xproofd PoDServer@lxplus415.cern.ch initialization started. Config using configuration file /tmp/salekd/.PoD/etc/xpd.cf =====> xrd.adminpath /tmp/PoD_ZRLhP16134 Config maximum number of connections restricted to 1024 130318 16:14:56 16285 xpd.port 21001 130318 16:14:56 16285 xpd.port 21001 130318 16:14:56 16285 xpd.tmp /tmp/PoD_ZRLhP16134 130318 16:14:56 16285 xpd.role master 130318 16:14:56 16285 xpd-I: Manager::DoDirectiveRootd: val: allow 130318 16:14:56 16285 xpd.rootd allow 130318 16:14:56 16285 xpd-I: Manager::Config: configuring 130318 16:14:56 16285 xpd-I: Manager::Config: listening on port 21001 130318 16:14:56 16285 xpd-I: Manager::Config: using temp dir: /tmp/PoD_ZRLhP16134 130318 16:14:56 16285 xpd-I: Manager::Config: role set to: master 130318 16:14:56 16285 xpd-I: Manager::Config: admin path set to: /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001 130318 16:14:56 16285 xpd-I: Manager::Config: unix sockets under: /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001/socks 130318 16:14:56 16285 xpd-I: Manager::Config: PROOF pool: root://lxplus415.cern.ch 130318 16:14:56 16285 xpd-I: Manager::Config: PROOF pool namespace: /proofpool 130318 16:14:56 16285 xpd-I: Manager::LoadScheduler: scheduler loaded: type: default 130318 16:14:56 16285 xpd-I: Manager::Config: user config files are disabled 130318 16:14:56 16285 xpd-I: Manager::Config: no dataset sources defined 130318 16:14:56 16285 xpd-I: Manager::Config: list of superusers: salekd 130318 16:14:56 16285 xpd-I: Manager::Config: bare lib path for proofserv (full LD_LIBRARY_PATH): /afs/cern.ch/user/s/salekd/PoD/3.12/lib:/afs/cern.ch/user/s/salekd/work/SFrame/RootCore/lib:/afs/cern.ch/user/s/salekd/work/SFrame/SFrame/lib:/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/xrootd/3.2.4-x86_64-slc5/v3.2.4/lib:/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/5.34.05-x86_64-slc5-gcc4.3/lib:/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/python/2.6.5-x86_64-slc5-gcc43/sw/lcg/external/Python/2.6.5/x86_64-slc5-gcc43-opt/lib:/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc435_x86_64_slc5/gcc-alt-435/x86_64-slc5-gcc43-opt/lib64:/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc435_x86_64_slc5/gcc-alt-435/x86_64-slc5-gcc43-opt/lib: 130318 16:14:56 16285 xpd-I: Group::Print: +++ Group: default 130318 16:14:56 16285 xpd-I: Group::Print: +++ Priority: -1, fraction: -1 130318 16:14:56 16285 xpd-I: Group::Print: +++ End of Group: default 130318 16:14:56 16285 xpd-I: Admin::Config: configuring 130318 16:14:56 16285 xpd-I: Admin::Config: allowed/supported copy commands: root:xrdcp,https:wget,file:cp,http:wget,xrd:xrdcp 130318 16:14:56 16285 xpd.resource static /tmp/salekd/.PoD/proof.conf 130318 16:14:56 16285 xpd-I: NetMgr::Config: configuring 130318 16:14:56 16285 xpd-I: NetMgr::Config: PROOF config file: /tmp/salekd/.PoD/proof.conf 130318 16:14:56 16285 xpd-I: NetMgr::Config: PROOF config file will be reloaded upon change 130318 16:14:56 16285 xpd-I: NetMgr::Config: 0 worker nodes defined at start-up 130318 16:14:56 16285 xpd-I: PriorityMgr::Config: configuring 130318 16:14:56 16285 xpd-I: PriorityMgr::Config: no priority changes requested 130318 16:14:56 16285 xpd-I: PriorityMgr::Config: poller thread started 130318 16:14:56 16285 xpd-I: ROOTMgr::SetLogDir: rootsys log validation path: /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001/rootsysvalidation 130318 16:14:56 16285 xpd-I: ROOTMgr::Config: configuring 130318 16:15:01 16285 xpd-I: ROOTMgr::Config: ROOT dist: '5.34/05 5.34/05 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/5.34.05-x86_64-slc5-gcc4.3 35' validated 130318 16:15:01 16285 xpd-I: ROOTMgr::Config: ROOT version details: svn: 48568, code: 336389, {mnp} = {5,34,5} 130318 16:15:01 16285 xpd-I: ClientMgr::Config: configuring 130318 16:15:01 16285 xpd-I: ClientMgr::Config: clients admin path set to: /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001/clients 130318 16:15:01 16285 xpd-I: ClientMgr::Config: XRD seclib not specified; strong authentication disabled 130318 16:15:01 16285 xpd-I: ClientMgr::Config: cron thread started 130318 16:15:01 16285 xpd.intwait 600 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: configuring 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: setting internal timeout to 600 secs 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: client sessions kept idle for 0 secs after disconnection 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: active sessions admin path set to: /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001/activesessions 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: terminated sessions admin path set to /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001/terminatedsessions 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: RC settings: 0 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: ENV settings: 0 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: using fork() to start proofserv sessions 130318 16:15:01 16285 xpd-I: ProofServMgr::Config: cron thread started 130318 16:15:01 16285 xpd.resource static /tmp/salekd/.PoD/proof.conf 130318 16:15:01 16285 xpd-I: Sched::Config: cron thread started 130318 16:15:01 16285 xpd-I: Manager::Config: unix socket path for rootd call backs: /tmp/PoD_ZRLhP16134/PoDServer//.xproofd.21001/socks/xpd.21001.16285.rootd 130318 16:15:01 16285 xpd-I: Manager::Config: serving files with: '/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/root/5.34.05-x86_64-slc5-gcc4.3/bin/rootd' (protocol: 'rootd://') to ALL hosts 130318 16:15:01 16285 xpd-I: Manager::Config: manager cron thread started 130318 16:15:01 16285 xpd-I: Protocol::Configure: global manager created 130318 16:15:01 16285 xpd-I: Protocol::Configure: xproofd protocol version 0.7 build v3.2.4 successfully loaded 130318 16:15:01 16775 xpd-I: ProofServCron: next full sessions check in 30 secs ------ xproofd PoDServer@lxplus415.cern.ch:21001 initialization completed. 130318 16:15:31 16776 xpd-I: SchedCron: running regular checks 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServMgr::CleanupLostProofServ: pidpath: /tmp/.xproofd.1093/xrootd.pid 130318 16:15:31 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:15:31 16775 xpd-I: ProofServCron: next sessions check in 30 secs 130318 16:16:01 16776 xpd-I: SchedCron: running regular checks 130318 16:16:01 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:16:01 16775 xpd-I: ProofServCron: next sessions check in 30 secs 130318 16:16:31 16776 xpd-I: SchedCron: running regular checks 130318 16:16:31 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:16:31 16775 xpd-I: ProofServCron: next sessions check in 30 secs 130318 16:17:01 16776 xpd-I: SchedCron: running regular checks 130318 16:17:01 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:17:01 16775 xpd-I: ProofServCron: next sessions check in 30 secs 130318 16:17:31 16776 xpd-I: SchedCron: running regular checks 130318 16:17:31 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:17:31 16775 xpd-I: ProofServCron: next sessions check in 30 secs 130318 16:18:01 16776 xpd-I: SchedCron: running regular checks 130318 16:18:01 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:18:01 16775 xpd-I: ProofServCron: next sessions check in 30 secs 130318 16:18:31 16776 xpd-I: SchedCron: running regular checks 130318 16:18:31 16775 xpd-I: ProofServCron: 0 sessions are currently active 130318 16:18:31 16775 xpd-I: ProofServCron: next sessions check in 30 secs

When LSF says, that jobs are running, this is not necessarily mean they are really started. It could also mean that they are just scheduled and resources are reserved.

Anyway, can you please switch logging on the WNs? In you ~/.PoD/PoD.cfg:
[lsf_plugin]
upload_job_log=yes

then stop PoD server it it runs:
pod-server stop

make sure you don’t run any PoD WNs.

start:
pod-server start

submit 1-2 WNs.
pod-submit -r lsf -q 1nh -n 2

check LSF job statuses

when they are in RUN, wait like a couple of minutes.
Check pod-info -n
If it returns 0, check status of LSF jobs.
If the JOBs still run, wait max 4-5 min to see whether they go away. If not, stop PoD server.
check that the log dir contains logs from the jobs under. There must be a directory with the JOB ID.
I need those logs.

If pod-info -n will return something else than 0, then try to connect to the PROOF cluster. If fails, stop the server and check job logs (like above).

Hi,

I started the proof server and sent two jobs. I saw the jobs running and they got killed after a few minutes. The “pod-info -n” command was always giving 0. I got the log files from the two jobs, please have a look below at the output of one of them:

LSFJOB

[code]#! /bin/sh

$LSB_TRAPSIGS
$LSB_RCP1
$LSB_RCP2
$LSB_RCP3

LSBATCH: User input

/usr/local/lsf/etc/job_starter '/tmp/salekd/PoD/Job.lsf’
ExitStat=$?
wait

LSBATCH: End user input

true
exit expr $? "|" $ExitStat[/code]

STDERR

STDOUT

[code]@(#)CERN job starter $Date: 2010/06/23 14:22:16 $
Working directory is </pool/lsf/salekd/379347506.1> on <lxbsj3516.cern.ch>

Job finished at Mon Mar 18 17:16:51 CET 2013 on node
under linux version Scientific Linux CERN SLC release 5.8 (Boron)

CERN statistics: This job used 0:00:01 NCU hours (1 NCU seconds)

CERN statistics: This job used 0:00:01 KSI2K hours (1 KSI2K seconds)

KSI2K = kilo-SpecInt2000 benchmark units = 1.00 NCU
[/code]

Cheers,

David

this is the problem:

sh: /tmp/salekd/PoD/Job.lsf: No such file or directory

LSF can’t find PoD Job scripts. Because your /tmp is not available on WNs.
You supposed to update only work_dir to be able to workaround a pipe issue on AFS. You updated too much :slight_smile:

Please, update your PoD.cfg accordingly:

work_dir=/tmp/$USER/.PoD
sandbox_dir=$HOME/.PoD
logfile_dir=$HOME/.PoD/log

Let me know if it doesn’t work.

Hi,

Thanks a lot, it works! I can now see the workers running.

root [0] p = TProof::Open(“pod://”);

Starting master: opening connection …
Starting master: OK
Opening connections to workers: OK (2 workers)
Setting up worker servers: OK (2 workers)
PROOF set to parallel mode (2 workers)

All the best,

David

you are very welcome :wink: