Master do not give all workers

Hello all,

The problem we have is that the master is not giving us all idle workers of the farm (96 workers).
As an example, only one person is running now with 18 workers. If I open a session only get 10 workers…

Do you know what could be the reason? Let me know if you need more information.

Thanks!

Jordi Nadal

Hi,

Can you post the daemon configuration file on the master?
Which ROOT version are you running?

G. Ganis

Hi,

Since I am a beginner maybe I don’t give the correct information… if this is the case please let me know.

Thanks,

Jordi

Root version:

root-5.30.01_slc5_gcc4.1_x86-64

Here is the xrootd configuration file(/etc/init.d/xrootd):

#!/bin/sh

chkconfig: 345 99 0

description: The xrootd daemon is used to as file server and starter of

the PROOF worker processes.

xrootd Start/Stop the XROOTD daemon

processname: xrootd

pidfile: /var/run/xrootd.pid

config:

NFSCONFIG=/software/at3/admin/proof_nodes/root/

Specify here the full path to the configuration file to be used

XRDCF=$NFSCONFIG/etc/conf/xpd.cf

. $NFSCONFIG/setup.sh

XROOTD=$ROOTSYS/bin/xrootd
XRDLIBS=$ROOTSYS/lib

Get xrootd config

. $NFSCONFIG/etc/sysconfig/xrootd

Source function library.

. /etc/init.d/functions

Get config.

. /etc/sysconfig/network

if [[ “X$XRDLOG” == “$XRDLOG” ]]; then
echo "Error: XRDLOG not defined.
exit 1
fi

if [ ! -d dirname $XRDLOG ] ; then
mkdir -p dirname $XRDLOG
touch $XRDLOG
fi

if [ ! -f $XRDLOG ] ; then
touch $XRDLOG
fi

if [[ “X$XPDMONLOG” == “$XPDMONLOG” ]]; then
echo "Error: XPDMONLOG not defined.
exit 1
fi

if [ ! -d dirname $XPDMONLOG ] ; then
mkdir -p dirname $XPDMONLOG
touch $XPDMONLOG
fi

if [ ! -f $XPDMONLOG ] ; then
touch $XPDMONLOG
fi

xpdmonowner=ls -l $XPDMONLOG | awk '{print $3}'
if [[ $xpdmonowner != “xrootd” ]] ; then
chown xrootd.xrootd $XPDMONLOG
fi

xpdmon_permit=ls -l $XPDMONLOG | awk '{print $1}'
if [[ $xpdmon_permit != “-rw-rw-rw-” ]]; then
chmod 666 $XPDMONLOG
fi

Read user config

[ ! -z “$XRDUSERCONFIG” ] && [ -f “$XRDUSERCONFIG” ] && . $XRDUSERCONFIG

Check that networking is up.

if [ ${NETWORKING} = “no” ]; then
exit 0
fi

[ -x $XROOTD ] || exit 0

RETVAL=0
prog=“xrootd”

export DAEMON_COREFILE_LIMIT=unlimited

start() {
echo -n $"Starting $prog: "
# Options are specified in /etc/sysconfig/xrootd .
# See $ROOTSYS/etc/daemons/xrootd.sysconfig for an example.
# $XRDUSER must be the name of an existing non-privileged user.
export LD_LIBRARY_PATH=$XRDLIBS:$LD_LIBRARY_PATH
cd /var/log/root/xrootd
# workaround change xroot.log access rights

if [ ! -f xrootd.log ]; then
touch xrootd.log
fi

chown xrootd:xrootd xrootd.log

limit on 1 GB resident memory, and 2 GB virtual memory

ulimit -m 1048576 -v 2097152 -n 65000
echo "daemon $XROOTD -b -l $XRDLOG -R $XRDUSER -c $XRDCF $XRDDEBUG"
daemon $XROOTD -b -l $XRDLOG -R $XRDUSER -c $XRDCF $XRDDEBUG
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/xrootd
return $RETVAL
}

stop() {
[ ! -f /var/lock/subsys/xrootd ] && return 0 || true
echo -n $"Stopping $prog: "

killproc proofserv.exe

Assuming proofd is on port 1093

killproc xrootd will also kill the standalone xrootd

So we prefer here to look for the xrootd process associated with proofd on 1093

    proofdpid=`netstat -ptln | grep ":1093" | grep xrootd | awk '{print $7}' | sed 's/\/xrootd//g'`
    RETVAL=0
    if [[ "X$proofdpid" != "X" ]] ; then
       kill $proofdpid

killproc xrootd

       RETVAL=$?
    fi
    echo
    [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/xrootd

return $RETVAL
}

See how we were called.

case “$1” in
start)
start
;;
stop)
stop
;;
status)
status xrootd
RETVAL=$?
;;
restart|reload)
stop
start
;;
condrestart)
if [ -f /var/lock/subsys/xrootd ]; then
stop
start
fi
;;
*)
echo $"Usage: $0 {start|stop|status|restart|reload|condrestart}"
exit 1
esac

exit $RETVAL

One thing that I forgot is that a complete restart of the farm solved the problem temporarily…

Hi,

Ok, the file that I was referring to is the one pointed by $XRDCF in your startup script.
Can you post that?

G. Ganis

Sorry for that. Here is the file (admin/proof_nodes/root/etc/conf/xpd.cf):

#all.export /home/proof r/w

Load the XrdProofdProtocol to serve PROOF sessions

COSUNA

I can not change xrootd port due to this

Using a non default xrootd port

xrd.port 11094

#xrd.debug all
#xpd.debug all
#xrd.timeout hail 30 idle 0 kill 3 read 15

xpd.intwait 500
xpd.putenv LOCALDATASERVER=root://:11094/

if exec xrootd
xrd.protocol xproofd:1093 libXrdProofd.so
fi

all.export /tmp r/o
all.export /home/proof r/w
#all.export /home/proof r/o

#xpd.groupfile /software/at3/root/root-5.26b_slc5_gcc4.1.2_x86-64_PROOF/etc/conf/groupfile
xpd.worker worker pf00[2,3,4,5,6,7,8,9].pic.es port=1093 repeat=12
xpd.master pf001.pic.es
xpd.tmp /home/proof/tmp
xpd.workdir /home/proof

#xpd.putrc Proof.DynamicStartup 1

For a full description of scheduling options

root.cern.ch/drupal/content/conf … schedparam

#xpd.schedparam queue:fifo mxrun:5 mxsess:2 selopt:random

Disconnect idle user sessions

xpd.putrc ProofServ.IdleTimeout 600
xpd.schedparam queue:fifo minforquery:0 selopt:load optnwrks:3

Log to syslog

xpd.putrc ProofServ.LogToSysLog m1

if pf001.pic.es
xpd.putrc ProofServ.Monitoring SQL mysql://localhost/proofdb proofmon pfpmon proofquerylog
fi

Truncate large log files that crashes daemons

Protection against large log files

xpd.putrc ProofServ.LogFileMaxSize 100M

Hi,

The file looks OK. I see, though, you are using the load-based scheduling: this is a somewhat experimental feature, especially for daemons running for a while …
Can you try what happens by commenting the related line out? I.e. with

# The '#' comments out the line
# xpd.schedparam queue:fifo minforquery:0 selopt:load optnwrks:3

G. Ganis

Hi Ganis,

what kind of queue do we have if we remove that line? still fifo? based on?

TIA,
Arnau

You can still get FIFO behavior with

xpd.schedparam queue:fifo mxrun:5

(for a queue of length 5).
Anyhow, I was suggesting to try commenting out the line to try to locate the part responsible for the unexpected behavior, and see if and what solution we can provide.

G. Ganis

Hi,

sorry for the late answer. Changing the line the behavior was:

Master gave you the total of workers in the farm. So you always had 60 workers for run independently of the number of user running on proof.

Let me know if you need more information.

Thanks!

Hi,

just adding a couple of lines to Jordi’s reply.
We had to go back to previous configuration because the farm became a mess. Imagine all users getting the entire farm. Nodes reached high load values.

So, we could not evaluate the results of removing that line.

Sorry,
Arnau

Well, you can limit the number of workers per session (user) with the other switches. This

xpd.schedparam queue:fifo mxrun:5 mxw:15 selopt:roundrobin

should start sessions of 15 workers each with a round-robin assignment.
See also root.cern.ch/drupal/content/conf … schedparam .

G. Ganis