Is it possible to send jobs to a dedicated WN?

Dear Proof expert:

I have one machine valtical acting as proofd manager, and then 6 machines (valtical04,valtical05,valtical06,valtical07,valtical08,valtical09, each one 16 core) acting as salves, so in this cluster we have 96 workers. Is it possible to assign the destination of one job? E.g. If I want to send 16 jobs to run on the machine valtical04, how can I apply this in the job submission configuration file?

Cheers,Gang

Hi,

The concept of ‘assigning a job’ to a subset of workers is not really a PROOF concept.
I do not know exactly what you are trying to achieve, but I see two possible ways you may want to proceed:

  • Starting the session and deactivating the workers you do not want with TProof::DeactivateWorker
  • Setting up the daemon in such a way it gives sessions a limited number of workers following a round robin policy modulo 16; you should be able to achieve using this card:
xpd.schedparam wmx:16 selopt:roundrobin

G. Ganis

Dear Ganis:

Thanks for the reply. We have 6 slave machines (each machine has 16 cores) in the cluster, and I want to check if proof is running fine on each machine. Currently I assign 96 parallel work_nodes in the job submission (this job runs every 10 minutues via crond as Service availability monitoring job):

     worker = 16*6
    p = TProof.Open("valtical.cern.ch","workers="+str(worker))
    p.SetParameter("PROOF_RateEstimation","average")
    p.SetParallel(worker);

But if one fails, then I don’t know it happened on which machine. I want to submit 16 jobs to machine_1, then to machine_2 … in sequence., in this way if there is something wrong, I can restart the proofd daemons on that machine directly instead of restarting proofd daemon everywhere. And I can also have a better view of the whole running situation at the cluster.

About the solution:

  • Starting the session and deactivating the workers you do not want with TProof::DeactivateWorker

    • when submitting jobs to machine_1, how can I deactivate the other workers on machine_2 - machine_6.
  • Setting up the daemon in such a way it gives sessions a limited number of workers following a round robin policy modulo 16; you should be able to achieve using this card: xpd.schedparam wmx:16 selopt:roundrobin

    • thus every round the 16 machines could be spreaded among the 6 machines instead of one.

    Cheers,Gang

Ok,
But then you should have 6 separate PROOF clusters each with 16 workers.
You have probably already set it up: what happens if you do:

p = TProof.Open("valtical04.cern.ch");

Does it open a session with 16 workers?

Gerri

Dear Ganis:

I still want to configure all the workers in one cluster with valtical.cern.ch asredirector so that users can submit to a 96-worker cluster instead of to 6 16-worker clusters. And when I open valtical04.cern.ch, it shows:

root [0] TProof *p = TProof::Open(“valtical04.cern.ch”)
Starting master: opening connection …
Starting master: OK
Opening connections to workers: OK (96 workers)

Cheers,Gang

Hi,

Yes, you can do that, we just need to play a bit with your config file.
The following may work:

# Set of workers depend on the node
if valtical.cern.ch
xpd.worker worker valtical[04-09].cern.ch repeat=16
elif valtical04.cern.ch
xpd.worker worker valtical04.cern.ch repeat=16
elif valtical05.cern.ch
xpd.worker worker valtical05.cern.ch repeat=16
elif valtical06.cern.ch
xpd.worker worker valtical06.cern.ch repeat=16
elif valtical07.cern.ch
xpd.worker worker valtical07.cern.ch repeat=16
elif valtical08.cern.ch
xpd.worker worker valtical08.cern.ch repeat=16
elif valtical09.cern.ch
xpd.worker worker valtical09.cern.ch repeat=16
fi

In this way when you enter valtical.cern.ch you get the full cluster, while addresing directly one of the nodes it gets you only 16 workers on that node.
Please try and let me know.

Gerri

Dear Ganis:

Thanks for the idea. I just added the following lines to valtical and valtical04:

if valtical.cern.ch
xpd.worker worker valtical[04-09].cern.ch repeat=16
elif valtical04.cern.ch
xpd.worker worker valtical04.cern.ch repeat=16
elif valtical05.cern.ch
xpd.worker worker valtical05.cern.ch repeat=16
elif valtical06.cern.ch
xpd.worker worker valtical06.cern.ch repeat=16
elif valtical07.cern.ch
xpd.worker worker valtical07.cern.ch repeat=16
elif valtical08.cern.ch
xpd.worker worker valtical08.cern.ch repeat=16
elif valtical09.cern.ch
xpd.worker worker valtical09.cern.ch repeat=16
fi

And after restarting the proofd daemons on these 2 machines, I made a test on valtical05:

Enclose multiple statements between { }.
root [0] TProof *proof = TProof::Open(“valtical04.cern.ch”)
Starting master: opening connection …
Starting master: OK
Opening connections to workers: OK (96 workers)

Seems it’s still trying to open 96 workers but encountered some problem in opening valtical05:

120131 11:10:43 001 Proofx-E: Conn::CheckResp: server [valtical05.cern.ch:1093] did not return OK replying to last request
120131 11:10:43 001 Proofx-E: Conn::CheckErrorStatus: error 3006: 'master not allowed to connect - request ignored’
120131 11:10:43 001 Proofx-I: Conn::Login: valtical05.cern.ch: master not allowed to connect - request ignored
120131 11:10:43 001 Proofx-E: Conn::GetAccessToSrv: client could not login at [valtical05.cern.ch:1093]

Here I put all the configration files:
A: /opt/root/etc/xrootd.cfg on valtical, valtical04-09:
###PROOF Config

Load the XrdProofd protocol:

if exec xrootd
xrd.protocol xproofd:1093 ${rootlocation}/lib/libXrdProofd.so
fi

ROOTSYS

xpd.rootsys ${rootlocation}

#xpd.intwait 20

xpd.workdir /localdisk/proofbox

xpd.resource static ${rootlocation}/etc/proof/proof.conf

xpd.role worker

if valtical.cern.ch
xpd.role master
fi

xpd.allow valtical.cern.ch

xpd.maxoldlogs 2

#xpd.namespace /localdisk/proofpool

xpd.poolurl root://valtical.cern.ch

#xpd.schedparam selopt:load queue:fifo optntwks:16

#xpd.putrc Proof.DynamicStartup 1

And for valtical amd valtical04 the following lines are added:

if valtical.cern.ch
xpd.worker worker valtical[04-09].cern.ch repeat=16
elif valtical04.cern.ch
xpd.worker worker valtical04.cern.ch repeat=16
elif valtical05.cern.ch
xpd.worker worker valtical05.cern.ch repeat=16
elif valtical06.cern.ch
xpd.worker worker valtical06.cern.ch repeat=16
elif valtical07.cern.ch
xpd.worker worker valtical07.cern.ch repeat=16
elif valtical08.cern.ch
xpd.worker worker valtical08.cern.ch repeat=16
elif valtical09.cern.ch
xpd.worker worker valtical09.cern.ch repeat=16
fi

B. cat/opt/root/etc/proof/proof.conf this is the same on valtical, valtical04-09:
[root@valtical ~]# cat /opt/root/etc/proof/proof.conf
master valtical.cern.ch workdir=/localdisk/proofbox

#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox
#worker valtical00.cern.ch workdir=/localdisk/proofbox
worker valtical04.cern.ch workdir=/localdisk/proofbox
worker valtical05.cern.ch workdir=/localdisk/proofbox
worker valtical06.cern.ch workdir=/localdisk/proofbox
worker valtical07.cern.ch workdir=/localdisk/proofbox
worker valtical08.cern.ch workdir=/localdisk/proofbox
worker valtical09.cern.ch workdir=/localdisk/proofbox

any idea? Or maybe when you have time so that I can pass by your office to make some real test?

Cheers,Gang

Please try again by commenting out the ‘xpd.resource’ line.
Also remove the ‘xpd.role worker’, because you want the machines to have both roles (when you connect directly they are masters in the first instance), and comment out for now also the ‘xpd.allow’.

G. Ganis