Hello,
I have configured a second proof farm at our site. When I run a job I have the impression that it is running on workers belonging to another proof farm.
I run the job:
I set the new proofmaster:
$ grep “proofmaster =” src/runHistogrammer.cxx
TString proofmaster = “tditaller025.pic.es”;
I run the job:
$ bin/main Protos/Protos_bb.txt Protos/JO_protos_bb.txt Muon
I get the output:
…
Starting master: opening connection …
Starting master: connection open: setting up server … ^MStarting master: OK
…
but then I get this output: f001.pic.es
runHistogrammer: Package Histogrammer enabled…
*** Enabled packages on slave 0.91 on pf004.pic.es
HISTOGRAMMERPKG
*** Enabled packages on slave 0.92 on pf008.pic.es
HISTOGRAMMERPKG
*** Enabled packages on slave 0.93 on pf008.pic.es
pf00* are slaves from the other proof farm we have. The only slave should be italler026.pic.es
In the logs of the new proof master I do receive the request:
==> /var/log/root/xrootd/xrootd.log <==
111214 10:37:03 11623 xpd-I: ProofServCron: 1 sessions are currently active
111214 10:37:03 11623 xpd-I: ProofServCron: next sessions check in 30 secs
111214 10:37:32 11623 xpd-I: SchedCron: running regular checks
111214 10:37:33 11623 xpd-I: ProofServCron: 1 sessions are currently active
111214 10:37:33 11623 xpd-I: ProofServCron: next sessions check in 30 secs
111214 10:37:34 11623 xpd-I: cborrego.17907:31@ui02: Protocol: user cborrego disconnected; type: ClientMaster
111214 10:38:02 11623 xpd-I: SchedCron: running regular checks
111214 10:38:03 11623 xpd-I: ProofServCron: 1 sessions are currently active
111214 10:38:03 11623 xpd-I: ProofServCron: next sessions check in 30 secs
111214 10:38:03 11623 xpd-I: cborrego.8580:33@localhost.localdomain: Protocol: user cborrego disconnected; type: Int
In the configuration file from the proofmaster I have just one slave:
[root@tditaller025 ~]# grep taller /software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root//etc/conf/xpd.cf
xpd.worker worker tditaller026.pic.es port=1093 repeat=12
xpd.master tditaller025.pic.es
Why it arriving to the other pf00* slaves? What am missing?
Thanks so much!
Carlos
Thanks so much ganis for your quick response.
The job is a es analysis framework for the whole physics group from my institute. Perhaps you could provide me a simpler test which I can submit to our new proof farm
Thanks so much again
Carlos
Ps: if you manage to run the CPU bench, can you post or send me the output file? I am trying to collect some stats about specs of clusters and your information will be very appreciated …
If I run you job I get:
[cborrego@ui02.pic.es]#root -l
*** DISPLAY not set, setting it to 80.174.167.41.dyn.user.ono.com:0.0
root [0] TProofBench pb(“tditaller025.pic.es”)
| ignoring (apparently) non-responding session(s): 8919
Starting master: opening connection …
Starting master: OK
Opening connections to workers: OK (12 workers)
Setting up worker servers: OK (12 workers)
PROOF set to parallel mode (12 workers)
Info in TProofBench::SetOutFile: using default output file: ‘proofbench-tditaller025.pic.es-12w-20111214-1224.root’
Then if I run the other command:
root [1] pb.RunCPU()
0.9: caught exception triggered by signal ‘1’
0.10: caught exception triggered by signal ‘1’
0.4: caught exception triggered by signal ‘1’
0.6: caught exception triggered by signal ‘1’
Worker ‘tditaller026.pic.es-0.4’ has been removed from the active list
+++ Most likely your code crashed on worker 0.4 at tditaller026.pic.es:1093.
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“tditaller025.pic.es:1093”)->GetSessionLogs()->Display(“0.4”,0)
Worker ‘tditaller026.pic.es-0.3’ has been removed from the active list
+++ Most likely your code crashed on worker 0.3 at tditaller026.pic.es:1093.
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“tditaller025.pic.es:1093”)->GetSessionLogs()->Display(“0.3”,0)
In the log file of the slave I see this:
==> /var/log/root/xrootd/xrootd.log <==
111214 12:22:26 31467 xpd-I: SchedCron: running regular checks
111214 12:22:33 31467 xpd-I: cborrego.8919:31@tditaller025: ClientMgr::MapClient: user cborrego logged-in; type: MasterWorker
111214 12:22:33 31467 xpd-I: cborrego.8919:31@tditaller025: ProofServMgr::Create: use of fork() enforced: calling CreateFork()
111214 12:22:33 31467 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions/cborrego.default.7178.status was successful!
111214 12:22:34 31467 xpd-I: cborrego.7178:33@localhost.localdomain: ClientMgr::MapClient: user cborrego logged-in; type: Internal
111214 12:22:34 31467 xpd-I: cborrego.8919:31@tditaller025: ProofServMgr::Create: use of fork() enforced: calling CreateFork()
111214 12:22:34 31467 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions/cborrego.default.7183.status was successful!
111214 12:22:34 31467 xpd-I: cborrego.7183:34@localhost.localdomain: ClientMgr::MapClient: user cborrego logged-in; type: Internal
111214 12:22:34 31467 xpd-I: cborrego.8919:31@tditaller025: ProofServMgr::Create: use of fork() enforced: calling CreateFork()
111214 12:22:34 31467 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions
Uhmm … that’s strange.
But at least the workers are on the machine that you expect (tditaller026.pic.es).
Now, why you get this failure is a different thing.
Can you try basic operations, like:
Hi,
How do you install ROOT? Is this a common installation?
You should locate the of the installation (binaries should be under /bin and libs under /lib). Then tutorials should be under /share/doc/root/tutorials.
You may need to copy the tutorials dir in some place where you can write.
[root@tditaller025 root]# ls /software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/tutorials/proof/ProofSimple.C*
/software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/tutorials/proof/ProofSimple.C
[cborrego@ui02.pic.es]#root -l
*** DISPLAY not set, setting it to deic-173.uab.es:0.0
root [0] TSelector *sel = TSelector::GetSelector("/software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/tutorials/proof/ProofSimple.C+")
Warning in : /software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/tutorials/proof is not writeable!
Warning in : Output will be written to /tmp/cborrego
Info in TUnixSystem::ACLiC: creating shared library /tmp/cborrego//software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/tutorials/proof/ProofSimple_C.so
and, no, users can not write in /software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/, should they? Isn’t it dangerous? a user could change the binaries, right?
[quote=“cborrego”]I have just seen that the tutorial is under:
[cborrego@tditaller025.pic.es]#pwd
/software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root
[cborrego@tditaller025.pic.es]#find share/|grep ProofSimple.h
share/doc/root/tutorials/proof/ProofSimple.h[/quote]
Ok, as expected after ‘make install’ .
Yes, that’s ok. But the tutorials need you to be able to write, that’s why I was telling you to copy the full tutorials directory (at least tutorials/proof and tutorials/tree) some where in your local area and run them from there.
Try that and try to run the tutorials (have a look at tutorials/proof/runProof.C). Then we will see how to solve your original problem.
ROOT 5.30/04 (branches/v5-30-00-patches@41803, De 05 2011, 11:42:00 on linuxx8664gcc)
CINT/ROOT C/C++ Interpreter version 5.18.00, July 2, 2010
Type ? for help. Commands must be C++ statements.
Enclose multiple statements between { }.
root [0] TProof *p = TProof::Open(“tditaller025.pic.es”)
Starting master: opening connection …
Starting master: OK
Opening connections to workers: OK (12 workers)
Setting up worker servers: OK (12 workers)
PROOF set to parallel mode (12 workers)
root [1] p->SetLogLevel(2)
root [2] p->Process("/tmp/tutorials/proof/ProofSimple.C+", 1000000)
(Long64_t)(-1)
root [3] TProofLog *pl = TProof::Mgr(“tditaller025.pic.es”)->GetSessionLogs()
Retrieving logs: 1 ok, 0 not ok (0 % processed)
Retrieving logs: 2 ok, 0 not ok (0 % processed)
Retrieving logs: 3 ok, 0 not ok (0 % processed)
Retrieving logs: 4 ok, 0 not ok (0 % processed)
Retrieving logs: 5 ok, 0 not ok (0 % processed)
Retrieving logs: 6 ok, 0 not ok (0 % processed)
Retrieving logs: 7 ok, 0 not ok (0 % processed)
Retrieving logs: 8 ok, 0 not ok (0 % processed)
Retrieving logs: 9 ok, 0 not ok (0 % processed)
Retrieving logs: 10 ok, 0 not ok (0 % processed)
Retrieving logs: 11 ok, 0 not ok (0 % processed)
Retrieving logs: 12 ok, 0 not ok (0 % processed)
Retrieving logs: 13 ok, 0 not ok (100 % processed)
Thanks G.,
I think it is a problem with the installation. It is not finding some files which are under the src directory and probably the make install did not copied (it did not complain at all).
Concerning the command you suggest, the ls commands works well, but the other does not:
It has complained about a directory called plugins which was under:
src/root/etc/plugins/
and I have copied it to
and it does not complaint anymore.
Now instead:
[code]root [1] p->SetLogLevel(2)
root [2] p->Load("/tmp/tutorials/proof/ProofSimple.C++")
Error in TPluginManager::FindHandler: Cannot find plugin handler for TVirtualStreamerInfo! Does $ROOTSYS/etc/plugins/TVirtualStreamerInfo exist?
*** Break *** segmentation violation
Error in TUnixSystem::StackTrace script /etc/root/gdb-backtrace.sh is missing
Root > .q
[/code]
The files are there, but not under $ROOTSYS but $ROOTSYS/src/root/etc…
#export $ROOTSYS=/software/at3/root/root-v5.30.04_slc57_gcc4.1.2_x86-64/root/
cd $ROOTSYS
mkdir src
cd src
wget wget ftp://root.cern.ch/root/root_v5.30.04.source.tar.gz
tar xvfz root_v5.30.04.source.tar.gz
cd root
./configure --prefix=$ROOTSYS
make
make install
Everything shoud be then under $ROOTSYS, but apparently it is not. I have manually copied:
tutorials to /tmp
and
$ROOTSYS/src/root/etc/plugins/ to $ROOTSYS/etc/
Is there something I am doing wrong?
Thanks so much!
Carlos