we here at GSI try to setup a PROOF-Cluster over a LSF-Cluster.
(Rootversion V3.10.02)
We are submitting the proofd’s as standard batch-jobs in it’s own queue
with the foreground flag set.
The setup of the proofd’s makes no problems but when we try to
connect to them ( gROOT->Proof(“lxb108:1095”) ) only a few of
them response. The others make problems on authentification:
Warning in TAuthenticate::ClearAuth on master:
Potential problems: got msg type: 2038 value: 1 (expecting: 0 0)
*** Break *** on master: write on a pipe with no one to read it
SysError in TUnixSystem::UnixSend on master: send (Broken pipe)
Error in TUnixSystem::SendRaw on master: cannot send buffer
…
The behaviour of the slaves changes during the testing with different
configurations, so the number of slaves to which we can connect to and the
error messages changes, too.
We tried to work with .rootdpass and .rootnetrc, but that doesn’t work as much
as without them.
can you make it to work without having the proofd’s being started by LSF?
Cheers, Fons.[/quote]
Hi Fons,
yes we are able to. We started a PROOF-Cluster in a local network before
and everything worked fine.
I think, it’s because we don’t need .rootrc, etc. to authentificate in a local
area ( username and password prompted by PROOF is all we need there).
It seems a problem with password transmission: the strange
thing is that it affects only part of the slaves.
To figure out more precisely what’s going on we need some more
debugging printouts.
Could you please re-run with ‘Root.Debug: 3’ on the client and master?
Also, if you have access to the proofd outputs, could you lunch
them with option ‘-d 3’ in addition ?
Gerri Ganis[/quote]
Thanks very much for your help,
I started the proofd’s with debug-level 3 and save the output, but I delete a few lines to
reduce the text (deleted lines marked with …).
I send you the printouts as attachment.
It’ll be in the order:
master (ok)
slave (ok)
slave (not ok)
slave (not ok)
slave (not ok)
slave (not ok)
Unfortunately the output of the second slave is different from the 3., 4. and 5. slave.
Thanks for the outputs, which are indeed very useful.
The output of slave 2 is different because is the one causing the
problem which makes slave 3,4 and 5 to fail.
The problem on slave 2 is caused by something strange that should
not happen: the master tries to reuse the same authentication context
used for slave 1: this is strange because the hosts are different (lxb109
ans lxb110). To try to understand why this happens, I need the output
on the “client” side, i.e. what you get on your screen and/or in the
$HOME/proof/master_***/master.log file, when you run with
"Root.Debug: 3" (in your “.rootrc” and in the one seen by the master).
As a consequence Slave 2 fails because it tries to read a file with the
key which is not there. This makes TAuthenticate on the master to think
that the used password is wrong, and it does not use it anylonger for
the other slaves, which explains the behaviour of slaves 3,4 and 5.
To force the master not to reuse any authentication, you should set
UsrPwd.ReUse: 0
on the .rootrc seen by the master: according to your output, it should be
(if you can not modify this file, create a .rootrc in the dvlambda $HOME
on the master with the above line in).
This is what I can suggest for now.
Please, if you can, send me also the outputs from the client sides so
that I can try to understand what’s going on wrongly.
I have been able to reproduce your problem on my setup and also
found the bug originating it.
To cure it, there are two possibilities:
1: the cleanest would be to recompile ROOT with the corrected version
of TAuthenticate.cxx; if you are in the position to recompile your
ROOT box (or to ask some one to do it) you can find in attachment
the CVS patch or the normal diff for net/src/TAuthenticate.cxx
2: the less clean way: create a file called .rootauthrc in the $HOME
of the account where you start the PROOF session as client; put
the line
(if you already have a $HOME/.rootauthrc add this line in).
It is important that you specify fully the slave names, ie no wild
character (*) should appear (the problem comes exactly from
there).
Please let me know if with any of these two patches it works.
Sorry for the inconvenience and thanks for having found the problem.
Gerri Ganis
Though there was actually a bug in TAuthenticate, as explained in the
previous post, the problem occured only if there was no authentication
directive given in the proof.conf file.
Looking again at the output that you attached to your last post, I have
just realized that in your proof.conf file you have
slave lxb109 port=1095 UsrPwd
which, unfortunately, is equivalent to ‘slave lxb109 port=1095’ since
the check on the method name is case sensitive, so one should use
’usrpwd’ as indicated in proof/etc/proof.conf.sample .
If you change ‘UsrPwd’ in your lines to ‘usrpwd’ the problem should
disappear.
Could you please let me know if any of these receipes works?