Problems with seting up a PROOF-Cluster

Hello,

we here at GSI try to setup a PROOF-Cluster over a LSF-Cluster.
(Rootversion V3.10.02)

We are submitting the proofd’s as standard batch-jobs in it’s own queue
with the foreground flag set.

The setup of the proofd’s makes no problems but when we try to
connect to them ( gROOT->Proof(“lxb108:1095”) ) only a few of
them response. The others make problems on authentification:

Warning in TAuthenticate::ClearAuth on master:
Potential problems: got msg type: 2038 value: 1 (expecting: 0 0)
*** Break *** on master: write on a pipe with no one to read it
SysError in TUnixSystem::UnixSend on master: send (Broken pipe)
Error in TUnixSystem::SendRaw on master: cannot send buffer

Warning in TAuthenticate::PromptPasswd on master: not tty:
cannot prompt for passwd, returning -1
Info in TAuthenticate::Authenticate on master:
failure: list of attempted methods: UsrPwd
Error in TSlave::TSlave on master:
authentication failed for dvlambda@lxb110

The behaviour of the slaves changes during the testing with different
configurations, so the number of slaves to which we can connect to and the
error messages changes, too.

We tried to work with .rootdpass and .rootnetrc, but that doesn’t work as much
as without them.

We apprichiate your help.

Thanks,
Carsten Preuss @ GSI.

Hi Carsten,

can you make it to work without having the proofd’s being started by LSF?

Cheers, Fons.

Dear Carsten,

It seems a problem with password transmission: the strange
thing is that it affects only part of the slaves.

To figure out more precisely what’s going on we need some more
debugging printouts.

Could you please re-run with ‘Root.Debug: 3’ on the client and master?

Also, if you have access to the proofd outputs, could you lunch
them with option ‘-d 3’ in addition ?

Gerri Ganis

[quote=“rdm”]Hi Carsten,

can you make it to work without having the proofd’s being started by LSF?

Cheers, Fons.[/quote]

Hi Fons,
yes we are able to. We started a PROOF-Cluster in a local network before
and everything worked fine.
I think, it’s because we don’t need .rootrc, etc. to authentificate in a local
area ( username and password prompted by PROOF is all we need there).

Ciao, Carsten.

[quote=“ganis”]Dear Carsten,

It seems a problem with password transmission: the strange
thing is that it affects only part of the slaves.

To figure out more precisely what’s going on we need some more
debugging printouts.

Could you please re-run with ‘Root.Debug: 3’ on the client and master?

Also, if you have access to the proofd outputs, could you lunch
them with option ‘-d 3’ in addition ?

Gerri Ganis[/quote]

Thanks very much for your help,

I started the proofd’s with debug-level 3 and save the output, but I delete a few lines to
reduce the text (deleted lines marked with …).
I send you the printouts as attachment.
It’ll be in the order:
master (ok)
slave (ok)
slave (not ok)
slave (not ok)
slave (not ok)
slave (not ok)

Unfortunately the output of the second slave is different from the 3., 4. and 5. slave.

Carsten.
debug_err.txt (31.5 KB)

Hi Carsten,

Thanks for the outputs, which are indeed very useful.
The output of slave 2 is different because is the one causing the
problem which makes slave 3,4 and 5 to fail.

The problem on slave 2 is caused by something strange that should
not happen: the master tries to reuse the same authentication context
used for slave 1: this is strange because the hosts are different (lxb109
ans lxb110). To try to understand why this happens, I need the output
on the “client” side, i.e. what you get on your screen and/or in the
$HOME/proof/master_***/master.log file, when you run with
"Root.Debug: 3" (in your “.rootrc” and in the one seen by the master).

As a consequence Slave 2 fails because it tries to read a file with the
key which is not there. This makes TAuthenticate on the master to think
that the used password is wrong, and it does not use it anylonger for
the other slaves, which explains the behaviour of slaves 3,4 and 5.

To force the master not to reuse any authentication, you should set

  UsrPwd.ReUse:  0

on the .rootrc seen by the master: according to your output, it should be

/d/alice04/alisoft/PPR/root/gcc295-04/v3-10-02/etc/system.rootrc

(if you can not modify this file, create a .rootrc in the dvlambda $HOME
on the master with the above line in).

This is what I can suggest for now.
Please, if you can, send me also the outputs from the client sides so
that I can try to understand what’s going on wrongly.

Cheers, Gerri

Hi Carsten,

I have been able to reproduce your problem on my setup and also
found the bug originating it.

To cure it, there are two possibilities:

  1: the cleanest would be to recompile ROOT with the corrected version
      of TAuthenticate.cxx; if you are in the position to recompile your
      ROOT box (or to ask some one to do it) you can find in attachment
      the CVS patch or the normal diff for net/src/TAuthenticate.cxx

  2: the less clean way: create a file called .rootauthrc in the $HOME
      of the account where you start the PROOF session as client; put
      the line

proofserv lxb108:dvlambda:0 lxb109:dvlambda:0 lxb110:dvlambda:0

      (if you already have a $HOME/.rootauthrc add this line in).
      It is important that you specify fully the slave names, ie no wild
      character (*) should appear (the problem comes exactly from
      there).

 Please let me know if with any of these two patches it works.

 Sorry for the inconvenience and thanks for having found the problem.

 Gerri Ganis

diff.TAuthenticate.cxx (332 Bytes)
cvs-patch.TAuthenticate.cxx (541 Bytes)

Hi Carsten,

Though there was actually a bug in TAuthenticate, as explained in the
previous post, the problem occured only if there was no authentication
directive given in the proof.conf file.
Looking again at the output that you attached to your last post, I have
just realized that in your proof.conf file you have

slave lxb109 port=1095 UsrPwd

which, unfortunately, is equivalent to ‘slave lxb109 port=1095’ since
the check on the method name is case sensitive, so one should use
’usrpwd’ as indicated in proof/etc/proof.conf.sample .
If you change ‘UsrPwd’ in your lines to ‘usrpwd’ the problem should
disappear.

Could you please let me know if any of these receipes works?

Cheers, Gerri

Hallo Gerri,

1000 thanks for your help.
We changed the line in .rootrc to

UsrPwd.ReUse 0

and edit the .rootauthrc-File.

Now everything is working fine. We don’t need to use the authentification-method-line in the .proof.conf at this time.

Thanks Carsten@GSI