PROOF issues

Hi,

I’m trying to install proof on a tiny, 12 slave farm. No security involved, as they’re in a private subnet only reachable through a server. So all nodes (called node-1…node-6, 2CPUs each) are in the hosts.equiv file, which means no passwords needed - who has managed to get onto the server is also allowed to log on to the nodes.

First question: I don’t know how to enable hosts.equiv “authentication” using proof. Haven’t seen a switch for that (I was guessing on the “uidgid” param for proof.conf, but aparently it’s not). Would that be difficult to implement? What does “uidgid” do? There’s no doc on it in README.PROOF.

Second problem. We have krb5 on our system. We use it for FNAL access, but not for accessing our small local farm. But with the current proofd.cxx there’s no way to a) disable krb5 initialization, or b) make proofd continue after krb5 initialization failed. Would it be possible to change the fatal error in case of unsuccessful krb5 initialization into a warning? I did that for our local setup, to be able to continue - to the…

Third problem. I created a pwd hash file /home/naumann/.rootdpass ($HOME is shared between server and all nodes via NFS). I learned that an empty password has a special meaning for TAuth* (would it make sense to change that?). Okay, so I selected an “almost empty” password. But now I get stuck here (“tail -n relevant” from “gDebug=7; gROOT->Proof(“node-6”)” output):

naumann@node-6 password: Info in <TAuthenticate::SecureSend>: local: enter ... (key: 1) Info in <TAuthenticate::SecureSend>: local: sent 22 bytes (expected: 22) Info in <TAuthenticate::ClearAuth>: after kROOTD_PASS: kind= 2001, stat= 9 Info in <TAuthenticate::ClearAuth>: received from server: user: naumann, offset: 0 (naumann 0) Info in <TAuthenticate::SecureRecv>: got len '22' 22 (msg kind: 2039) Info in <TAuthenticate::SecureRecv>: local: decoded string is 8 bytes long Info in <TAuthenticate::ClearAuth>: received from server: token: 'oStuK5ob' Info in <TAuthenticate::DecodeDetails>: analyzing ... pt:1 ru:1 cp:1 us:naumann Info in <TAuthenticate::DecodeDetails>: Pt:1, Ru:1, Us:naumann Info in <TAuthenticate::GetRemoteLogin>: details:pt:1 ru:1 cp:1 us:naumann Info in <TAuthenticate::GetRemoteLogin>: returning: naumann Info in <TAuthenticate::GetOffSet>: analyzing: Method:0, Details:pt:1 ru:1 cp:1 us:naumann Info in <TAuthenticate::DecodeDetails>: analyzing ... pt:1 ru:1 cp:1 us:naumann Info in <TAuthenticate::DecodeDetails>: Pt:1, Ru:1, Us:naumann Info in <TAuthenticate::GetOffSet>: found Nw: 1, Wd: naumann (null) (null) (null) Info in <TAuthenticate::GetOffSet>: found entry: met:0 det:pt:1 ru:1 cp:1 us:naumann off:0 Info in <TAuthenticate::GetOffSet>: returning: 0 Info in <TAuthenticate::SecureSend>: local: enter ... (key: 1) Info in <TAuthenticate::SecureSend>: local: sent 22 bytes (expected: 22) Info in <TPluginManager::FindHandler>: did not find plugin for class TSystem and uri /home/naumann/.rootauthrc
This is the last line of output I get; root doesn’t return to its prompt. Looking at the “proofd -d 3” server’s output in /var/log/messages:

Mar 10 16:33:46 node-6 proofd[18275]: ProofdLogin: user naumann authenticated Mar 10 16:33:46 node-6 proofd[18275]: Authenticate: kind:2000 -- Meth:0 -- gAuth:1 -- gNumLeft:1 Mar 10 16:33:46 node-6 proofd[18275]: ProofdExec: send Okay (gSockFd: 0) Mar 10 16:33:46 node-6 proofd[18275]: ProofdExec: execv(/usr/local/root/bin/proofserv, proofserv, /usr/local/root, /usr/tmp, d0-diskserver.d0farm.nijmegen, 9065, naumann) Mar 10 16:33:59 node-6 proofd[18290]: Can't open/find Kerberos configuration file while initializing krb5 Mar 10 16:33:59 node-6 proofd[18290]: main: pid = 18290, gInetdFlag = 1 Mar 10 16:33:59 node-6 proofd[18290]: RpdSetDebugFlag: gDebug set to 3 Mar 10 16:33:59 node-6 proofd[18290]: RpdDefaultAuthAllow: Enter Mar 10 16:33:59 node-6 proofd[18290]: RpdCheckDaemon: Enter ... sshd Mar 10 16:33:59 node-6 kernel: kmod: failed to exec -s -k block-major-107, errno = 2 Mar 10 16:33:59 node-6 kernel: kmod: failed to exec -s -k block-major-107, errno = 2 Mar 10 16:33:59 node-6 proofd[18290]: RpdCheckDaemon: read 0 lines Mar 10 16:33:59 node-6 proofd[18290]: RpdCheckSshd: Enter ... Mar 10 16:33:59 node-6 proofd[18290]: RpdCheckSshd: cannot connect to local port 22 Mar 10 16:33:59 node-6 proofd[18290]: RpdDefaultAuthAllow: default list of secure methods available: 0 2 Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: Enter: rcfile: Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: checking system: /usr/local/root/etc/system.rootrc Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: Proofd.Authentication:: 0 (0) Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: system: -1: /usr/local/root/etc/system.rootrc Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: user: -1: Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: checking local: .rootrc Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: local: -1: Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: checking system proof.conf: /usr/local/root/proof/etc/p roof.conf Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-1 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-1 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-2 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-2 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-3 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-3 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-4 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-4 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-5 image=nfs uidgid # usrpwd Mar 10 16:33:59 node-6 proofd[18290]: CheckGlobus: slave node-5 image=nfs uidgid # usrpwd Mar 10 16:34:00 node-6 proofd[18290]: CheckGlobus: slave node-6 image=nfs uidgid # usrpwd Mar 10 16:34:00 node-6 proofd[18290]: CheckGlobus: slave node-6 image=nfs uidgid # usrpwd Mar 10 16:34:00 node-6 proofd[18290]: CheckGlobus: proof.conf: -1: /usr/local/root/proof/etc/proof.conf Mar 10 16:34:00 node-6 proofd[18290]: CheckGlobus: exit: -1: .rootrc Mar 10 16:34:00 node-6 proofd[18290]: ProofdExec: gGlobus: -1, gRcFile: .rootrc Mar 10 16:34:00 node-6 proofd[18290]: ProofdExec: gOpenHost = node-6.d0farm.nijmegen Mar 10 16:34:00 node-6 proofd[18290]: ProofdExec: gConfDir = /usr/local/root Mar 10 16:34:00 node-6 proofd[18290]: ProofdExec: master/slave = slave Mar 10 16:34:00 node-6 proofd[18290]: Authenticate got: 2012 -- Mar 10 16:34:00 node-6 proofd[18290]: RpdGuessClientProt: Enter: buf: '', kind: 2012 Mar 10 16:34:00 node-6 proofd[18290]: RpdGuessClientProt: guess for gClientProtocol is 9
So - who’s waiting for what?

Last comment: in README/README.PROOF line 81 it says

Should that read

(";" instead of “,”)?

I’d appreciate help with this boring, non-fancy and thus pretty unique setup.
Axel.

Hi,
I have some more info on issue 3, the biggest one. As soon as I reduce the nodes to the PROOF server node everything works. But when adding e.g. node-2, I get

Mar 11 09:48:57 node-2 proofd[12567]: Can't open/find Kerberos configuration file while initializing krb5 Mar 11 09:48:57 node-2 kernel: kmod: failed to exec -s -k block-major-107, errno = 2 Mar 11 09:48:57 node-2 kernel: kmod: failed to exec -s -k block-major-107, errno = 2 Mar 11 09:48:57 node-2 proofd[12567]: RpdCheckSshd: cannot connect to local port 22
and then gROOT->Proof(“node-6”) is stuck again, same gDebug=7 output as above. The servers message.log says

Mar 11 09:48:56 node-6 proofd[28210]: ProofdLogin: user naumann authenticated Mar 11 09:48:56 node-6 proofd[28210]: Authenticate: kind:2000 -- Meth:0 -- gAuth :1 -- gNumLeft:1 Mar 11 09:48:56 node-6 proofd[28210]: ProofdExec: send Okay (gSockFd: 0) Mar 11 09:48:56 node-6 proofd[28210]: ProofdExec: execv(/usr/local/root/bin/proofserv, proofserv, /usr/local/root, /usr/tmp, d0-diskserver.d0farm.nijmegen, 30602, naumann)
(d0-diskserver is the server which allows access to the nodes, i.e. the machine I issued gROOT->Proof() from). Do I need to disable ssh somehow? Where is the ProofExec command sent to?
Axel.

Hi,

The authentificatio parts have different behavior depending on the version of ROOT. For ROOT 4.00.xx see the talk at the ROOT User’s Workshop. You should be able to explicit request your proof cluster to NOT use kerberos (also you can simply build with kerberos disabled).

Cheers,
Philippe

Heehee, sorry :-] You’re right, I forgot to specify the version number. It’s root-cvs on linux.

Looking at the code it seems that as soon as root is built with krb5 it will stop with a fatal error if krb5 fails to initialize. There’s a (paraphrased) #ifdef R__KRB5 … if (failed_to_initialize) Error(ErrFatal) #endif - I don’t see how to get out of that at runtime. I still believe this behaviour is inappropriate. I believe the “disable krb5 for proof” switch only comes into play after krb5 was initialized.

I now tried it with root built without krb5. It takes a while, but then gROOT->Proof says “PROOF set to parallel mode (8 slaves)”. Yippee! So your hint was a good one, Philippe. Thanks!

Now there are still four issues left:

  • unreachable slaves make proof freeze,
  • hosts.equiv is ignored,
  • empty passwords are misinterpreted,
  • failed krb5 init makes proof abort.

I’m willing to help, implementing hosts.equiv, if someone points me to where to add this authentication procedure. The krb5 thing is trivial, and the empty pw might have side effects I don’t realize.

Axel.

Hello,
if you don’t need the kbr5-athentification I think you can use the standart-method
UsrPwd. It’s not nessecary to write this into the proof.conf file but you must write it
into the .rootrc and .rootauthrc files.

The problem that proof freeze if were are unreachable slaves is strange because it
happens often that slaves dying during processing. But then proof sets this slaves to a bad-slave-list and the rest is working still good.

Cheers Carsten@GSI

Hi,
thanks, Carsten, I learned that in the meantime form Gerardo: having uidguid in proof.conf is not enough; after setting it as default in rootauthrc everything works nicely. And you’re right: it even works with slaves in proof.conf which are down! Perfect. I also found a problem which makes the proof start-up slower (/dev/random blocking - in my case it took up to 20mins), so I’m paying back for your (and Gerardo’s!) help at least a bit :wink:
Cheers, Axel.