I’m running into some weird problems while trying to set up a PROOF cluster at our Tier3. I should say upfront that I’ve been configuring/using simple PROOF clusters for a long time now. The change now is that this will be a properly big cluster.
Authentication on the cluster is managed by LDAP. I start with this, as I think this might be at the source of my problems. Ideally direct user access to the node running the PROOF master and the nodes running the slaves would not be allowed. The node running the PROOF master is also responsible for running some pretty important services in the cluster, so we don’t want regular users poking around on it.
I compiled a custom version of ROOT 5.28d on the cluster. (This is to have a version of ROOT compatible with the system-default GCC 4.1. This has always made my life a lot easier in the past than having to set up GCC 4.3 for the PROOF daemon.) But I guess I’ll try 5.26 later on as well, as that performed very well on the small cluster I’ve been using so far.
I think I manage to configure the cluster more or less correctly. The configuration files (they’re not really private at this point) are attached to the post. (If I’ll manage to upload them.) If I try to connect to the cluster as the special “xrootd” user (which is not supposed to be a real user on the cluster), then the connection is successful. PROOF tells me that it managed to connect to 16 workers. But when I try to connect to the cluster as myself (user “krasznaa”) then I get this output:
root  p = TProof::Open( "t3head" );
Starting master: opening connection ...
: mst-0:failure setting up proofserv: timed-out receiving status-of-setup from pipe
Error in <TXSocket::Create>: 4 creation/attachment attempts failed: no attempts left
Error in <TXSocket::Create:>: problems creating or attaching to a remote server (mst-0:failure setting up proofserv: timed-out receiving status-of-setup from pipe|log:/local/home/proof/krasznaa/session-t3head-1306012903-16687/master-0-t3head-1306012903-16687.log)
Error in <TXSocket::TXSocket>: create or attach failed (mst-0:failure setting up proofserv: timed-out receiving status-of-setup from pipe|log:/local/home/proof/krasznaa/session-t3head-1306012903-16687/master-0-t3head-1306012903-16687.log)
110521 17:22:03 001 Proofx-E: Conn::CheckResp: server [t3head.physics.nyu.edu:1093] did not return OK replying to last request
110521 17:22:03 001 Proofx-E: Conn::CheckErrorStatus: error 3006: 'session ID not found'
: session ID not found
Error in <TProof::Open>: new session could not be created
This is what’s in the logfile of the master node:
110521 17:20:43 16631 xpd-I: krasznaa.1826:34@t3int0: ClientMgr::MapClient: user krasznaa logged-in; type: ClientMaster
110521 17:20:43 16631 xpd-E: Aux::ChangeOwn: could not get privileges to change ownership
110521 17:20:43 16631 xpd-E: ProofServ::SetAdminPath: unable to give ownership of the status file /tmp/.xproofd.1093/activesessions/krasznaa.default.16658.status to user; errno = 0
My first idea was that the problem must be that my user is not allowed to log into the node. But I even tried with the special (non-root) user that’s allowed to log into all the nodes. No difference.
As I wrote in the beginning, none of these users are local users. Their information all comes from an LDAP server. But for all other intents and purposes this worked very well so far. (CERN is using a sort of LDAP mechanism on lxplus as well after all.)
So now I’m stumped. Any help would be very welcome, as I’d really like to get PROOF up and running on our cluster.
proof.cfg.txt (1.21 KB)
PROOF node list:
proof.conf.txt (1.21 KB)