PROOF cluster setup on a US Tier3

Hi,

I’m running into some weird problems while trying to set up a PROOF cluster at our Tier3. I should say upfront that I’ve been configuring/using simple PROOF clusters for a long time now. The change now is that this will be a properly big cluster.

Authentication on the cluster is managed by LDAP. I start with this, as I think this might be at the source of my problems. Ideally direct user access to the node running the PROOF master and the nodes running the slaves would not be allowed. The node running the PROOF master is also responsible for running some pretty important services in the cluster, so we don’t want regular users poking around on it.

I compiled a custom version of ROOT 5.28d on the cluster. (This is to have a version of ROOT compatible with the system-default GCC 4.1. This has always made my life a lot easier in the past than having to set up GCC 4.3 for the PROOF daemon.) But I guess I’ll try 5.26 later on as well, as that performed very well on the small cluster I’ve been using so far.

I think I manage to configure the cluster more or less correctly. The configuration files (they’re not really private at this point) are attached to the post. (If I’ll manage to upload them.) If I try to connect to the cluster as the special “xrootd” user (which is not supposed to be a real user on the cluster), then the connection is successful. PROOF tells me that it managed to connect to 16 workers. But when I try to connect to the cluster as myself (user “krasznaa”) then I get this output:

root [0] p = TProof::Open( "t3head" ); Starting master: opening connection ... : mst-0:failure setting up proofserv: timed-out receiving status-of-setup from pipe Error in <TXSocket::Create>: 4 creation/attachment attempts failed: no attempts left Error in <TXSocket::Create:>: problems creating or attaching to a remote server (mst-0:failure setting up proofserv: timed-out receiving status-of-setup from pipe|log:/local/home/proof/krasznaa/session-t3head-1306012903-16687/master-0-t3head-1306012903-16687.log) Error in <TXSocket::TXSocket>: create or attach failed (mst-0:failure setting up proofserv: timed-out receiving status-of-setup from pipe|log:/local/home/proof/krasznaa/session-t3head-1306012903-16687/master-0-t3head-1306012903-16687.log) 110521 17:22:03 001 Proofx-E: Conn::CheckResp: server [t3head.physics.nyu.edu:1093] did not return OK replying to last request 110521 17:22:03 001 Proofx-E: Conn::CheckErrorStatus: error 3006: 'session ID not found' : session ID not found Error in <TProof::Open>: new session could not be created

This is what’s in the logfile of the master node:

110521 17:20:43 16631 xpd-I: krasznaa.1826:34@t3int0: ClientMgr::MapClient: user krasznaa logged-in; type: ClientMaster 110521 17:20:43 16631 xpd-E: Aux::ChangeOwn: could not get privileges to change ownership 110521 17:20:43 16631 xpd-E: ProofServ::SetAdminPath: unable to give ownership of the status file /tmp/.xproofd.1093/activesessions/krasznaa.default.16658.status to user; errno = 0

My first idea was that the problem must be that my user is not allowed to log into the node. But I even tried with the special (non-root) user that’s allowed to log into all the nodes. No difference.

As I wrote in the beginning, none of these users are local users. Their information all comes from an LDAP server. But for all other intents and purposes this worked very well so far. (CERN is using a sort of LDAP mechanism on lxplus as well after all.)

So now I’m stumped. Any help would be very welcome, as I’d really like to get PROOF up and running on our cluster.

Cheers,
Attila

PROOF configuration:
proof.cfg.txt (1.21 KB)

PROOF node list:
proof.conf.txt (1.21 KB)

Hi,

Trying ROOT 5.26.00e didn’t make any difference. I get the exact same results using this version.

I tried compiling ROOT with “–enable-ldap”, but I don’t think this has anything to do with the way PROOF does authentication. (I could compile ROOT successfully like this, but this didn’t change the results of my tests.)

Help would still be very much appreciated…

              Attila

Hi Attila,

PROOF needs that the system call ‘getpwnam’ succeeds for the username that you want to use. At CERN the LDAP mechanism that you mention is setup to do this.

This said, how do you start the daemon? As a normal user or as a privileged user?

There may be also some issues with the permissions of some directories. Also, I advise to run directly ‘xproofd’ since it seems that you have another xrootd system for data serving on the machines.

Gerri

Hi Gerri,

I see. I’ll try to find out what’s different at CERN later on.

It’s running as a regular user, local to each machine. (So no LDAP involved there.)

The ‘xrootd’ user (who is running the xrootd process which runs PROOF) seems to have the correct privileges for all the local directories involved. When I connect to the cluster as the xrootd user itself, then everything goes fine. We ran some larger tests today, and those seemed to go fine.

I had a quick try with the xproofd daemon, but as I didn’t have quick success, I just stayed with the current configuration. It’s not in a usable state now, so I don’t want to invest too much effort for no apparent gain.

Cheers,
Attila

Hi Gerri,

The problem has shifted a bit in the meanwhile. When I try to connect as a “regular” (LDAP) user, I get this output on the client:

110529 03:59:16 001 Proofx-E: Conn::CheckResp: server [t3head.physics.nyu.edu:1093] did not return OK replying to last request 110529 03:59:16 001 Proofx-E: Conn::CheckErrorStatus: error 3006: 'unable to instantiate object for client krasznaa' 110529 03:59:16 001 Proofx-I: Conn::Login: t3head.physics.nyu.edu: unable to instantiate object for client krasznaa 110529 03:59:16 001 Proofx-E: Conn::GetAccessToSrv: client could not login at [t3head.physics.nyu.edu:1093] 110529 03:59:16 001 Proofx-E: Conn::Connect: failure: unable to instantiate object for client krasznaa 110529 03:59:16 001 Proofx-E: XrdProofConn: XrdProofConn: severe error occurred while opening a connection to server [t3head.physics.nyu.edu:1093]

And this is in my server’s log:

110528 21:59:16 12037 xpd-E: Aux::AssertDir: could not get privileges to create dir 110528 21:59:16 12037 xpd-E: XrdProofdSandbox: unable to create work dir: /local/home/proof/krasznaa 110528 21:59:16 12037 xpd-E: ClientMgr::GetClient: instance for {client, group} = {krasznaa, default} is invalid 110528 21:59:16 12037 xpd-E: ClientMgr::Login: unable to instantiate object for client krasznaa 110528 21:59:17 12037 xpd-I: Protocol::Recycle: user disconnected; type: ClientMaster

I tried to look at the code itself, but didn’t quite manage to understand what’s going on. Notice, that I’ve made /local/home/proof writable by everybody just to avoid problems possibly arising from my username not being part of some group. But this didn’t change the “unable to create work dir” error message.

This seems to be a completely separate issue from authentication. I did manage to set up GSI authentication for the cluster in one test (I think), but I still bumped into this same error.

Cheers,
Attila

Dear All,

I figured out yesterday what was going wrong. I was forcing the xrootd daemon to be running as the user “xrootd”. But I wasn’t doing this by using the “-R” parameter, which tells the daemon to assume an “effective” identity, but I used the init.d script to explicitly run the daemon as this user.

After switching back to starting the daemon as root, and letting it take on the xrootd identity itself (while keeping root privileges in the background), the problem disappeared. I actually found a set of other problems next, but those could be sorted out with a bit of effort as well. I’ll be contacting Gerri directly about these issues later on.

Cheers,
Attila