PROOF cluster crashes with network traffic

Hi PROOF experts,

We have been having a lot of trouble with our cluster crashing during instances of high network traffic (often during merging) and in particular if someone is downloading data to the cluster at the same time. These crashes are typified by xrootd logs which indicate that the user running their job has disconnected from the node, then various other error messages and finally a segmentation fault in the xproofd plugin which usually takes down the xrootd daemon on that node.

I should note that I noticed we have the option “intwait 20” in our xrootd.cf file…could this be a too-small internal timeout issue (I’m not sure what this timeout controls)? We are using ROOT 5.28f, and we had this problem with 5.28a as well.

In discussions with a BNL sysadmin, they apparently may have a similar issue with their cluster which they address via a cron job which restarts the daemon, though this of course doesn’t prevent the crashes from occurring. Below is a sample xrootd log file from one of the crashes.

Thanks,

Bart

Hi,

Sorry for the late reply.
In the ALICE cluster we also observed cases where the PROOF connections suffered severely from high network traffic from other activities. There was no real solution, except try to coordinate such activities in a way to minimize the collisions.
Since the non-PROOF activities in the ALICE case was xrootd downloading data, it was suggested to limit the bandwidth available to port 1094 so that the PROOF port 1093 could get the minimal part it needed. But this has never been tried.

ALICE also experienced problems with submergers but these were typically due to high memory usage. Are you sure you are affected by network in such cases?

From your message it looks that you are running one xrootd daemon fro data serving and PROOF. If that is the case, I suggest to go for the two daemon way with the ‘xproofd’ binary for PROOF and xrootd for data serving. We observed much better stability with such configuration, and you can better disentangle the problems.

G. Ganis

Thanks Gerri, we will try this.

As for the sub-merging, it seems to be working as well as non-sub-merging now, that is, they both sometimes crash, seemingly due to user disconnections like the log posted above. We’ll see what happens when we take out the PROOF plugin.

Thanks,

Bart

Hi Gerri,

My understand of your suggestion is that we start xproofd separately, along with xrootd/and cmsd. If so, than we see the following problem:

110830 16:51:12 14820 ofs_remove: bcbutler.32706:34@atlprf01 Unable to remove /atlas/proof/bcbutler/session-atlprf01-1314745789-32706/worker-0.21-atlprf02-1314745790-14951/test.GbbG800L100.susy1004.root.merger; Permission denied

Basically, xproofd runs as root and changes file ownerships in /atlas/proof/, while xrootd runs as non-root. So when the proof master (atlproof01) ask one of its workers to delete a file (above), the request go to xrootd process, which as a daemon user, can’t delete the file. How do we deal with this situation?

Wei

Hi Wei,

Uhmm … I see the problem. This is while merging via file with sub-mergers, right? but it will affect other cases too.
There is no obvious solution now for this except forcing doing everything via PROOF. In 5.28 we introduced the possibility to use rootd via the same PROOF port to access the files in the sandbox; that solution should not suffer from this permission issue. I do not know, though, if you can try it out.

In general, the fact that xproofd is running as ‘root’ is a problem and we are working at a solution which should remove this need. In this context we would like to rationalize all these permission issues.

Gerri

Hi Gerri,

Are you referring to the TProofMgr (root.cern.ch/drupal/content/accessing-sandbox)? I was also thinking of this and will ask Bart if he can delete some old stuffs left there (especially those that can’t be deleted by xrootd). My main concern is files accumulating there that will eat up all space over time.

Wei

Hi Wei,

That’s a possibility but, as you say, requires offline action.

I was referring to ‘xpd.rootd’: root.cern.ch/drupal/content/conf … uide#rootd . This will be active by default only for the files handled via TProofOutputFile and in principle it should not suffer from the permission problem, because ‘rootd’ is started by xproofd.
It requires at least 5.28c and remember to comment out any LOCALDATASERVER setting in xpd.cf .

Let me know if you try.

Gerri

[quote=“ganis”]Hi Wei,

That’s a possibility but, as you say, requires offline action.

I was referring to ‘xpd.rootd’: root.cern.ch/drupal/content/conf … uide#rootd . This will be active by default only for the files handled via TProofOutputFile and in principle it should not suffer from the permission problem, because ‘rootd’ is started by xproofd.
It requires at least 5.28c and remember to comment out any LOCALDATASERVER setting in xpd.cf .

Let me know if you try.

Gerri[/quote]

Hi Gerri,

I just tried this, and it didn’t seem to make any difference…the errors still showed up in the xrootd log file. Also, I checked and despite those errors, the files are getting removed somehow, so this probably isn’t a big issue. Is there a way I can be sure rootd is being used? I commented out the LOCALDATASERVER line, tried xpd.rootd allow, xpd.rootd allow mode:rw, the works, nothing changed.

-Bart

Fixed, was running an older version of proofd servers (5.28a). rootd worked, but it seemed to increase the error rate on network-intensive jobs, so back to xrootd.