PROOF cluster network issues

Hi PROOF experts,

We have observed (at least) 3 reoccurring, seemingly random, intermittent error messages during job processing on the SLAC PROOF cluster (originally 7 worker machines with 49 worker cores, recently upgraded to 15 worker machines with 105 worker cores). These errors have been occurring now for almost a full year (https://root-forum.cern.ch/t/proof-cluster-crashes-with-network-traffic/12940/1). In that time, we have spent considerable effort diagnosing the problems, but have only found partial work-arounds which reduce the frequency of the crashes.

Problem 1: Master disconnects

This problem seems to be a weakness in xproofd, and the error, though random, produces a telltale signature in the xproofd log file. At some point in running, the worker registers a disconnect of the master:

Then, the master apparently re-connects, and send some signal to start processing:

However, the processes from before the disconnection apparently continue attempting to communicate via an invalid xrootd link (ignore the timestamp, this is from a different session, with debug output on for xproofd):

This continues for maybe 10 minutes before the xproofd daemon crashes with no error message, and is then restarted by a cron job:

This seems to happen more frequently with higher network traffic, as one might expect, but the fundamental problem seems to be the inability of the master to reconnect and recover the running worker processes successfully. Another solution would perhaps be an increase in whatever time-out period is used to determine a disconnect.

Problems 2 & 3: ROOT file read errors

This I think is essentially one problem, but gets manifested two ways. The first happens when a worker which has finished processing it’s own local files, starts processing files from another machine, in particular if the data file source machine is under heavy I/O load. What seems to happen is that the data arrives corrupted:

The source file in this case is NOT corrupted, just this particular network read. This can also manifest itself as missing branches. This also happens during merging (or sub-merging), if other workers on the source node are still running and saturating the disk I/O.

In this case, atlprf08 was still processing data when atlprf09 attempted the copy. I believe the local copy of the output file was corrupted in the same way the data file was in the earlier example.

Certain work-arounds help reduce the frequency of these errors. Local copies for merging, as seen above, help avoid network disconnects (Problem 1) during merging. Likewise, the “ForceLocal” option prevents workers from getting data from other nodes when they exhaust their local data, which helps prevent network read errors (Problem 2), but slows down the job depending on homogeneity of the dataset and distribution of the files on the cluster. “ForceLocal” also helped us notice that the mergers only failed when the data was being read from the node that still had processing workers on it. Restricting sub-mergers to individual physical machines also helps avoid network read errors, though to my knowledge this is not a built-in feature (I had to hack 5.28 to do it, though I have not tried it in 5.32, the version we now use).

Any ideas for what could be going wrong here?

Thanks,

Bart Butler

Hi Bart,

Sorry for the somewhat late reply.

For problem 1, we are aware of the weaknesses of the XrdProofd plug-in and we are in the process of re-implementing it; in particular, we aim at improving the handling of connections to the proofserv processes and the overall stability. Reconnections are fragile, as you are experiencing. However, if I understand correctly, you are using ROOT 5.28/… . There were some fixes since then, so it may be worth to give a try with a more recent version, possibly 5.32 .
Also, I have added in the trunk and in the patch branches 5.28, 5.30 and 5.32 the possibility to disable the reconnection attempts. This may help in avoid the xproofd crashes. If you are able to work with 5-28-00-patches that may be worth trying.
But one should also try to understand why under heavy network load the connections go down. It looks like that the underlying XrdClient connection breaks. Also in this respect moving to 5.32, which builds with xrootd 3.1.0, may somehow help; there were several fixes, also in the client.
Note that in principle you could just upgrade the daemon and run PROOF 5.28 inside (see xpd.rootsys), though in such a case you will not get the improvements in XrdClient.

The I/O problems look really related to the network TFile class used, which looks like being TXNetFile and therefore again XrdClient. LHCb and (I think) CMS observed similar problems which were traced back to subtle bugs in XrdClient. Some of these were fixed in 5.30 and 5.32. The LHCb one is fixed in the Xrootd repository but, unfortunately, not yet tagged (but I have already asked for a patch tag).

So, it would be really good if you could see if with 5.32 the problems are at least reduced. From the last sentence of you post, I understand that that should be possible.

G. Ganis

Ps:
For what relates to

yes, you are right, is not built-in right now, but we got already the request from another channel and we plan to implement it.

Hi Gerri,

Thanks for the reply.

[quote=“ganis”]Hi Bart,

Sorry for the somewhat late reply.

For problem 1, we are aware of the weaknesses of the XrdProofd plug-in and we are in the process of re-implementing it; in particular, we aim at improving the handling of connections to the proofserv processes and the overall stability. Reconnections are fragile, as you are experiencing. However, if I understand correctly, you are using ROOT 5.28/… . There were some fixes since then, so it may be worth to give a try with a more recent version, possibly 5.32 .
[/quote]

Sorry for any confusion–we are using 5.32, and all of these problems are observed using 5.32.

As I stated above, we are using 5.32, but I think we are using an xrootd version ~3.0.5. Are there xrootd fixes in 3.1.0 that address this?

So this would be a patch for xrootd itself, not ROOT?

[quote=“ganis”]

So, it would be really good if you could see if with 5.32 the problems are at least reduced. From the last sentence of you post, I understand that that should be possible.

G. Ganis

Ps:
For what relates to

yes, you are right, is not built-in right now, but we got already the request from another channel and we plan to implement it.[/quote]

Do you know what the timescale for this is? In the meantime I will port my hack implementing this to 5.32, as this hack + ForceLocal are the only way I can run on large datasets reliably.

Hi Gerri,

If this helps, attached is my implementation of “MergersByHost”. The modifications are on top of the 5.32 production release. It seems to work pretty well. Maybe it’ll save you guys some time?

-Bart
TProof.h (45.9 KB)
TProof.cxx (385 KB)

Hi Bart,

Ok, sorry, I misunderstood.

[quote=“bbutler”]I think we are using an xrootd version ~3.0.5. Are there xrootd fixes in 3.1.0 that address this?
[/quote]
At least some issues were addressed.

Yes, that’s right. But, for the client you just need to setup the new LD_LIBRARY_PATH. The ‘xproofd’ binary needs to be linked again with the new version of the libs. I do not know if you are using that …

Not exactly but short. I am discussing the details with the person responsible for the tags. We are waiting for a couple of confirmations … I hope we will be able to tag end of this week or beginning of next.

It will certainly do. Thanks a lot!

Gerri

All issues still present with xrootd 3.1.0. Any progress on those fixes/patches?

Hi Bart,

A slightly modified (aesthetics only) version of your patch is in the trunk (r43109). Would it be possible for you to check that it works as you expect? 5.32/01 is going to be tagged on Monday and there is still a chance that I port it there.

3.1.1-rc1 was cut last week and is under test by a few groups. Perhaps you can give a try?

Gerri

Similar disconnect problems are observed with 3.1.1-rc1, though the error messages have changed it seems:

[quote=“ganis”]Hi Bart,

A slightly modified (aesthetics only) version of your patch is in the trunk (r43109). Would it be possible for you to check that it works as you expect? 5.32/01 is going to be tagged on Monday and there is still a chance that I port it there.

3.1.1-rc1 was cut last week and is under test by a few groups. Perhaps you can give a try?

Gerri[/quote]

As an update, xrootd 3.1.1 has helped somewhat, and we think the frequency of disconnect/reconnect-related crashes has decreased, and as far as we are aware, the read errors are gone. The MergersByHost in the trunk seems to work well too. So really the only outstanding issue is the disconnects, but they are much less urgent than they once were.

-Bart