FIXED: xproofd crashes on disconnect, failures to reconnect

bbutler · June 23, 2012, 3:21am

Hi Gerri,

Sorry in advance for the long-winded explanation of what was going wrong–I know you wrote the code, so you know how it works, but it’s easier for me to explain what the problems were and what I did to fix them if I explain how the system worked before.

We have documented our long-standing issues (2+ years) with disconnects, and how they a) often crash the xproofd daemon, killing all jobs currently being run in addition to the one that disconnected b) do not reconnect, ever. As this is kind of my white whale of PROOF problems, and I will hopefully be graduating soon, and I decided to find out what was going on.

Previous thread:
[url]PROOF cluster network issues

1. Disconnects

These seem to be real, and happen occasionally, especially under high load. They can be triggered either by XrdProofdClientMgr::CheckClients(), if the admin file hasn’t been touched recently, or by disconnects detected directly by the XrdLink/Xrd poller. In both cases, XrdProofdProcotol::Recycle() is called, and this send an asynchronous message to XrdProofdClientMgr (which forwards it to XrdProofdProofServMgr) to remove any references to either the disconnected protocol or any associated XrdProofdResponses. These XrdProofdResponses are then deleted in XrdProofdProtocol::Reset.

2. xproofd Crashes

All the xproofd crashes we experienced are due to dangling pointers to XrdProofdResponses which had been deleted. These arise after disconnects, and cause crashes more often in memory-heavy jobs where the memory used by the XrdProofdResponse often gets re-allocated. The most common reason for this was a strange “continue” line in the kClientDisconnect block of XrdProofdClientCron, which skipped the removal of the dangling pointers if the admin directory could not be successfully removed. Considering the admin directory got deleted already in most cases (those called from CheckClients) this always failed, the pointers were left, and often xproofd seg faulted and died.

Another much rarer failure mode was based on the asynchronous nature of the dangling pointer cleanup–sometimes, a the responses would be deleted, reallocated by another object/process, and accessed via a now-dangling pointer all before the asynchronous message could be processed by XrdProofdClientCron. While rare, it makes clear that cleaning up pointers asynchronously is a bad idea.

3. Crash Solution

I moved the asynchronous elements from the ClientMgr and ProofServMgr cron jobs and called them directly from XrdProofdProtocol::Recycle, before the responses are deleted. And of course, made this cleanup independent of whether or not the admin directory could be successfully removed. XrdProofdProtocol::Recycle can grab the relevant ProofServMgr mutex directly, so everything should be safe in the multi-threaded sense.

4. Reconnects

Even if xproofd didn’t crash, ‘MasterWorker’ reconnects never work. The reason for this is two-fold. One, the admin directory was always deleted, both by CheckClients and XrdClientCron when the message kClientDisconnect was sent, so the client ID informational was lost. The proofserv processes are given a specific cid when they are forked, and if a client reconnects with a different one, they cannot send messages to it. The second is that the CheckActiveSessions call to the ProofServMgr automatically kills all proofservs without valid clients, and what makes a valid client is a non-NULL response and protocol pointer! But we just cleared those to avoid crashes!

A way around this (already built into the class) is to call SetReconnectTime on the ProofServMgr, which delays the killing of the worker proofserv processes. I set this based on the client having at least one proofserv process which is not in kXPD_idle state. This is again set in XrdProofdProtocol::Recycle.

For the admin paths, I remove any automatic removal of the admin path from CheckClients or Recycle and its calls. I also write the “disconnected” file from Recycle, so both types of disconnects are on equal footing. Old admin paths are cleaned up periodically anyway, so we don’t have to delete on disconnect.

With those changes, reconnects seem to now work 100%. As a last safeguard, I modified XrdProofdClient::GetClientID to not give out freed XrdClientIDs too soon after they have been freed–this is to prevent the rare case where another connection by the same user (maybe an Admin connection) reconnects before the disconnected client and steals its cid. I also added a few lines so that the existing functions to clear the dangling pointers do a more complete job–the XrdProofdProofServs were not getting their fProtocol, fResponse, or fParent pointers cleared before, though these seem to be called seldom and thus I never traced these directly to any one crash.

5. Summary

My SVN diff relative to the trunk (Friday June 22nd) from the proof/proofd direcory is attached. All the changes in it I have found or consider necessary, so please let me know if anything looks extraneous and I will try to explain it. I have been running these fixes on the SLAC PROOF cluster (105 worker cores, 16 machines, 220 TB) using a slightly older version of the trunk (~2 weeks ago) for a week now with zero xproofd crashes and all recoverable disconnects reconnecting successfully, so it should be well validated, but we will continue to validate with today’s trunk to make sure.

Hope this helps, and thanks,

Bart

Savannah thread started here:
https://savannah.cern.ch/bugs/index.php?95612
svn_diff.txt (15.6 KB)

ganis · June 25, 2012, 12:31am

Hi Bartl,

I am officially offline until next Saturday, but I could connect and have a look at your patch and explanations.
Thanks a lot for the thorough and deep analysis, the patch and the detailed explanation of your findings. The patch makes sense, as far as I can see, and I am going to commit it in the trunk now so that other people can try it.
As soon as I am back I will try to have it installed on one of the ALICE facilities where the disconnections where observed, though less frequently.

Thanks again for this invaluable contribution.

Gerri

bbutler · June 26, 2012, 6:24am

Hi Gerri,

Glad to help! It’s been an extremely frustrating problem for a long time for us…glad to be rid of it! I have updated SLAC to the trunk and will continue testing it as well.

Bart