5.30: Side effects of PROOF sandbox purging via xproofd

fprelz · December 1, 2011, 3:56pm

Hi,

after upgrading the PROOF cluster that serves our local community of ATLAS users
from ROOT 5.28 to 5.30, we started observing two issues that I believe can all be traced back to
the new handling of TProof::DisablePackages via admin commands to xproofd:

6571 if (fManager) {
6572 // Try to do it via XROOTD (new way)
6573 if (fManager->Rm("~/packages/*", “-rf”, “all”) != -1) {
6574 done = kTRUE;
6575 st = 0;
6576 }
6577 }

The broadcast requests to remove all files in the user sandbox (our users are issuing a TProof::ClearPackages()
at the beginning of each PROOF run) are piling up fast with time in the xproofds. Looking at the message exchanges
between a given pair of xproofds, I see a ratio of roughly twelve ‘Rm’ requests to two ‘OK’ replies in the
same amount of time:
$ grep 'sendto.47.rf’ /tmp/clearp.trace
[pid 31318] sendto(47, "all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/*|-rf”, 20, 0, NULL, 0) = 20
$grep ‘recvfrom(47’ /tmp/clearp.trace
[pid 2048] recvfrom(47, “\3\0\0\0\0\0\0\3”, 8, 0, NULL, NULL) = 8
[pid 2048] recvfrom(47, “OK\0”, 3, 0, NULL, NULL) = 3
[pid 2048] recvfrom(47, “\3\0\0\0\0\0\0\3”, 8, 0, NULL, NULL) = 8
[pid 2048] recvfrom(47, “OK\0”, 3, 0, NULL, NULL) = 3
[pid 2048] recvfrom(47, <unfinished …>
[pid 2048] recvfrom(47, <unfinished …>
This has the effect of having the user ‘packages’ areas wiped out at random times by
the xproofds, which I believe is the problem described by one of our users in the thread ‘disappearing files’.
We (obviously) also see a correspondingly large number of command aborts in the log files:
111201 00:00:06 31068 xpd-E: Conn::SendRecv: reading msg from connmgr (server[t2.wn.10.mi.infn.it:1093])
111201 00:00:06 31068 xpd-E: Conn::SendReq: max number of retries reached - Abort
There is an increasing number of leftover, forked xproofds, all waiting to acquire the mutex to write
to the xproofd log file. This is not really surprising, as the behaviour of pthread_mutexes
upon forking is ‘undefined’, and the chance of the log mutex to be locked by another thread
in the parent process is sizable:
#0 0x000000384b00d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x000000384b008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x000000384b008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004fadde in XrdSysError::TBeg ()
#4 0x0000000000486fe1 in XrdProofdProofServMgr::SetUserEnvironment ()
#5 0x00000000004330ac in XrdProofdAdmin::ExecCmd ()
#6 0x0000000000436ff0 in XrdProofdAdmin::Exec ()
#7 0x000000000043df66 in XrdProofdAdmin::Process ()
#8 0x000000000045ab30 in XrdProofdManager::Process ()
#9 0x000000000049fda3 in XrdProofdProtocol::Process2 ()
#10 0x00000000004a0262 in XrdProofdProtocol::Process ()
#11 0x00000000004be990 in XrdLink::DoIt ()
#12 0x00000000004c2d73 in XrdScheduler::Run ()
#13 0x00000000004c2ef9 in XrdStartWorking ()
#14 0x00000000004fd0a7 in XrdSysThread_Xeq ()
#15 0x000000384b00673d in start_thread () from /lib64/libpthread.so.0
#16 0x000000384a8d44bd in clone () from /lib64/libc.so.6

I’m looking for suggestions as to what the best workarounds (short of rolling back) might be.
For (1): is there a way to administratively manage/purge the command queue in the xrootds ? Should I just ask the users to refrain from routinely calling TProof::ClearPackages() ?
For (2): I’m currently running a script that kills leftover children that are waiting for the log mutex at the exact program counter of the Exec call log. There’s little else I can do. Probably the XrdProofdTrace object should be reset in the child forking (assuming this is the only parent mutex used by the forked child).

Thanks.
Francesco Prelz
Computing Support, INFN Milan

ganis · December 2, 2011, 11:00am

Hi Francesco,

Thanks for your report.
We did not see this problem in the farms that we follow regularly, but there the usage of ClearPackages is not so systematic as it seems in your case.Of course it may also depend of the amount of data to be cleaned and on the underlying fs.
The idea behind packages is that they contain relatively stable code to be rebuild only when needed, not on each query. So invoking systematically ClearPackages goes against the original intentions.
But , of course, this should not freeze the daemon. I will try to reproduce the problem and investigate it.
In the meantime, yes, I would ask users to refrain from running blindly ClearPackages in macros. If they really need to do that, then we need to understand why.

Gerri Ganis

fprelz · December 2, 2011, 11:52am

The underlying FS for the sandboxes is ext3 over a local software RAID, so nothing exotic. From fork to ‘OK’ reply, the remove command takes normally about 25 ms to execute. The nodes do tend to become busy with proofserv processing, but it looks like the xproofd is scheduling pending ‘Exec’ commands sparingly even when the load is low. How often are they supposed to be scheduled ?: I find this hard to understand in either xrootd documentation or code, and there may as well be some xrd config parameters to tweak for this case.

I believe that the use of parent process mutexes for the log file after fork is an issue that could bite in other cases as well and would better be addressed.

As for ClearPackages, users are now warned to use it sparingly, though it was executed promptly via the
proofservs before the upgrade.

Thanks for your reply and assistance.
Francesco P.