Hi,
after upgrading the PROOF cluster that serves our local community of ATLAS users
from ROOT 5.28 to 5.30, we started observing two issues that I believe can all be traced back to
the new handling of TProof::DisablePackages via admin commands to xproofd:
6571 if (fManager) {
6572 // Try to do it via XROOTD (new way)
6573 if (fManager->Rm("~/packages/*", “-rf”, “all”) != -1) {
6574 done = kTRUE;
6575 st = 0;
6576 }
6577 }
-
The broadcast requests to remove all files in the user sandbox (our users are issuing a TProof::ClearPackages()
at the beginning of each PROOF run) are piling up fast with time in the xproofds. Looking at the message exchanges
between a given pair of xproofds, I see a ratio of roughly twelve ‘Rm’ requests to two ‘OK’ replies in the
same amount of time:
$ grep 'sendto.47.rf’ /tmp/clearp.trace
[pid 31318] sendto(47, "all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, “all|~/packages/|-rf", 20, 0, NULL, 0) = 20
[pid 1962] sendto(47, "all|~/packages/|-rf”, 20, 0, NULL, 0) = 20
[pid 31318] sendto(47, “all|~/packages/*|-rf”, 20, 0, NULL, 0) = 20
$grep ‘recvfrom(47’ /tmp/clearp.trace
[pid 2048] recvfrom(47, “\3\0\0\0\0\0\0\3”, 8, 0, NULL, NULL) = 8
[pid 2048] recvfrom(47, “OK\0”, 3, 0, NULL, NULL) = 3
[pid 2048] recvfrom(47, “\3\0\0\0\0\0\0\3”, 8, 0, NULL, NULL) = 8
[pid 2048] recvfrom(47, “OK\0”, 3, 0, NULL, NULL) = 3
[pid 2048] recvfrom(47, <unfinished …>
[pid 2048] recvfrom(47, <unfinished …>
This has the effect of having the user ‘packages’ areas wiped out at random times by
the xproofds, which I believe is the problem described by one of our users in the thread ‘disappearing files’.
We (obviously) also see a correspondingly large number of command aborts in the log files:
111201 00:00:06 31068 xpd-E: Conn::SendRecv: reading msg from connmgr (server[t2.wn.10.mi.infn.it:1093])
111201 00:00:06 31068 xpd-E: Conn::SendReq: max number of retries reached - Abort -
There is an increasing number of leftover, forked xproofds, all waiting to acquire the mutex to write
to the xproofd log file. This is not really surprising, as the behaviour of pthread_mutexes
upon forking is ‘undefined’, and the chance of the log mutex to be locked by another thread
in the parent process is sizable:
#0 0x000000384b00d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x000000384b008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x000000384b008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004fadde in XrdSysError::TBeg ()
#4 0x0000000000486fe1 in XrdProofdProofServMgr::SetUserEnvironment ()
#5 0x00000000004330ac in XrdProofdAdmin::ExecCmd ()
#6 0x0000000000436ff0 in XrdProofdAdmin::Exec ()
#7 0x000000000043df66 in XrdProofdAdmin::Process ()
#8 0x000000000045ab30 in XrdProofdManager::Process ()
#9 0x000000000049fda3 in XrdProofdProtocol::Process2 ()
#10 0x00000000004a0262 in XrdProofdProtocol::Process ()
#11 0x00000000004be990 in XrdLink::DoIt ()
#12 0x00000000004c2d73 in XrdScheduler::Run ()
#13 0x00000000004c2ef9 in XrdStartWorking ()
#14 0x00000000004fd0a7 in XrdSysThread_Xeq ()
#15 0x000000384b00673d in start_thread () from /lib64/libpthread.so.0
#16 0x000000384a8d44bd in clone () from /lib64/libc.so.6
I’m looking for suggestions as to what the best workarounds (short of rolling back) might be.
For (1): is there a way to administratively manage/purge the command queue in the xrootds ? Should I just ask the users to refrain from routinely calling TProof::ClearPackages() ?
For (2): I’m currently running a script that kills leftover children that are waiting for the log mutex at the exact program counter of the Exec call log. There’s little else I can do. Probably the XrdProofdTrace object should be reset in the child forking (assuming this is the only parent mutex used by the forked child).
Thanks.
Francesco Prelz
Computing Support, INFN Milan