Hi Gerri, all,
Over at SLAC we’re trying to transition our PROOF cluster to PoD to facilitate a bit more fine-grained control on our system (and to sandbox the daemon between users, so that one bad crash doesn’t kill everyone). This is a 36 machine system (1 master, 35 slaves, 7 to 11 workers on each slave). I’ve gotten PoD with pod-ssh working for the most part, except for one problem: I cannot get the connection to work for more than 24 machines. With 34, I can connect to all workers, no problem. With 25 machines (even if I reduce the workers per machine), I always get:
root  TProof pod("email@example.com:21001")
Starting master: opening connection ...
Starting master: OK
pening connections to workers: 27 out of 245 (11 %)
| session: swiatlow.default.13412.status terminated by peer
Info in <TXSlave::HandleError>: 0xed92e0:atlprf01.slac.stanford.edu:0 got called ... fProof: 0xe025a0, fSocket: 0xee0b50 (valid: 1)
Info in <TXSlave::HandleError>: 0xed92e0: proof: 0xe025a0
TXSlave::HandleError: 0xed92e0: DONE ...
I was debugging this with Anar on this thread (github.com/AnarManafov/PoD/issues/3) but he thinks it’s a PROOF issue at this point, since all the workers have direct proof connections (as revealed by pod-info -l).
Do you have any thoughts? Anar’s suspicion that I am running into file/process limits seemed reasonable, but I increased these, and the problems still persist with these values for ulimit:
~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 191968
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 16384
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) 2097152
file locks (-x) unlimited
I would appreciate any insight into this-- it seems like some small configuration problem, or at least I hope!