Crash with PoD when using >= 25 worker machines

Hi Gerri, all,

Over at SLAC we’re trying to transition our PROOF cluster to PoD to facilitate a bit more fine-grained control on our system (and to sandbox the daemon between users, so that one bad crash doesn’t kill everyone). This is a 36 machine system (1 master, 35 slaves, 7 to 11 workers on each slave). I’ve gotten PoD with pod-ssh working for the most part, except for one problem: I cannot get the connection to work for more than 24 machines. With 34, I can connect to all workers, no problem. With 25 machines (even if I reduce the workers per machine), I always get:

root [0] TProof pod("swiatlow@atlprf01.slac.stanford.edu:21001") Starting master: opening connection ... Starting master: OK pening connections to workers: 27 out of 245 (11 %) | session: swiatlow.default.13412.status terminated by peer Info in <TXSlave::HandleError>: 0xed92e0:atlprf01.slac.stanford.edu:0 got called ... fProof: 0xe025a0, fSocket: 0xee0b50 (valid: 1) Info in <TXSlave::HandleError>: 0xed92e0: proof: 0xe025a0 TXSlave::HandleError: 0xed92e0: DONE ...

I was debugging this with Anar on this thread (github.com/AnarManafov/PoD/issues/3) but he thinks it’s a PROOF issue at this point, since all the workers have direct proof connections (as revealed by pod-info -l).

Do you have any thoughts? Anar’s suspicion that I am running into file/process limits seemed reasonable, but I increased these, and the problems still persist with these values for ulimit:

~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 191968 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 16384 virtual memory (kbytes, -v) 2097152 file locks (-x) unlimited
I would appreciate any insight into this-- it seems like some small configuration problem, or at least I hope!

Best,
Max

One additional bit of info: when I retrieve the master log after this crash, I see:

150312 16:18:20 7709 xpd-I: ProofServMgr::CreateFork: srvtype = 2 terminate called after throwing an instance of 'St9bad_alloc' what(): std::bad_alloc

Which looks like a bad_alloc after a fork call-- but I don’t know what could be causing that.

(And to be specific, I’m using root 5.34/05 right now, but if necessary can switch to something newer).

Hi Max,

Uhmm … there is only one fork on the master (the master process) which initially works since you start getting workers.
Is there anything in the master process log file (not xpd.log, the proofserv one; it is on the node where you start pod-ssh).

It would certainly help debugging if you could give a try with a more recent ROOT. For example, if you have cvmfs on the nodes, you could use a current version that I am using for other tests. This is what I use to set it up:

# Gcc
source /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/setup.sh

# ROOT
ROOTDIST="/cvmfs/sft.cern.ch/lcg/dev/root-v5-34-26-proof-test/x86_64-slc6-gcc48-opt"
# ROOTDIST="/afs/cern.ch/work/g/ganis/public/root/root-v5-34-26-proof-test/x86_64-slc6-gcc48-dbg"
source $ROOTDIST/bin/thisroot.sh

# Xrootd
source $ROOTDIST/bin/setxrd.sh /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/xrootd/3.3.6.p1-x86_64-slc6-gcc48-opt/v3.3.6

Gerri

Hi Gerri,

Thanks very much for the suggestions! I am now using your recommended ROOT version, and still get the same issue. The log contents are:

[code]
150320 14:07:25 10332 xpd-I: ProofServMgr::Create: *** spawned child process 10332 ***
150320 14:07:25 10332 xpd-I: ProofServMgr::Create: srvtype = 2
terminate called after throwing an instance of ‘St9bad_alloc’

  • what():  std::bad_alloc
    
    ~ [/code]

I’ll keep digging, but any tips are highly appreciated. Thanks!

Best,
Max