Xrootd : cmsd daemon crashing after a few minutes

Hello,

Setting up a small PROOF cluster with one master and 2 nodes (4 workers).
I would like to have xrootd setup on the same machines so I understood
I had to lauch cmsd on the master (manager node) and on the other nodes
(server nodes). I used the startup script from the CAF example. I tried
different ways for starting cmsd on the manager node and it keeps crashing.

I tried with both ROOT 5.22 and 5.23 compiled in situ on the machines in 64bits :

./configure linuxx8664gcc --disable-mathmore --disable-asimage --prefix=/opt/root

Here is a log extract (run from the xrootd account). The cmsd daemon crashed
without any message 2mn after the last message in the log :

-sh-3.2$ time /opt/root/bin/cmsd -c /opt/proof.cnf -d
090625 09:20:48 001 Scalla is starting. . .
Copr.  2007 Stanford University, xrd version 20090217-0500
++++++ cmsd anon@nanxrd07.in2p3.fr initialization started.
Config using configuration file /opt/proof.cnf
=====> xrd.port 1094
=====> all.adminpath /pool/admin
Config maximum number of connections restricted to 65535
090625 09:20:48 001 XrdConfig: sendfile enabled.
090625 09:20:48 001 XrdSched: scheduling underused thread monitor in 780 seconds090625 09:20:48 23732 XrdXeq: Time scheduler thread started
090625 09:20:48 001 XrdSched: Starting with 2 workers
090625 09:20:48 001 XrdLink: Allocating 8 link objects at a time
090625 09:20:48 001 XrdPoll: Starting poller 0
090625 09:20:48 23732 XrdXeq: Worker thread started
090625 09:20:48 23732 XrdXeq: Poller thread started
090625 09:20:48 23732 XrdXeq: Buffer Manager reshaper thread started
090625 09:20:48 23732 XrdXeq: Worker thread started
090625 09:20:48 001 XrdPoll: Starting poller 1
090625 09:20:48 23732 XrdXeq: Poller thread started
090625 09:20:48 001 XrdPoll: Starting poller 2
090625 09:20:48 23732 XrdXeq: Poller thread started
090625 09:20:48 001 XrdProtocol: getting port from protocol cmsd
Copr.  2007 Stanford University/SLAC cmsd.
++++++ anon@nanxrd07.in2p3.fr phase 1 initialization started.
=====> all.adminpath /pool/admin
=====> all.export /pool/data
=====> all.role manager
=====> all.manager nanxrd07.in2p3.fr 3121
------ anon@nanxrd07.in2p3.fr phase 1 manager initialization completed.
090625 09:20:48 001 XrdConfig: LCL port 3121 wsz=1048576 (1048576)
090625 09:20:48 001 XrdProtocol: getting protocol object cmsd
++++++ anon@nanxrd07.in2p3.fr phase 2 manager initialization started.
090625 09:20:48 23732 XrdXeq: Cache Clock thread started
090625 09:20:48 001 Replenish old free 0 + 4096 = 4096
090625 09:20:48 001 Configure2 Global System Identification: anon-m 3121nanxrd07.in2p3.fr
Config round robin scheduling in effect.
090625 09:20:48 23732 XrdXeq: Performance monitor thread started
090625 09:20:48 23732 XrdXeq: Refcount monitor thread started
090625 09:20:48 23732 XrdXeq: Request Responder thread started
090625 09:20:48 001 Update FrontEnd Parm1=1 Parm2=0
090625 09:20:48 001 XrdSched: Set min_Workers=8 max_Workers=200
090625 09:20:48 001 XrdSched: Set stk_Workers=160 max_Workidl=780
------ anon@nanxrd07.in2p3.fr phase 2 manager initialization completed.
090625 09:20:48 23732 XrdSched: running cmsd startup inq=0
------ cmsd anon@nanxrd07.in2p3.fr:3121 initialization completed.
090625 09:20:48 23732 XrdXeq: Request Timeout thread started
090625 09:20:48 23732 XrdXeq: Prep handler thread started
090625 09:20:48 23732 XrdXeq: Admin traffic thread started
090625 09:20:48 23732 XrdXeq: State monitor thread started
090625 09:20:49 001 XrdInet: Accepted connection from 17@nanxrd07.in2p3.fr
090625 09:20:49 23732 XrdSched: Now have 3 workers
090625 09:20:49 23732 XrdSched: running ?:17@nanxrd07 inq=0
090625 09:20:49 23732 XrdProtocol: matched protocol cmsd
090625 09:20:49 23732 ?:17@nanxrd07 XrdPoll: FD 17 attached to poller 0; num=1
090625 09:20:49 23732 Protocol: redirector.23341:17@nanxrd07 logged in.
090625 09:20:49 23732 Admit_Redirector redirector.23341:17@nanxrd07 assigned slot 1
090625 09:20:49 23732 XrdXeq: Worker thread started
090625 09:22:18 23732 Update Stage Parm1=-1 Parm2=0
090625 09:22:18 23732 Update Active Parm1=-1 Parm2=0
090625 09:22:18 23732 Config: manager service enabled.
090625 09:22:18 23732 State: Status changed to suspended + nostaging
090625 09:22:18 23732 Send status to redirector.23341:17@nanxrd07
Terminated

real    4m12.535s
user    0m0.002s
sys     0m0.003s

Hi,

Never seen this, but I will ask A. Hanushevsky to have a look.
Could you post the config file /opt/proof.cnf ?

Thanks,
Gerri

Thanks Gerardo,

Here is the proof.cnf

I also tried to run cmsd with strace :

xrootd@manager:strace /opt/root/bin/cmsd -c /opt/proof.cnf -d
090626 13:35:20 16011 XrdSched: running ?:20@nanxrd05 inq=0
090626 13:35:20 16011 XrdProtocol: matched protocol cmsd
090626 13:35:20 16011 ?:20@nanxrd05 XrdPoll: FD 20 attached to poller 2; num=1
090626 13:35:20 16011 Add server.20888:20@nanxrd05:1094 to cluster 1 slot 1.3 (n=2 m=1) ID=3121nanxrd07.in2p3.fr anon-s
090626 13:35:20 16011 Update Counts Parm1=1 Parm2=0
090626 13:35:20 16011 Admit nanxrd05 TSpace=105GB NumFS=1 FSpace=99346MB MinFR=10240MB Util=0
090626 13:35:20 16011 Admit nanxrd05 adding path: w /pool/data
090626 13:35:20 16011 Protocol: server.20888:20@nanxrd05:1094 logged in.
090626 13:35:20 16011 XrdXeq: Worker thread started
090626 13:36:47 16011 Update Stage Parm1=-1 Parm2=0
090626 13:36:47 16011 Update Active Parm1=-1 Parm2=0
090626 13:36:47 16011 Config: manager service enabled.
090626 13:36:47 16011 State: Status changed to active + nostaging
090626 13:36:47 16011 Send status to redirector.23341:19@nanxrd07
0x7fff79688d90, [1578491878184058896]) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
+++ killed by SIGTERM +++

proof.cnf.txt (3.1 KB)

Hi Jean-Michel,

A. Hanushevsky gave a look but he did not find anything wrong with the logs and/or config file.

One thing that we both noticed is that in both your logs the program seems to receive a SIGTERM, not a crash.

Anyhow, further investigation would require to have closer look at the core+executable.
Would it be possible for you to post such files in some public place?
The OS is SLC5 on amd64, right?

Gerri

Hello Gerardo,

The OS is SL5.2 x86_64

I do not have core files. I do not know if the config file defines well where they
should be saved but as you noticed, this is a SIGTERM temination, not a crash
so maybe we cannot expect to find core files…

I put the cmsd executable here, please tell me when you have it so that I
remove it :
www-subatech.in2p3.fr/~infosr/cmsd

JM

Dear Jean-Michel,

Indeed what’s strange is that the daemon gets the termination signal.
Andy added this comment to my request for help (if you give me your email I will add it to that email-thread):

We will have a look to your binary.

I attach a script that starts a very basic xrootd+cmsd system on a local machine (I use it to test very basic functionality): it starts a redirector on the starndard port 1094 and two data-servers on ports 11094 and 21094.

Just run ‘startXRD.sh’ (and ‘stopXRD.sh’ to stop): all the files are under /tmp/testxrd . This may find problems with the build. Anyhow, it will help locating the problem.

Cheers, Gerri
start-stop-XRD.tar (10 KB)