Xproofd just... stops

gwatts · September 20, 2010, 6:57am

Hi,
I’m running a xproofd server on a single Linux machine right now (this is with version 5.26). When I frst start one, and I attempt to do something that causes a crash, I often see the following on the window where I typed “xproofd”:

[code]100919 23:45:59 15751 xpd-I: gwatts.15824:54@localhost.localdomain: ClientMgr::MapClient: user gwatts logged-in (privileged); type: Internal
100919 23:45:59 15751 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions/gwatts.default.15827.status was successful!
100919 23:45:59 15751 xpd-I: gwatts.15827:56@localhost.localdomain: ClientMgr::MapClient: user gwatts logged-in (privileged); type: Internal
100919 23:45:59 15751 xpd-I: ProofServ::SetAdminPath: creation/assertion of the status path /tmp/.xproofd.1093/activesessions/gwatts.default.15829.status was successful!
100919 23:45:59 15751 xpd-I: gwatts.15829:58@localhost.localdomain: ClientMgr::MapClient: user gwatts logged-in (privileged); type: Internal
100919 23:46:04 15751 xpd-I: ProofServCron: 1 sessions are currently active
100919 23:46:04 15751 xpd-I: ProofServCron: next sessions check in 30 secs
100919 23:46:29 15751 xpd-I: SchedCron: running regular checks
100919 23:46:34 15751 xpd-I: ProofServCron: 1 sessions are currently active
100919 23:46:34 15751 xpd-I: ProofServCron: next sessions check in 30 secs
100919 23:46:59 15751 xpd-I: SchedCron: running regular checks
100919 23:47:04 15751 xpd-I: ProofServCron: 1 sessions are currently active
100919 23:47:04 15751 xpd-I: ProofServCron: next sessions check in 30 secs
100919 23:47:14 15751 xpd-I: ProofServMgr::BroadcastClusterInfo: tot: 1, act: 1

[1]+ Stopped xproofd
bash-3.2$ [/code]

It just stops. The client is usually hung… if I type “fg” in the window, everythign picks up, and the crash (or whatever it is) is transmitted back to the client.

On the server a workder’s log looks like this:

23:46:45 15818 Wrk-0.9 | Info in <TXProofServ::HandleCache>: loading macro GlobalCacheReset.cpp+ ... 23:46:52 15818 Wrk-0.9 | Info in <TXProofServ::HandleCache>: loading macro BasicPlotMaker.cxx+ ... 23:47:00 15818 Wrk-0.9 | Info in <TXProofServ::HandleCache>: loading macro JetKinematicPlots.cpp+ ... 23:47:05 15818 Wrk-0.9 | Info in <TXProofServ::HandleCache>: loading macro EMJESfix.cpp+ ... 23:47:12 15818 Wrk-0.9 | Info in <TXProofServ::HandleCache>: loading macro MakeJetCollection.cpp+ ... Flow base was created!! FlowSequential was created Flow base was created!! Creating a GlobalCacheReset Flow base was created!! Creating a Make jets collectin Flow base was created!! Creating a jet kinematic plots Flow base was created!! FlowSequential was created Flow base was created!! Creating a GlobalCacheReset 23:47:14 15818 Wrk-0.9 | *** Break ***: segmentation violation

(the print-outs are my poor-man’s attempt at debugging). And the master’s log:

[code]23:47:12 15800 Wrk-0.3 | Info in TXProofServ::HandleCache: loading macro MakeJetCollection.cpp+ …
Flow base was created!!
FlowSequential was created
Flow base was created!!
Creating a GlobalCacheReset
Flow base was created!!
Creating a Make jets collectin
Flow base was created!!
Creating a jet kinematic plots
Flow base was created!!
FlowSequential was created
Flow base was created!!
Creating a GlobalCacheReset
Flow base was created!!
Creating a Make jets collectin
Flow base was created!!
Creating a jet kinematic plots

23:47:14 15773 Mst-0 | Info in TXProofServ::SetQueryRunning: starting query: 1
23:47:14 15773 Mst-0 | Info in TXProofServ::HandleInput: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
23:47:14 15773 Mst-0 | Info in TPacketizerAdaptive::TPacketizerAdaptive: fraction of remote files 1.000000
23:47:14 15773 Mst-0 | SvcMsg in TProofPlayerRemote::Process: Start merging Memory information
[/code]

After I do the fg, the crash dump shows up in the worker’s log file, and it looks like this is probably my problem:

[code]The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

#6 0x00002aaab14789a8 in ROOT::delete_FlowBase ()
from /phys/users/gwatts/.proof/session-tev2-1284965156-15773/worker-0.9-tev2-1284965158-15818/./FlowBase_cpp.so
#7 0x00002b070bdf623a in TClass::Destructor ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libCore.so
#8 0x00002b070d485eb8 in TBufferFile::ReadFastArray ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libRIO.so
#9 0x00002b070d5334ed in TStreamerInfo::ReadBuffer<char**> ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libRIO.so
#10 0x00002b070d487071 in TBufferFile::ReadClassBuffer ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libRIO.so
#11 0x00002b070d4b0aa4 in TGenCollectionStreamer::ReadObjects ()
[/code]

(flowbase is the base object that actually does the work for me on the cluster).

But at the very least - why on earth do I get the “stopped”? I’ve never seen that before in a linux process - one that can “stop” itself. And of course, this means I have to log into the xproofd server to get things started again!

This is also present with 5.27/04, but I’ve not tried it with the source tree head.

Cheers, Gordon.

pcanal · January 21, 2011, 10:12pm

Hi Gordon,

Do you still have this problem with v5.28?

Philippe.

gwatts · January 21, 2011, 10:40pm

Thanks for asking. I will be moving my PROOF code to 5.28 soon - I’ve not done it yet. I’ll check back when I’ve done that to see if it still happens.

Xproofd just... stops

[code]The lines below might hint at the cause of the crash. If they do not help you then please submit a bug report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace from above as an attachment in addition to anything else that might help us fixing this issue.

[code]The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.