TProofBench crash

I’m trying to use TProofBench::Run on 5.30/04, but at a certain point it crashes with the usual message:

Info in <TProofBenchRunCPU::Run>: Running CPU-bound tests with 49 active worker(s); trial 4/4
Worker 'proof-01.mi.infn.it-0.96' has been removed from the active list

The corresponding log is this:

17:11:38  9341 Mst-0 | Info in <TXProofServ::SetQueryRunning>: starting query: 196
17:11:38  9341 Mst-0 | Info in <TProofQueryResult::SetRunning>: nwrks: 49
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:38  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:40  9341 Mst-0 | Info in <TProof::HandleInputMessage>: finalization on Mst-0 started ...
17:11:40  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:40  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:40  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:41  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:41  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:41  9341 Mst-0 | Info in <TXProofServ::HandleInput>: kXPD_clusterinfo: tot: 1, act: 1, eff: 1.000000
17:11:42  9341 Mst-0 | Error in <TXSocket::ProcessUnsolicitedMsg>: 0x2aaab0001db0: async semaphore taken by Close()! Should not be here!
17:11:42  9341 Mst-0 | Error in <TXSocket::ProcessUnsolicitedMsg>: 0x2aaab00091e0: async semaphore taken by Close()! Should not be here!
 
| session: turra.default.11715.status terminated by peer
17:12:17  9341 Mst-0 | Info in <TXSlave::HandleError>: 0x2aaab0033b30:proof-01.mi.infn.it:0.96 got called ... fProof: 0x1780dd90, fSocket: 0x2aaab0033cb0 (valid: 1)
17:12:17  9341 Mst-0 | Info in <TXSlave::HandleError>: 0x2aaab0033b30: proof: 0x1780dd90
17:12:17  9341 Mst-0 | Info in <TProof::MarkBad>: 
 +++ Message from top master at proof-06.mi.infn.it:1093 : marking proof-01.mi.infn.it:1093 (0.96) as bad
 +++ Reason: received kPROOF_FATAL
TXSlave::HandleError: 0x2aaab0033b30: DONE ... 
120321 17:12:17 9341 Proofx-E: Conn::LowWrite: sending header to server [proof-01.mi.infn.it:1093] (rc=-3)
120321 17:12:17 9341 Proofx-E: Conn::SendRecv: problems sending request to server [proof-01.mi.infn.it:1093]
// --------- End of element log -------------------


Retrieving logs: 1 ok, 0 not ok (100 % processed) 


// --------- Start of element log -----------------

// Ordinal: 0.96 (role: worker)

// Path: turra@proof-01.mi.infn.it:1093//proof/workingdirs/turra/session-proof-06-1332345582-9341/worker-0.96-proof-01-1332345585-12593.log 
// # of retrieved lines: 23 


// ------------------------------------------------

120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: log file: /proof/workingdirs/turra/session-proof-06-1332345582-9341/worker-0.96-proof-01-1332345585-12593.log
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: child process 12593
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: admin path: /proof/proofadmin/.xproofd.1093/activesessions/turra.default.12593
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: UNIX sock path: /proof/proofadmin/.xproofd.1093/socks/xpd.1093.12593
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: srvtype = 0
120321 16:59:45 10786 xpd-I: ProofServMgr::SetUserOwnerships: enter
120321 16:59:45 10786 xpd-I: ProofServMgr::SetUserOwnerships: done
120321 16:59:45 10786 xpd-I: ProofServMgr::SetUserEnvironment: enter
120321 16:59:45 10786 xpd-I: ProofServMgr::SetUserEnvironment: done
120321 16:59:45 10786 xpd-I: ProofServMgr::SetProofServEnv: psid: 12, log: 0
120321 16:59:45 10786 xpd-I: ProofServMgr::SetProofServEnv: ROOT dir: /gpfs/storage_4/users/home/proof/root
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateProofServRootRc: session rootrc file: /proof/workingdirs/turra/session-proof-06-1332345582-9341/worker-0.96-proof-01-1332345585-12593.rootrc
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateProofServEnvFile: environment file: /proof/workingdirs/turra/session-proof-06-1332345582-9341/worker-0.96-proof-01-1332345585-12593.env
120321 16:59:45 10786 xpd-I: ProofServMgr::SetProofServEnv: creating symlink
120321 16:59:45 10786 xpd-I: ProofServMgr::SetProofServEnv: done
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: 12593: proofserv env set up
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: 12593: log file path communicated
120321 16:59:45 10786 xpd-I: ProofServMgr::CreateFork: 12593: user: turra, uid: 11547, euid:11547, psrv: /gpfs/storage_4/users/home/proof/root/bin/proofserv
Received SIGTERM: terminating
17:14:53 12593 Wrk-0.96 | Info in <TXProofServ::Terminate>: starting session termination operations ...
17:14:53 12593 Wrk-0.96 | Info in <TXProofServ::Terminate>: process memory footprint: 138008/-1 kB virtual, 28016/-1 kB resident 
17:14:55 12593 Wrk-0.96 | Info in <TXProofServ::Terminate>: data directory '/proof/workingdirs/turra/data/0.96/proof-01-1332345585-12593' has been removed
Terminate: termination operations ended: quitting!
// --------- End of element log -------------------

Hi,

It crashes where? The client? Or is the xproofd in control of 0.96?
Can you say a bit more about the setup? How many machines, cores/machine, …

G. Ganis

[quote=“ganis”]Hi,

It crashes where? The client? Or is the xproofd in control of 0.96?
Can you say a bit more about the setup? How many machines, cores/machine, …

G. Ganis[/quote]

Hello, I got the crash message (the first one) on the client. The log is from one workers. The cluster is made of 192 cores and the client is one of this