Received SIGTERM: terminating

during a computation with proof I got:

0.16: caught exception triggered by signal '1' while processing dset:'TDSet:photon', file:'/gpfs/storage_2/atlas/atlasgroupdisk/phys-sm/mc10_7TeV/NTUP_PHOTON/e598_s933_s946_r2215_r2260_p548/mc10_7TeV.106367.McAtNlo_Jimmy_H120gamgam.merge.NTUP_PHOTON.e598_s933_s946_r2215_r2260_p548_tid326925_00/NTUP_PHOTON.326925._000001.root.1', event:1000 - check logs for possible stacktrace
Worker 't2-wn-10.mi.infn.it-0.16' has been removed from the active list

 +++ Message from top master at t2-wn-10.mi.infn.it:1093 : marking t2-wn-10.mi.infn.it:1093 (0.16) as bad
 +++ Reason: received kPROOF_FATAL

 +++ Most likely your code crashed on worker 0.16 at t2-wn-10.mi.infn.it:1093.
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing
 +++
 +++ root [] TProof::Mgr("t2-wn-10.mi.infn.it:1093")->GetSessionLogs()->Display("0.16",0)

and in the log

Event pass the preselection, simple_mass = 115838corrected_mass = 117906
Received SIGTERM: terminating
14:40:14  6729 Wrk-0.10 | Info in <TXProofServ::Terminate>: starting session termination operations ...
14:40:14  6729 Wrk-0.10 | Info in <TXProofServ::Terminate>: process memory footprint: 320592/-1 kB virtual, 93348/-1 kB resident 
14:40:15  6729 Wrk-0.10 | Info in <TXProofServ::Terminate>: data directory '/proof/workingdirs/turra/data/0.10/t2-wn-10-1304685584-6729' has been removed
Terminate: termination operations ended: quitting!
14:40:15  6729 Wrk-0.10 | Info in <TProofPlayerSlave::Process>: received stop-process signal
14:40:15  6729 Wrk-0.10 | Warning in <TClass::TClass>: no dictionary for class egammaOQ is available
14:40:15  6729 Wrk-0.10 | Warning in <TClass::TClass>: no dictionary for class OffsetEtaJES is available
110506 14:40:15 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:15 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:16 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:16 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:17 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:17 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:18 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:18 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:19 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:19 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:20 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:20 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:21 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:21 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 14:40:22 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 14:40:22 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
// --------- End of element log -------------------

with prooflite everythings work. I don’t have any backtrace, any segmentation fault, nothing! What’s the problem? ROOT-5.28/00c

Hi,

Why do you check the log of 0.10? The crash was on 0.16 …
Signal 1 is usually a segv …

Gerri

[quote=“ganis”]Hi,

Why do you check the log of 0.10? The crash was on 0.16 …
Signal 1 is usually a segv …

Gerri[/quote]

sorry, I’ve copy the wrong message, but proof master said that also on 0.10 there was a crash. At the end all workers crashed. An other example:


Info in <TProof::Collect>:    t2-wn-10.mi.infn.it (0)
Info in <TProof::HandleInputMessage>: got type 1044 from '0'
Info in <TProof::HandleInputMessage>: kPROOF_MESSAGE: enter

 +++ Message from top master at t2-wn-10.mi.infn.it:1093 : marking t2-wn-11.mi.infn.it:1093 (0.14) as bad
 +++ Reason: could not send kPROOF_GROUPVIEW message

 +++ Most likely your code crashed on worker 0.14 at t2-wn-11.mi.infn.it:1093.
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing
 +++
 +++ root [] TProof::Mgr("t2-wn-10.mi.infn.it:1093")->GetSessionLogs()->Display("0.14",0)


Info in <TProof::Collect>:  1 node(s) still active:

on the worker:

15:20:04 21442 Wrk-0.14 | Info in <TProofPlayerSlave::Process>: Call Process(1578)
Event pass the preselection, simple_mass = 118927corrected_mass = 118593
Received SIGTERM: terminating
15:20:04 21442 Wrk-0.14 | Info in <TXProofServ::Terminate>: starting session termination operations ...
15:20:04 21442 Wrk-0.14 | Info in <TXProofServ::Terminate>: process memory footprint: 391844/-1 kB virtual, 120704/-1 kB resident 
15:20:05 21442 Wrk-0.14 | Info in <TXProofServ::Terminate>: data directory '/proof/workingdirs/turra/data/0.14/t2-wn-11-1304687955-21442' has been removed
Terminate: termination operations ended: quitting!
15:20:05 21442 Wrk-0.14 | Info in <TProofPlayerSlave::Process>: 579 events processed
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `histo_PV_n[8]'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `histo_PV_z_resolution'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `histo_PV_x_resolution'
[...]
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `b_MET_Truth_NonInt_sumet'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fInput'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fOutput'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: Found 2848 data members.
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: Data member `histo_run_counter_all' corresponds to output `histo_run_counter_all'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: Data member `histo_selection_counter_run' corresponds to output `histo_selection_counter_run'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: Data member `histo_selection_counter' corresponds to output `histo_selection_counter'
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: Data member `histo_run_counter' corresponds to output `histo_run_counter'
[...]
15:20:05 21442 Wrk-0.14 | Info in <TOutputListSelectorDataMap::Init()>: Data member `profile_PV_sumPt_vs_i_bign' corresponds to output `profile_PV_sumPt_vs_i_bign'
15:20:05 21442 Wrk-0.14 | Info in <TProofPlayerSlave::Process>: Call SlaveTerminate()
15:20:05 21442 Wrk-0.14 | Info in <TXProofServ::TProofServ::Handleprocess>: worker 0.14 has finished processing with 191 objects in output list
15:20:05 21442 Wrk-0.14 | Info in <TXProofServ::HandleProcess>: merging mode check: 0
15:20:05 21442 Wrk-0.14 | Info in <TXProofServ::HandleProcess>: sending result directly to master
15:20:05 21442 Wrk-0.14 | Info in <TXProofServ::SendResults>: enter
15:20:05 21442 Wrk-0.14 | Info in <TXProofServ::SendResults>: message has 1075660 bytes: limit of 1000000 bytes reached - sending ...
110506 15:20:05 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 15:20:05 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 15:20:06 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 15:20:06 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 15:20:07 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 15:20:07 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
110506 15:20:08 001 Proofx-E: Conn::CheckResp: server [:0] did not return OK replying to last request
110506 15:20:08 001 Proofx-E: Conn::CheckErrorStatus: SendMsg: INT: session is reconnecting: retry later
15:20:09 21442 Wrk-0.14 | Info in <TXProofServ::SendResults>: message has 1106798 bytes: limit of 1000000 bytes reached - sending ...
15:20:09 21442 Wrk-0.14 | Info in <TXProofServ::SendResults>: message has 1146840 bytes: limit of 1000000 bytes reached - sending ...
15:20:09 21442 Wrk-0.14 | Info in <TXProofServ::SendResults>: done
15:20:09 21442 Wrk-0.14 | Info in <TXProofServ::SendLogFile>: kPROOF_LOGDONE sent
15:20:09 21442 Wrk-0.14 | Info in <TXProofServ::HandleProcess>: done
// --------- End of element log -------------------

The error you had on 0.16 (exception following signal 1) is probably causing all the rest.
In the case 0.14 was marked bad because sending a control message (type kPROOF_GROUVIEW) failed. This is probably a consequence of some other problem.
Can you post a tarball with all the logs of a failing session?

G

[quote=“ganis”]The error you had on 0.16 (exception following signal 1) is probably causing all the rest.
In the case 0.14 was marked bad because sending a control message (type kPROOF_GROUVIEW) failed. This is probably a consequence of some other problem.
Can you post a tarball with all the logs of a failing session?

G[/quote]

sure! here: precision-turra.mi.infn.it/log_Efficiency.tar.gz