fedino
June 14, 2011, 2:20pm
1
Dear Proof experts,
I am using root 5.28/00 (trunk@37585).
I am facing a very anomalous problem, and do not understand whether it depends on Proof or not.
I have written a very stupid TSelector, which actually does nothing, and let it run over some root files. The same code that runs over a newer set of data crashes miserably, and I do not understand why.
Here attached an example of the output of the crash and the corresponding log; this is about proof-lite, but the problem is the same using “full” proof.
The output:
Looking up for exact location of files: OK (16 files)
Looking up for exact location of files: OK (16 files)
Info in TPacketizerAdaptive::TPacketizerAdaptive : Setting max number of workers per node to 3
Validating files: OK (16 files)
Info in TPacketizerAdaptive::InitStats : fraction of remote files 1.000000
0.1: caught exception triggered by signal ‘1’ …| 0.00 %
Info in TProofLite::MarkBad :
+++ Message from master at gridui1.pi.infn.it : marking 0.1-gridui1-1308058668-11115:-1 (0.1) as bad
+++ Reason: undefined message in TProof::CollectInputFrom(…)
+++ Most likely your code crashed
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root TProof::Mgr(“gridui1.pi.infn.it”)->GetSessionLogs()->Display(“*”)
Error in TPacketizerAdaptive::SplitPerHost : Error removing a missing file
Info in TPacketizerAdaptive::InitStats : fraction of remote files 1.000000
0.2: caught exception triggered by signal ‘1’
Info in TProofLite::MarkBad :
+++ Message from master at gridui1.pi.infn.it : marking 0.2-gridui1-1308058668-11117:-1 (0.2) as bad
+++ Reason: undefined message in TProof::CollectInputFrom(…)
+++ Most likely your code crashed
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root TProof::Mgr(“gridui1.pi.infn.it”)->GetSessionLogs()->Display(“*”)
Error in TPacketizerAdaptive::SplitPerHost : Error removing a missing file
Info in TPacketizerAdaptive::InitStats : fraction of remote files 1.000000
0.0: caught exception triggered by signal ‘1’
Info in TProofLite::MarkBad :
+++ Message from master at gridui1.pi.infn.it : marking 0.0-gridui1-1308058668-11113:-1 (0.0) as bad
+++ Reason: undefined message in TProof::CollectInputFrom(…)
+++ Most likely your code crashed
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root TProof::Mgr(“gridui1.pi.infn.it”)->GetSessionLogs()->Display(“*”)
Error in TPacketizerAdaptive::SplitPerHost : Error removing a missing file
Info in TPacketizerAdaptive::InitStats : fraction of remote files 1.000000
The log:
===========================================================
There was a crash (kSigSegmentationViolation).
This is the entire stack trace of all threads:
#0 0x0000003f75c99fc5 in waitpid () from /lib64/libc.so.6
#1 0x0000003f75c3c331 in do_system () from /lib64/libc.so.6
#2 0x00002ab1aa79b94b in TUnixSystem::Exec (this=0x6392a0,
shellcmd=0x1126678 “/afs/cern.ch/sw/lcg/app/releases/ROOT/5.28.00/x86_64-slc5-gcc43-dbg/root/etc/gdb-backtrace.sh 11113 1>&2”)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:2005
#3 0x00002ab1aa79ab04 in TUnixSystem::StackTrace (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:2227
#4 0x00002ab1aa79e040 in TUnixSystem::DispatchSignals (this=0x6392a0,
sig=kSigSegmentationViolation)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1131
#5 0x00002ab1aa79e16a in SigHandler (sig=kSigSegmentationViolation)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:352
#6 0x00002ab1aa79330c in sighandler (sig=11)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:3496
#7
#8 0x00002ab1b0f0adbc in TEventIterTree::GetTrees (this=0x111cd40, elem=
0x1139640)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TEventIter.cxx:519
#9 0x00002ab1b0f0b6b0 in TEventIterTree::GetNextEvent (this=0x111cd40)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TEventIter.cxx:753
#10 0x00002ab1b0f53885 in TProofPlayer::Process (this=0x11120e0,
dset=0xdc29b0, selector_file=0x1084ee8 “test”, option=0x2ab1aaf1fe78 “”,
nentries=-1, first=-1)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayer.cxx:863
#11 0x00002ab1ada0026f in TProofServ::HandleProcess (this=0xbe4ca0, mess=
0xdb32e0, slb=0x0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:3676
#12 0x00002ab1ada15ee9 in TProofServ::HandleSocketInput (this=0xbe4ca0,
mess=0xdb32e0, all=true)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1556
#13 0x00002ab1ada08171 in TProofServ::HandleSocketInput (this=0xbe4ca0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1290
#14 0x00002ab1ada1fc27 in TProofServLiteInputHandler::Notify (this=0xbe5070)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:162
#15 0x00002ab1ada22d60 in TProofServLiteInputHandler::ReadNotify (
this=0xbe5070)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:154
#16 0x00002ab1aa79d3c7 in TUnixSystem::CheckDescriptors (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1233
#17 0x00002ab1aa79db39 in TUnixSystem::DispatchOneEvent (this=0x6392a0,
pendingOnly=false)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:940
#18 0x00002ab1aa6e544a in TSystem::InnerLoop (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:406
#19 0x00002ab1aa6f4e60 in TSystem::Run (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:356
#20 0x00002ab1aa66444b in TApplication::Run (this=0xbe4ca0, retrn=false)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TApplication.cxx:1052
#21 0x00002ab1ada05b4c in TProofServ::Run (this=0xbe4ca0, retrn=false)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:2431
#22 0x0000000000402348 in main (argc=5, argv=0x7fff00625828)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/main/src/pmain.cxx:314
The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
root.cern.ch/bugs . Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
#8 0x00002ab1b0f0adbc in TEventIterTree::GetTrees (this=0x111cd40, elem=
0x1139640)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TEventIter.cxx:519
#9 0x00002ab1b0f0b6b0 in TEventIterTree::GetNextEvent (this=0x111cd40)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TEventIter.cxx:753
#10 0x00002ab1b0f53885 in TProofPlayer::Process (this=0x11120e0,
dset=0xdc29b0, selector_file=0x1084ee8 “test”, option=0x2ab1aaf1fe78 “”,
nentries=-1, first=-1)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayer.cxx:863
#11 0x00002ab1ada0026f in TProofServ::HandleProcess (this=0xbe4ca0, mess=
0xdb32e0, slb=0x0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:3676
#12 0x00002ab1ada15ee9 in TProofServ::HandleSocketInput (this=0xbe4ca0,
mess=0xdb32e0, all=true)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1556
#13 0x00002ab1ada08171 in TProofServ::HandleSocketInput (this=0xbe4ca0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:1290
#14 0x00002ab1ada1fc27 in TProofServLiteInputHandler::Notify (this=0xbe5070)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:162
#15 0x00002ab1ada22d60 in TProofServLiteInputHandler::ReadNotify (
this=0xbe5070)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServLite.cxx:154
#16 0x00002ab1aa79d3c7 in TUnixSystem::CheckDescriptors (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:1233
#17 0x00002ab1aa79db39 in TUnixSystem::DispatchOneEvent (this=0x6392a0,
pendingOnly=false)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/unix/src/TUnixSystem.cxx:940
#18 0x00002ab1aa6e544a in TSystem::InnerLoop (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:406
#19 0x00002ab1aa6f4e60 in TSystem::Run (this=0x6392a0)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TSystem.cxx:356
#20 0x00002ab1aa66444b in TApplication::Run (this=0xbe4ca0, retrn=false)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/core/base/src/TApplication.cxx:1052
#21 0x00002ab1ada05b4c in TProofServ::Run (this=0xbe4ca0, retrn=false)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proof/src/TProofServ.cxx:2431
#22 0x0000000000402348 in main (argc=5, argv=0x7fff00625828)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/main/src/pmain.cxx:314
Is it possible that data prepared with different version of root behave differently under proof?
I am quite sure that the dataset that makes proof crash is ok, since other people are using it with normal root, and also access via browser raises no exceptions or warning.
I have also tryed to use root versions 28, 28c, 28d, 30rc, but nothing has changed.
Thank you very much in advance.
Cheers,
federico
fedino
June 14, 2011, 3:01pm
2
Dear all,
I would like to link this post:
here: [url]Proof precess problem with ATLAS D3PD data
in the Proof section of roottalk; it seems that my problem is almost the same (my toy TSelector now is running after applying the suggested trickery), just checking further. anyhow, it would be useful to understand how to interpret the logs, since I could not even imagine how to solve the problem if I did not read the quoted post.
Thank you again,
federico
ganis
June 17, 2011, 1:03pm
3
Dear Federico,
Debugging these issues is always complicated, especially when there are many layers involved.
In this case the mastre log did not gave hints, I agree, except that something bad happened on (all) the workers.
On the other end, the workers logs gave the exact location of the segmentation violation which helped to understand and fix the problem in the trunk, and to provide the workaround.
Gerri
fedino
June 17, 2011, 1:43pm
4
Hi Gerri,
thank you very much for your answer. Sorry, I did not meant to be unpolite, just wanted to point out that I found very difficult to get my way out of this.
By the way, I do not understand where the trace-log should have helped me; could you please post the lines? in the next few days data will be changed again, so it is possible I will get trapped again in something similar.
Thanks a lot.
cheers,
federico
ganis
June 17, 2011, 2:06pm
5
Hi Federico,
Sure, I did not interpreted in such a way.
The point that these problems are difficult to follow even for the ‘experts’, especially if the happen in the PROOF code, not in yours.
In the case in exam, the stack trace in the worker log indicated that a ‘segmentation violation’ occurred at line 519 of proof/proofplayer/src/TEventIter.cxx :
The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
root.cern.ch/bugs . Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
#8 0x00002ab1b0f0adbc in TEventIterTree::GetTrees (this=0x111cd40, elem=
0x1139640)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TEventIter.cxx:519
#9 0x00002ab1b0f0b6b0 in TEventIterTree::GetNextEvent (this=0x111cd40)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TEventIter.cxx:753
#10 0x00002ab1b0f53885 in TProofPlayer::Process (this=0x11120e0,
dset=0xdc29b0, selector_file=0x1084ee8 “test”, option=0x2ab1aaf1fe78 “”,
nentries=-1, first=-1)
at /build/bellenot/SPI/x86_64-slc5-gcc43-dbg/root/proof/proofplayer/src/TProofPlayer.cxx:863
#11 0x00002ab1ada0026f in TProofServ::HandleProcess (this=0xbe4ca0, mess=
0xdb32e0, slb=0x0)
The only thing we can ask you in these cases is to post the logs,as you have done. But, of course, if you want to dig the code by yourself you are free to do it …
Gerri