Tracking down a crash on PROOF

gwatts · September 19, 2010, 11:13pm

Hi,
When I run on PROOF on a single dataset, everything seems to work well. When, in the same proof session, I try to run again, I get a fairly hard crash. I’m almost postiive this is my bug, but I don’t I’m having some trouble chasing after it.

Here is the log from one of the worker nodes. It happens when I re-run the same job a second time - it is the identical job, on the same dataset.

[code]The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

#6 0x00002ad19aa9f8f0 in vtable for TString ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libCore.so
#7 0x00002ad19a37318a in TClass::Destructor ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libCore.so
#8 0x00002ad19ba04fc8 in TBufferFile::ReadFastArray ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libRIO.so
#9 0x00002ad19bab4ce4 in TStreamerInfo::ReadBuffer<char**> ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libRIO.so
#10 0x00002ad19ba05f91 in TBufferFile::ReadClassBuffer ()
from /phys/groups/tev/scratch1/users/gwatts/root/lib/libRIO.so
#11 0x00002ad19ba30c24 in TGenCollectionStreamer::ReadObjects ()
[/code]

I tryed to turn on logging using the SetLogLevel, and the crash dump went away, though things still crashed.

I’ve been able to run mutliple times locally using the same TSelector, and even taking my fInputList and writing it to a file and reading it back. And it only happens the second time…

Many thanks for anyone’s help!

Cheers, Gordon.

gwatts · September 21, 2010, 3:20pm

With help from the ROOT discussion board I’ve now got this error - it was an unitnalized pointer in an object that was one of PROOF’s inputs. I don’t know if this will help, but here is what I did to try to help me out:

Make my scripts trivially switch between running locally and remotely.
Generate the Input list (TProof::AddInput) locally even when not on PROOF, and then write the objects out to a file, delete them, and then read them back from a file and use those. This assures that you’ve not made a ROOT I/O error (this is, I think, the same mech that PROOF uses to transmit those objects accross the wire).
Use valgrind when running locally
Create a small tarball sample and upload it to the ROOT discussion board.

In general, is there some way to debug these sorts of things if you are seeing the crashes on the PROOF server?

Many thanks, Gordon.

Tracking down a crash on PROOF

[code]The lines below might hint at the cause of the crash. If they do not help you then please submit a bug report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace from above as an attachment in addition to anything else that might help us fixing this issue.

[code]The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.