Segfault in TTree->Fill() on solaris, not OSX

Hello,

I have a very puzzling and disturbing problem in ROOT, when filling trees based on a custom event class. Running either ROOT 4.00/02 or 4.02/00 on my Solaris box (compiled from source with CC 5.6) causes a segfault after a few TTree->Fill() calls. The same code on my OSX box with the fink-supplied 4.02/00 distribution runs fine with no crashes. (OSX 10.3, g++ 3.3) Therefore, I believe it must be something ill-configured in my ROOT compilation on the SUN. Here are some details:

The segfault occurs in the TBasket::WriteBuffer method, as illustrated in the following backtrace from dbx:
signal SEGV (no mapping at the fault address) in TBasket::WriteBuffer at 0xfdfb60cc
0xfdfb60cc: WriteBuffer+0x0478: ld [%l3 + 256], %l2
Current function is GSacq
213 gsTree->Fill();
(dbx) where
[1] TBasket::WriteBuffer(0x17037b88, 0x17037d68, 0x17037cc0, 0xfdfd8fa4, 0x16f0c38e, 0xffbdb208), at 0xfdec1294
[2] TBranch::Fill(0x7891, 0x654, 0x0, 0x170376f0, 0x17037b88, 0x17037d68), at 0xfdec2d04
[3] TBranchElement::Fill(0x170376f0, 0x0, 0xfdfdc118, 0xfdfcf720, 0x4, 0x1), at 0xfdecb974
[4] TBranchElement::Fill(0x17003f40, 0xe, 0xfdfdc118, 0xfdfcf720, 0xfe9f4e3c, 0xc), at 0xfdecb940
[5] TBranchElement::Fill(0x16f8cd20, 0x24, 0xfd9edb20, 0xfdfcf720, 0xfe9f4e3c, 0xb), at 0xfdecb940
[6] TTree::Fill(0x16f0e6a0, 0x0, 0x1426fd40, 0xfdfcf720, 0x1, 0xfdfae85e), at 0xfdeec5d8
=>[7] GSacq(ChatFileName = 0xffbfed54 “testchat”), line 211 in “UserEv.h”
[8] main(argc = 3, argv = 0xffbfee4c), line 250 in “GSSort.cxx”

My environment:
11:53am> CC -V
CC: Sun C++ 5.6 2004/07/15
11:53am> uname -a
SunOS wigner 5.9 Generic_118558-02 sun4u sparc SUNW,Sun-Blade-100

I have placed a gObjectTable->Print() in front of the Fill() call and see that the table is identical for all calls, so nothing is obviously wrong there, as I might expect from the fact that it runs fine on my other computer.

Any ideas what may cause this behavior? Are there particular configure options I should en(dis)able?

Thanks,
-Don

It could be that in one of your classes, you have one or more uninitialized variables. Your traceback is typical of this case.

Rene

Rene,

Thanks for that tip. Closer examination revealed that I had recently added a couple of variables to one of my classes and had forgotten to initialize them in the constructor. However, this did not solve the segfault problem. I still get the same segfault at the same spot. (The 19th Fill()).

Any other ideas?
-Don

Rerun your failed case after settting the global variable gDebug

gDebug=3;and send us the result.
Cheers,
Philippe.

Philippe,

Looking at the output, it’s clear that the crash is occurring at the first tree write (when enough bytes have been filled to fill a “buffer”) so I redefined my branch call to be
gsTree->Branch(“EventBranch”,“GSFMAEvent”,&myevent,16000,2) in order to crash more quickly and have smaller output files. An interesting feature I’ve discovered: the crash does not always occur. For some input files, I can call the Fill() many more times before a crash occurs (if at all).

The attached tarball has 4 files:
run03* are from a run that did not segfault
run21* ended in a segfault
the *screen files contain text that was output to the screen, but not redirected to the file (presumably these were stderr info?)

One difference I notice between the two is that in run03, the STRIP class counts up to ~1700 bytes and then resets to start counting from 1 again whreas for run21 it just creeps up and up until it hits 16k. Perhaps this is pointing to an error in the wrapper program I’m using to decode the data stream?

Thanks again for any pointers, and any advice on other things to look for in the debug output will be welcome.

AH! The BB didn’t like the .tgz extension. I changed to .tar.gz and it now appears I can attach the files.
gDebug.tar.gz (545 KB)

Don,

This confirms my first diagnostic that you have an uninitialized variable
somewhere in one of your branches. It is likely in a data member at
the top level of your class because the crash appears only when you write
the basket in memory.
The best way to locate this type of errors is to run your application
under valgrind. Valgrind will tell you the place where the problem happens.

Rene

Valgrind is only for x86 architectures, of which I don’t have access to at the moment. However, I will play with SunStudio and see if I can find the culprit.

The crash occurs under Solaris, but not OS X. This seems odd. I would think if it were an uninitialized variable I might get hosed under ANY platform. Oh well, I’ll keep digging.

Thanks,
-Don

Ok. Sorry to drag this out, but I have finally brought this down to a very small test case. This still runs, but the bad memory access occurs at the same place—when the branch is initialized.

I’ve attached a sample class header and implementation file, as well as a test program. When run with dbx’s memory access checking, here is the trace:
Read from uninitialized (rui):
Attempting to read 1 byte at address 0xffbf9e1d
which is 621 bytes above the current stack pointer
stopped in G__parse_parameter_link at 0xed9a409c
0xed9a409c: G__parse_parameter_link+0x0184: ba,a 0xed7a5bac ! 0xed7a5bac
Current function is main
35 gsTree->Branch(“EventBranch”,“GSFMAEvent”,&myevent,16000,2);
(dbx) where
[1] G__parse_parameter_link(0xef6eb7eb, 0x14, 0x4f, 0x0, 0x10, 0xeda7b590), at 0xed9a409c
[2] G__memfunc_setup(0x0, 0x0, 0xb7af8, 0x19, 0x320, 0x5), at 0xed9a3dcc
[3] G__setup_memfuncTObject(0x1, 0x1, 0x2, 0xff8, 0x3fade8, 0xef49fcc0), at 0xef49ffcc
[4] G__incsetup_memfunc(0xedac3db4, 0x70, 0x24, 0xeda86cec, 0xeda70adc, 0xeda7b408), at 0xed9a71d4
[5] G__get_methodhandle(0xef6bfbfd, 0xffffffff, 0xb7af8, 0xffbfdb84, 0xffbfdbf4, 0x1), at 0xed9854fc
[6] G__ClassInfo::GetMethod(0xffbfdbf8, 0x6b4980, 0xef6bfbfd, 0xef6bfc09, 0xffbfdbf4, 0x1), at 0xeda242e0
[7] TClass::BuildRealData(0x6b5a48, 0x6a9600, 0x4f148, 0xffbfdc20, 0x6a9600, 0xef734dfc), at 0xef2a05d8
[8] TClass::BuildRealData(0x6b54d8, 0x6a9600, 0x6b5dd8, 0x6b5a48, 0x6a9600, 0xef734dfc), at 0xef2a0708
[9] TTree::BuildStreamerInfo(0x6a9358, 0x6b54d8, 0x6a9600, 0xe6d5ea04, 0xef766ea0, 0xf3521d30), at 0xe6d5f0b0
[10] TTree::Bronch(0x6a9358, 0x1c451, 0x1c45d, 0xffbfed88, 0x3e80, 0x2), at 0xe6d5ea14
=>[11] main(argc = 1, argv = 0xffbfee14), line 35 in “test.cxx”

I would be most grateful for an explanation of why this doesn’t work.

Thanks again,
-Don

P.S. When compiling I get several Warnings of the type:
“/dk/bgo37/dpeterson/include/root/BaseCls.h”, line 53: Warning: G__BaseClassInfo::Init hides the function G__ClassInfo::Init().
Should I be concerned at all?
test.cxx (1.02 KB)
GSFMAEvent.cxx (606 Bytes)
GSFMAEvent.h (744 Bytes)

I tested your program without any problem on Linux and Solaris.
My Solaris machine has CC 5.2. I cannot test with 5.4, but I have doubts
that the compiler version could be a problem.

Are you generating the dictionary for your class?
Could you test the following
root > .L GSFMAEvent.cxx+
root > .x mytest.C

where mytest.C is

[code]void mytest() {
TFile *fConv = new TFile(“TestCase.root”,“RECREATE”);
TTree *gsTree = new TTree(“GSFMA”,“GSFMA Data”);

// Declare an instance of our event class
GSFMAEvent *myevent = new GSFMAEvent();

// Build a branch from our event class
gsTree->Branch(“EventBranch”,“GSFMAEvent”,&myevent,16000,2);

cout << “First Fill call\n”;
gsTree->Fill();

myevent->print();
cout << “Try Filling Tree Again\n” << flush;
gsTree->Fill();
cout << “done!\n”;

fConv->Write();
fConv->Close();
}
[/code]

Hi Rene,

Yes, I am creating a dictionary for the class. As I mentioned, the above code runs, but is still accessing bad memory. My actual fulI class would also run for a simple one-pass fill like this.

I ran valgrind on it on my home linux box last night, and it shows several leaks in libCint.so and libCore.so. What’s interesting is that the amount of leaked memory is different depending on whether I link against my class’ shared objects created by ACLiC in root or using the rootcint command line tool. (ACLiC leaks 50% more).

If you see that valgrind is not reporting errors on your end, then it must be something in how I’ve built ROOT from source.

What else can we do to discover the flaw?

Thanks again for your time,
-Don

What you call leaks are not leaks. They are simply normal creation
of dynamic data structures that we do not delete when the job terminates
because it does not make sense to waist time at the end of the job.

As I said, I could not reproduce your problem with your file.
Please send a concrete example that fails.

Rene

Hi rene,

The simple example I sent before DID fail if filled enough times. However, I spent some time this weekend debugging the “stock” data stream unpacking program here that was feeding my tree. After cleaning up some reads from uninitialized memory in that program, I have not seen a crash. So it seems that it wasn’t a real problem with my ROOT class or TTree, but rather this wrapper program overwriting something.

It’s still a bit strange that we didn’t crash on the mac as well, but I could see that different compilers on different architectures handle memory management differently and I was getting “lucky”. Or, it could be that gcc3-3 actually initializes variables as their declared in some instances. Anyhow, it seems that was a red herring. Thanks again, all seems to be working now.

Cheers,
-Don

OK Don. So my initial guess was correct.

Rene