Segmentation Fault in TObject::TObject(TObject const&)

Dear ROOTers,

I am having a tough problem and I would be extremely grateful if you could help to solve it.

I am getting a segmentation fault in analysis program (under ROOT 5.34/34). It happens rather randomly, but only after a significant number of events gets processed (order of a few M). The stack trace is below:

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f90dfafcb4c in __libc_waitpid (pid=3111, stat_loc=stat_loc
entry=0x7ffd5142fc80, options=options
entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:31
#1  0x00007f90dfa822e2 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:148
#2  0x00007f90e079b073 in TUnixSystem::StackTrace() () from /usr/cern/root_v5.34.34/lib/libCore.so
#3  0x00007f90e079cd7c in TUnixSystem::DispatchSignals(ESignals) () from /usr/cern/root_v5.34.34/lib/libCore.so
#4  <signal handler called>
#5  0x00007f90e06f3177 in TObject::TObject(TObject const&) () from /usr/cern/root_v5.34.34/lib/libCore.so
#6  0x00007f90db8addb1 in TVector3::TVector3(TVector3 const&) () from /usr/cern/root_v5.34.34/lib/libPhysics.so
#7  0x00007f90dabe8f80 in StMuRpsTrack::thetaRp(unsigned int) const () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./StMuRpsTrack_cxx.so
#8  0x00007f90d9d5a879 in StMuRpsUtil::recalcTrack(StMuRpsTrack const*, StMuRpsTrack*) () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./StMuRpsUtil_cxx.so
#9  0x00007f90d9d5ac8d in StMuRpsUtil::process() () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./StMuRpsUtil_cxx.so
#10 0x00007f90d8e96980 in star::RpTrackSelection() const () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./star_C.so
#11 0x00007f90d8ea064b in star::Process(long long) () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./star_C.so
#12 0x00007f90d912c289 in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) () from /usr/cern/root_v5.34.34/lib/libProofPlayer.so
#13 0x00007f90dc2148c6 in TProofServ::HandleProcess(TMessage*, TString*) () from /usr/cern/root_v5.34.34/lib/libProof.so
#14 0x00007f90dc218fe8 in TProofServ::HandleSocketInput(TMessage*, bool) () from /usr/cern/root_v5.34.34/lib/libProof.so
#15 0x00007f90dc201e1f in TProofServ::HandleSocketInput() () from /usr/cern/root_v5.34.34/lib/libProof.so
#16 0x00007f90dc21b6a1 in TProofServLiteInputHandler::Notify() () from /usr/cern/root_v5.34.34/lib/libProof.so
#17 0x00007f90e079c545 in TUnixSystem::CheckDescriptors() () from /usr/cern/root_v5.34.34/lib/libCore.so
#18 0x00007f90e079d06a in TUnixSystem::DispatchOneEvent(bool) () from /usr/cern/root_v5.34.34/lib/libCore.so
#19 0x00007f90e071ffe6 in TSystem::InnerLoop() () from /usr/cern/root_v5.34.34/lib/libCore.so
#20 0x00007f90e0720bf0 in TSystem::Run() () from /usr/cern/root_v5.34.34/lib/libCore.so
#21 0x00007f90e06c636f in TApplication::Run(bool) () from /usr/cern/root_v5.34.34/lib/libCore.so
#22 0x0000000000401bbe in main ()
===========================================================


The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00007f90e06f3177 in TObject::TObject(TObject const&) () from /usr/cern/root_v5.34.34/lib/libCore.so
#6  0x00007f90db8addb1 in TVector3::TVector3(TVector3 const&) () from /usr/cern/root_v5.34.34/lib/libPhysics.so
#7  0x00007f90dabe8f80 in StMuRpsTrack::thetaRp(unsigned int) const () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./StMuRpsTrack_cxx.so
#8  0x00007f90d9d5a879 in StMuRpsUtil::recalcTrack(StMuRpsTrack const*, StMuRpsTrack*) () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./StMuRpsUtil_cxx.so
#9  0x00007f90d9d5ac8d in StMuRpsUtil::process() () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./StMuRpsUtil_cxx.so
#10 0x00007f90d8e96980 in star::RpTrackSelection() const () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./star_C.so
#11 0x00007f90d8ea064b in star::Process(long long) () from /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1455994104-30508/worker-0.2/./star_C.so
#12 0x00007f90d912c289 in TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) () from /usr/cern/root_v5.34.34/lib/libProofPlayer.so
#13 0x00007f90dc2148c6 in TProofServ::HandleProcess(TMessage*, TString*) () from /usr/cern/root_v5.34.34/lib/libProof.so
#14 0x00007f90dc218fe8 in TProofServ::HandleSocketInput(TMessage*, bool) () from /usr/cern/root_v5.34.34/lib/libProof.so
#15 0x00007f90dc201e1f in TProofServ::HandleSocketInput() () from /usr/cern/root_v5.34.34/lib/libProof.so
#16 0x00007f90dc21b6a1 in TProofServLiteInputHandler::Notify() () from /usr/cern/root_v5.34.34/lib/libProof.so
#17 0x00007f90e079c545 in TUnixSystem::CheckDescriptors() () from /usr/cern/root_v5.34.34/lib/libCore.so
#18 0x00007f90e079d06a in TUnixSystem::DispatchOneEvent(bool) () from /usr/cern/root_v5.34.34/lib/libCore.so
#19 0x00007f90e071ffe6 in TSystem::InnerLoop() () from /usr/cern/root_v5.34.34/lib/libCore.so
#20 0x00007f90e0720bf0 in TSystem::Run() () from /usr/cern/root_v5.34.34/lib/libCore.so
#21 0x00007f90e06c636f in TApplication::Run(bool) () from /usr/cern/root_v5.34.34/lib/libCore.so
#22 0x0000000000401bbe in main ()
===========================================================

To me it looks like ROOT crashes in TObject copy constructor. The lines of code which lead to described segmentation fault are as follows:

double StMuRpsTrack::thetaRp(unsigned int coordinate) const { if(coordinate>rpsAngleTheta) return 0.0; if(mType==rpsLocal) return theta(coordinate); TVector3 deltaVector = trackPoint(1)->positionVec() - trackPoint(0)->positionVec(); return atan((coordinate<rpsAngleTheta ? deltaVector[coordinate] : deltaVector.Perp())/abs(deltaVector.z())); }

The culprit is line where deltaVector is assigned with the difference of two vectors. I checked that at the moment of crash these two vectors “are OK”. Any guess what could be wrong?

Thank you,
Rafal

Hi Rafal,

can you reproduce the error in a standalone program?

Danilo

Check first that your pointers aren’t NULL before creating deltaVector.

if(!trackPoint(1) || !trackPoint(0))
{
	cout << "null error" << endl;
	return 0.0;
}

Hi,

[quote]
Hi Rafal,
can you reproduce the error in a standalone program?[/quote]

it’ll be a bit troublesome, but I’ll try.

I’ve done that check

Cheers,
Rafal

Ok, I see.

Then try the following, maybe you can detect the problem that way

valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --leak-check=full --log-file=output.log root.exe -n -l -b yourscript.cpp+ -q

Hi,

I used valgrind to track the problem. Since I’m running analysis using TProof I enabled valgrind through the following set of commands:

(...)
TProof::AddEnvVar("PROOF_WRAPPERCMD", "valgrind_opts:--leak-check=full");
TProof *p = TProof::Open("workers=5", "valgrind=workers");
(...)

I attached log file from valgrind for one of the workers, on which process failed first. Here’s the part of valgrind output which refers to TObject:

[quote]==21831== 1 errors in context 1 of 49:
==21831== Invalid read of size 4
==21831== at 0x4FF1177: TObject::TObject(TObject const&) (in /usr/cern/root_v5.34.34/lib/libCore.so)
==21831== by 0xC64CDB0: TVector3::TVector3(TVector3 const&) (in /usr/cern/root_v5.34.34/lib/libPhysics.so)
==21831== by 0xEB3AF7F: StMuRpsTrack::thetaRp(unsigned int) const (in /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1456171684-21803/worker-0.4/StMuRpsTrack_cxx.so)
==21831== by 0xFDCF878: StMuRpsUtil::recalcTrack(StMuRpsTrack const*, StMuRpsTrack*) (in /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1456171684-21803/worker-0.4/StMuRpsUtil_cxx.so)
==21831== by 0xFDCFC8C: StMuRpsUtil::process() (in /home/rafal/.proof/STAR_analysis-QA_2015-CEP_analysis/session-120-D11-1456171684-21803/worker-0.4/StMuRpsUtil_cxx.so)
==21831== by 0x1109AECF: star::RpTrackSelection() const (in /home/rafal/.proof/cache/star_C.so)
==21831== by 0x110A538A: star::Process(long long) (in /home/rafal/.proof/cache/star_C.so)
==21831== by 0x10DE2288: TProofPlayer::Process(TDSet*, char const*, char const*, long long, long long) (in /usr/cern/root_v5.34.34/lib/libProofPlayer.so)
==21831== by 0xB0D28C5: TProofServ::HandleProcess(TMessage*, TString*) (in /usr/cern/root_v5.34.34/lib/libProof.so)
==21831== by 0xB0D6FE7: TProofServ::HandleSocketInput(TMessage*, bool) (in /usr/cern/root_v5.34.34/lib/libProof.so)
==21831== by 0xB0BFE1E: TProofServ::HandleSocketInput() (in /usr/cern/root_v5.34.34/lib/libProof.so)
==21831== by 0xB0D96A0: TProofServLiteInputHandler::Notify() (in /usr/cern/root_v5.34.34/lib/libProof.so)
==21831== Address 0x1c is not stack’d, malloc’d or (recently) free’d[/quote]

Any guess?

Best regards,
Rafal
worker-0.4.valgrind.log.txt (111 KB)

It seems that in your library, you have some uninitialised values, e.g. in your class constructors deriving from TObject?. This may be a reason as explained here…

http://stackoverflow.com/a/23802904

Try compiling your library with the option -g and no optimization (no -O2), to get extra information about code lines error, rerun and repost the ROOT crash log and valgrind log.

Does the problem only appear when using PROOF?

Hi,
after a lot of investigation I found what has been causing a segmentation fault.

I checked that the problem was present not only when running ROOT with PROOF. Program always crashed after significant number of processed events, and basically that’s the core of the problem. Running program under gdb has very much helped in code analysis.

Description of the problem and final solution:
For each event I’ve been running a sort of reconstruction algorithm, during which many objects (tracks) were created. Tracks have contained other objects, let’s call it subtracks, which also have been numerously created during reconstruction. Subtracks were stored in tracks through TRef objects.
What I figured out with gdb was, that prgram crashed when I tried to invoke a method on a pointer to some particular subtrack. This pointer was obtained from a track by calling TRef::GetObject() and casting returned pointer on (subtrack *). And, unfortunately, this pointer was NULL. However, I had access to the same subtrack pointer “externally” (independently from a track), and it was a valid pointer (not NULL).
So the problem turned out to be related with TRef class. Each time I created a track object, also TRef objects were created as a part of track class. And each time TRef object is created, some counter is incremented in TProcessID. This counter is limited, and probably when maximum value was exceeded (too many TRef’s → it required significant number of processed events to happen) TRef::GetObject() returned NULL pointer.
The solution is described at https://root.cern.ch/doc/master/classTRef.html, namely, before running reconstruction I’m storing current number of TRef objects via

int currentNum = TProcessID::GetObjectCount();

and after reconstruction is done I am restoring number of those objects before reconstruction via

TProcessID::SetObjectCount(currentNum);

Now the program executes without any problems. Thank you for your help.

Best regards,
Rafal