Root crashes system with NFS shares?

Hi Fellow Rooters,

I’ll cut right to the chase. I am running on a workgroup with Leopard server providing LDAP access and some NFS shares to some other Leopard workstations. If I take the following program:

#include <iostream>
#include "TH1F.h"

int main()
{
  TH1F *g = NULL;
  g->Draw();
  return 0;
}

and compile it in the usual way

source $ROOTSYS/bin/thisroot.sh
g++ -o crash `root-config --cflags --libs` crash.cc

Then the following occurs:

  1. if I log in to a workstation as a local user, compile and link against a local copy of the ROOT libraries, the program crashes (as it should), but nothing else exciting.

  2. if I log in to a workstation as an LDAP user (and so my home directory is on an NFS mount) and link to a local copy of root, the program crash freezes my whole system, and I have to do a hard reboot.

  3. If I log in to the workstation as a local user and link against a copy of the root libs which is on an NFS mount, again the crash freezes my system.

  4. If I ssh into the server as an LDAP user, and link against a copy of root on the server (so that both my home directory and root are on local disks), the crash brings down the whole server.

Clearly, a non-privileged user should not be able to crash the whole system with a simple root-linked binary, which is a problem with apple (and don’t even get me started on problems with apple).

However, so far I can’t replicate this problem with any other libraries except root. So at this point I’m working under the assumption that something in root is accidentally exploiting some fatal flaw in Leopard’s handling of NFS shares. Or that god just hates me.

I’m running root 5.20/00, the pre-compiled Leopard binary, although I’ve also tried compiling from source with no real effect.

Cheers,
~Ben

Hi,

Can you try this variation:[code]#include
#include “TH1F.h”
#include “TSystem.h”

int main()
{
gSystem->ResetSignal(kSigSegmentationViolation);
TH1F *g = NULL;
g->Draw();
return 0;
} [/code]This will tell us if the problem is with the segmentation fault itself or with the signal handler that writes the stack trace of the problem.

Cheers,
Philippe.

Hi Phillipe,

No change. I see *** Break *** Bus Error, and then the system freezes.

Trying to debug this further, I found another “fun” feature. I can write a simple bad program which uses no root whatsoever, compile it, generate a bus error, and recover. If I take that same program and link it to some root libraries, it will then freeze when the bus error comes, even though nothing ever calls the root stuff.

For whatever diagnostic good it’s worth, the freeze happens whenever I link to any of lCore, lRIO, lNet, lHist, lGraf, lGpad, or lTree. It DOESN’T happen when I link to lCint. I haven’t had a chance to test any of the other libs.

~Ben

Hi,

I am guessing that the ‘random’ behavior cause by the seg fault in ‘randomly’ modifying the initialization (or de-initialization) behavior of the ROOT libraries. Short of being able to capture some sort of stack trace of your process at the time of the system freezes (humm … I have no clue how to do that), I don’t see how to proceed.

Anyway, (not that it helps :slight_smile:), this is clearly a macos bug; a seg fault (any seg fault) in a user level process should not be freezing the system.

Cheers,
Philippe.

Well Apple is as usual ignoring this bug report, but meanwhile I’ve found a hack to at least fix root’s problem. On the advice of someone on the macos-x-server mailing list, I replaced TUnixSystem::StackTrace() with an empty function that just returns immediately, and this fixed the problem. So it seems that StackTrace() goes looking for information about where it’s running, or where it can dump its info, and trips over NFS somehow?

Unfortunately I don’t have the time or expertise to do a thorough debugging to sort out the exact issue, but hopefully this will point someone more capable in the right direction.

Cheers,
~Ben

[quote] I replaced TUnixSystem::StackTrace() with an empty function that just returns immediately, and this fixed the problem. So it seems that StackTrace() goes looking for information about where it’s running, or where it can dump its info, and trips over NFS somehow? [/quote]Humm … interesting. A priori the problem in not in StackTrace since it should not have been called when you try with “gSystem->ResetSignal(kSigSegmentationViolation);”.

Cheers,
Philippe.