Root segfaults on __cxa_finalize in XrdClFile through TNetXNGFile when using a client plug-in

Hello,
I am testing a xrootd client plugin (xrootd 4.2.2) and have been using it with root v 5.34/34 .
(gcc 4.9.2)
While everything works as expected if i run any root Macro like this :f=new TNetXNGFile("root://xrd-manager:1094://roottest/data/test.root"); f->ls(); delete f;
Unless I explicitly delete the file :f=new TNetXNGFile("root://xrd-manager:1094://roottest/data/test.root"); f->ls();
I run into a segmentation violation in XrdCl::File:IsOpen() on exit (__cxa_finalize) .
(After a Close() has been run on my plugin file, but before its destructor is called)#4 signal handler called #5 0x00007f235290f5dc in XrdCl::File::IsOpen (this=0xf9faa0) at /tmp/xrootd-4.2.2/XrdClFile.cc:389 #6 0x00007f2352e1089e in TNetXNGFile::~TNetXNGFile() () from /lustre/jknedre/root/5.34.34/lib/root/libNetxNG.so #7 0x00007f2352e10be9 in TNetXNGFile::~TNetXNGFile() () from /lustre/jknedre/root/5.34.34/lib/root/libNetxNG.so #8 0x00007f2358e567e5 in TList::Delete(char const*) () from /lustre/jknedle/root/5.34.34/lib/root/libCore.so #9 0x00007f2358de03aa in TROOT::~TROOT (this=0x7f2359596400 ) at /tmp/root/core/base/src/TROOT.cxx:506 #10 0x00007f235811ceaf in __cxa_finalize (d=0x7f2359592c80) at cxa_finalize.c:56 #11 0x00007f2358dcbf13 in __do_global_dtors_aux () from /lustre/jknedlik/root/5.34.34/lib/root/libCore.so #12 0x00007fff1b797860 in ?? () #13 0x00007f23595dcfca in _dl_fini () at dl-fini.c:252

Using valgrind and some effort I found out that the problem is, that the XrdCl::DefaultEnv is “finalized” [including dlclose() on the plug-in shared lib through its plug-in manager ] before the file object is(/are) deleted.

[code]TNetXNGFile::~TNetXNGFile()
{
if (IsOpen())
Close();
delete fFile;
delete fUrl;
delete fInitCondVar;
}
->
Bool_t TNetXNGFile::IsOpen() const
{
return fFile->IsOpen();
}
->
bool XrdCl::File::IsOpen() const
{
if( pPlugIn )
return pPlugIn->IsOpen();

return pStateHandler->IsOpen();

}[/code]

The following call to the TNetXNGFile destructor finally tries to use the pPlugin’s IsOpen() method which fails, because the plug-in library has already been unloaded.

//////Valgrind Output Discarding syms at 0xf1876f0-0xf18f140 in /home/jknedlik/xplug/plug.so due to munmap( ... pure virtual method called terminate called without an active exception Aborted

Strangely, in root 6 everything works fine, because the cxa__finalize_ does not try to finalize DefaultEnv first.

The root 6 valgrind output is pretty comparable in everything else (until that point of course).
Sadly, my use case still needs me to get it work in root 5.

Maybe there is a specific option that can set some rules in the root 5 env regarding the garbage collector?

Any help is welcome

Regards
JK

ps: I am sorry if this is the wrong place to post this. Feel free to move me / point me to the right place

https://github.com/xrootd/xrootd/issues/338

Hi,

Are you using the ‘root.exe’ executable or your own? If you are using you own creating a TApplication object in your main might solve the problem.

If you are using root.exe or already have a TApplication object, then we need to investigate a bit further. On your machine using v6, can you set a break point on the TNetXNGFile destructor to figure out where/when it is called.

Thanks,
Philippe.

Hi,

I am already using root.exe, therefore I have set a breakpoint at the destructor.

Breakpoint 1, 0x00007f47abe89ac0 in TNetXNGFile::~TNetXNGFile() () from /lustre/jknedlik/software/root/6.04.14/lib/root/libNetxNG.so (gdb) bt #0 0x00007f47abe89ac0 in TNetXNGFile::~TNetXNGFile() () from /lustre/jknedlik/software/root/6.04.14/lib/root/libNetxNG.so #1 0x00007f47b70dba1d in TList::Delete(char const*) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libCore.so #2 0x00007f47b706aff2 in TROOT::~TROOT ( this=0x7f47b7467400 <ROOT::GetROOT1()::alloc>, __in_chrg=<optimized out>) at /tmp/root-6.04.14/core/base/src/TROOT.cxx:658 #3 0x00007f47b7069074 in at_exit_of_TROOT () at /tmp/root-6.04.14/core/base/src/TROOT.cxx:267 #4 0x00007f47b615bb29 in __run_exit_handlers (status=0, listp=0x7f47b64c95a8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #5 0x00007f47b615bb75 in __GI_exit (status=<optimized out>) at exit.c:104 #6 0x00007f47b7131afd in TUnixSystem::Exit(int, bool) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libCore.so #7 0x00007f47b70a427f in TApplication::ProcessLine(char const*, bool, int*) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libCore.so #8 0x00007f47b6c29685 in TRint::ProcessLineNr(char const*, char const*, int*) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libRint.so #9 0x00007f47b6c298b1 in TRint::HandleTermInput() () from /lustre/jknedlik/software/root/6.04.14/lib/root/libRint.so #10 0x00007f47b7135d4d in TUnixSystem::CheckDescriptors() () from /lustre/jknedlik/software/root/6.04.14/lib/root/libCore.so #11 0x00007f47b7136e6a in TUnixSystem::DispatchOneEvent(bool) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libCore.so #12 0x00007f47b70947f9 in TSystem::InnerLoop (this=0x96a810) at /tmp/root-6.04.14/core/base/src/TSystem.cxx:410 #13 0x00007f47b7094594 in TSystem::Run (this=0x96a810) at /tmp/root-6.04.14/core/base/src/TSystem.cxx:360 #14 0x00007f47b70a1c8f in TApplication::Run(bool) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libCore.so #15 0x00007f47b6c2ad0b in TRint::Run(bool) () from /lustre/jknedlik/software/root/6.04.14/lib/root/libRint.so #16 0x0000000000401010 in main ()

Hi,

Can you do the same for v5 (i.e. I am trying to understand why it works in v6 but not in v5)?

Thanks,
Philippe.

I thought, you could see that in my valgrind output. But here is the gdb output for v5 :

[code]Breakpoint 1, 0x00007fb3f0728be0 in TNetXNGFile::~TNetXNGFile() ()
from /lustre/nyx/rz/jknedlik/software/root/5.34.34_new/lib/root/libNetxNG.so
(gdb) bt
#0 0x00007fb3f0728be0 in TNetXNGFile::~TNetXNGFile() ()
from /lustre/nyx/rz/jknedlik/software/root/5.34.34_new/lib/root/libNetxNG.so
#1 0x00007fb3f8552eb5 in TList::Delete(char const*) ()
from /lustre/nyx/rz/jknedlik/software/root/5.34.34_new/lib/root/libCore.so
#2 0x00007fb3f84ebe77 in TROOT::~TROOT() ()
from /lustre/nyx/rz/jknedlik/software/root/5.34.34_new/lib/root/libCore.so
#3 0x00007fb3f7874eaf in __cxa_finalize (d=0x7fb3f8c47f40) at cxa_finalize.c:56
#4 0x00007fb3f84cad33 in __do_global_dtors_aux ()
from /lustre/nyx/rz/jknedlik/software/root/5.34.34_new/lib/root/libCore.so
#5 0x00007ffee8beb8b0 in ?? ()
#6 0x00007fb3f8c91fca in _dl_fini () at dl-fini.c:252

[/code]

Hi,

Thanks for repointing the stack trace it out.

One of the difference is that we introduced an explicit at_exit function to process the destruction of TROOT in v6. Unfortunately the straightforward back port of adding this call did not work out and need more extensive reworking (in particular of when and how gROOT is used). Such changes are too extensive to be made in v5 we now longer actively develop (except for critical bugs without work-around).

To work-around, the simple solution is to (as you pointed out) explicitly delete the TFile objects before the end of the process.

Cheers,
Philippe.

Note that there are some side evidence that the success you see in v6 might actually just be ‘luck’ and that it might actually have technically the same problem (unloading of the libNetxNG.so library before the deletion of the TNetXNGFile objects).

Hi Philippe,

https://github.com/xrootd/xrootd/issues/338#issuecomment-194027770
suggested, that I could ask for [quote]ROOT garbage collects the file objects after it dlcloses the relevant plugin library. You could ask Philippe (pcanal) to see if he can at least run a round of garbage collection atexit for the TFile objects in global scope, if the full backport is not possible.[/quote]

So that is my question here :slight_smile:
Thank you very much for your help, even if this is not possible.

JK