Program crashes in cling only when dictionaries are built

Hello,

I have bumped into a rather strange problem that I am unable to debug any further than I’ve tried so far…

The situation is as follows. The software I’m working with uses ROOT for

  1. serialization for backup/restore functionality
  2. the C++ notebooks

We generate, compile and link the dictionaries in one shared library (libbiodynamo.so), which contains all the core code of the software as well.

Then the problem occurs when we do the following in either a notebook, or in cling:

root [0]: .L libbiodynamo.so
root [1]: <macro with some calls into our library>
 *** Break *** segmentation violation
...

Now the stack trace differs depending on which calls we make. However, if there is a situation that causes a segfault, it segfaults deterministically.

The weird part is that when we compile the same program (with gcc) and then run it, there is no segfault (there are also no memory leaks / errors according to valgrind).

Even weirder: when we do not compile the dictionaries into libbiodynamo.so, there is also no segfault anymore in the interpreted execution… It seems that somehow the presence of the dictionaries in libbiodynamo.so is causing our program to crash in the interpreter.

A minimalistic reproducer unfortunately still is based on calls into our library, and I have been unable to isolate it further. I was hoping that someone in this forum recognizes this general issue and could give me tips on how to further debug this. Is there a way to use gdb in combination with the interpreter?

Cheers,
Ahmad


ROOT Version: v6.18.04
Platform: Ubuntu 16.04
Compiler: GCC 5.3


One possible next step is to take the same work flow but to load/compile the macro with ACLiC (it might require some tweaks of the macro file to make it compilable). Then if the problem persist, valgrind will be able to help. If the problem does not persist with ACLiC, I still run with valgrind to insure that the new ‘undefined’ behavior just happens to not crash. If the problems does not show up with valgrind and ACLiC then I turn to executing the macro line by line (and/or commenting out more and more of the macro) until I pin-point the line(s) that provoke the crash.

Thanks for your suggestions.

I am able to run macro in ACLiC-compiled mode, and the problem still persists. I found that a call to dynamic_cast that tries to cast a base class type to one of its derived types is the entry point of all this horror. The target class for the dynamic_cast is inheriting from two classes btw.

Do you know what the relation is between dynamic_cast and the ROOT dictionaries? I have create a situation where the stack trace leads to the Streamer function of the derived type upon calling dynamic_cast. This stack trace only occurs inside the interpreter. The ACLiC-compiled binary crashes on the dynamic_cast invocation. Are dictionaries in any way used for RTTI when dealing with C++ polymorphism? If so, how?

In gdb I noticed that dynamically casting from Base -> Derived ended up with the vtable pointer being null in some cases, which I believe is the reason for my segfaults.

I tried recreating a simple version of the classes to reproduce the problem, but so far I had no luck. Any more tips?

A ClassDef section and a dictionary are required only for classes inheriting from TObject. If this is not the case dictionary do not affect directly the virtual table.

Were you able to run valgrind on the failing case? (and when compiling with ACLiC in debug mode: .L filename.C+g)?

In gdb I noticed that dynamically casting from Base → Derived ended up with the vtable pointer being null in some cases,

This usually indicates that the memory being used has already been freed or has not been allocated yet or is somehow mis-aligned.

What are the 2 (sets) of classes you class inherits from. Where does the pointers you use for the dynamic_cast (i.e. where do you get the value or allocate it)?

This is intriguing. I think to make progress we’ll need the full reproducer. Ahmad, could we meet on Friday morning to debug this in my office?
Axel.

Were you able to run valgrind on the failing case? (and when compiling with ACLiC in debug mode: .L filename.C+g )?

Yes, but unfortunately didn’t get any wiser. Valgrind just told me the following:

==21978==    at 0x4EBFDEC: single_neuron_mode() (single_neuron_mode.C:99)
==21978==    by 0x4006BE: main (in /home/ahmad/bdm-paper-examples/bdm/pyramidal-cell/src/main)
==21978==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==21978== 
{
   <insert_a_suppression_name_here>
   Memcheck:Addr8
   fun:_Z18single_neuron_modev
   fun:main
}

 *** Break *** segmentation violation

On single_neuron_mode.C:99 I am doing the dynamic_cast call. There was no more valgrind output besides the above unfortunately.

What are the 2 (sets) of classes you class inherits from.

Those are two polymorphic classes of our own (i.e. no TObject derivations).

Where does the pointers you use for the dynamic_cast (i.e. where do you get the value or allocate it)?

The pointer we use for the dynamic_cast is defined in the C macro, but the corresponding object that it points to is allocated by our shared library (libbiodynamo.so). Essentially we do the following:

void single_neuron_mode() {
  auto *soma = new NeuronSoma();

  NeuriteElement neurite;
  SimObject *so = neurite.GetCopy();  // Returns a `new NeuriteElement()`

  Event event;

  // Here we do a `NeuriteElement* neurite = dynamic_cast<NeuriteElement*>(so);`
  // This call corrupts `so` somehow (_vptr SimObject becomes 0x0)
  soma->EventHandler(event, so);

  auto *casted = dynamic_cast<NeuriteElement *>(so);  // SEGFAULT
  std::cout << casted->GetUid() << std::endl;
  delete soma;
  delete so ;
}

could we meet on Friday morning to debug this in my office?

That would be great Axel. I will contact you by e-mail.

Ahmad

The valgrind output seems to indicates that the value of so is 0x8 which indeed is not valid.
Can you print so both after the call to GetCopy and before and after the call to EventHandler?
Also what is the signature of EventHandler?

It indeed seems like that 0x8 is the address of so, but when I print out so it is a normal address value (same before and after the call).

This is the signature:

void EventHandler(const Event& event, SimObject* other1, SimObject* other2 = nullptr)

The implementation is stripped down to just the following:

  std::cout << "other1  " << other1 << std::endl;
  std::cout << "nuid " << other1->GetUid() << std::endl;
  NeuriteElement* neurite = dynamic_cast<NeuriteElement*>(other1);
  std::cout << "neurite " << neurite << std::endl;
  std::cout << "nuid " << neurite->GetUid() << std::endl;

A new insight: Removing SimObject from the dictionary results in no more segfault. Attached I have the original dictionary libbiodynamo-dict.cc and the one without SimObject libbiodynamo_dict_filtered.cc.

I didn’t notice anything out of the ordinary when I do a diff between the two, but maybe you see something is off?

libbiodynamo_dict.cc (532.7 KB) libbiodynamo_dict_filtered.cc (529.1 KB)

Given the EventHandler signature it can not change the value of other1, the only things that could go wrong is that it is deleted. With the printout (especially the "nuid " << other1->GetUid() part) can you re-run valgrind?

Is the the SimObject part of the Event? Does EventHandler do any I/O on the Event?

A new insight: Removing SimObject from the dictionary results in no more segfault.

The main things that this does is that without the dictionary the I/O would generate and use an emulation for the classes upon reading objects of that type. (So without dict should lead to more problem) …

Given the EventHandler signature it can not change the value of other1

Why not? Even though we do not do anything with other1, it’s not a const pointer.

Here is the more verbose output when I run it through valgrind:

[Before lib call] so = 0x28f0b000
[Before lib call] so->GetUid() = 1
[In lib call, before cast] other1 = 0x28f0b000
[In lib call, before cast] other1->GetUid() = 1
[In lib call, after cast] other1 = 0x28f0b000
[In lib call, after cast] other1->GetUid() = 1
[In lib call, after cast] neurite->GetUid() = 686862336
[After lib call] so = 0x28f0b000
[After lib call] so->GetUid() = 0
==19906== Invalid read of size 8
==19906==    at 0x5FB79B0: __dynamic_cast (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==19906==    by 0x4EBD2D1: single_neuron_mode() (single_neuron_mode.C:90)
==19906==    by 0x4006AE: main (in /home/ahmad/bdm-paper-examples/bdm/pyramidal-cell/src/main)
==19906==  Address 0xfffffffffffffff0 is not stack'd, malloc'd or (recently) free'd
==19906== 
{
   <insert_a_suppression_name_here>
   Memcheck:Addr8
   fun:__dynamic_cast
   fun:_Z18single_neuron_modev
   fun:main
}

As you can see, the so pointer points to the same address the whole time.

The pointer itself is based by value. So the numerical value of the pointer can not be changed by the function … however the object’s content can be changed (up to deletion).

One more test, does the dynamic cast ‘work’ if done before the call to EventHandler?

So after a long debugging session with a colleague of mine, we pinpointed the exact cause of the problem. In retrospect, this should have been the first place we should have looked at, but it didn’t occur to us, since ‘it always worked before’.

In our code we have wrapped the ClassDef and its variants macros in our own macros. We generate a ‘dummy dictionary implementation’ ourselves (e.g. static TClass* Class() { return nullptr; }), when users of our software wish to compile libbiodynamo.so without dictionaries. We do this to handle the following case more gracefully: user compiles libbiodynamo.so without dictionaries (and lose the backup/restore functionality), but later on builds a binary (linked with libbiodynamo.so) that uses some backup/restore functionality. In our dummy implementation we would throw a fatal error message saying that both libbiodynamo.so and your binary must have dictionaries enabled for these features to work.

Anyways, whether or not the compiler would use the rootcling-generated ClassDef implementations, or our dummy version, was decided by whether or not a certain macro (USE_DICT) was defined (in our case, this was set in CMake). However, cling and ACLiC did not see USE_DICT and would fall back on our dummy implementations. This messed up the virtual table pointer of one of the derived classes.

I have created a minimal reproducer [1]. In the end it was a matter of also defining a macro before we include any of our own headers into cling (see [1] for details). But it would be nice if Cling would have told us that memory-layout of a class as defined in libbiodynamo.so differs from what it could infer from including our headers.

Thanks @pcanal and @Axel for your help in debugging this! @Axel, let me know if the reproducer is clear enough in explaining the issue for us to dismiss our meeting tomorrow :wink:

Cheers,
Ahmad

[1] https://github.com/Senui/cling_dict_issue

Just to add one more thing: since we create our own rootlogon.C, we could just add #define USE_DICT in there, and then cling would pick it up automatically. Unfortunately, there is no convenient solution yet for the ROOT Notebooks, since rootlogon.C is not picked up there… This creates another ‘boilerplate’ line of code for our users (see rootlogon.C not picked up by Jupyter C++ Notebooks)

You may want to investigate the use of ClassDefInline (available since v6.12). This macros makes a classes using a ClassDef fully functional (some cases of template class using the ROOT typedefs are not supported) even without a dictionary … To get I/O to work properly, without the dictionary, you still need to find an alternative way to inject the header information (which is now essential).

However, cling and ACLiC did not see USE_DICT and would fall back on our dummy implementations. This messed up the virtual table pointer of one of the derived classes

Indeed this break the “one definition rule” :frowning:

. But it would be nice if Cling would have told us that memory-layout of a class as defined in libbiodynamo.so differs from what it could infer from including our headers.

That is actually very challenging (to impossible) to implement. This would require using/understanding the debug symbols (if available) to infer the layout used by the compile code. [And even if it works it would likely proved to be a run-time costly check]

To have Cling (and thus ACLiC) always use the USE_DICT macro, you place somewhere ‘central’:

#if defined(__CLING__) && !defined(USE_DICT)
#define USE_DICT
#endif

Cheers,
Philippe.

Well done, Ahmad and colleague - that’s a very impressive debug session you had!

I agree that there’s no point to meet tomorrow. Enjoy the extra hour in your day! :slight_smile:

We did check for consistency of class sizes in the old CINT days - but here we’re talking about different vtable entries, and that’s just not something I find reasonable to verify (let alone would know how to do). Please make sure you have a look at Philippe’s hint on the ClassDefInline!

Enjoy your holiday, and don’t hesitate to contact us again if you find another issue!

Axel.

It’s always nice to find your own question when you Google for more info on a matter; in this case ClassDefInline: How to use ClassDefInline? - #7 by ahesam :wink:

The documentation on ClassDefInline is still a bit scarce. I can appreciate the fact that no dictionaries need to be generated (this would reduce the compile time of our library significantly).

To get I/O to work properly, without the dictionary, you still need to find an alternative way to inject the header information (which is now essential).

Could you explain a bit more what you mean by this? What alternative ways would you suggest? If I were to write a non-ROOT object to file using ROOT I/O, how would I, for example, be able to read this file in the interpreter / notebook? Would including the header (which uses ClassDefInline) be enough? At the moment I need to load the dictionary shared object first and then I can properly read back.

In earlier release notes [1] it was mentioned that it’s especially useful for “scripts and other non-framework code”. Is this still the case? Because my intention would be to replace the regular ClassDef in our codebase with ClassDefInline to get rid of generating dictionaries. What would be the effect on the performance? Would every I/O call incur an extra penalty as compared to have precompiled dictionaries? Are there any studies or benchmarks done on this matter?

To have Cling (and thus ACLiC) always use the USE_DICT macro, you place somewhere ‘central’:

This would be a convenient solution if it weren’t for the case where our library is compiled with USE_DICT off and users start loading it into cling where we would now force USE_DICT to be set and get again mismatches in memory layouts.

@Axel, @pcanal, the point on checking vtable entries in cling sounds indeed like something close to impossible. Although my point was not to have this check ‘on by default’, but to enable this in a kind of ‘cling debug mode’. Anyways, this is probably not a highly reoccurring issue I can imagine :wink:

I wish you both great holidays too!

Cheers,
Ahmad

[1] ROOT v6.10/00 has been released!

The 2 majors performances differences are:

  • To get information about the class, ROOT/Cling needs to parse the header at run-time (or load the pcm) ; this is a ‘one time per process per header’ cost.
  • To create (and delete) objects in the I/O we need to pass through an interpreted function instead of a compiled function. There is both a one-time cost to ‘load’ the function and a per-allocation-of-those-objects-by-the-IO cost (because interpreted function are not as optimized)

[At the moment the dictionary generator does not handle ClassDefInline so you can not generate a dictionary for those classes (to get autoloading and autoparsing and accelerator functions) but in principle this could be added)]

Would including the header (which uses ClassDefInline) be enough?

Yes (in addition to the library that implement those classes of course) but since there is no dictionary, there wouldn’t be any autoloading of the library nor any automatic load/parsing of the headers.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.