"Trying to access the contents of a null RResultPtr" with GCC 11

After upgrading to GCC 11.1, and rebuilding ROOT 6.24 with the new compiler, I’m getting a runtime error due to “Trying to access the contents of a null RResultPtr” when calling GetValue() on the result of a call to Book.

This code works on the same file in CI (using LCG 98), and worked (ish, I made a few changes but not to this section of the code) with GCC 10.2, so I think this is an incompatibility with GCC 11.

Hi @beojan ,
it is fairly unlikely that this is a compiler issue, I added some more nullptr checks in RResultPtr lately so it might be that RDF is now warning about a pre-existing problem that previously went unnoticed.

You can check e.g. with gdb why that RResultPtr is null, or if you can share a self-contained reproducer I can take a look.

Cheers,
Enrico

It’s hard to provide a reproducer when I can’t reproduce it on CI. I got a backtrace from gdb but that just told me what I already knew, it’s the Book call that’s at fault. I can try again setting breakpoints and such (I went down a bit of a rabbit hole trying to get rr (Record and Replay) to work so I could attach the recording).

Tracing with log statements tells me the object I pass to the Book call gets constructed (which I do inline), but it never gets Initialized.

That makes sense because Initialize would be called at the beginning of the event loop, while the error is reported before even starting the event loop. The error means that you are dereferencing/accessing an empty RResultPtr – it could be a use-after-move, or given the hard-to-reproduce nature of the problem maybe a use-after-delete. Running the repro under valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp might dig up something interesting.

Running within gdb you don’t even need to set breakpoints, if an exception is thrown gdb will break there and you can check why the RResultPtr is null.

Cheers,
Enrico

Wouldn’t RResultPtrs always be null until the event loop is run to create the contents?

Not in a way that’s observable by the users, as they get “filled” as you dereference them.

Users would see they are null only if they are constructed using the default constructor and never assigned to or they are moved from.

In my case, this Book call writes a TTree to an output file, and returns a pointer to the tree. Before the event loop is run no tree exists (it’s created in the Finalize function using TTree::MergeTrees for trees filled in memory in the various workers) so I was returning a nullptr up to that point.

If I understand correctly, the check for null is done before the event loop is run, so it’s using this nullptr.

How do you tell the RResultPtr to “update” its contents from a nullptr to the actual tree object produced? If I remember correctly, the RResultPtr internal pointer-to-result is only assigned once, before running the event loop, at the moment the RResultPtr object is constructed.

I think what was happening was that in 6.22 and below, GetResultPtr was called after the event loop was run, so it returned a non-null shared pointer. In 6.24, it’s called before the event loop is run so it now gets a null pointer.

Still investigating though, and I haven’t looked into the ROOT code to check the above.

What should have changed is not that (GetResultPtr was always called before the event loop), but recently I introduced more nullptr checks, so it’s possible the problem went unnoticed.

Let me know if I can help with debugging.

I now see that the docs are not clear about the sequencing of these calls, I will fix that.

That makes sense, because I never actually used the value returned by GetValue (I just needed to trigger the event loop).

Ok that’s likely it then. Uhm, tricky.

A quick and dirty workaround is calling GetPtr() to trigger the event loop instead of GetValue or operator*: GetPtr does not do the nullptr check because it does not need to dereference the pointer, it just returns whatever it is.

To properly fix the Book helper, if you never need use the result, you can just return a pointer to a boolean that you set to true in Finalize.

I’ll think whether we can do something on our end without breaking backwards compatibility (for sure we need to mention in the Book docs that GetResultPtr is called before the event loop, at RResultPtr construction time).

P.S.
in older ROOT versions in which the nullptr dereference went unnoticed, valgrind should print an error

I decided to fix it by creating the TTree early and using tree->Merge instead of TTree::MergeTrees.

I then ran into a couple more problems.

  1. GCC 11 switched to DWARF 5 debugging info as the default. I think the backtrace printing in ROOT might not be compatible, since I just get a lot of ?? while running the same command with valgrind (updated to a version that supports DWARF 5) prints the symbol names.
  2. There seems to be some incompatibility with cling as included in ROOT 6.24 which causes a segfault when the JIT compiler is triggered. I’ve just spent a couple of hours trying to eliminate every use of JITting but I think this might be a fool’s errand.

Mmmh sorry for the trouble, indeed there are issues with the very recent gcc 11 that we still haven’t had the time to fix, e.g. Warnings compiling with g++-11.1.0 · Issue #8129 · root-project/root · GitHub .
I was not aware of problems with the backtrace printing or jitting, however. Just to double-check, is your ROOT installation built with gcc 11? Pinging @Axel in case he knows more.

About eliminating jitting: you can gauge the amount of jitted code you have by adding the following line before the creation of RDataFrame – the super verbose logging level kDebug + 10 tells RDF to dump the code it will jit:

#include <ROOT/RLogger.hxx>
// ...
auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kDebug + 10);

Let me know if I can help in any way.
Cheers,
Enrico

EDIT: now fixed the debug level from kInfo to kDebug + 10

I rebuilt ROOT while investigating the original null RResultPtr.

1 Like

Here’s the relevant (I think) bit of valgrind output. It seems to crash before the log gets printed so I don’t get the list of code to be JITted:

==912271== Conditional jump or move depends on uninitialised value(s)
==912271==    at 0x2A4F469E: ???
==912271==    by 0x2A4F308E: ???
==912271==    by 0x9F82E9D: cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const (in /opt/root/lib/libCling.so)
==912271==    by 0x9F0AEBA: cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) (in /opt/root/lib/libCling.so)
==912271==    by 0x9F0C62B: cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) (in /opt/root/lib/libCling.so)
==912271==    by 0x9E41A02: TCling::Calc(char const*, TInterpreter::EErrorCode*) (in /opt/root/lib/libCling.so)
==912271==    by 0x5ABCF79: ROOT::Internal::RDF::InterpreterCalc(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x5AC9FED: ROOT::Detail::RDF::RLoopManager::Jit() (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x5AD199D: ROOT::Detail::RDF::RLoopManager::Run() (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x1BF737: TriggerRun (RResultPtr.hxx:350)
==912271==    by 0x1BF737: ROOT::RDF::RResultPtr<ROOT::RDF::RCutFlowReport>::Get() (RResultPtr.hxx:142)
==912271==    by 0x151DF1: operator-> (RResultPtr.hxx:205)
==912271==    by 0x151DF1: main (resolved-recon.cpp:1132)
==912271==
==912271== Use of uninitialised value of size 8
==912271==    at 0x2A4F5219: ???
==912271==    by 0x2A4F3DC5: ???
==912271==    by 0x2A4F46BA: ???
==912271==    by 0x2A4F308E: ???
==912271==    by 0x9F82E9D: cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const (in /opt/root/lib/libCling.so)
==912271==    by 0x9F0AEBA: cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) (in /opt/root/lib/libCling.so)
==912271==    by 0x9F0C62B: cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) (in /opt/root/lib/libCling.so)
==912271==    by 0x9E41A02: TCling::Calc(char const*, TInterpreter::EErrorCode*) (in /opt/root/lib/libCling.so)
==912271==    by 0x5ABCF79: ROOT::Internal::RDF::InterpreterCalc(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x5AC9FED: ROOT::Detail::RDF::RLoopManager::Jit() (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x5AD199D: ROOT::Detail::RDF::RLoopManager::Run() (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x1BF737: TriggerRun (RResultPtr.hxx:350)
==912271==    by 0x1BF737: ROOT::RDF::RResultPtr<ROOT::RDF::RCutFlowReport>::Get() (RResultPtr.hxx:142)
==912271==
==912271== Invalid read of size 8
==912271==    at 0x2A4F5219: ???
==912271==    by 0x2A4F3DC5: ???
==912271==    by 0x2A4F46BA: ???
==912271==    by 0x2A4F308E: ???
==912271==    by 0x9F82E9D: cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const (in /opt/root/lib/libCling.so)
==912271==    by 0x9F0AEBA: cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) (in /opt/root/lib/libCling.so)
==912271==    by 0x9F0C62B: cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) (in /opt/root/lib/libCling.so)
==912271==    by 0x9E41A02: TCling::Calc(char const*, TInterpreter::EErrorCode*) (in /opt/root/lib/libCling.so)
==912271==    by 0x5ABCF79: ROOT::Internal::RDF::InterpreterCalc(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x5AC9FED: ROOT::Detail::RDF::RLoopManager::Jit() (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x5AD199D: ROOT::Detail::RDF::RLoopManager::Run() (in /opt/root/lib/libROOTDataFrame.so)
==912271==    by 0x1BF737: TriggerRun (RResultPtr.hxx:350)
==912271==    by 0x1BF737: ROOT::RDF::RResultPtr<ROOT::RDF::RCutFlowReport>::Get() (RResultPtr.hxx:142)
==912271==  Address 0x5 is not stack'd, malloc'd or (recently) free'd
==912271==

 *** Break *** segmentation violation

I’m printing a cutflow report to trigger the event loop earlier than it otherwise would be, in order to figure out what’s triggering the crash.

It’s weird that you don’t get the log printed: there is a line at the very beginning of ROOT::Internal::RDF::InterpreterCalc, inside which the crash happens, that dumps the jitted code. Are you using log level kDebug + 10?

The invalid read (and the crash) happens inside just-in-time-compiled code, which does not have debug symbols, hence the ???.

This is nasty. I’ll spin up a docker container with Fedora 34 to try and reproduce the issue. If you have a small snippet that should reproduce the crash feel free to share it.

Cheers,
Enrico

I managed to fix this crash by replacing all my Sum calls with Sum<double> (which is appropriate for the weight variable I’m summing). I suspect a simple reproducer that uses Sum to sum some double would work.

EDIT:

Are you using log level kDebug + 10?

That was the issue with the log. I had kInfo before and changing it to kDebug + 10 tells me it’s the Sum call (that I changed back to the JIT version) that was being JITted.

1 Like

We see sizeof inconsistencies between JIT and compiled code, hinting at ABI issues between JIT and GCC11. Scary. This cannot be connected to DWARF5 of course - maybe the triplet isn’t guessed correctly. Anyway, a broken ABI would certainly explain crashes in the compiled/JIT boundary. I did not find the time to debug yet…

Ok jitting is currently broken with gcc11, this crashes:

#include <ROOT/RDataFrame.hxx>

int main() {
  ROOT::RDataFrame(10)
      .Define("x", [] { return 42; })
      .Sum("x")
      .GetValue();

  return 0;
}

We’ll try to solve the issue as soon as possible. I can’t think of any other workaround besides avoiding jitting or changing compiler :confused: