Segmentation fault with ROOT 6.28

Hello,

I was trying to add an integration test case for our bamboo package with the new ROOT 6.28, but the tests fail with a segmentation fault.

Bamboo uses RDataFrame to fill histograms and skims. The segfault happens after the event loop has been triggered and is finished and the results have been retrieved and written to a file.

The strange thing is that no stacktrace appears, which makes it quite difficult to debug: all I see is:

 *** Break *** segmentation violation

after which execution just hangs forever (the CI/CD tests don’t event fail, they just timeout!)…

The tests in question worked perfectly fine up to now, e.g. with ROOT 6.26.04.

I don’t see anything obvious from the 6.28 release notes that would require changes from our side… Any pointers as to what might be going on would be appreciated!

Best,
Sébastien


LCG 103:
ROOT Version: 6.28.00
Platform: CentOS7
Compiler: GCC 12.1.0


As it s a problem related to this “bamboo” software you might ask the developers/maintainers of this software. But as it uses RDataFrame, @eguiraud might have some ideas about it.

I am the maintainer of this software, and I’m asking here because I’m dumbfounded by this :slight_smile:

Actually, after adding the newly available CLING_DEBUG=1, I got more information:

cppyy.ll.SegmentationViolation: Template method resolution failed:
  static unique_ptr<rdfhelpers::PrintProgress,default_delete<rdfhelpers::PrintProgress> > rdfhelpers::PrintProgress::addToNode(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> df, int printFreq, int nThreads = 1) =>
    SegmentationViolation: segfault in C++; program state was reset
  static unique_ptr<rdfhelpers::PrintProgress,default_delete<rdfhelpers::PrintProgress> > rdfhelpers::PrintProgress::addToNode(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>* df, int printFreq, int nThreads = 1) =>
    SegmentationViolation: segfault in C++; program state was reset
  static unique_ptr<rdfhelpers::PrintProgress,default_delete<rdfhelpers::PrintProgress> > rdfhelpers::PrintProgress::addToNode(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> df, int printFreq, int nThreads = 1) =>
    SegmentationViolation: segfault in C++; program state was reset

While it’s still not clear to me why that gets broken all of a sudden, at least I have something to go by…

EDIT: removing the PrintProgress::addToNode call, I get:

cppyy.ll.SegmentationViolation: TH1D& ROOT::RDF::RResultPtr<TH1D>::operator*() =>
    SegmentationViolation: segfault in C++; program state was reset

which is a segfault from the event loop itself… :thinking:

However, is it expected that after the segfault the crashed program doesn’t exit and just hangs forever waiting for a stracktrace?

Hi @swertz ,

sorry for the trouble! Can you please try with v6.28.02? See the “release notes” section :grimacing:

No, that’s for @Axel I guess :confused:

Cheers,
Enrico

Hi @eguiraud ,

I’ve tried with the nightly build (LCG dev3) and I encounter the same issue… :confused:

Mmh I don’t know what this might be, can you please provide a self-contained reproducer?

We’ll try, but it might take a while since it’s a priori not clear at all where it is coming from… I was hoping this was somehow a known issue :smiley:

Unfortunately I’m not aware of any RDF bug that could cause otherwise correct code to crash other than [VecOps] Masking RVec<T> is broken for non-trivially-constructible Ts · Issue #12398 · root-project/root · GitHub , which was introduced at some point in v6.26 and fixed in v6.28.02 .

You could try running a debug build of v6.28.02 with the environment variable CLING_DEBUG=1 set and see whether you get a more helpful stacktrace – other than that, I would need a way to reproduce this to debug.

Sorry again for the trouble!
Enrico

Indeed, with CLING_DEBUG=1 I do get stacktraces (see above), but whenever I remove the apparent cause, something else pops up.

The segfault actually happens right when the program exits, so I’m wondering if it couldn’t be related to SIGSEGV from Destructor of ROOT::RDF::RNode · Issue #12023 · root-project/root · GitHub … but we don’t have any uses of RInterface, so it must be something else.

Bamboo has the possibility of writing out the generated RDF analysis to a C++ file, compile it, and run it: when doing that, things work fine. So the issue must come from something that is instanciated with pyROOT or with the jitting…

That’s often caused by gdb becoming unhappy, somehow: we attach gdb to the running process and have it spit out the backtrace of all threads.

When it hangs you can do ps -feH to see how ROOT fires up gdb, and you should be able to run it by hand and see why “bt” (i.e. backtrace) doesn’t work by gdb…

What you posted above is the Python-side error, what would be useful is the C++ stacktrace. But I guess it gets swallowed by PyROOT :confused: terrible.

I was actually not aware of SIGSEGV from Destructor of ROOT::RDF::RNode · Issue #12023 · root-project/root · GitHub , thanks! Hard to say for sure but the failure mode is indeed similar.

Mmmh I will try to take a look at that issue if I find the time, but before CHEP it might be difficult, sorry. Not sure how else to proceed.

Thanks @eguiraud , @oguz.guzel and I will keep digging in the meanwhile…

Thanks for the suggestion! Unfortunately I don’t see anything from the ps output, apart from the original bamboo command…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.