Segmentation fault with ROOT 6.28

swertz · April 25, 2023, 7:31am

Hello,

I was trying to add an integration test case for our bamboo package with the new ROOT 6.28, but the tests fail with a segmentation fault.

Bamboo uses RDataFrame to fill histograms and skims. The segfault happens after the event loop has been triggered and is finished and the results have been retrieved and written to a file.

The strange thing is that no stacktrace appears, which makes it quite difficult to debug: all I see is:

 *** Break *** segmentation violation

after which execution just hangs forever (the CI/CD tests don’t event fail, they just timeout!)…

The tests in question worked perfectly fine up to now, e.g. with ROOT 6.26.04.

I don’t see anything obvious from the 6.28 release notes that would require changes from our side… Any pointers as to what might be going on would be appreciated!

Best,
Sébastien

LCG 103:
ROOT Version: 6.28.00
Platform: CentOS7
Compiler: GCC 12.1.0

couet · April 25, 2023, 7:46am

As it s a problem related to this “bamboo” software you might ask the developers/maintainers of this software. But as it uses RDataFrame, @eguiraud might have some ideas about it.

swertz · April 25, 2023, 7:49am

I am the maintainer of this software, and I’m asking here because I’m dumbfounded by this

swertz · April 25, 2023, 7:57am

Actually, after adding the newly available CLING_DEBUG=1, I got more information:

cppyy.ll.SegmentationViolation: Template method resolution failed:
  static unique_ptr<rdfhelpers::PrintProgress,default_delete<rdfhelpers::PrintProgress> > rdfhelpers::PrintProgress::addToNode(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> df, int printFreq, int nThreads = 1) =>
    SegmentationViolation: segfault in C++; program state was reset
  static unique_ptr<rdfhelpers::PrintProgress,default_delete<rdfhelpers::PrintProgress> > rdfhelpers::PrintProgress::addToNode(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>* df, int printFreq, int nThreads = 1) =>
    SegmentationViolation: segfault in C++; program state was reset
  static unique_ptr<rdfhelpers::PrintProgress,default_delete<rdfhelpers::PrintProgress> > rdfhelpers::PrintProgress::addToNode(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> df, int printFreq, int nThreads = 1) =>
    SegmentationViolation: segfault in C++; program state was reset

While it’s still not clear to me why that gets broken all of a sudden, at least I have something to go by…

EDIT: removing the PrintProgress::addToNode call, I get:

cppyy.ll.SegmentationViolation: TH1D& ROOT::RDF::RResultPtr<TH1D>::operator*() =>
    SegmentationViolation: segfault in C++; program state was reset

which is a segfault from the event loop itself…

However, is it expected that after the segfault the crashed program doesn’t exit and just hangs forever waiting for a stracktrace?

eguiraud · April 25, 2023, 2:04pm

Hi @swertz ,

sorry for the trouble! Can you please try with v6.28.02? See the “release notes” section

No, that’s for @Axel I guess

Cheers,
Enrico

swertz · April 25, 2023, 5:13pm

Hi @eguiraud ,

I’ve tried with the nightly build (LCG dev3) and I encounter the same issue…

eguiraud · April 25, 2023, 5:34pm

Mmh I don’t know what this might be, can you please provide a self-contained reproducer?

swertz · April 25, 2023, 7:47pm

We’ll try, but it might take a while since it’s a priori not clear at all where it is coming from… I was hoping this was somehow a known issue

eguiraud · April 25, 2023, 8:31pm

Unfortunately I’m not aware of any RDF bug that could cause otherwise correct code to crash other than [VecOps] Masking RVec<T> is broken for non-trivially-constructible Ts · Issue #12398 · root-project/root · GitHub , which was introduced at some point in v6.26 and fixed in v6.28.02 .

You could try running a debug build of v6.28.02 with the environment variable CLING_DEBUG=1 set and see whether you get a more helpful stacktrace – other than that, I would need a way to reproduce this to debug.

Sorry again for the trouble!
Enrico

swertz · April 26, 2023, 2:03pm

Indeed, with CLING_DEBUG=1 I do get stacktraces (see above), but whenever I remove the apparent cause, something else pops up.

The segfault actually happens right when the program exits, so I’m wondering if it couldn’t be related to SIGSEGV from Destructor of ROOT::RDF::RNode · Issue #12023 · root-project/root · GitHub … but we don’t have any uses of RInterface, so it must be something else.

Bamboo has the possibility of writing out the generated RDF analysis to a C++ file, compile it, and run it: when doing that, things work fine. So the issue must come from something that is instanciated with pyROOT or with the jitting…

Axel · April 26, 2023, 2:56pm

That’s often caused by gdb becoming unhappy, somehow: we attach gdb to the running process and have it spit out the backtrace of all threads.

When it hangs you can do ps -feH to see how ROOT fires up gdb, and you should be able to run it by hand and see why “bt” (i.e. backtrace) doesn’t work by gdb…

eguiraud · April 26, 2023, 2:58pm

What you posted above is the Python-side error, what would be useful is the C++ stacktrace. But I guess it gets swallowed by PyROOT terrible.

I was actually not aware of SIGSEGV from Destructor of ROOT::RDF::RNode · Issue #12023 · root-project/root · GitHub , thanks! Hard to say for sure but the failure mode is indeed similar.

Mmmh I will try to take a look at that issue if I find the time, but before CHEP it might be difficult, sorry. Not sure how else to proceed.

swertz · April 26, 2023, 3:08pm

Thanks @eguiraud , @oguz.guzel and I will keep digging in the meanwhile…

Thanks for the suggestion! Unfortunately I don’t see anything from the ps output, apart from the original bamboo command…

system · May 10, 2023, 3:08pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.