Irreproducible crash with RDataFrames

Hi,

I am finding an irreproducible crash when using RDataFrames. The code can be found at
https://gitlab.cern.ch/lwilkins/ntuplesklimmer/blob/master/ntuplesklimmer.py.

I will run the code and it will crash with a similar crash log to that attached. Try again with the exact same arguments and it will suddenly run fine. It can then run many more times with new arguments (different input files and configs) and work fine and then suddenly crash again. This has been found to happen from 3 different people on 3 different machines.

Does anyone have an idea from the crash log what could be the issue? My gut feeling is some issue with the multi-threading whereas others think it could be pyROOT.

All ideas welcome!

Lewis
ntuplesklimmer_crashlog.txt (136.7 KB)


ROOT Version: 6.16
Platform: Not Provided
Compiler: Not Provided


Hi,
there is a simple way to check if it’s a race condition (i.e. whether multi-threading is the culprit): does the code crash if you turn off multi-threading?

Looking at the stacktrace, however, there seems to be a problem in an invocation to Snapshot:

  File "ntuplesklimmer.py", line 16659, in <module>                                                                     
    df.Snapshot(options.tree_name, options.output_file_name, r_branch_list)                                             
TypeError: none of the 3 overloaded methods succeeded. Full details:                                                    
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(experimental::basic_string_view<char,char_traits<char> >
    problem in C++; program state has been reset

Maybe check that the arguments are correct and that you can invoke Snapshot with the same or similar arguments in isolation…

Hope this helps!
Enrico

Taking a closer look, python reports that there was a problem within the Snapshot call, but that’s the call that also runs the event loop so it can very well be that the problem is a crash in the event loop and not the invocation of Snapshot itself.

It would be great if you could confirm that the crash appears only if ROOT.ROOT.EnableImplicitMT has been invoked (i.e. it’s a threading issue).

I’m looking at the stacktrace and I don’t see anything obvious, so debugging this might require a bit more work on your part: a standalone reproducer, better if in C++, otherwise very simple python that we could easily translate to C++.

Hi Enrico,

Thanks for the reply.

I’d agree I think it’s in the event loop rather than the call of Snapshot because, as I mentioned in the initial post, the crash will happen with arguments which work fine when retrying.

I have tried it with MT disabled and have been so far unable to reproduce the crash. This could be luck, it can sometime go 100s of times running the script before I get a crash.

Have there been any other reported issues with using MT?

Thanks again!
Lewis

Have there been any other reported issues with using MT?

There are currently no known issues with RDF+multi-threading. With the current information it’s hard to tell whether the problem is in RDF or in the code you execute through it (e.g. are you sure all your Filter expressions are thread-safe?).

The crash you are seeing does indeed look suspicious though. The problem is that PyROOT executes just-in-time compiled C++ code and that code is missing debug information. So it’s hard to tell what’s going on. It would be best to have a small standalone reproducer in C++ that we can compile with debug symbols and run through thread-sanitizer and valgrind. The processing of creating a small reproducer often makes the problem emerge naturally.

Otherwise a small standalone reproducer in PyROOT will have to do – but in all cases we must be able to run it and it should reproduce the issue fairly consistently.

Cheers,
Enrico

I’m fairly sure the Filter expressions are all thread safe (they are just simple cuts on variables).

I can try to make a C++ reproducer but the whole point of this issue is that I can’t easily reproduce it, the crash is happening what seems to be randomly.

Cheers,

Lewis

It’s fine if the reproducer crashes rarely, as long as it crashes.

I really wouldn’t know how else to proceed, I reviewed the code but couldn’t see anything suspicious.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.