ROOT::EnableThreadSafety() causing failures

NathanielTagg · February 9, 2019, 4:39pm

Dear ROOT,

I’ve been working trying to integrate some some code that uses ROOT into a Node.js framework, which necessitates use of threads. Based on Multicore/multithreading, I’m explicitly keeping all ROOT operations confined locally to their own threads.

However, my test case keeps exploding. I have some rather complicated LarSOFT framework code, but at it’s heart it’s doing this (all in one thread):
my_thing() {
TFile f(“file.root”);
tree = f.Get(“tree”);
tree->SetBranchPointer( );
tree->GetEntry(0)
}

This code creates a lot of side activity, requiring a lot of environment variables and other shared libraries. However, it fails at the GetEntry() call.

Now here’s the interesting thing:

It runs fine if the main thread calls my_thing()
It runs fine if my_thing() is in a sub-thread (with no other activity) as long as EnableThreadSafety() is NOT called.
If I call EnableThreadSafety in the main thread, and then call my_thing(), it segfaults.
(I get: “Warning in TClass::GetStreamer: For art::BranchDescription, the TClassStreamer passed does not properly implement the Generate method”, followed by the crash dump) The debugger says it’s crashing in one of the framework libraries, in a TClassStreamer::operator() method.
If I don’t call EnableThreadSaftety(), and attempt a simultaneous my_thing() in another thread, it crashes… as expected from the discussion of ROOT internals.
If I launch both my_thing() calls from different threads, but non-simultaneous, with EnableThreadSafety() off, everything works OK.

The puzzling thing is why EnableThreadSaftety() would cause problems…? My guess was that the first launch of the my_thing code causes a lot of shared libraries to get loaded only in that particular thread. But I tried launching my_thing() from the main thread, letting it finish, and then launching it from a thread… and the thread crashes again.

I’m at my wits end here. I can’t figure out how my_thing() even KNOWS it’s in a thread!

(To those who say “don’t do this”… I’m trying to build a server that can reply to multiple data requests via http. Those requests are time-consuming, and so require either a new executable launch, a new fork, or a new thread for each request.)

ROOT Version: 6.12/06
Platform:_ macosx64
Compiler: clang

pcanal · February 9, 2019, 9:33pm

It looks like the initialization needed by the ART framework (on top of but not part of ROOT) are not done properly in this case. It is a bit surprising that ‘just’ adding the enabling of the ROOT locks would cause a difference though. (I would run the failing example with valgrind and see if it gives some more information).

eguiraud · February 10, 2019, 1:56pm

Hi Nathaniel,
ROOT v6.12 is very old in terms of parallelism support. I strongly suggest that you upgrade to v6.16.

The my_thing() function as written in your post should be safe to execute concurrently if EnableThreadSafety() has been called (conversely, there is no chance it will execute correctly and concurrently if EnableThreadSafety() has not been invoked).
So there are important elements missing in the reproducer.

You can use gdb on a ROOT (and art) debug build to check what is “crashing” (is it a use after delete? an out of bound access? dereferencing of a null pointer or an invalid pointer? …).
As @pcanal suggests, valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ./your-program also might report problems – just as compiling everything with clang’s thread sanitizer or address sanitizer could.

In short…what is going wrong exactly? What is the minimal reproducer, with all non-relevant elements removed? (in fact, if you can share a minimal reproducer, we could be able to help debug the crash)

NathanielTagg · February 10, 2019, 8:33pm

Alas, updating ROOT is not an option, since ART is very sensitive, and our build chain is massive.

The failure is a segfault in TClass:: operator(), down deep inside GetEntry. Despite using the “Debug” version of the UPS package, the debugger can provide no line by line debugging. (I suspect because it can’t associate the prebult binary with the source code). tmp.txt (9.7 KB)

The real mystery to me is what EnableThreadSafety actually DOES. Where is the source code? I couldn’t figure it out, since it’s some sort of dyaic library thing.

Another question: does the existence of a TApplication (or not) connect with this? It’s a bit of a black box.

Full error dump attached.

eguiraud · February 11, 2019, 8:21am

This is really one for @pcanal: could it be that TClass::StreamerExternal(TClass const*, void*, TBuffer&, TClass const*) has issues with ROOT::EnableThreadSafety?

NathanielTagg · February 11, 2019, 5:12pm

Very probably it’s an ART issue. I mistook it for ROOT because none of the ART code is being called directly, at least not in the failure case.

In the meantime, if I wrap all by gallery calls (the stuff my_thing() represents) with a mutex, I can make it all work OK without calling EnableThreadSafety, at least on one platform. It’s a fugly workaround, though, and I have other functions that also make ROOT TFile calls, so this is only good enough for early development.

pcanal · February 11, 2019, 5:39pm

What did the valgrind run say?

pcanal · February 11, 2019, 9:06pm

I heard from the ART development team and they recognize this issue as (likely to be) something they fixed quite a while ago. It is likely you are trying to use a (older) version of art/gallery that was not designed to be thread-safe. You might want to get in touch directly with them for further support/help.

Cheers,
Philippe.

NathanielTagg · February 11, 2019, 11:55pm

I’ve never successfully run valgrind on any platform. ( I work on os X so I can use the apple debugger!)

pcanal · February 13, 2019, 6:59pm

No need to run valgrind as it is clear/likely that you must either not use multi-threading or upgrade to a newer version of ART that supports multi-threading.

Cheers,
Philippe.

NathanielTagg · February 14, 2019, 1:56pm

Thanks. Alas, that’s useless since the lead time of advancing our framework is months, minimum. I’m forever doomed to working with ancient software.

I hate frameworks.

It looks to me that the only reasonable way forward is to put mutex locks around every gallery and ROOT call, and not ever call EnableThreadSafety. If that doesn’t work, I need to fork a process to handle each file access, and pipe the data back to the main process. (I have code for that working, but it’s such a ham-fisted solution.)

Can you tell me where the code for EnsureTheadSafety is? What files have the C++ code?

–NJT

pcanal · February 15, 2019, 4:20pm

What’s the bottleneck there? (I.e. your bug is fixed with newer version of ART and ROOT, if your work is helping the experiment, they need to move (and should do so on a regular basis).

Can you tell me where the code for EnsureTheadSafety is? What files have the C++ code?

How does it help? [It is in core/base/src/TROOT.cxx]

Cheers,
Philippe.

NathanielTagg · February 19, 2019, 5:42pm

My work is not critical to the experiment, and the lead time on incorporating and verifying software is months. We only do a major version upgrade once a year at most. So we are always likely to be two years behind.

No, core/base/src/TROOT.cxx just has a call to load a shared library, but I can’t figure out what code goes into that shared library. I’m curious if I can isolate the problem with gallery/art and perform a workaround in my own executable.

eguiraud · February 19, 2019, 8:20pm

Hi,
you can follow the sequence of calls to TThread::Init, in thread/src/TThread.cxx.

Fundamentally, it instantiates the global ROOT mutex that will be used to make ROOT operations on the application’s global state thread-safe.

system · March 5, 2019, 8:20pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.