Home | News | Documentation | Download

TRentrantRWlock thread lock, program freezes

I am developing a standalone CMake program that makes use of a TApplication to show a couple of GUI windows with some internal TThreads and TTimers. I call ROOT::EnableThreadSafety in the very beginning.

However, sometimes, when stopping the TThread, my program freezes. When running it through the debugger, I see the following (below).

Is there any way I can prevent this freeze? Note: RThread is a TThread derived class, following How to Use the ROOT TThread, that calls TThread::Printf() when stopped. Maybe I should avoid calling Printf ?

Thread 1:

1  futex_wait_cancelable                                                                  futex-internal.h    183 0x7ffff7482376 
2  __pthread_cond_wait_common                                                             pthread_cond_wait.c 508 0x7ffff7482376 
3  __pthread_cond_wait                                                                    pthread_cond_wait.c 638 0x7ffff7482376 
4  std::condition_variable::wait(std::unique_lock<std::mutex>&)                                                   0x7ffff7361e50 
5  ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::ReadLock()                         0x7ffff685d125 
6  THashTable::FindObject(const char *) const                                                                     0x7ffff7713bad 
7  TClass::GetClass(const char *, bool, bool, unsigned long, unsigned long)                                       0x7ffff776305a 
8  TClass::InheritsFrom(const char *) const                                                                       0x7ffff77641f5 
9  TGCommandPlugin::CheckRemote(const char *)                                                                     0x7ffff7c84655 
10 TGCommandPlugin::HandleTimer(TTimer *)                                                                         0x7ffff7c84880 
11 TTimer::Notify()                                                                                               0x7ffff76e5d06 
12 TTimer::CheckTimer(TTime const&)                                                                               0x7ffff76e5749 
13 TUnixSystem::DispatchTimers(bool)                                                                              0x7ffff77c1eaf 
14 TUnixSystem::DispatchOneEvent(bool)                                                                            0x7ffff77c2e1d 
15 TSystem::Run()                                                                                                 0x7ffff76cfa49 
16 TApplication::Run(bool)                                                                                        0x7ffff76674e3 
17 main                                                                                   main.cxx            121 0x55555555f315

Thread 2:

1 __GI___poll  poll.c           29  0x7ffff7048aff 
2 ??                                0x7fffdfd831a2 
3 start_thread pthread_create.c 477 0x7ffff747b609 
4 clone        clone.S          95  0x7ffff7055293

Thread 3:

1  futex_wait_cancelable                                  futex-internal.h    183  0x7ffff7482376 
2  __pthread_cond_wait_common                             pthread_cond_wait.c 508  0x7ffff7482376 
3  __pthread_cond_wait                                    pthread_cond_wait.c 638  0x7ffff7482376 
4  TThread::XARequest(const char *, int, void * *, int *)                          0x7ffff686cc6a 
5  TThread::Printf(const char *, ...)                                              0x7ffff686ce49 
6  RThread::Stop                                          RThread.cpp         118  0x7ffff7f263bb 
7  MainWindow::DoStopDAQ                                  MainWindow.cpp      1720 0x7ffff7f3a5ee 
8  MainWindow::SaveAndExit                                MainWindow.cpp      1206 0x7ffff7f37156 
9  ??                                                                              0x7fff9ec2702a 
10 ??

In other cases, the problem comes from another TTimer I use specifically, not the internal one in TGCommandPlugin. Here the gdb output:

Thread 1 "runGUI" received signal SIGINT, Interrupt.
futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556ccb85c) at ../sysdeps/nptl/futex-internal.h:183
183	../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) thread apply all bt

Thread 3 (Thread 0x7ffedf519700 (LWP 417864)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557d50f9c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5555568b8a80, cond=0x555557d50f70) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x555557d50f70, mutex=0x5555568b8a80) at pthread_cond_wait.c:638
#3  0x00007ffff686cc6a in TThread::XARequest(char const*, int, void**, int*) () from /opt/root/lib/libThread.so
#4  0x00007ffff686ce49 in TThread::Printf(char const*, ...) () from /opt/root/lib/libThread.so
#5  0x00007ffff7f26390 in RThread::Stop (this=0x5555573de458) at /home/user/zdt-daq/gui/RThread.cpp:117
#6  0x00007ffff7f3a561 in MainWindow::DoStopDAQ (this=0x7ffffffdd380) at /home/ferhue/zdt-daq/gui/MainWindow.cpp:1715
#7  0x00007ffff7f37156 in MainWindow::SaveAndExit (this=0x7ffffffdd380) at /home/ferhue/zdt-daq/gui/MainWindow.cpp:1206
#8  0x00007ffedec5502a in ?? ()
#9  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fffdfd44700 (LWP 417863)):
#0  0x00007ffff7048aff in __GI___poll (fds=0x7fffdfd3ea80, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fffdfd831a2 in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2  0x00007ffff747b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#3  0x00007ffff7055293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff521cb80 (LWP 417832)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556ccb85c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5555567aa760, cond=0x555556ccb830) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x555556ccb830, mutex=0x5555567aa760) at pthread_cond_wait.c:638
#3  0x00007ffff7361e50 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff685c3a6 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock() () from /opt/root/lib/libThread.so
#5  0x00007ffff76b03ea in TQConnection::LockSlot() const () from /opt/root/lib/libCore.so
#6  0x00007ffff7cc92a7 in TQConnection::SendSignal() () from /opt/root/lib/libGui.so
#7  0x00007ffff7f285ec in TQObject::EmitVA<>(char const*, int) (this=0x7ffffffdd6d0, signal_name=0x7ffff7928d53 "Timeout()") at /home/ferhue/builds/build-root_src-Desktop-Debug/include/TQObject.h:137
#8  0x00007ffff76e5cf1 in TTimer::Notify() () from /opt/root/lib/libCore.so
#9  0x00007ffff76e5749 in TTimer::CheckTimer(TTime const&) () from /opt/root/lib/libCore.so
#10 0x00007ffff77c1eaf in TUnixSystem::DispatchTimers(bool) () from /opt/root/lib/libCore.so
#11 0x00007ffff77c2e1d in TUnixSystem::DispatchOneEvent(bool) () from /opt/root/lib/libCore.so
#12 0x00007ffff76cfa49 in TSystem::Run() () from /opt/root/lib/libCore.so
#13 0x00007ffff76674e3 in TApplication::Run(bool) () from /opt/root/lib/libCore.so
#14 0x000055555556b315 in main (argc=3, argv=0x7fffffffe118) at /home/ferhue/zdt-daq/gui/main.cxx:121

If I remove the TThread::Printf() part from my Stop() function, then it does not freeze. Is there a better way to issue this Printf so that it does not cause a freeze?


ROOT Version: 6.25/01
Platform: Ubuntu 20
Compiler: gcc


I am not sure what is happening (do you have a standalone reproducer?).

The symptoms is that the thread 3 is waiting for the main thread to execute the Printf.
But the main thread is itself waiting for the main ROOT read-write lock to be released.

What is not clear is which thread is holding the lock that the main thread is waiting on (It could be the MainWindow code … or it could be bug where a thread that was holding the lock was kill before it could release it … or it could be something else).

Not calling the Printf will reduce the likelihood of this dead-lock happening (by avoiding the secondary lock) but won’t necessarily prevent it (well we can not tell since we do not yet understand where it comes from).

The software is a bit complex to produce a minimum standalone reproducer. I also do not get this race condition reproducibly, it depends a lot on the timing, how many instructions I place before stopping the thread, etc. But I can share the Thread class I am using in case you see something suspicious I am doing…
RThread.zip (2.3 KB)

Yes, the MainWindow creates a TTimer (single shot), which a private member of MainWindow, that after 100 milliseconds automatically calls MainWindow::DoStopDAQ() which itself calls “RThread->Stop()” and then Printf is blocking that.

The main program runs like this: TApplication app; Mainwindow w; app.Run(); and MainWindow has some update timers that itself interact with an internal DAQ RThread.

Thanks so much for the help.

As a side note, it seems that the flags (fStopFlag, fPauseFlag) can be modify and read by multiple threads (via the Stop function for example) and thus might need to be std::atomic to insure consistency.

I am still not enable to guess from your code where the problem might be.

One thing that might help is to find out which thread is blocking the main thread.
So, with a debug build of ROOT and in one the stuck state do

gdb> print ((ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts> * const)ROOT::gCoreMutex)->fMutex
gdb> info threads

Cheers,
Philippe.

Thanks so much.

(gdb) print ((ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts> * const)ROOT::gCoreMutex)->fMutex
A syntax error in expression, near `const)ROOT::gCoreMutex)->fMutex'.

The above does not work.

print (ROOT::gCoreMutex)
$1 = (class ROOT::TVirtualRWMutex *) 0x5555569d5550
(gdb) info threads
  Id   Target Id                                   Frame 
* 1    Thread 0x7ffff4961b80 (LWP 427777) "runGUI" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555569d55bc) at ../sysdeps/nptl/futex-internal.h:183
  2    Thread 0x7fffe404c700 (LWP 427808) "runGUI" 0x00007ffff6d98aff in __GI___poll (fds=0x7fffe4046a80, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7ffee3827700 (LWP 427809) "runGUI" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557b9600c) at ../sysdeps/nptl/futex-internal.h:183
(gdb) thread apply all bt

Thread 3 (Thread 0x7ffee3827700 (LWP 427869)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557a0d8ac) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555557ba0990, cond=0x555557a0d880) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x555557a0d880, mutex=0x555557ba0990) at pthread_cond_wait.c:638
#3  0x00007ffff63c6ed1 in TPosixCondition::Wait (this=0x555557a0d870) at /opt/root_src/core/thread/src/TPosixCondition.cxx:65
#4  0x00007ffff63c4a92 in TThread::XARequest (xact=0x7ffff63d2200 "PRTF", nb=2, ar=0x7ffee3821310, iret=0x0) at /opt/root_src/core/thread/src/TThread.cxx:1061
#5  0x00007ffff63c4449 in TThread::Printf (fmt=0x7ffff7fa7106 "Stop:  %s") at /opt/root_src/core/thread/src/TThread.cxx:950
#6  0x00007ffff7f26439 in RThread::Stop (this=0x55555718c468) at /home/user/zdt-daq/gui/RThread.cpp:121
#7  0x00007ffff7f3a5df in MainWindow::DoStopDAQ (this=0x7ffffffdd590) at /home/user/zdt-daq/gui/MainWindow.cpp:1715
#8  0x00007ffff7f371d4 in MainWindow::SaveAndExit (this=0x7ffffffdd590) at /home/user/zdt-daq/gui/MainWindow.cpp:1206
#9  0x00007ffee2f2502a in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7fffe404c700 (LWP 427868)):
#0  0x00007ffff6d98aff in __GI___poll (fds=0x7fffe4046a80, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fffe408b1a2 in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2  0x00007ffff71cb609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#3  0x00007ffff6da5293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff4961b80 (LWP 427837)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555569d55b8) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5555569b1060, cond=0x5555569d5590) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x5555569d5590, mutex=0x5555569b1060) at pthread_cond_wait.c:638
#3  0x00007ffff70b1e50 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff63b31ab in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex> > (this=0x5555569d5590, __lock=...) at /usr/include/c++/9/condition_variable:273
#5  0x00007ffff63b1df4 in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex>, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}>(std::unique_lock<std::mutex>&, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}) (this=0x5555569d5590, __lock=..., __p=...) at /usr/include/c++/9/condition_variable:282
#6  0x00007ffff63adb8b in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x5555569d5558) at /opt/root_src/core/thread/src/TReentrantRWLock.cxx:190
#7  0x00007ffff63bc736 in ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x5555569d5550) at /opt/root_src/core/thread/src/TRWMutexImp.cxx:42
#8  0x00007ffff771d55b in ROOT::TVirtualRWMutex::Lock (this=0x5555569d5550) at /home/user/builds/build-root_src-Desktop-Debug/include/TVirtualRWMutex.h:79
#9  0x00007ffff7556780 in TQConnection::LockSlot (this=0x555557490ff0) at /opt/root_src/core/base/src/TQConnection.cxx:664
#10 0x00007ffff7556bf0 in TQConnection::SendSignal (this=0x555557490ff0) at /opt/root_src/core/base/inc/TQConnection.h:71
#11 0x00007ffff7f2866a in TQObject::EmitVA<>(char const*, int) (this=0x7ffffffdd8e0, signal_name=0x7ffff7dfb9d3 "Timeout()") at /home/user/builds/build-root_src-Desktop-Debug/include/TQObject.h:137
#12 0x00007ffff7f9713e in TQObject::Emit (this=0x7ffffffdd8e0, signal=0x7ffff7dfb9d3 "Timeout()") at /home/user/builds/build-root_src-Desktop-Debug/include/TQObject.h:173
#13 0x00007ffff7c2b643 in TTimer::Timeout (this=0x7ffffffdd8d0) at /opt/root_src/core/base/inc/TTimer.h:96
#14 0x00007ffff758d5ba in TTimer::Notify (this=0x7ffffffdd8d0) at /opt/root_src/core/base/src/TTimer.cxx:145
#15 0x00007ffff758d582 in TTimer::CheckTimer (this=0x7ffffffdd8d0, now=...) at /opt/root_src/core/base/src/TTimer.cxx:132
#16 0x00007ffff76a6ad0 in TUnixSystem::DispatchTimers (this=0x5555555fe800, mode=true) at /opt/root_src/core/unix/src/TUnixSystem.cxx:2956
#17 0x00007ffff76a17f3 in TUnixSystem::DispatchOneEvent (this=0x5555555fe800, pendingOnly=false) at /opt/root_src/core/unix/src/TUnixSystem.cxx:1096
#18 0x00007ffff7577475 in TSystem::InnerLoop (this=0x5555555fe800) at /opt/root_src/core/base/src/TSystem.cxx:405
#19 0x00007ffff75771e3 in TSystem::Run (this=0x5555555fe800) at /opt/root_src/core/base/src/TSystem.cxx:355
#20 0x00007ffff74fbaaf in TApplication::Run (this=0x7ffffffdd1d0, retrn=false) at /opt/root_src/core/base/src/TApplication.cxx:1624
#21 0x000055555556b315 in main (argc=3, argv=0x7fffffffe328) at /home/user/zdt-daq/gui/main.cxx:121

I tried changing the fFlags to std::atomic. First it looked as if it worked. But then I added additional Printf statements in the Stop function, and got the freeze again (every time). So might have changed just the timing a bit of when it happened.

Thanks again.

Attached some helgrind output (it does not freeze with helgrind, probably because it runs slowlier).
helgrind.txt (286.0 KB)

Below another one where it did freeze, and I pressed CTRL+C
hel.txt (190.1 KB)

We need to find a way to tweak it to make it work.
try

print ((ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique> * const)ROOT::gCoreMutex)->fMutex

or

print ((ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique> * )ROOT::gCoreMutex)->fMutex

if none work, try getting the right syntax by setting a break point on

ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock

and then doing

print this

this should give you the syntax for the type name that you version of gdb should like.

Ok, with single quotes, it worked. Here the output.

(gdb) print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>' * const)ROOT::gCoreMutex)->fMutex
$2 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 1, __count = 1, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\001\000\000\000\001", '\000' <repeats 34 times>, __align = 4294967297}}, <No data fields>}

It gives the same result as print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>' * const)ROOT::gCoreMutex)->fMutex

That seems to indicates an ‘odd’ problem. The mutex is locked but has no owner :frowning: … Most likely the thread that held it died unexpectedly or abruptly or something.

One way forward would be to instrument the ROOT mutex code to track the locking and unlocking of the mutex (in the hope that the printing does not change the behavior too much and we can still catch the issue) [This requires modifying the ROOT code and rebuilding]

1 Like

Thanks, I’ll try to create a minimized version of my program to see if the same issue happens, and see then if by hacking the ROOT build we can find out more.

From the Helgrind output attached above, I see some data race warnings between Printf and TTimers, e.g.

==430510== Possible data race during read of size 8 at 0x64DC540 by thread #1
==430510== Locks held: none
==430510==    at 0x64ABB82: TThread::XAction() (TThread.cxx:1079)
==430510==    by 0x64AC234: TThreadTimer::Notify() (TThread.cxx:1208)
==430510==    by 0x51B4581: TTimer::CheckTimer(TTime const&) (TTimer.cxx:132)
==430510==    by 0x52CDACF: TUnixSystem::DispatchTimers(bool) (TUnixSystem.cxx:2956)
==430510==    by 0x52C87F2: TUnixSystem::DispatchOneEvent(bool) (TUnixSystem.cxx:1096)
==430510==    by 0x519E474: TSystem::InnerLoop() (TSystem.cxx:405)
==430510==    by 0x519E1E2: TSystem::Run() (TSystem.cxx:355)
==430510==    by 0x5122AAE: TApplication::Run(bool) (TApplication.cxx:1624)
==430510==    by 0x11F30F: main (main.cxx:123)
==430510== 
==430510== This conflicts with a previous write of size 8 by thread #3
==430510== Locks held: 1, at address 0x167C9B50
==430510==    at 0x64AB927: TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1036)
==430510==    by 0x64AB448: TThread::Printf(char const*, ...) (TThread.cxx:950)
==430510==    by 0x487B46A: ThSFMC01::ThreadFunction() (ThSFMC01.cpp:57)
==430510==    by 0x487A9D8: RThread::ThreadHandle(void*) (RThread.cpp:174)
==430510==    by 0x64AAEC2: TThread::Function(void*) (TThread.cxx:828)
==430510==    by 0x4842B1A: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)
==430510==    by 0x563C608: start_thread (pthread_create.c:477)
==430510==    by 0x5AC5292: clone (clone.S:95)
==430510==  Address 0x64dc540 is 0 bytes inside data symbol "_ZN7TThread11fgXActCondiE"

Alright, I came up with a (more or less) small reproducer.
zdt.zip (73.2 KB)

By adding several artificial printfs and stuff like that, I managed to get reproducible ‘freezes’ in two different computers. It always get stuck, though sometimes with an internal TTimer from TGCommandPlugin, sometimes with the normal timers from TApplication.Run (TUnixsystem).

I do:

mkdir build && cd build
make
gdb --args ../build/gui/runGUI --prbs 26 -q
Gets stuck...
CTRL+C
info threads
thread apply all bt
print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>' * const)ROOT::gCoreMutex)->fMutex

First one needs to clone the CLI11 repository as explained in the README file.

Thanks!!

Probably unrelated, but valgrind found a memleak in TThread::Printf valgrind TThread Init Printf definitely lost bytes · Issue #8297 · root-project/root · GitHub

@pcanal could you point out which class or files I should modify to keep track of the lock owners? Shall I do it by printing out some info, or by storing an array of strings that can be debugged interactively without calling printf?
Thanks.

→ EDIT, Using Helgrind, I think I am getting closer to the reason of the freeze, see Data races TThread TTimer TApplication · Issue #8365 · root-project/root · GitHub

The 4 points are:

  • Claiming the write lock: TReentrantRWLock.cxx:194
  • Actually release the write lock: TReentrantRWLock.cxx:230 (both line update fWriter)
  • Taking one of the read lock, right before the return statement: TReentrantRWLock.cxx:131
  • Releasing one of the read lock, TReentrantRWLock.cxx:149

You can record the return of std::this_thread::get_id() (and/or “print” (see std::thread::get_id - cppreference.com) its value). Keeping a map, might be the easiest … maybe.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.