Copying TRefs and accessing TRef data from multiple threads


ROOT Version: 6.12.06 and 6.14.00
Platform: Linux
Compiler: GCC 6.4.0


Hello, I’m working on improving core utilization in a program that uses ROOT to keep tabs on data and I request some help. During run-time, the program simulate events in parallel; one per thread/core. Event data is not shared between threads, but all threads run a lot of TRef functions.

While benchmarking with many threads (16 and above) I noticed that the majority of threads spend a lot of their time waiting to acquire a ROOT mutex in void TRef::operator=(TObject *obj) and TObject *TRef::GetObject() const. According to gdb backtraces, a ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>::WriteLock() is acquired which, as I understand it, only allows a single thread in the critical section.

Looking at the definition of void TRef::operator=(TObject *obj), I found that setting obj->SetBit(kHasUUID) in advance allowed me to get around a mutex lock, but fooling ROOT is probably not the best way around the issue, and I’m still left with the unconditional mutex lock in TRef::GetObject().

This is my first foray into ROOT, so I have a few questions: why is a WriteLock acquired instead of a ReadLock; must the new TRef be registered internally? Why is TProcessID::IsValid(TProcessID *pid) called from TRef::GetObject()?

These mutexes are write-locked in both listed ROOT versions.

Is it possible to bypass these mutex locks across threads?

Cheers.

I guess @pcanal may help you with this question.

Any updates on this?

Speaking to some people here it seems the access patterns are being reworked in ROOT7 - but there is no way to work around this central mutex in ROOT6? We do see that almost all of our wait time goes into waiting for ROOT6 to free the TReentrantRWLock

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Could be more specific in which circunstance/operation/stack-trace this is happening in you case?

Thanks,
Philippe.

In general, because the separation between the read and write lock is brand new and has been taken advantage of only in few places in the code.

I do not (immediately) spot the exact location where they happen. Would you be able to provide a standalone example reproducing your workload (so that we can optimize the right code path and contentions).

If the processID pointed to by fPID has become invalid then the subsequent operations (eg. fPID->GetObjectWithID(uid);) are at best useless and at worse leading to a segmentation fault.

Here is a minimum example that simulate 100 “events” on every spawned thread. An event is constituted by a collection of Derived instances (derived from TObject). When running an event, we just get the event-unique underlaying TRef and do nothing with it. With a higher thread count (16 for example) most threads are waiting for the global ROOT mutex. This mutex locking occurs in Derived’s constructor and when calling Derived::getObject().

#include <TObject.h>
#include <TRef.h>
#include <TROOT.h>

#include <chrono>
#include <thread>
#include <vector>
#include <memory>
#include <string>
#include <iostream>

using namespace std::chrono_literals;

class Derived;

class Object : TObject {
    public:
        friend class Derived;
        /**
         * @brief Required default constructor
         */
        Object() = default;
        /**
         * @brief Required virtual destructor
         */
        ~Object() override = default;

        /// @{
        /**
         * @brief Use default copy behaviour
         */
        Object(const Object&) = default;
        Object& operator=(const Object&) = default;
        /// @}

        /// @{
        /**
         * @brief Use default move behaviour
         */
        Object(Object&&) = default;
        Object& operator=(Object&&) = default;
        /// @}
};


class Derived : public TObject {
    public:
        Derived(const Object* object) {
            object_ = const_cast<Object*>(object);
        }

        const Object* getObject() const {
            return dynamic_cast<Object*>(object_.GetObject());
        }

    private:
        TRef object_;
};

// ---------------------------------------------------------------------------- //

class Event {
    public:
        Event(const Object* object) {
            for(int i = 0; i < 25; i++) {
                charges_.emplace_back(object);
            }
        }

        void run() {
            for(auto& charge : charges_) {
                auto object = charge.getObject();
            }
        }

    private:
        std::vector<Derived> charges_;
};

int main(int argc, char *argv[]) {
    // How many threads do we use?
    std::vector<std::string> args{argv + 1, argv + argc};
    int threads_num = args.size() > 0 ? std::stoi(args[0]) : 1;
    std::cout << "Using " << threads_num << " thread(s).\n";

    ROOT::EnableThreadSafety();

    // Time run-time
    auto start = std::chrono::steady_clock::now();

    // Start run
    std::vector<std::thread> threads;
    for(int i = 0; i < threads_num; i++) {
        auto object = std::make_shared<Object>();
        threads.emplace_back([object = std::move(object)]() {
            for(int i = 0; i < 100; i++) {
                Event event{object.get()};
                event.run();
            }
        });
    }

    for(auto &thread : threads) {
        thread.join();
    }

    auto end = std::chrono::steady_clock::now();
    std::cout << "Run finished in "
              << static_cast<std::chrono::duration<long double, std::milli>>(end - start).count()
              << "ms";
}

I build it with the following CMakeLists.txt

cmake_minimum_required(VERSION 3.0)

# Set standard build flags
SET(COMPILER_FLAGS -pedantic -Wall -Wextra -Wcast-align -Wcast-qual -Wconversion -Wuseless-cast -Wzero-as-null-pointer-constant -Wdisabled-optimization -Wformat=2 -Winit-self -Wlogical-op -Wmissing-declarations -Wmissing-include-dirs -Wnoexcept -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wsign-conversion -Wsign-promo -Wstrict-null-sentinel -Wstrict-overflow=5 -Wswitch-default -Wundef -Werror -Wshadow -Wformat-security -Wdeprecated -fdiagnostics-color=auto -Wheader-hygiene)

# Require a C++14 compiler but do allow extensions
SET(CMAKE_CXX_STANDARD 14)
SET(CMAKE_CXX_STANDARD_REQUIRED ON)
SET(CMAKE_CXX_EXTENSIONS OFF)

# Include Threads
FIND_PACKAGE(Threads REQUIRED)

# ROOT is required for vector and persistency etc
FIND_PACKAGE(ROOT REQUIRED COMPONENTS Geom GenVector)
IF(NOT ROOT_FOUND)
    MESSAGE(FATAL_ERROR "Could not find ROOT, make sure to source the ROOT environment\n"
    "$ source YOUR_ROOT_DIR/bin/thisroot.sh")
ENDIF()

# Check ROOT version
IF (NOT ${ROOT_VERSION} VERSION_GREATER "6.0")
    MESSAGE(FATAL_ERROR "ROOT versions below 6.0 are not supported")
ENDIF()

# Set the dependencies
SET(ALLPIX_DEPS_INCLUDE_DIRS ${ROOT_INCLUDE_DIRS})
SET(ALLPIX_DEPS_LIBRARIES ${ROOT_LIBRARIES} ${ROOT_COMPONENT_LIBRARIES} Threads::Threads)

# include dependencies
INCLUDE_DIRECTORIES(SYSTEM ${ALLPIX_DEPS_INCLUDE_DIRS})

# create executable and link the libs
ADD_EXECUTABLE(root-test main.cpp)
TARGET_LINK_LIBRARIES(root-test ${ALLPIX_DEPS_LIBRARIES})

Here are some backtraces during a run with 16 threads.

When construcing an Event:

#0  0x00007ffff3d8b8cc in __lll_lock_wait ()
   from /nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libpthread.so.0
#1  0x00007ffff3d84a45 in pthread_mutex_lock ()
   from /nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libpthread.so.0
#2  0x00007ffff484ecf2 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>::WriteLock() ()
   from /nix/store/x83lnr0shkcqi84l3fv4iihpzmrqn98v-root-6.12.06/lib/libThread.so
#3  0x00007ffff79c7ff7 in TProcessID::GetProcessWithUID(unsigned int, void const*) ()
   from /nix/store/x83lnr0shkcqi84l3fv4iihpzmrqn98v-root-6.12.06/lib/libCore.so
#4  0x00007ffff79d5b95 in TRef::operator=(TObject*) ()

When running an Event:

#0  0x00007ffff3d8858f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libpthread.so.0
#1  0x00007ffff3aad3dc in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /nix/store/4zd34747fz0ggzzasy4icgn3lmy89pra-gcc-7.3.0-lib/lib/libstdc++.so.6
#2  0x00007ffff484e937 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>::WriteLock() ()
   from /nix/store/x83lnr0shkcqi84l3fv4iihpzmrqn98v-root-6.12.06/lib/libThread.so
#3  0x00007ffff79c833c in TProcessID::IsValid(TProcessID*) ()
   from /nix/store/x83lnr0shkcqi84l3fv4iihpzmrqn98v-root-6.12.06/lib/libCore.so
#4  0x00007ffff79d6040 in TRef::GetObject() const ()


On my system, a run with a single thread takes 1-2ms, whereas a run with 16 threads take 1100-1200ms. Since the Object instances are unique to the event, I expect a run to take 1-2ms independent on the thread count.

Cheers.

See https://github.com/root-project/root/pull/2381

1 Like

Wow, great, this seems to be the remedy! i have continued the discussion at the pull request.

Just ran some benchmarks with this patch. Eventually the program segfaults with the following stracktrace:

#0  0x00007ffff4cf8406 in TProcessID::GetObjectWithID(unsigned int) () from /nix/store/zvcvwyzv1ywfvqsrm2akik74msnp4d18-root-6.16/lib/libCore.so
#1  0x00007ffff4d062b2 in TRef::GetObject() const () from /nix/store/zvcvwyzv1ywfvqsrm2akik74msnp4d18-root-6.16/lib/libCore.so

Presuming this is not reproducable in the minimal example above, I’ll see about amending it, but that might be a while.