"fStreamerImpl not properly initialized" with IMT and RDF.RunGraphs in PyRoot


Dear ROOT experts,

I am doing an analysis with O(100) computational graphs (each with a different RDF) running on a 12-core machine in parallel.

Using less than 4 cores, the “fStreamerImpl not properly initialized (0)” error is very unlikely to happen. However, with IMT set to on, in my case using 24 threads, the process panic with the error almost every time!

I can’t help but thinking this might be some kind of race condition but not sure where to start to debug… Please could you shed some light on this?

I have my script and relevant source files needed to reproduce the error attached below. As well as some log with “RLogScopedVerbosity” set to “Info

err.log.txt (36.3 KB)
source.zip (87.2 KB)

Cheers, Dong

ROOT Version: 6.28.04
Platform: ArchLinux x86_64
Compiler: GCC 13


Dear @qidong ,

Thank you for the report and for the source code. In order to have a full reproducer, it would be necessary to have some simple data sample so that we can run your code. Is it possible to provide to us (even in private) some data?

If you also want to try debugging the application yourself, a starting point would be to build ROOT with debug symbols, e.g. appending -DCMAKE_BUILD_TYPE=Debug to your cmake command. Then the stack traces would be a bit more informative. Also, since you are running Python code, you can tell cling to JIT-compile code with debug symbols via export CLING_DEBUG=1 in your environment before running the Python script.

Cheers,
Vincenzo

Hi @vpadulan,

Thanks a lot for your constructive reply and instruction to debug ROOT!

The data involved to reproduce the error is a bit too large for uploading directly here, but they are readily available on Rucio: samples.txt (32.8 KB)
Please let me know if you have access, otherwise we could use CernBox or something else :wink:

I have a new zip file uploaded with everything required to reproduce the error
rdf_error_repro.zip (85.5 KB)

Here are some instructions:

pip install atlasplots numpy

gunzip rdf_error_repro.zip
cd rdf_error_repro

cd data;
rucio get `cat samples.txt` # may take long, ~100 GB data

cd ../
python ./rdf_error.py

Please let me know if you encounter any issue setting it up!

Cheers, Dong

Dear @qidong,

Thanks for the instructions. I have never used rucio before, I installed it but then I get complaints about missing configuration

RuntimeError: Could not load Rucio configuration file. Rucio looked in the following paths for a configuration file, in order:
	/home/vpadulan/Projects/rootcode/forum-posts/56278/rucio-venv/etc/rucio.cfg
	/opt/rucio/etc/rucio.cfg

The online docs were not helping. Before proceeding, are we sure I can use the tool without being an ATLAS member?

Hi @vpadulan,

My apology! Didn’t realise only ATLAS uses rucio…

I suppose in that case CERNbox should work? (Judging by the name CERNbox)

I have prepared a link here which should contain all the data needed. if you copy all the files from this directory and put it in the data folder everything should work.

Sorry for the inconvenience!

Cheers, Dong

Dear @qidong ,

I can see the CERNBox link, I fear that the files have been stored each one into its own folder, but the dataset should be there somehow. I started the download, this will take a while probably, I will let you know if I find out something interesting

Unfortunately, it seems like the CERNBox link gives up after downloading ~10 GB of data, which I understand is not the full dataset size. I only get the following two folders

user.qidong.0823v4.data15_13TeV.periodAllYear.physics_Main.YS.grp15_v01_p5631.sv0_Zt
user.qidong.0823v4.data16_13TeV.periodAllYear.physics_Main.YS.grp16_v01_p5631.sv0_Zt

I have to understand how to work around this.

Hi @vpadulan and @qidong,

have you maybe tried using the direct /eos/ path to share the files, I think this should work fine even for large datasets.

Cheers,
Marta

Hi @mczurylo ,

That’s a good suggestion indeed. In principle, the files should be even reachable via xrootd from the EOS path. Let me try that!

Hi @mczurylo and @vpadulan,

Thanks a lot for your enthusiasm in helping me despite so many technical issues.

Yeah that is a great suggestion!

but I was not able to change the permission to my folder for some reason:
chmod -R 755 /eos/home-q/qidong
chmod: changing permissions of '/eos/home-q/qidong': Operation not permitted

Please could you confirm you do have access to it?

Cheers, Dong

Hi @qidong,

you can change the access to your directory on cernbox website directly, you can even just choose individual people to share it with. It should work then.

Cheers,
Marta

Dear @qidong ,

I confirm I can access the dataset and I am running your reproducer. Indeed, a properly-formed xrootd path to the folder you shared with me on EOS works (i.e. root://eosuser.cern.ch//eos/user/q/qidong/ntuple_share).

For now, I cannot reproduce the fStreamerImpl error, the analysis runs fine after a few tries. But, with this type of error it’s hard to tell on a first glance, I may need to repeat the test many times to find the data race. I will keep you updated

2 Likes

Hi @vpadulan,

Super! hmm, the same analysis crashes almost every time with ROOT.EnableImplicitMT(24) on my local machine.

I will try to run it on lxplus and see if I get same error.

Thanks again for helping!!

Cheers, Dong

Hi @qidong ,

Small update. I have been trying your reproducer on a Fedora 38 docker image, that’s because it has the same compiler version as you originally posted (GCC 13). Also in this case, I have run the reproducer over a few hours (a couple hundred times) and never got the fStreamerImpl error.

Let’s try to explore more possibilities. How did you install ROOT? Did you compile it yourself perhaps? In case, would you post the full cmake command you used with all the options?

Cheers,
Vincenzo

Hi @vpadulan, All,

Thanks a lot for the update! Good to know that this might be OS/compiler specific issues.

I also tested the same code on my M1Pro Mac with a conda env (ROOT 6.28.04); things also runs fine even with IMT on.

To answer your question:
I used the ROOT packaged by Archlinux[1], they did apply some minor patches to ROOT. If I read the code correctly - the envs and CMake command they use are fairly standard:

    # specify some custom flags
    # needed by vc to link properly
    CUSTOM_CMAKE_FLAGS="-DTARGET_ARCHITECTURE:STRING=generic"
    # make sure it finds python
    CUSTOM_CMAKE_FLAGS+=" -DPYTHON_EXECUTABLE:PATH=/usr/bin/python"
    # need to set install prefix like so
    CUSTOM_CMAKE_FLAGS+=" -DINSTALL_PREFIX=/usr"
    export CUSTOM_CMAKE_FLAGS

    # update system flags
    # don't let ROOT play around with lib paths
    export CPPFLAGS="${CPPFLAGS} -DIS_RPATH_BUILD=1"
    # make sure pthread gets detected
    CUSTOM_COMPILER_FLAGS="${CPPFLAGS} -pthread"
    export CFLAGS="${CFLAGS} ${CUSTOM_COMPILER_FLAGS}"
    export CXXFLAGS="${CXXFLAGS} ${CUSTOM_COMPILER_FLAGS}"
    export LDFLAGS="${LDFLAGS} ${CUSTOM_COMPILER_FLAGS}"

    # go flags for built-in clang
    export CGO_LDFLAGS="${LDFLAGS}"
    export GOFLAGS="-buildmode=pie -trimpath -modcacherw"
    ...
    cmake -C "${srcdir}/settings.cmake" \
        ${CUSTOM_CMAKE_FLAGS} \
        "${srcdir}/${pkgbase}-${pkgver}"
    make
    ...

Cheers, Dong

[1]PKGBUILD · main · Arch Linux / Packaging / Packages / root · GitLab