AVX illegal instruction in job submitted from PyROOT

Dear PyROOT experts,

This is an almost beautifully subtle bug.

I am using Python to submit PyROOT jobs to my local batch system. This worked very well, except a some of the jobs were crashing on startup with:

*** Break *** illegal instruction

Jobs only crashed on certain older worker nodes. Identical jobs worked on other worker nodes. Jobs submitted from the bash command-line did not crash. Jobs only crashed if submitted from a Python script that also used ROOT (prior import ROOT). Jobs only crashed if environment variables were sent with the job (like HTCondor’s getenv=True setting), rather than (less conveniently) setting up the environment for each job in a wrapper script.

Those clues allowed me to track down the cause. I was submitted from a modern CPU that supported AVX instructions. My local batch system still has some some old CPUs which don’t support AVX. It seems that on an AVX CPU, PyROOT defines an environment variable

EXTRA_CLING_ARGS=' -O2 -mavx'

If that environment variable is defined on a CPU that doesn’t support AVX, then PyROOT (or just plain ROOT) crashes with an illegal instruction, presumably when it tries to use one of the AVX instructions. The -mavx setting is transferred with the EXTRA_CLING_ARGS environment variable that goes to the job.

Once discovered, the workaround to this bug is quite easy: just remove EXTRA_CLING_ARGS from Python’s os.environ before submitting the job, or unset it in bash before starting the job. But both workarounds are correcting a hidden side-effect of PyROOT - it would be better (clearer, fewer random crashes in future) if PyROOT did not export the problematic environment variable in the first place. Alternatively, each ROOT job could decide for itself if the -mavx setting was required.

I hope that this problem is rare, as many job submissions don’t send the environment with the job, and fewer farms keep old worker nodes around. But if present the problem may be insidiously causing “random” crashes that users discount if rerunning the job lands on a newer CPU. So it would be good to fix.

Thanks,
Tim.

PS. To reproduce in ROOT 6.22/06, on a modern CPU (grep -c ' avx ' /proc/cpuinfo is non-zero):

. /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_99 x86_64-centos7-gcc8-opt
python
>>> import os, ROOT
>>> print ("EXTRA_CLING_ARGS='"+os.environ["EXTRA_CLING_ARGS"]+"'")
EXTRA_CLING_ARGS=' -O2 -mavx'

(in other versions of ROOT, it may require using some ROOT objects before EXTRA_CLING_ARGS is set).

To see the crash, on an old CPU (grep -c ' avx ' /proc/cpuinfo = 0):

EXTRA_CLING_ARGS=' -O2 -mavx' root -b -q -e 'TH1D h("x","x",10,0,10);'

PPS. I’ve tested and see this problem with 6.20/02, 6.22/00, and 6.22/06 with Python 2.7, 3.7, and 3.8.

ROOT Version: 6.22/06
Platform: x86_64-centos7
Compiler: gcc8


It seems to me that you are trying to use a ROOT binary distribution built for a “modern” architecture (e.g., Haswell with “modern” SSE / AVX support).
Expect problems if your computers do not support any of the “modern” features.
It’s not just about python, which will fail when importing ROOT. The standard ROOT binaries / libraries will break, too.
Any application that you link against ROOT libraries will die as soon as any “modern” SSE / AVX feature is used from inside of any ROOT library.
The only solution is to build ROOT so that its binaries will not require any “modern” SSE / AVX features.
Note: The standard binary distributions provided by the ROOT Team are fine (they are also available via the “cvmfs”). Maybe the “lcg/views” provides such binary distributions, too.

Hi @adye,

Yes, as @Wile_E_Coyote said in his post, it seems that your ROOT version was compiled with AVX support, so expect failures as soon as an AVX instruction is executed. Removing the -mavx in EXTRA_CLING_ARGS disables AVX for JITted code, which might work in your specific use case, but nothing guarantees that you will not experience failures when using other parts of ROOT.

Cheers,
J.

Hi @Wile_E_Coyote, @jalopezg,

Thanks for the encouraging feedback. Fortunately I don’t think it is specific to the ROOT build I am using. That’s fortunate, because these builds are used by much of the ATLAS software, which would have been failing on all our old Grid CPUs if it didn’t need EXTRA_CLING_ARGS to crash.

My previous testing was with the lcg/views builds, which don’t list -mavx in their configuration (root-config --cflags) and I have been able to run much larger ROOT and PyROOT programs on the old CPUs, as long as I don’t submit via PyROOT.

Thanks to your suggestion, I now tested using one of the standard binary distributions from the ROOT team:

. /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.06/x86_64-centos7-gcc48-opt/bin/thisroot.sh

This version shows the same behaviour I mentioned in my OP: EXTRA_CLING_ARGS=' -O2 -mavx' is set when I use PyROOT on a new CPU, and crashes on an old CPU on startup when that envrionment variable is set.

So I really think this behaviour is independent of the ROOT build settings. It is being set somewhere in PyROOT based on the runtime CPU flags.

Thanks,
Tim.

In this case, it seems to me that it is the “cppyy backend” that improperly [gs]ets the actual CPU features on the machine on which ROOT is started (currently, the “-mavx” should appear only if the actual CPU has “avx” in “/proc/cpuinfo”, which is independent on how the ROOT itself was built).

But, looking at your first post again … it seems that this flag can be “inherited” from the “master node”, on which you start your HTCondor job (as you explicitly ask getenv=True, if you first "export EXTRA_CLING_ARGS=-O2", then it will run fine on any node).

After a quick look at cppyy sources, it only sets EXTRA_CLING_ARGS if not already defined. Can you please provide the following information:

  1. EXTRA_CLING_ARGS is not proviously defined in your environment (via .bashrc or any other means), i.e. $ echo $EXTRA_CLING_ARGS should be empty.

  2. The output for $ grep -c avx /proc/cpuinfo in a machine in which you are seeing the failures, should return 0. If not, please, attach the output.

Hi @jalopezg,

  1. Yes, $EXTRA_CLING_ARGS is not set before I run the first Python script. It is also not set for the batch job on the old worker node, unless passed as part of the environment from the PyROOT setting.
  2. Yes, the output for $ grep -c avx /proc/cpuinfo is 0 on the machine I see the illegal instruction on.

Thanks,
Tim.

cppyy sets EXTRA_CLING_ARGS in bindings/pyroot/cppyy/cppyy-backend/cling/python/cppyy_backend/loader.py:102

        if has_avx: CURRENT_ARGS += ' -mavx'

This can break binary compatibility of compiled and interpreted code. I would be in favor of removing this. But that needs @etejedor 's input and he’s only back in April.

I confirm that this also allows the submitted job to run OK. So this is a third obscure workaround :slight_smile:

Hi @Axel ,

Thanks for the news. Of course removing the setting altogether would fix my issue as well, so I’d be happy with that :slight_smile:

Thanks,
Tim.

It seems that cppyy only sets the -mavx after an occurrence of the “avx” string in /proc/cpuinfo. My guess was that it was contained as a substring somewhere else (not as part of the flags: line).

In any case, we will have to wait for @etejedor’s input.

Thanks for reporting this problem!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi,

As @jalopezg said, the code in cppyy does not add the -mavx option if it can’t find avx in /proc/cpuinfo on the node where Python is running, so I don’t see how the option can be added if there is no avx in cpuinfo. Following the link below you can see the function that sets EXTRA_CLING_ARGS. Note how the fact of exporting this variable beforehand prevents cppyy from setting it (which explains why the advice given by @Wile_E_Coyote works):

Also, you said you tested this with ROOT 6.20/02 as well (this is old PyROOT, so no new cppyy) and you observed the same, is this correct? If so, this must be coming from somewhere else.

Note that, serious problems have been reported in this thread, when libraries built with and without “avx features” have been mixed:

@Axel So, the “cppyy backend” should be FORCED to use “avx features” if the ROOT itself was compiled with them (i.e., ROOT libraries). Otherwise, even if “avx” is present in the actual “/proc/cpuinfo”, they should NOT be used.

@Wile_E_Coyote: close, but the twist is not in the ROOT libs, assuming no AVX-specific implementation details are exposed (e.g. different memory layout for classes with or without AVX as is the case with Eigen as also exemplified in the other topic). The real issue is with the PCMs/PCH (incl. I/O dictionaries). Since for Clang part of the AVX implementation consists of header files, if -mavx is not used when building the PCMs/PCH, __AVX__ will be undefined and no amount of compiler flags allows enabling AVX from that point on. (The reverse is also true.)

Normal use of cppyy gets away with looking at CPU features b/c the PCH is (re)build after installation or on first use. In the case of ROOT, the first half of that feature is kept, but the other half is gone. Hence fun ensues.

Hi @etejedor,

My original problem was that the setting is exported to the environment, which causes problems if that environment is used on another machine without AVX. It would be better if this environment variable didn’t “leak” into subprocesses, but instead let them determine the setting for themselves.

My case was a rather unusual situation (though perhaps more serious because of how hit or miss, obscure, and hard to track down it is). @Wile_E_Coyote / @wlav’s issue sounds worse.

Sorry, I was mistaken about 6.20/02. ROOT 6.20/02 does crash on an old machine with -mavx in the environment. But PyROOT 6.20/02 does not set the EXTRA_CLING_ARGS environment variable, and that is the issue at hand.

Thanks,
Tim.