Dear PyROOT experts,
This is an almost beautifully subtle bug.
I am using Python to submit PyROOT jobs to my local batch system. This worked very well, except a some of the jobs were crashing on startup with:
*** Break *** illegal instruction
Jobs only crashed on certain older worker nodes. Identical jobs worked on other worker nodes. Jobs submitted from the bash command-line did not crash. Jobs only crashed if submitted from a Python script that also used ROOT (prior import ROOT
). Jobs only crashed if environment variables were sent with the job (like HTCondor’s getenv=True
setting), rather than (less conveniently) setting up the environment for each job in a wrapper script.
Those clues allowed me to track down the cause. I was submitted from a modern CPU that supported AVX instructions. My local batch system still has some some old CPUs which don’t support AVX. It seems that on an AVX CPU, PyROOT defines an environment variable
EXTRA_CLING_ARGS=' -O2 -mavx'
If that environment variable is defined on a CPU that doesn’t support AVX, then PyROOT (or just plain ROOT) crashes with an illegal instruction, presumably when it tries to use one of the AVX instructions. The -mavx
setting is transferred with the EXTRA_CLING_ARGS
environment variable that goes to the job.
Once discovered, the workaround to this bug is quite easy: just remove EXTRA_CLING_ARGS
from Python’s os.environ before submitting the job, or unset it in bash before starting the job. But both workarounds are correcting a hidden side-effect of PyROOT - it would be better (clearer, fewer random crashes in future) if PyROOT did not export the problematic environment variable in the first place. Alternatively, each ROOT job could decide for itself if the -mavx
setting was required.
I hope that this problem is rare, as many job submissions don’t send the environment with the job, and fewer farms keep old worker nodes around. But if present the problem may be insidiously causing “random” crashes that users discount if rerunning the job lands on a newer CPU. So it would be good to fix.
Thanks,
Tim.
PS. To reproduce in ROOT 6.22/06, on a modern CPU (grep -c ' avx ' /proc/cpuinfo
is non-zero):
. /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_99 x86_64-centos7-gcc8-opt
python
>>> import os, ROOT
>>> print ("EXTRA_CLING_ARGS='"+os.environ["EXTRA_CLING_ARGS"]+"'")
EXTRA_CLING_ARGS=' -O2 -mavx'
(in other versions of ROOT, it may require using some ROOT objects before EXTRA_CLING_ARGS
is set).
To see the crash, on an old CPU (grep -c ' avx ' /proc/cpuinfo
= 0):
EXTRA_CLING_ARGS=' -O2 -mavx' root -b -q -e 'TH1D h("x","x",10,0,10);'
PPS. I’ve tested and see this problem with 6.20/02, 6.22/00, and 6.22/06 with Python 2.7, 3.7, and 3.8.
ROOT Version: 6.22/06
Platform: x86_64-centos7
Compiler: gcc8