Valgrind detects illegal opcode on import

aharel · May 6, 2013, 8:35am

Hi,

While trying to debug memory corruption I narrowed down one of the scarier complaints from Valgrind to an unrelated, and seemingly harmless issue. That is, the program runs fine. Which according to Valgrind is impossible.

[quote]$ valgrind --tool=memcheck --suppressions=/home/aharel/valgrind-root534.supp.txt --suppressions=/home/aharel/valgrind-python.supp.txt --trace-children=yes --track-origins=yes --gen-suppressions=yes python test0.py 2>&1 | tee -a text/valgrind_test0.txt
==10739== Memcheck, a memory error detector
…
==10748== Command: /lib/ld-linux.so.2 --verify /usr/bin/python2.7
==10748==
vex x86->IR: unhandled instruction bytes: 0xC5 0xF9 0x6E 0x40
==10748== valgrind: Unrecognised instruction at address 0x11d0e0.
==10748== at 0x11D0E0: _dl_sysdep_start (dl-sysdep.c:160)
==10748== by 0x10CC22: _dl_start (rtld.c:337)
==10748== by 0x1088F6: ??? (in /lib32/ld-2.15.so)
==10748== Your program just tried to execute an instruction that Valgrind
==10748== did not recognise. There are two possible reasons for this.
==10748== 1. Your program has a bug and erroneously jumped to a non-code
==10748== location. If you are running Memcheck and you just saw a
==10748== warning about a bad jump, it’s probably your program’s fault.
==10748== 2. The instruction is legitimate but Valgrind doesn’t handle it,
==10748== i.e. it’s Valgrind’s fault. If you think this is the case or
==10748== you are not sure, please let us know and we’ll try to fix it.
==10748== Either way, Valgrind will now raise a SIGILL signal which will
==10748== probably kill your program.
==10748==
==10748== Process terminating with default action of signal 4 (SIGILL)
==10748== Illegal opcode at address 0x11D0E0
==10748== at 0x11D0E0: _dl_sysdep_start (dl-sysdep.c:160)
==10748== by 0x10CC22: _dl_start (rtld.c:337)
==10748== by 0x1088F6: ??? (in /lib32/ld-2.15.so)
==10748==
==10748== HEAP SUMMARY:
==10748== in use at exit: 0 bytes in 0 blocks
==10748== total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==10748==
==10748== All heap blocks were freed – no leaks are possible
==10748==
==10748== For counts of detected and suppressed errors, rerun with: -v
==10748== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
…
Importing R:
Importing SU:
Done
…
=[/quote]

This was not only reproducible, but also stable enough that I could narrow it down to a few lines of innocuous code. This was the function whose import triggered the complaint. Oddly, the main programs printouts appeared after the Valgrind complaints.

test0.py is:

#!/usr/bin/python2.7
print 'Importing R:'
import ROOT as R
print 'Importing SU:'
import test_su as SU
print 'Done'

test_su.py is:

import ROOT as R

def set_graph_style( graph, color = R.kRed+1, marker = R.kOpenCircle ):
    graph.SetMarkerStyle( marker )
    graph.SetMarkerColor( color )
    graph.SetLineColor( color )

puzzled,
Amnon

wlav · May 6, 2013, 6:56pm

Amnon,

no SIGILL on my side when running this.

To figure out, probably need a lot more info. Such as: version of valgrind, type of machine this is run on, whether you compiled ROOT yourself or downloaded it, and whether the python versions match up.

Cheers,
Wim

aharel · May 7, 2013, 8:21am

Hi Wim,

Thanks for looking into it. Here are the details.

Valgrind is valgrind-3.8.1
machine is (from uname -a):
Linux ahx 3.7.9-gentoo #10 SMP Thu Mar 21 14:17:20 IST 2013 x86_64 Intel® Core™ i7-2620M CPU @ 2.70GHz GenuineIntel GNU/Linux
I compiled ROOT myself. It is not the official Gentoo package (which also would’ve been compiled). I downloaded root_v5.34.02.source.tar.gz and installed it as a user (I also changed a couple of classes that are irrelevant here, TUnfold and RooKeysPdf).
Python version used is 2.7.3. ROOT was built against that version (looked through the config.log, which only cares that its 2.7).

When I run this without valgrind it works smoothly (I used this in a more complex, setting, of course), with no SIGILL. As the valgrind error says, it could be an error in valgrind. At the very least I wanted to start a paper trail, so the next person who tries to solve a memory corruption using valgrind and runs into this will have fair warning.

BTW: I solved the memory corruption by following more closely my pyroot golden rule “never let ROOT’s current directory go out of scope”.

cheers,
Amnon

wlav · May 7, 2013, 6:30pm

Amnon,

okay, so the chip is 64b (and hence presumably so is /usr/bin/python2.7), but the error originates from a lib located in /lib32? Was valgrind itself build from source, and if so, which build target was selected? Also, could you check whether any of the ROOT libs are perhaps linked against libs in /lib32?

Cheers,
Wim

aharel · May 7, 2013, 7:34pm

Hi Wim,

Yes, valgrind was build from source (Gentoo). I rebuilt it to check. Here are some key outputs:

[quote]./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --libdir=/usr/lib64 --disable-dependency-tracking --without-mpicc
…
Primary build target: AMD64_LINUX
Secondary build target: X86_LINUX
[/quote]

The problem seems to be with the use of /lib/ld-linux.so.2:
~ $ ls -l /lib/ld-linux.so.2
lrwxrwxrwx 1 root root 22 May 5 21:42 /lib/ld-linux.so.2 → …/lib32/ld-linux.so.2

Otherwise, things look to be pure 64bit:
~ ls -l /usr/lib lrwxrwxrwx 1 root root 5 Nov 16 2011 /usr/lib -> lib64 ~ ldd which root
linux-vdso.so.1 (0x00007fff97193000)
libXpm.so.4 => /usr/lib64/libXpm.so.4 (0x00007f3c2c717000)
libX11.so.6 => /usr/lib64/libX11.so.6 (0x00007f3c2c3d8000)
libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libstdc++.so.6 (0x00007f3c2c0d3000)
libc.so.6 => /lib64/libc.so.6 (0x00007f3c2bd28000)
libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x00007f3c2bb07000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f3c2b903000)
libm.so.6 => /lib64/libm.so.6 (0x00007f3c2b613000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3c2c929000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.3/libgcc_s.so.1 (0x00007f3c2b3fd000)
libXau.so.6 => /usr/lib64/libXau.so.6 (0x00007f3c2b1f9000)
libXdmcp.so.6 => /usr/lib64/libXdmcp.so.6 (0x00007f3c2aff3000)
~ ldd `which python` linux-vdso.so.1 (0x00007fffd19ff000) libc.so.6 => /lib64/libc.so.6 (0x00007fdbbdf39000) /lib64/ld-linux-x86-64.so.2 (0x00007fdbbe2e4000) ~ ldd which valgrind
linux-vdso.so.1 (0x00007fffd1531000)
libc.so.6 => /lib64/libc.so.6 (0x00007f9f729f4000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9f72d9f000)

So I guess the question is: why did valgrind decide to use /lib/ld-linux.so.2?
Looking at valgrind logs, it also uses /usr/bin/ldd or /lib64/ld-linux-x86-64.so.2
And what is it about the content of test_su that triggers this?

Just to double check, I created another similar test:

~/d0/afb $ diff test0.py test1.py
5c5
< import test_su as SU
---
> import test_su1 as SU
~/d0/afb $ less test_su1.py
import ROOT as R

def save_canvas( name = 'tmp', fig_out_dir = 'fig', formats = [ 'eps', 'png' ]  ):
    can = R.gPad.GetCanvas()
    can.cd()
    for form in formats:
        can.SaveAs( '{}/{}.{}'.format( fig_out_dir, name, form ) )

This starts the same way, but then valgrind does not issue a:

but proceeds directly to run the python program.
valgrind_test0_new.txt (13.1 KB)
valgrind_test1.txt (8.17 KB)

wlav · May 7, 2013, 10:03pm

Amnon,

something (not valgrind) seems to run both the 32b and the 64b versions. If I run the 32b version, I get a return value of 1; if I run the 64b version, I get a return value of 0. So, it looks to me like sub-programs are started by trying one ld-linux after another until one succeeds, and valgrind is simply tracing these children as asked.

In any case, what wasn’t clear from your original message but is from these logs: the SIGILL does not originate from running the Python/PyROOT program itself, but from the running of a sub-process.

If I run the python program under gdb, and follow children on fork, I also see /lib/ld-linux run (and gdb complaining about it being 32b). If I catch fork, I see:#0 0x00007ffff752b9c6 in fork () from /lib64/libc.so.6 #1 0x00007ffff74eabd3 in _IO_proc_open@@GLIBC_2.2.5 () from /lib64/libc.so.6 #2 0x00007ffff74eaf41 in popen@@GLIBC_2.2.5 () from /lib64/libc.so.6 #3 0x00007ffff53fa2c5 in TUnixSystem::GetLinkedLibraries() () from /home/wlav/rootdev/root/lib/libCore.so #4 0x00007ffff5349d53 in TSystem::GetLibraries(char const*, char const*, bool) () from /home/wlav/rootdev/root/lib/libCore.so #5 0x00007ffff534d5e8 in TSystem::Load(char const*, char const*, bool) () from /home/wlav/rootdev/root/lib/libCore.so #6 0x00007ffff5575fbf in G__G__Base2_282_0_122(G__value*, char const*, G__param*, int) () from /home/wlav/rootdev/root/lib/libCore.so #7 0x00007ffff492f2ff in Cint::G__CallFunc::Execute(void*) () from /home/wlav/rootdev/root/lib/libCint.so #8 0x00007ffff1eb74b2 in stlLoader() () from /home/wlav/rootdev/v5-34-00-patches/cint/cint/stl/vector.dll
And there’s your culprit and the explanation …

Note that what the code does is perfectly fine, just that apparently valgrind (nor gdb unless told to re-exec itself) can switch dynamically from 64b to 32b tracing.

Cheers,
Wim

aharel · May 7, 2013, 10:19pm

Hi Wim,

Thanks for looking into this.
I’m not sure I understood who the culprit is
(python is launching a subprogram for parsing and doing it badly, based on the content of the file?)
But this “trying all loaders” approach should be harmless, and I guess that’s the sufficient explanation.

So thanks again.
Amnon

wlav · May 8, 2013, 5:38pm

Amnon,

python is not launching a program, TUnixSystem::GetLinkedLibraries() is. But that’s perfectly okay: a 64b process may launch a 32b one, as after the fork, exec will replace the whole image.

The problem appears to be that valgrind when tracing a 64b program, can then not trace a 32b child. Note that here, the image is not completely replaced as valgrind traps the fork (it has to, as valgrind would otherwise be removed from the child process and then it can’t trace it). With gdb, you’d run into the same problem, except that gdb has the option to start a new gdb process on the sub-process, but I’m not aware of a similar option with valgrind.

Cheers,
Wim