This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug c/66275] New: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc generates incorrect code
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 24 May 2015 15:27:30 +0000
- Subject: [Bug c/66275] New: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc generates incorrect code
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275
Bug ID: 66275
Summary: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc
generates incorrect code
Product: gcc
Version: 4.9.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Created attachment 35615
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35615&action=edit
main.c, asm-test.h, and less-stripped-down intrin-pinsrw.c
gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2
w64 mingw gcc is mis-compiling functions declared with
__attribute__((sysv_abi)). Registers holding args are getting clobbered with
temp values, but then used as if they still held function args.
This code is mis-compiled by
x86_64-w64-mingw32-gcc -g -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c
... main.c
cat > intrin-pinsrw.c <<"EOF"
#include <emmintrin.h>
#include <stdint.h>
#define SYSV_ABI __attribute__((sysv_abi))
union wordbytes { uint16_t w; uint8_t b[2]; };
void SYSV_ABI rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid,
size_t size, const uint32_t* LH)
{
const uint64_t *src = srcvoid;
__m128i d, l, h;
__m128i *dst = dstvoid;
for (size_t i = 0; i < size/sizeof(d) ; i+=1) {
uint64_t s0 = src[i*2 + 0];
#define LO(x) ((uint8_t)x)
#define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1])
l = _mm_cvtsi32_si128( LH[ LO(s0)] );
h = _mm_cvtsi32_si128( LH[256 + HI(s0)] );
s0 >>= 16;
// commented-out code that wasn't needed to trigger the bug
d = _mm_xor_si128(l, h);
d = _mm_xor_si128(d, dst[i]);
_mm_store_si128(&dst[i], d);
}
}
EOF
x86_64-w64-mingw32-gcc -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c -S
search for "si" in the output, and notice how %rsi is used as a pointer to load
src data, but also used as a tmp.
.L5:
movq (%rsi,%r10), %rdx
movzbl %dl, %esi ## clobbered here
movzbl %dh, %eax
movd (%rcx,%rsi,4), %xmm8
movd 1024(%rcx,%rax,4), %xmm9
pxor %xmm8, %xmm9
pxor (%rdi,%r10), %xmm9
movaps %xmm9, (%rdi,%r10)
movq 16(%rsi,%r10), %rdx ## not restored after use as a tmp
...
%rdi is also clobbered and then used as a pointer again, later in the loop.
In the AMD64 sysv ABI, %rdi holds the 1st arg, %rsi holds the 2nd, %rdx holds
the 3rd, and %rcx holds the 4th.
(and see attached for a less-stripped-down and a main.c to call it)
-funroll-loops is needed to trigger the problem on this reduced test-case. It
might not be needed with a bigger function that would run out of registers
without unrolling. Even the less-stripped-down version in the attachment runs
fine under WINE when compiled without -funroll-loops.
If testing by actually running the code with WINE, note that wine and wine64
seem to have incompatible requirements for the contents of ~/.wine. I
eventually made a wrapper like:
#!/bin/sh
WINEPREFIX=~/.wine64 /usr/bin/wine64 "$@"
For anyone curious why I had this code in the first place:
This code is from par2 (the reed-solomon inner-loop that multiplies a buffer of
GF16 values by a constant factor, using a pre-computed lookup table). See
https://github.com/pcordes/par2-asm-experiments
I have some functions written directly in asm. They're written for the Linux
SysV ABI. I could make an alternate entry point for win ABI, with extra mov
instructions, and push/pop the extra regs that are callee-save in MS-ABI but
not SYSV. Or I could just declare their prototypes with
__attribute__((sysv_abi)), and gcc will call them properly.
I tried making a version of the function using intrinsics, so of course my test
harness needs to call it, too. Since I call via function pointer, I had to
declare it the same way.
It was working fine until a recent change to the code of the function. Old
version that didn't tickle this bug (partial diff):
- uint64_t s0, s1;
- for (size_t i = 0; i < size ; i+=16) {
- s0 = *((uint64_t*) ((char*)srcvoid + i)); // byte address
math, 64bit load
- s1 = *((uint64_t*) ((char*)srcvoid+8 + i)); // byte address
math, 64bit load
- d = _mm_xor_si128(d, *(dst+i/16));
- _mm_store_si128(dst+i/16, d);
I wrote it in that ugly way initially because I was basically porting my ASM
code to intrinsics. BTW, the results were terrible. gcc generates
ridiculously bad code for getting the src bytes into zero-extended 64bit regs,
for use as scaled-offsets in an address, compared to
movzx %dl, %eax
movxz %dh, %ebx
shr $16, %rdx
use rax/rbx
movzx %dl, %eax
movxz %dh, %ebx
shr $16, %rdx
use rax/rbx
...
gcc never just shifts the reg holding src data. Instead if copies, and shifts
the copy by $16, $32, or $48.
gcc's code is about 35% slower than the hand-written version, even letting it
use avx so it doesn't emit useless movdqa instructions when it doesn't realize
that an old value is no longer needed. Just un-comment the mostly-commented
loop body in the testcase (attachment version).
Anyway, slow code is off-topic, this bug is about wrong code!