This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug c/66275] New: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc generates incorrect code

            Bug ID: 66275
           Summary: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc
                    generates incorrect code
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Created attachment 35615
main.c, asm-test.h, and less-stripped-down intrin-pinsrw.c

gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2

w64 mingw gcc is mis-compiling functions declared with
__attribute__((sysv_abi)).  Registers holding args are getting clobbered with
temp values, but then used as if they still held function args.

This code is mis-compiled by
x86_64-w64-mingw32-gcc -g -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c
... main.c

cat > intrin-pinsrw.c <<"EOF"
#include <emmintrin.h>
#include <stdint.h>

#define SYSV_ABI __attribute__((sysv_abi))

union wordbytes { uint16_t w; uint8_t b[2]; };

void SYSV_ABI rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid,
size_t size, const uint32_t* LH)
        const uint64_t *src = srcvoid;
        __m128i d, l, h;
        __m128i *dst = dstvoid;

        for (size_t i = 0; i < size/sizeof(d) ; i+=1) {
                uint64_t s0 = src[i*2 + 0];

#define LO(x) ((uint8_t)x)
#define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1])

                l = _mm_cvtsi32_si128( LH[      LO(s0)] );
                h = _mm_cvtsi32_si128( LH[256 + HI(s0)] );
                s0 >>= 16;
 // commented-out code that wasn't needed to trigger the bug
                d = _mm_xor_si128(l, h);
                d = _mm_xor_si128(d, dst[i]);
                _mm_store_si128(&dst[i], d);
x86_64-w64-mingw32-gcc -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c -S

search for "si" in the output, and notice how %rsi is used as a pointer to load
src data, but also used as a tmp.

        movq    (%rsi,%r10), %rdx
        movzbl  %dl, %esi              ## clobbered here
        movzbl  %dh, %eax
        movd    (%rcx,%rsi,4), %xmm8
        movd    1024(%rcx,%rax,4), %xmm9
        pxor    %xmm8, %xmm9
        pxor    (%rdi,%r10), %xmm9
        movaps  %xmm9, (%rdi,%r10)
        movq    16(%rsi,%r10), %rdx    ## not restored after use as a tmp

%rdi is also clobbered and then used as a pointer again, later in the loop.

In the AMD64 sysv ABI, %rdi holds the 1st arg, %rsi holds the 2nd, %rdx holds
the 3rd, and %rcx holds the 4th.  

(and see attached for a less-stripped-down and a main.c to call it)

-funroll-loops is needed to trigger the problem on this reduced test-case.  It
might not be needed with a bigger function that would run out of registers
without unrolling.  Even the less-stripped-down version in the attachment runs
fine under WINE when compiled without -funroll-loops.

If testing by actually running the code with WINE, note that wine and wine64
seem to have incompatible requirements for the contents of ~/.wine.  I
eventually made a wrapper like:
 WINEPREFIX=~/.wine64 /usr/bin/wine64 "$@"

For anyone curious why I had this code in the first place:
This code is from par2 (the reed-solomon inner-loop that multiplies a buffer of
GF16 values by a constant factor, using a pre-computed lookup table).  See

 I have some functions written directly in asm.  They're written for the Linux
SysV ABI.  I could make an alternate entry point for win ABI, with extra mov
instructions, and push/pop the extra regs that are callee-save in MS-ABI but
not SYSV.  Or I could just declare their prototypes with
__attribute__((sysv_abi)), and gcc will call them properly.

I tried making a version of the function using intrinsics, so of course my test
harness needs to call it, too.  Since I call via function pointer, I had to
declare it the same way.

It was working fine until a recent change to the code of the function.  Old
version that didn't tickle this bug (partial diff):

-       uint64_t s0, s1;
-       for (size_t i = 0; i < size ; i+=16) {
-               s0 = *((uint64_t*) ((char*)srcvoid + i));       // byte address
math, 64bit load
-               s1 = *((uint64_t*) ((char*)srcvoid+8 + i));     // byte address
math, 64bit load

-               d = _mm_xor_si128(d, *(dst+i/16));
-               _mm_store_si128(dst+i/16, d);

I wrote it in that ugly way initially because I was basically porting my ASM
code to intrinsics.  BTW, the results were terrible.  gcc generates
ridiculously bad code for getting the src bytes into zero-extended 64bit regs,
for use as scaled-offsets in an address, compared to

movzx %dl, %eax
movxz %dh, %ebx
shr   $16, %rdx
use rax/rbx
movzx %dl, %eax
movxz %dh, %ebx
shr   $16, %rdx
use rax/rbx

 gcc never just shifts the reg holding src data.  Instead if copies, and shifts
the copy by $16, $32, or $48.

 gcc's code is about 35% slower than the hand-written version, even letting it
use avx so it doesn't emit useless movdqa instructions when it doesn't realize
that an old value is no longer needed.  Just un-comment the mostly-commented
loop body in the testcase (attachment version).

 Anyway, slow code is off-topic, this bug is about wrong code!

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]