Bug 66275 - __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc generates incorrect code
Summary: __attribute__((sysv_abi)) with x86_64-w64-mingw32-gcc generates incorrect code
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.9.2
: P3 normal
Target Milestone: 4.8.5
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2015-05-24 15:27 UTC by Peter Cordes
Modified: 2015-06-08 21:01 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-w64-mingw32-gcc
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-05-25 00:00:00


Attachments
main.c, asm-test.h, and less-stripped-down intrin-pinsrw.c (3.18 KB, application/zip)
2015-05-24 15:27 UTC, Peter Cordes
Details
Proposed patch (719 bytes, patch)
2015-06-01 19:55 UTC, Uroš Bizjak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2015-05-24 15:27:30 UTC
Created attachment 35615 [details]
main.c, asm-test.h, and less-stripped-down intrin-pinsrw.c

gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2

w64 mingw gcc is mis-compiling functions declared with __attribute__((sysv_abi)).  Registers holding args are getting clobbered with temp values, but then used as if they still held function args.

This code is mis-compiled by
x86_64-w64-mingw32-gcc -g -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c ... main.c


cat > intrin-pinsrw.c <<"EOF"
#include <emmintrin.h>
#include <stdint.h>

#define SYSV_ABI __attribute__((sysv_abi))


union wordbytes { uint16_t w; uint8_t b[2]; };

void SYSV_ABI rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid, size_t size, const uint32_t* LH)
{
	const uint64_t *src = srcvoid;
	__m128i d, l, h;
	__m128i *dst = dstvoid;

	for (size_t i = 0; i < size/sizeof(d) ; i+=1) {
		uint64_t s0 = src[i*2 + 0];

#define LO(x) ((uint8_t)x)
#define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1])

		l = _mm_cvtsi32_si128( LH[      LO(s0)] );
		h = _mm_cvtsi32_si128( LH[256 + HI(s0)] );
		s0 >>= 16;
 // commented-out code that wasn't needed to trigger the bug
		d = _mm_xor_si128(l, h);
		d = _mm_xor_si128(d, dst[i]);
		_mm_store_si128(&dst[i], d);
	}
}
EOF
x86_64-w64-mingw32-gcc -Wall -funroll-loops -O3 -std=gnu11 intrin-pinsrw.c -S

search for "si" in the output, and notice how %rsi is used as a pointer to load src data, but also used as a tmp.

.L5:
        movq    (%rsi,%r10), %rdx
        movzbl  %dl, %esi              ## clobbered here
        movzbl  %dh, %eax
        movd    (%rcx,%rsi,4), %xmm8
        movd    1024(%rcx,%rax,4), %xmm9
        pxor    %xmm8, %xmm9
        pxor    (%rdi,%r10), %xmm9
        movaps  %xmm9, (%rdi,%r10)
        movq    16(%rsi,%r10), %rdx    ## not restored after use as a tmp
        ...

%rdi is also clobbered and then used as a pointer again, later in the loop.

In the AMD64 sysv ABI, %rdi holds the 1st arg, %rsi holds the 2nd, %rdx holds the 3rd, and %rcx holds the 4th.  

(and see attached for a less-stripped-down and a main.c to call it)

-funroll-loops is needed to trigger the problem on this reduced test-case.  It might not be needed with a bigger function that would run out of registers without unrolling.  Even the less-stripped-down version in the attachment runs fine under WINE when compiled without -funroll-loops.


If testing by actually running the code with WINE, note that wine and wine64
seem to have incompatible requirements for the contents of ~/.wine.  I eventually made a wrapper like:
 #!/bin/sh
 WINEPREFIX=~/.wine64 /usr/bin/wine64 "$@"


For anyone curious why I had this code in the first place:
This code is from par2 (the reed-solomon inner-loop that multiplies a buffer of GF16 values by a constant factor, using a pre-computed lookup table).  See https://github.com/pcordes/par2-asm-experiments

 I have some functions written directly in asm.  They're written for the Linux SysV ABI.  I could make an alternate entry point for win ABI, with extra mov instructions, and push/pop the extra regs that are callee-save in MS-ABI but not SYSV.  Or I could just declare their prototypes with __attribute__((sysv_abi)), and gcc will call them properly.

I tried making a version of the function using intrinsics, so of course my test harness needs to call it, too.  Since I call via function pointer, I had to declare it the same way.

It was working fine until a recent change to the code of the function.  Old version that didn't tickle this bug (partial diff):

-	uint64_t s0, s1;
-	for (size_t i = 0; i < size ; i+=16) {
-		s0 = *((uint64_t*) ((char*)srcvoid + i));	// byte address math, 64bit load
-		s1 = *((uint64_t*) ((char*)srcvoid+8 + i));	// byte address math, 64bit load

- 		d = _mm_xor_si128(d, *(dst+i/16));
-		_mm_store_si128(dst+i/16, d);

I wrote it in that ugly way initially because I was basically porting my ASM code to intrinsics.  BTW, the results were terrible.  gcc generates ridiculously bad code for getting the src bytes into zero-extended 64bit regs, for use as scaled-offsets in an address, compared to

movzx %dl, %eax
movxz %dh, %ebx
shr   $16, %rdx
use rax/rbx
movzx %dl, %eax
movxz %dh, %ebx
shr   $16, %rdx
use rax/rbx
 ...

 gcc never just shifts the reg holding src data.  Instead if copies, and shifts the copy by $16, $32, or $48.

 gcc's code is about 35% slower than the hand-written version, even letting it use avx so it doesn't emit useless movdqa instructions when it doesn't realize that an old value is no longer needed.  Just un-comment the mostly-commented loop body in the testcase (attachment version).

 Anyway, slow code is off-topic, this bug is about wrong code!
Comment 1 Uroš Bizjak 2015-05-25 09:58:41 UTC
This problem can also be seen on x86_64-linux with gcc-6.0 and

gcc -O3 -funroll-loops -mabi=ms

In _.ce3 dump, we have following sequence:

  ...
   73: L73:
   72: NOTE_INSN_BASIC_BLOCK 5
   44: dx:DI=[si:DI+ax:DI]
      REG_EQUIV [si:DI+ax:DI]
   45: r9:DI=zero_extend(dx:QI)
   46: NOTE_INSN_DELETED
   47: xmm0:V4SI=vec_merge(vec_duplicate([r9:DI*0x4+cx:DI]),const_vector,0x1)
      REG_DEAD r9:DI
   50: NOTE_INSN_DELETED
   51: NOTE_INSN_DELETED
   52: dx:DI=zero_extract(dx:DI,0x8,0x8)
   53: NOTE_INSN_DELETED
   54: NOTE_INSN_DELETED
   55: xmm1:V4SI=vec_merge(vec_duplicate([dx:DI*0x4+cx:DI+0x400]),const_vector,0x1)
      REG_DEAD dx:DI
   56: xmm0:V2DI=xmm0:V2DI^xmm1:V2DI
      REG_DEAD xmm1:V2DI
   57: xmm0:V2DI=xmm0:V2DI^[di:DI+ax:DI]
      REG_EQUIV [di:DI+ax:DI]
   58: [di:DI+ax:DI]=xmm0:V2DI
      REG_DEAD xmm0:V2DI
  183: dx:DI=[si:DI+ax:DI+0x10]
      REG_EQUIV [si:DI+ax:DI+0x10]
  ...

Please note (insn 183), a consumer of SI register.

In _.rnreg dump, the above sequence is transformed to:

  ...
   73: L73:
   72: NOTE_INSN_BASIC_BLOCK 5
   44: dx:DI=[si:DI+r10:DI]
      REG_EQUIV [si:DI+ax:DI]
   45: si:DI=zero_extend(dx:QI)
   46: NOTE_INSN_DELETED
   47: xmm8:V4SI=vec_merge(vec_duplicate([si:DI*0x4+cx:DI]),const_vector,0x1)
      REG_DEAD r9:DI
   50: NOTE_INSN_DELETED
   51: NOTE_INSN_DELETED
   52: ax:DI=zero_extract(dx:DI,0x8,0x8)
   53: NOTE_INSN_DELETED
   54: NOTE_INSN_DELETED
   55: xmm9:V4SI=vec_merge(vec_duplicate([ax:DI*0x4+cx:DI+0x400]),const_vector,0x1)
      REG_DEAD dx:DI
   56: xmm8:V2DI=xmm8:V2DI^xmm9:V2DI
      REG_DEAD xmm1:V2DI
   57: xmm8:V2DI=xmm8:V2DI^[di:DI+r10:DI]
      REG_EQUIV [di:DI+ax:DI]
   58: [di:DI+r10:DI]=xmm8:V2DI
      REG_DEAD xmm0:V2DI
  183: dx:DI=[si:DI+r10:DI+0x10]
      REG_EQUIV [si:DI+ax:DI+0x10]
  ...

So, (insn 45) now indeed clobbers SI register that is needed by (insn 183).

Confirmed as RTL optimization problem, specifically, a rnreg pass problem.
Comment 2 Uroš Bizjak 2015-05-26 11:09:40 UTC
This is RTL infrastructure problem, not target specific.
Comment 3 Uroš Bizjak 2015-06-01 19:52:00 UTC
At the end of the day, this is target problem.

From _.dfinit dump:

;;  entry block defs     1 [dx] 2 [cx] 6 [bp] 7 [sp] 16 [argp] 20 [frame] 21 [xmm0] 22 [xmm1] 23 [xmm2] 24 [xmm3] 25 [xmm4] 26 [xmm5] 27 [xmm6] 28 [xmm7] 37 [r8] 38 [r9]

Registers DI and SI should also be listed here.

I have a patch.
Comment 4 Uroš Bizjak 2015-06-01 19:55:52 UTC
Created attachment 35670 [details]
Proposed patch

Patch that fixes ix86_function_arg_regno_p and ix86_function_value_regno_p to follow the ABI attribute.
Comment 5 Uroš Bizjak 2015-06-01 20:02:32 UTC
Kai, can you please test the patch on x86_64-w64-mingw32 ?

The patch has been bootstrapped and regression tested on x86_64-linux-gnu, but there is the comment which I'm not sure still applies:

   /* TODO: The function should depend on current function ABI but
      builtins.c would need updating then. Therefore we use the
      default ABI.  */

The patch fixes ix86_function_arg_regno_p and ix86_function_value_regno_p to follow the ABI attribute. So, if there is ABI attribute specified, it won't just change mid-compilation. If cfun is non-NULL, we can look into cfun->machine->call_abi, to determine current ABI.

With the proposed patch, we get:

;;  entry block defs     0 [ax] 1 [dx] 2 [cx] 4 [si] 5 [di] 6 [bp] 7 [sp] 16 [argp] 20 [frame] 21 [xmm0] 22 [xmm1] 23 [xmm2] 24 [xmm3] 25 [xmm4] 26 [xmm5] 27 [xmm6] 28 [xmm7] 37 [r8] 38 [r9]

And AFAICS the assembly looks correct.
Comment 6 uros 2015-06-03 15:47:14 UTC
Author: uros
Date: Wed Jun  3 15:46:41 2015
New Revision: 224094

URL: https://gcc.gnu.org/viewcvs?rev=224094&root=gcc&view=rev
Log:
	PR target/66275
	* config/i386/i386.c (ix86_function_arg_regno): Use ix86_cfun_abi
	to determine current function ABI.
	(ix86_function_value_regno_p): Ditto.

testsuite/ChangeLog:

	PR target/66275
	* gcc.target/i386/pr66275.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr66275.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
Comment 7 Uroš Bizjak 2015-06-03 18:05:48 UTC
Fixed in mainline, will be backported to gcc-5 branch.
Comment 8 Uroš Bizjak 2015-06-03 18:08:01 UTC
(In reply to Peter Cordes from comment #0)

> I wrote it in that ugly way initially because I was basically porting my ASM
> code to intrinsics.  BTW, the results were terrible.  gcc generates
> ridiculously bad code for getting the src bytes into zero-extended 64bit
> regs, for use as scaled-offsets in an address, compared to
> 
> movzx %dl, %eax
> movxz %dh, %ebx
> shr   $16, %rdx
> use rax/rbx
> movzx %dl, %eax
> movxz %dh, %ebx
> shr   $16, %rdx
> use rax/rbx
>  ...
> 
>  gcc never just shifts the reg holding src data.  Instead if copies, and
> shifts the copy by $16, $32, or $48.
> 
>  gcc's code is about 35% slower than the hand-written version, even letting
> it use avx so it doesn't emit useless movdqa instructions when it doesn't
> realize that an old value is no longer needed.  Just un-comment the
> mostly-commented loop body in the testcase (attachment version).
> 
>  Anyway, slow code is off-topic, this bug is about wrong code!

Please open a new PR for this bug.
Comment 9 uros 2015-06-08 18:41:55 UTC
Author: uros
Date: Mon Jun  8 18:41:24 2015
New Revision: 224244

URL: https://gcc.gnu.org/viewcvs?rev=224244&root=gcc&view=rev
Log:
	Backport from mainline:
	2015-06-03  Uros Bizjak  <ubizjak@gmail.com>

	PR target/66275
	* config/i386/i386.c (ix86_function_arg_regno): Use ix86_cfun_abi
	to determine current function ABI.
	(ix86_function_value_regno_p): Ditto.

testsuite/ChangeLog:

	Backport from mainline:
	2015-06-03  Uros Bizjak  <ubizjak@gmail.com>

	PR target/66275
	* gcc.target/i386/pr66275.c: New test.


Added:
    branches/gcc-5-branch/gcc/testsuite/gcc.target/i386/pr66275.c
Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/i386/i386.c
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
Comment 10 uros 2015-06-08 20:07:40 UTC
Author: uros
Date: Mon Jun  8 20:07:08 2015
New Revision: 224247

URL: https://gcc.gnu.org/viewcvs?rev=224247&root=gcc&view=rev
Log:
	Backport from mainline:
	2015-06-03  Uros Bizjak  <ubizjak@gmail.com>

	PR target/66275
	* config/i386/i386.c (ix86_function_arg_regno): Use ix86_cfun_abi
	to determine current function ABI.
	(ix86_function_value_regno_p): Ditto.

testsuite/ChangeLog:

	Backport from mainline:
	2015-06-03  Uros Bizjak  <ubizjak@gmail.com>

	PR target/66275
	* gcc.target/i386/pr66275.c: New test.


Added:
    branches/gcc-4_9-branch/gcc/testsuite/gcc.target/i386/pr66275.c
Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/config/i386/i386.c
    branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
Comment 11 uros 2015-06-08 20:59:39 UTC
Author: uros
Date: Mon Jun  8 20:59:07 2015
New Revision: 224249

URL: https://gcc.gnu.org/viewcvs?rev=224249&root=gcc&view=rev
Log:
	Backport from mainline:
	2015-06-03  Uros Bizjak  <ubizjak@gmail.com>

	PR target/66275
	* config/i386/i386.c (ix86_function_arg_regno): Use ix86_cfun_abi
	to determine current function ABI.
	(ix86_function_value_regno_p): Ditto.

testsuite/ChangeLog:

	Backport from mainline:
	2015-06-03  Uros Bizjak  <ubizjak@gmail.com>

	PR target/66275
	* gcc.target/i386/pr66275.c: New test.


Added:
    branches/gcc-4_8-branch/gcc/testsuite/gcc.target/i386/pr66275.c
Modified:
    branches/gcc-4_8-branch/gcc/ChangeLog
    branches/gcc-4_8-branch/gcc/config/i386/i386.c
    branches/gcc-4_8-branch/gcc/testsuite/ChangeLog
Comment 12 Uroš Bizjak 2015-06-08 21:01:51 UTC
Fixed everywhere.