The following example generates wrong code when compiled with -O2 on amd64. It doesn't seem to happen when compiling in 32bit mode though. #include <mmintrin.h> #include <assert.h> typedef unsigned int CARD32; static void mmxCombineAddU (CARD32 *dest, const CARD32 *src, int width) { const __m64 mmx_0 = _mm_setzero_si64(); const CARD32 *end = dest + width; while (dest < end) { __m64 s, d; s = _mm_unpacklo_pi8 (_mm_cvtsi32_si64(*src), mmx_0); d = _mm_unpacklo_pi8 (_mm_cvtsi32_si64(*dest), mmx_0); s = _mm_add_pi16(s, d); *dest = (CARD32)_mm_cvtsi64_si32(_mm_packs_pu16(s, mmx_0)); ++dest; ++src; } _mm_empty(); } int main() { CARD32 a = 0xffffffff; CARD32 b = 0x10101010; mmxCombineAddU(&a, &b, 1); assert(a == 0xffffffff); return 0; } The compiled assembler for the mmx instructions looks like this: 400555: 0f 6e 00 movd (%rax),%mm0 400558: 0f 60 ca punpcklbw %mm2,%mm1 40055b: 0f 60 c2 punpcklbw %mm2,%mm0 40055e: 0f fc c1 paddb %mm1,%mm0 400561: 0f 67 c3 packuswb %mm3,%mm0 400564: 0f 7e 00 movd %mm0,(%rax) It should use paddw instead of paddb here. best regards, Lars
Confirmed.
The code produced by crosscompiling from i686 for x86_86, I got functionally equal asm for 32bit and 64bit mode: gcc -O2 -m64: .LFB128: subq $24, %rsp .LCFI0: leaq 20(%rsp), %rax leaq 24(%rsp), %rdx cmpq %rax, %rdx jbe .L7 movq .LC5(%rip), %mm0 pxor %mm1, %mm1 packuswb %mm1, %mm0 movq %mm0, (%rsp) movq (%rsp), %rax incl %eax jne .L4 emms ... gcc -m32 -mmmx: pushl %ebp movl %esp, %ebp subl $40, %esp leal -4(%ebp), %eax andl $-16, %esp subl $16, %esp cmpl %eax, %ebp jbe .L7 movq .LC6, %mm0 pxor %mm1, %mm1 packuswb %mm1, %mm0 movd %mm0, -20(%ebp) movl -20(%ebp), %eax incl %eax jne .L4 emms ... Tested with gcc version 4.1.0 20050716 (experimental)
A regression from 3.4.x and 3.3.x.
With my patched gcc-4.1.0-20051019 the testcase works. Applied patches: PR7776, PR20297, PR22429, PR22533, PR23948, PR19505, PR20606, PR24069, PR24419, PR24172, PR24295, PR20928 [builder2@estel BUILD]$ gcc -Wall -O2 pr22432.c -S -march=x86-64 [builder2@estel BUILD]$ cat pr22432.s .file "pr22432.c" .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "pr22432.c" .LC1: .string "a == 0xffffffff" .section .rodata.cst8,"aM",@progbits,8 .align 8 .LC2: .long 269488144 .long 0 .align 8 .LC3: .long -1 .long 0 .text .p2align 4,,15 .globl main .type main, @function main: pushl %ebp movl %esp, %ebp subl $56, %esp leal -4(%ebp), %eax andl $-16, %esp subl $16, %esp cmpl %ebp, %eax jae .L9 movl $0, -24(%ebp) movl $0, -20(%ebp) movq -24(%ebp), %mm2 movq .LC2, %mm0 movq .LC3, %mm1 punpcklbw %mm2, %mm0 punpcklbw %mm2, %mm1 paddw %mm1, %mm0 packuswb %mm2, %mm0 movd %mm0, -28(%ebp) emms cmpl $-1, -28(%ebp) jne .L10 leave xorl %eax, %eax ret .L9: emms leave xorl %eax, %eax ret .L10: movl $__PRETTY_FUNCTION__.2544, 12(%esp) movl $30, 8(%esp) movl $.LC0, 4(%esp) movl $.LC1, (%esp) call __assert_fail .size main, .-main .section .rodata .type __PRETTY_FUNCTION__.2544, @object .size __PRETTY_FUNCTION__.2544, 5 __PRETTY_FUNCTION__.2544: .string "main" .ident "GCC: (GNU) 4.1.0 20051019 (experimental)" .section .note.GNU-stack,"",@progbits [builder2@estel BUILD]$ gcc -Wall -O2 pr22432.c -save-temps -march=x86-64 [builder2@estel BUILD]$ ./a.out ; echo $? 0
This is a showstopper; wrong code on a primary platform.
In the .life1 dump we have: (insn 41 38 42 2 (set (reg:V2SI 79 [ D.2609 ]) (subreg:V2SI (reg:V8QI 76) 0)) 998 {*movv2si_internal_rex64 (insn_list:REG_DEP_TRUE 32 (nil)) (expr_list:REG_DEAD (reg:V8QI 76) (nil))) (insn 42 41 43 2 (set (reg:V2SI 80 [ D.2619 ]) (subreg:V2SI (reg:V8QI 78) 0)) 998 {*movv2si_internal_rex64} (insn_list:REG_DEP_TRUE 38 (nil)) (expr_list:REG_DEAD (reg:V8QI 78) (nil))) (insn 43 42 46 2 (set (reg:V4HI 81) (plus:V4HI (subreg:V4HI (reg:V2SI 79 [ D.2609 ]) 0) (subreg:V4HI (reg:V2SI 80 [ D.2619 ]) 0))) 1031 {mmx_addv4hi3} (insn_list:REG_DEP_TRUE 41 (insn_list:REG_DEP_TRUE 42 (nil))) (expr_list:REG_DEAD (reg:V2SI 79 [ D.2609 ]) (expr_list:REG_DEAD (reg:V2SI 80 [ D.2619 ]) (nil)))) After combine we have: (note 41 38 42 2 NOTE_INSN_DELETED) (note 42 41 43 2 NOTE_INSN_DELETED) (insn 43 42 46 2 (set (reg:V8QI 81) (plus:V8QI (reg:V8QI 76) (reg:V8QI 78))) 1030 {mmx_addv8qi3} (insn_list:REG_DEP_TRUE 38 (insn_list:REG_DEP_TRUE 32 (nil))) (expr_list:REG_DEAD (reg:V8QI 78) (expr_list:REG_DEAD (reg:V8QI 76) (nil)))) Is that really the same???
According to rth on IRC this is indeed absolutely wrong. So it looks like we may have a combine bug here.
No, doing the add in v4hi mode is not the same as doing the add in v8qi mode. The carry bits will be handled differently. It's also rather odd that register 81 changed from V4HImode to V8QImode. Normally a pseudo-register always has the same mode, and accessing it in a different mode requires a subreg.
Created attachment 10151 [details] Proposed patch This patch fixes the problem. I'm running tests now.
Subject: Bug 22432 Author: ian Date: Sun Nov 6 05:34:38 2005 New Revision: 106555 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=106555 Log: ./: PR target/22432 * combine.c (apply_distributive_law): Don't distribute across a vector mode subreg. testsuite/: PR target/22432 * gcc.target/i386/pr22432.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr22432.c Modified: trunk/gcc/ChangeLog trunk/gcc/combine.c trunk/gcc/testsuite/ChangeLog
Subject: Bug 22432 Author: ian Date: Sun Nov 6 06:38:14 2005 New Revision: 106556 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=106556 Log: ./: PR target/22432 * combine.c (apply_distributive_law): Don't distribute across a vector mode subreg. testsuite/: 2005-11-05 Ian Lance Taylor <ian@airs.com> PR target/22432 * gcc.target/i386/pr22432.c: New test. Added: branches/gcc-4_0-branch/gcc/testsuite/gcc.target/i386/pr22432.c Modified: branches/gcc-4_0-branch/gcc/ChangeLog branches/gcc-4_0-branch/gcc/combine.c branches/gcc-4_0-branch/gcc/testsuite/ChangeLog
Fixed for 4.0.3 and 4.1.