Gcc 4.4 generates an extra load in a loop: [hjl@gnu-6 gcc]$ cat /tmp/b.c #include <tmmintrin.h> extern __m128i src[10]; extern __m128i resdst[10]; void foo (void) { int i; for (i = 0; i < 10; i++) resdst[i] = _mm_abs_epi16 (src[i]); } [hjl@gnu-6 gcc]$ gcc -O2 -S /tmp/b.c -o old.s -mssse3 -fno-asynchronous-unwind-tables [hjl@gnu-6 gcc]$ gcc --version gcc (GCC) 4.3.0 20080428 (Red Hat 4.3.0-8) Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [hjl@gnu-6 gcc]$ cat old.s .file "b.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: xorl %eax, %eax .p2align 4,,10 .p2align 3 .L2: pabsw src(%rax), %xmm0 movdqa %xmm0, resdst(%rax) addq $16, %rax cmpq $160, %rax jne .L2 rep ret .size foo, .-foo .ident "GCC: (GNU) 4.3.0 20080428 (Red Hat 4.3.0-8)" .section .note.GNU-stack,"",@progbits [hjl@gnu-6 gcc]$ ./xgcc -B./ -O2 -mssse3 -S /tmp/b.c -fno-asynchronous-unwind-tables [hjl@gnu-6 gcc]$ cat b.s .file "b.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: xorl %eax, %eax .p2align 4,,10 .p2align 3 .L2: movdqu src(%rax), %xmm0 pabsw %xmm0, %xmm0 movdqu %xmm0, resdst(%rax) addq $16, %rax cmpq $160, %rax jne .L2 rep ret .size foo, .-foo .ident "GCC: (GNU) 4.4.0 20081006 (experimental) [trunk revision 140917]" There are 2 problems: 1. Alignment info is lost and unaligned load is generated. 2. The load isn't needed at all.
How is the load not needed?
Just the alignment information is lost really: (mem/s:V16QI (plus:SI (reg/f:SI 68) (reg:SI 63 [ ivtmp.68 ])) [4 resdst S16 A8]) Which I think is fixed via http://gcc.gnu.org/ml/gcc-patches/2008-10/msg00325.html . The load is needed. If we use a pointer instead of an array we get: L2: pabsw (%ecx,%eax), %xmm0 movdqa %xmm0, (%edx,%eax) addl $16, %eax cmpl $160, %eax jne L2 Note since __m128i has the attribute of may_alias you have to do the load of the global pointer before the loop.
Newer patch http://gcc.gnu.org/ml/gcc-patches/2008-10/msg00350.html
(In reply to comment #3) > Newer patch http://gcc.gnu.org/ml/gcc-patches/2008-10/msg00350.html > With this patch, I got .globl foo .type foo, @function foo: xorl %eax, %eax .p2align 4,,10 .p2align 3 .L2: pabsw src(%rax), %xmm0 movdqa %xmm0, resdst(%rax) addq $16, %rax cmpq $160, %rax jne .L2 rep ret The load is combined into pabsw. The extra load insn and unaligned move are gone.
Subject: Bug 37774 Author: jakub Date: Thu Oct 9 08:17:08 2008 New Revision: 141003 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=141003 Log: PR middle-end/37774 * tree.h (get_object_alignment): Declare. * emit-rtl.c (set_mem_attributes_minus_bitpos): Call get_object_alignment if needed. * builtins.c (get_pointer_alignment): Move ADDR_EXPR operand handling to ... (get_object_alignment): ... here. New function. Try harder to determine alignment from get_inner_reference returned offset. Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/emit-rtl.c trunk/gcc/tree.h
Fixed.