The following testcase: #include <immintrin.h> static inline __m256 add(const __m256 &a, const __m256 &b) { return _mm256_add_ps(a, b); } void foo(__m256 &a, const __m256 b) { a = add(a, b); } static inline __m128 add(const __m128 &a, const __m128 &b) { return _mm_add_ps(a, b); } void foo(__m128 &a, const __m128 b) { a = add(a, b); } compiled with "-O2 -mavx" lead to vmovups (%rdi), %xmm1 vinsertf128 $0x1, 16(%rdi), %ymm1, %ymm1 vaddps %ymm0, %ymm1, %ymm0 vmovaps %ymm0, (%rdi) for the __m256 case and vmovups (%rdi), %xmm1 vaddps %xmm0, %xmm1, %xmm0 vmovaps %xmm0, (%rdi) for the __m128 case. It should rather be: vaddps (%rdi), %ymm0, %ymm0 vmovaps %ymm0, (%rdi) and: vaddps (%rdi), %xmm0, %xmm0 vmovaps %xmm0, (%rdi) The latter result can be obtained if the const-ref arguments to add are changed to pass by value.
The low alignment originates from eipa_sra, foo isn't still in SSA form, and ipa_modify_call_arguments computes align and misalign of base, which is PARM_DECL (of REFERENCE_TYPE, referencing __m256 with 256-bit alignment). But get_pointer_alignment_1 for PARM_DECLs always returns 8-bit alignment.
With -O2 -mavx -fno-ipa-sra the whole #include <x86intrin.h> static inline __m256 add1 (const __m256 & a, const __m256 & b) { return _mm256_add_ps (a, b); } void f1 (__m256 & a, const __m256 b) { a = add1 (a, b); } static inline __m128 add2 (const __m128 & a, const __m128 & b) { return _mm_add_ps (a, b); } void f2 (__m128 & a, const __m128 b) { a = add2 (a, b); } static inline __m256 add3 (const __m256 *a, const __m256 *b) { return _mm256_add_ps (*a, *b); } void f3 (__m256 *a, const __m256 b) { *a = add3 (a, &b); } static inline __m128 add4 (const __m128 *a, const __m128 *b) { return _mm_add_ps (*a, *b); } void f4 (__m128 *a, const __m128 b) { *a = add4 (a, &b); } testcase compiles into optimal code. Beyond the eipa_sra issue the thing is that for AVX/AVX2 we generally should attempt to combine unaligned loads with operations that use them (unless it is a plain move), but there is UNSPEC_LOADU involved (and for 256-bit values also vec_concat with another MEM load), so not sure what would be the best pass to handle that, if some hack in the combiner, peephole2 (but we'd need many of them) or what.
BTW, the problem is just as well visible with only SSE. The __m128 case then compiles to movlps and movhps instead of the memory operand.
I have proposed a patch on the mailing list: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02265.html It still needs a testcase from this bug but addresses this problem.
Author: jamborm Date: Fri Nov 30 16:11:33 2012 New Revision: 193998 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193998 Log: 2012-11-30 Martin Jambor <mjambor@suse.cz> PR middle-end/52890 PR tree-optimization/55415 PR tree-optimization/54386 PR target/55448 * ipa-prop.c (ipa_modify_call_arguments): Be optimistic when get_pointer_alignment_1 returns false and the base was not a dereference. * tree-sra.c (access_precludes_ipa_sra_p): New parameter req_align, added check for required alignment. Update the user. * testsuite/gcc.dg/ipa/ipa-sra-7.c: New test. * testsuite/gcc.dg/ipa/ipa-sra-8.c: Likewise. * testsuite/gcc.dg/ipa/ipa-sra-9.c: Likewise. * testsuite/gcc.target/i386/pr55448.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-7.c trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-8.c trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-9.c trunk/gcc/testsuite/gcc.target/i386/pr55448.c Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-prop.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
Duplicate. Now a fixed one, fortunately. *** This bug has been marked as a duplicate of bug 54386 ***