Bug 55448

Summary: using const-reference SSE or AVX types leads to unnecessary unaligned loads
Product: gcc Reporter: Matthias Kretz (Vir) <mkretz>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: normal CC: jakub, jamborm, uros
Priority: P3    
Version: 4.7.2   
Target Milestone: 4.8.0   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2012-11-23 00:00:00

Description Matthias Kretz (Vir) 2012-11-23 15:13:50 UTC
The following testcase:

#include <immintrin.h>
static inline __m256 add(const __m256 &a, const __m256 &b) { return _mm256_add_ps(a, b); }
void foo(__m256 &a, const __m256 b) { a = add(a, b); }

static inline __m128 add(const __m128 &a, const __m128 &b) { return _mm_add_ps(a, b); }
void foo(__m128 &a, const __m128 b) { a = add(a, b); }

compiled with "-O2 -mavx"

lead to
        vmovups (%rdi), %xmm1
        vinsertf128     $0x1, 16(%rdi), %ymm1, %ymm1
        vaddps  %ymm0, %ymm1, %ymm0
        vmovaps %ymm0, (%rdi)

for the __m256 case and

        vmovups (%rdi), %xmm1
        vaddps  %xmm0, %xmm1, %xmm0
        vmovaps %xmm0, (%rdi)

for the __m128 case.

It should rather be:
        vaddps  (%rdi), %ymm0, %ymm0
        vmovaps %ymm0, (%rdi)
and:
        vaddps  (%rdi), %xmm0, %xmm0
        vmovaps %xmm0, (%rdi)

The latter result can be obtained if the const-ref arguments to add are changed to pass by value.
Comment 1 Jakub Jelinek 2012-11-23 17:28:01 UTC
The low alignment originates from eipa_sra, foo isn't still in SSA form, and
ipa_modify_call_arguments computes align and misalign of base, which is PARM_DECL (of REFERENCE_TYPE, referencing __m256 with 256-bit alignment).
But get_pointer_alignment_1 for PARM_DECLs always returns 8-bit alignment.
Comment 2 Jakub Jelinek 2012-11-23 17:50:56 UTC
With -O2 -mavx -fno-ipa-sra the whole
#include <x86intrin.h>

static inline __m256
add1 (const __m256 & a, const __m256 & b)
{
  return _mm256_add_ps (a, b);
}

void
f1 (__m256 & a, const __m256 b)
{
  a = add1 (a, b);
}

static inline __m128
add2 (const __m128 & a, const __m128 & b)
{
  return _mm_add_ps (a, b);
}

void
f2 (__m128 & a, const __m128 b)
{
  a = add2 (a, b);
}

static inline __m256
add3 (const __m256 *a, const __m256 *b)
{
  return _mm256_add_ps (*a, *b);
}

void
f3 (__m256 *a, const __m256 b)
{
  *a = add3 (a, &b);
}

static inline __m128
add4 (const __m128 *a, const __m128 *b)
{
  return _mm_add_ps (*a, *b);
}

void
f4 (__m128 *a, const __m128 b)
{
  *a = add4 (a, &b);
}

testcase compiles into optimal code.  Beyond the eipa_sra issue the thing is that for AVX/AVX2 we generally should attempt to combine unaligned loads with operations that use them (unless it is a plain move), but there is UNSPEC_LOADU involved (and for 256-bit values also vec_concat with another MEM load), so not sure what would be the best pass to handle that, if some hack in the combiner, peephole2 (but we'd need many of them) or what.
Comment 3 Matthias Kretz (Vir) 2012-11-24 21:38:21 UTC
BTW, the problem is just as well visible with only SSE. The __m128 case then compiles to movlps and movhps instead of the memory operand.
Comment 4 Martin Jambor 2012-11-27 20:45:50 UTC
I have proposed a patch on the mailing list:

http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02265.html

It still needs a testcase from this bug but addresses this problem.
Comment 5 Martin Jambor 2012-11-30 16:11:44 UTC
Author: jamborm
Date: Fri Nov 30 16:11:33 2012
New Revision: 193998

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193998
Log:
2012-11-30  Martin Jambor  <mjambor@suse.cz>

	PR middle-end/52890
	PR tree-optimization/55415
	PR tree-optimization/54386
	PR target/55448
	* ipa-prop.c (ipa_modify_call_arguments): Be optimistic when
	get_pointer_alignment_1 returns false and the base was not a
	dereference.
	* tree-sra.c (access_precludes_ipa_sra_p): New parameter req_align,
	added check for required alignment.  Update the user.

	* testsuite/gcc.dg/ipa/ipa-sra-7.c: New test.
	* testsuite/gcc.dg/ipa/ipa-sra-8.c: Likewise.
	* testsuite/gcc.dg/ipa/ipa-sra-9.c: Likewise.
	* testsuite/gcc.target/i386/pr55448.c: Likewise.


Added:
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-7.c
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-8.c
    trunk/gcc/testsuite/gcc.dg/ipa/ipa-sra-9.c
    trunk/gcc/testsuite/gcc.target/i386/pr55448.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-prop.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-sra.c
Comment 6 Martin Jambor 2012-12-03 13:28:48 UTC
Duplicate. Now a fixed one, fortunately.

*** This bug has been marked as a duplicate of bug 54386 ***