Bug 39840 - Non-optimal (or wrong) implementation of SSE intrinsics
Summary: Non-optimal (or wrong) implementation of SSE intrinsics
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 37565
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-21 19:05 UTC by Ulrich Drepper
Modified: 2009-04-22 13:58 UTC (History)
3 users (show)

See Also:
Host:
Target: i?86-* x86_64-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
An example (3.72 KB, application/octet-stream)
2009-04-21 20:34 UTC, H.J. Lu
Details
An eample (1.54 KB, application/octet-stream)
2009-04-21 21:56 UTC, H.J. Lu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Drepper 2009-04-21 19:05:32 UTC
The implementations of the SSE intrinsics for x86 and x86-64 in gcc is tied to the use of an appropriate -m option, such as -mssse3 or -mavx.  This is different from what icc does and it prevents code from being written in the most natural form.  This is nothing new in gcc 4.4, it has been the behavior of gcc forever, as far as I can see.  But especially the introduction of AVX brings this problem to the foreground.

As an example, assume I want to write a vector class with the usual operations.  I can write code like this:

#ifdef __AVX__
vec<float,N> operator+(vec<float,N> &a, vec<float,N> &b) {
  ... use AVX intrinsics ...
}
#elif defined __SSE4__
vec<float,N> operator+(vec<float,N> &a, vec<float,N> &b) {
  ... use SSE4 intrinsics ...
}
#elif defined __SSE2__
vec<float,N> operator+(vec<float,N> &a, vec<float,N> &b) {
  ... use SSE2 intrinsics ...
}
#else
vec<float,N> operator+(vec<float,N> &a, vec<float,N> &b) {
  ... generic implementation ...
}
#endif

But this means, of course, that the binary has to be compiled for every single target and the correct one has to be chosen.  This is not attractive or practical.  Chances are that only a generic implementation will be available.

It would be better to have a self-optimizing implementation:

vec<float,N> operator+(vec<float,N> &a, vec<float,N> &b) {
  if (AVX is available)
    ... use AVX intrinsics ...
  else if (SSE4 is available)
    ... use SSE4 intrinsics ...
  else if (SSE2 is available)
    ... use SSE2 intrinsics ...
  else
    ... generic implementation ...
}

This is possible with icc.  It is not possible with gcc in the moment.  For gcc I would have to split the implementation of all the variants in individual files and then, in the template function as seen above, these implementations would have to be called.  Even if as in this case it might be doable (but terribly inconvenient) there are situations where this is really impractical or impossible.


The problem is that to be able to use the AVX intrinsics the compiler has to be passed -mavx (all other extensions are implied in -mavx).   But this flag has another consequence: the compiler will now take advantage of the new instructions in AVX and generate for unrelated code not associated with intrinsics (e.g., an inlined memset implementation).  The result is that such a binary will fail to run on anything but an AVX-enabled machine.


In icc the -mavx flag exclusively controls the code generation (i.e., whether AVX is used in inlined memset etc).  The SSE intrinsics and all the associated data types are _always_ defined as soon as <immintrin.h> is included.


This means the exmaple code above would be compiled with an -m parameter for the minimum ISA to support and still the AVX, SSE4, ... intrinsics are available.


gcc should follow icc's way of handling the intrinsics.  Since all this intrinsic business comes from icc I consider this a bug in gcc's implementation instead of an enhancement request.
Comment 1 H.J. Lu 2009-04-21 19:07:53 UTC
Please provide some sample code which can be compiled.
Comment 2 Ulrich Drepper 2009-04-21 19:37:48 UTC
[I couldn't attach the code as an attachment, bugzilla has a bug.]

The program below has to be compiled with -mavx to allow the AVX intrinsics being used.  But this also triggers using the use of the vmovss instruction to load the parameter for the sin() call from memory.

(Forget the reference to memset in the original report, it's as simple as passing floating point parameters that triggers the problem.)

#include <math.h>
#include <stdio.h>
#include <immintrin.h>


static unsigned int eax, ebx, ecx, edx;


static int
has_avx (void)
{
  if ((ecx & (1 << 27)) == 0)
    /* No OSXSAVE.  */
    return 0;

  unsigned int feat_eax, feat_edx;
  asm ("xgetbv" : "=a" (feat_eax), "=d" (feat_edx) : "c" (0));
  if ((feat_eax & 6) != 6)
    return 0;

  return (ecx & (1 << 28)) != 0;
}


template <typename T, int N>
struct vec {
  union {
    T n[N];
    __v4sf f[N / (sizeof (__v4sf) / sizeof (T))];
    __v8sf fa[N / (sizeof (__v8sf) / sizeof (T))];
  };
};


template <typename T, int N>
T
optscalar(const vec<T,N> &src1, const vec<T,N> &src2)
{
  T r = 0;
  for (int i = 0; i < N; ++i)
    r += src1[i] * src2[i];
  return r;
}


template <int N>
float
optscalar(const vec<float,N> &src1, const vec<float,N> &src2)
{
  if (has_avx ())
    {
      __m256 tmp = _mm256_setzero_ps ();
      for (int i = 0; i < N / 8; ++i)
        tmp = _mm256_add_ps (tmp, _mm256_mul_ps (src1.fa[i], src2.fa[i]));
      tmp = _mm256_hadd_ps (tmp, tmp);
      tmp = _mm256_hadd_ps (tmp, tmp);
      tmp = _mm256_hadd_ps (tmp, tmp);
      union
      {
        __m256 v;
        float a[8];
      } cvt = { tmp };
      return cvt.a[0];
    }
  else
    {
      __m128 tmp = _mm_setzero_ps ();
      for (int i = 0; i < N / 4; ++i)
        tmp = _mm_add_ps (tmp, _mm_mul_ps (src1.f[i], src2.f[i]));
      tmp = _mm_hadd_ps (tmp, tmp);
      tmp = _mm_hadd_ps (tmp, tmp);
      return __builtin_ia32_vec_ext_v4sf (tmp, 0);
    }
}


#define N 100000
#define DEF(type) vec<type,N> v##type##1, v##type##2; type type##res, type##cmp
DEF(float);

float g;

int
main ()
{
  float f = sinf  (g);
  printf ("%g\n", f);

  asm volatile ("cpuid"
		: "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
		: "0" (1));

  float floatres = optscalar (vfloat1, vfloat2);
  printf ("%g\n", floatres);

  return 0;
}
Comment 3 pinskia@gmail.com 2009-04-21 19:41:54 UTC
Subject: Re:  Non-optimal (or wrong) implementation of SSE intrinsics

Gcc 4.4 and above supports different target options on the function  
level but not on a basic block level. So you can create an interneral  
version for AVX.

Sent from my iPhone

On Apr 21, 2009, at 12:37 PM, "drepper at redhat dot com" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

>
>
> ------- Comment #2 from drepper at redhat dot com  2009-04-21 19:37  
> -------
> [I couldn't attach the code as an attachment, bugzilla has a bug.]
>
> The program below has to be compiled with -mavx to allow the AVX  
> intrinsics
> being used.  But this also triggers using the use of the vmovss  
> instruction to
> load the parameter for the sin() call from memory.
>
> (Forget the reference to memset in the original report, it's as  
> simple as
> passing floating point parameters that triggers the problem.)
>
> #include <math.h>
> #include <stdio.h>
> #include <immintrin.h>
>
>
> static unsigned int eax, ebx, ecx, edx;
>
>
> static int
> has_avx (void)
> {
>  if ((ecx & (1 << 27)) == 0)
>    /* No OSXSAVE.  */
>    return 0;
>
>  unsigned int feat_eax, feat_edx;
>  asm ("xgetbv" : "=a" (feat_eax), "=d" (feat_edx) : "c" (0));
>  if ((feat_eax & 6) != 6)
>    return 0;
>
>  return (ecx & (1 << 28)) != 0;
> }
>
>
> template <typename T, int N>
> struct vec {
>  union {
>    T n[N];
>    __v4sf f[N / (sizeof (__v4sf) / sizeof (T))];
>    __v8sf fa[N / (sizeof (__v8sf) / sizeof (T))];
>  };
> };
>
>
> template <typename T, int N>
> T
> optscalar(const vec<T,N> &src1, const vec<T,N> &src2)
> {
>  T r = 0;
>  for (int i = 0; i < N; ++i)
>    r += src1[i] * src2[i];
>  return r;
> }
>
>
> template <int N>
> float
> optscalar(const vec<float,N> &src1, const vec<float,N> &src2)
> {
>  if (has_avx ())
>    {
>      __m256 tmp = _mm256_setzero_ps ();
>      for (int i = 0; i < N / 8; ++i)
>        tmp = _mm256_add_ps (tmp, _mm256_mul_ps (src1.fa[i],  
> src2.fa[i]));
>      tmp = _mm256_hadd_ps (tmp, tmp);
>      tmp = _mm256_hadd_ps (tmp, tmp);
>      tmp = _mm256_hadd_ps (tmp, tmp);
>      union
>      {
>        __m256 v;
>        float a[8];
>      } cvt = { tmp };
>      return cvt.a[0];
>    }
>  else
>    {
>      __m128 tmp = _mm_setzero_ps ();
>      for (int i = 0; i < N / 4; ++i)
>        tmp = _mm_add_ps (tmp, _mm_mul_ps (src1.f[i], src2.f[i]));
>      tmp = _mm_hadd_ps (tmp, tmp);
>      tmp = _mm_hadd_ps (tmp, tmp);
>      return __builtin_ia32_vec_ext_v4sf (tmp, 0);
>    }
> }
>
>
> #define N 100000
> #define DEF(type) vec<type,N> v##type##1, v##type##2; type  
> type##res, type##cmp
> DEF(float);
>
> float g;
>
> int
> main ()
> {
>  float f = sinf  (g);
>  printf ("%g\n", f);
>
>  asm volatile ("cpuid"
>                : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
>                : "0" (1));
>
>  float floatres = optscalar (vfloat1, vfloat2);
>  printf ("%g\n", floatres);
>
>  return 0;
> }
>
>
> -- 
>
> drepper at redhat dot com changed:
>
>           What    |Removed                     |Added
> --- 
> --- 
> ----------------------------------------------------------------------
>             Status|WAITING                     |UNCONFIRMED
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39840
>
Comment 4 Ulrich Drepper 2009-04-21 19:51:38 UTC
(In reply to comment #3)
> Gcc 4.4 and above supports different target options on the function  
> level but not on a basic block level. So you can create an interneral  
> version for AVX.

This doesn't work either.  Aside from being also impractical.

First, you'd have to switch to AVX mode, in this case, to include <immintrin.h>.  How do you switch back to what was used before?  How to even determine it?

Even if you can, try it, and you'll see that gcc is horribly broken when it comes to the target("...") attributes.  In the current Fedora 11 compiler (4.4) all target options are apparently turned off and none of the intrinsics work at all.

Even if the necessary support would be added and the bugs fixed it still differs from icc (where all this comes from) and not in a nice way.  To the contrary, it's much, much more complicated.
Comment 5 H.J. Lu 2009-04-21 20:34:10 UTC
Created attachment 17667 [details]
An example

I am enclosing a modified example which can be compiled with both
icc and gcc. I also included assembly codes generated by "icc -O2"
and "gcc -avx -O2". Icc generates:

  54:	c5 ff 7c c8          	vhaddps %ymm0,%ymm0,%ymm1
  58:	c5 f7 7c d1          	vhaddps %ymm1,%ymm1,%ymm2
  5c:	c5 ef 7c da          	vhaddps %ymm2,%ymm2,%ymm3
  60:	c5 fc 29 5c 24 e0    	vmovaps %ymm3,-0x20(%rsp)
  66:	f3 0f 10 44 24 e0    	movss  -0x20(%rsp),%xmm0

for

if (has_avx ())
 {
   ...
 }

There is

f3 0f 10 44 24 e0    	movss  -0x20(%rsp),%xmm0

although this code will only run on AVX targets. Since we don't
support basic block optimization, I don't see how we can avoid
SSE instructions in AVX code path. The best option I can think
of is function level optimization. But as we all know, function
level optimization isn't usable, as least in this context. I
think we should go back and another look at function level
optimization. We should do it right this time. I have some
ideas in

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37565
Comment 6 H.J. Lu 2009-04-21 21:56:51 UTC
Created attachment 17668 [details]
An eample

Here is an example for gcc 4.4. If function level optimization works,
we don't need separate files for AVX and SSE3.
Comment 7 Richard Biener 2009-04-22 09:36:43 UTC
The problem with different instruction sets in different BBs is also how to
avoid code motion across them.  IMNSHO this is a bad idea.
Comment 8 H.J. Lu 2009-04-22 13:58:19 UTC
(In reply to comment #7)
> The problem with different instruction sets in different BBs is also how to
> avoid code motion across them.  IMNSHO this is a bad idea.
> 

I agree. There are too many issues with it. I'd like to see
function level optimization work properly for all cases.