Summary: | Non-optimal (or wrong) implementation of SSE intrinsics | ||
---|---|---|---|
Product: | gcc | Reporter: | Ulrich Drepper <drepper.fsp> |
Component: | middle-end | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | gcc-bugs, hjl.tools, jakub |
Priority: | P3 | ||
Version: | 4.4.0 | ||
Target Milestone: | 5.0 | ||
Host: | Target: | i?86-* x86_64-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | ||
Bug Depends on: | 37565 | ||
Bug Blocks: | |||
Attachments: |
An example
An eample |
Description
Ulrich Drepper
2009-04-21 19:05:32 UTC
Please provide some sample code which can be compiled. [I couldn't attach the code as an attachment, bugzilla has a bug.] The program below has to be compiled with -mavx to allow the AVX intrinsics being used. But this also triggers using the use of the vmovss instruction to load the parameter for the sin() call from memory. (Forget the reference to memset in the original report, it's as simple as passing floating point parameters that triggers the problem.) #include <math.h> #include <stdio.h> #include <immintrin.h> static unsigned int eax, ebx, ecx, edx; static int has_avx (void) { if ((ecx & (1 << 27)) == 0) /* No OSXSAVE. */ return 0; unsigned int feat_eax, feat_edx; asm ("xgetbv" : "=a" (feat_eax), "=d" (feat_edx) : "c" (0)); if ((feat_eax & 6) != 6) return 0; return (ecx & (1 << 28)) != 0; } template <typename T, int N> struct vec { union { T n[N]; __v4sf f[N / (sizeof (__v4sf) / sizeof (T))]; __v8sf fa[N / (sizeof (__v8sf) / sizeof (T))]; }; }; template <typename T, int N> T optscalar(const vec<T,N> &src1, const vec<T,N> &src2) { T r = 0; for (int i = 0; i < N; ++i) r += src1[i] * src2[i]; return r; } template <int N> float optscalar(const vec<float,N> &src1, const vec<float,N> &src2) { if (has_avx ()) { __m256 tmp = _mm256_setzero_ps (); for (int i = 0; i < N / 8; ++i) tmp = _mm256_add_ps (tmp, _mm256_mul_ps (src1.fa[i], src2.fa[i])); tmp = _mm256_hadd_ps (tmp, tmp); tmp = _mm256_hadd_ps (tmp, tmp); tmp = _mm256_hadd_ps (tmp, tmp); union { __m256 v; float a[8]; } cvt = { tmp }; return cvt.a[0]; } else { __m128 tmp = _mm_setzero_ps (); for (int i = 0; i < N / 4; ++i) tmp = _mm_add_ps (tmp, _mm_mul_ps (src1.f[i], src2.f[i])); tmp = _mm_hadd_ps (tmp, tmp); tmp = _mm_hadd_ps (tmp, tmp); return __builtin_ia32_vec_ext_v4sf (tmp, 0); } } #define N 100000 #define DEF(type) vec<type,N> v##type##1, v##type##2; type type##res, type##cmp DEF(float); float g; int main () { float f = sinf (g); printf ("%g\n", f); asm volatile ("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "0" (1)); float floatres = optscalar (vfloat1, vfloat2); printf ("%g\n", floatres); return 0; } Subject: Re: Non-optimal (or wrong) implementation of SSE intrinsics Gcc 4.4 and above supports different target options on the function level but not on a basic block level. So you can create an interneral version for AVX. Sent from my iPhone On Apr 21, 2009, at 12:37 PM, "drepper at redhat dot com" <gcc-bugzilla@gcc.gnu.org > wrote: > > > ------- Comment #2 from drepper at redhat dot com 2009-04-21 19:37 > ------- > [I couldn't attach the code as an attachment, bugzilla has a bug.] > > The program below has to be compiled with -mavx to allow the AVX > intrinsics > being used. But this also triggers using the use of the vmovss > instruction to > load the parameter for the sin() call from memory. > > (Forget the reference to memset in the original report, it's as > simple as > passing floating point parameters that triggers the problem.) > > #include <math.h> > #include <stdio.h> > #include <immintrin.h> > > > static unsigned int eax, ebx, ecx, edx; > > > static int > has_avx (void) > { > if ((ecx & (1 << 27)) == 0) > /* No OSXSAVE. */ > return 0; > > unsigned int feat_eax, feat_edx; > asm ("xgetbv" : "=a" (feat_eax), "=d" (feat_edx) : "c" (0)); > if ((feat_eax & 6) != 6) > return 0; > > return (ecx & (1 << 28)) != 0; > } > > > template <typename T, int N> > struct vec { > union { > T n[N]; > __v4sf f[N / (sizeof (__v4sf) / sizeof (T))]; > __v8sf fa[N / (sizeof (__v8sf) / sizeof (T))]; > }; > }; > > > template <typename T, int N> > T > optscalar(const vec<T,N> &src1, const vec<T,N> &src2) > { > T r = 0; > for (int i = 0; i < N; ++i) > r += src1[i] * src2[i]; > return r; > } > > > template <int N> > float > optscalar(const vec<float,N> &src1, const vec<float,N> &src2) > { > if (has_avx ()) > { > __m256 tmp = _mm256_setzero_ps (); > for (int i = 0; i < N / 8; ++i) > tmp = _mm256_add_ps (tmp, _mm256_mul_ps (src1.fa[i], > src2.fa[i])); > tmp = _mm256_hadd_ps (tmp, tmp); > tmp = _mm256_hadd_ps (tmp, tmp); > tmp = _mm256_hadd_ps (tmp, tmp); > union > { > __m256 v; > float a[8]; > } cvt = { tmp }; > return cvt.a[0]; > } > else > { > __m128 tmp = _mm_setzero_ps (); > for (int i = 0; i < N / 4; ++i) > tmp = _mm_add_ps (tmp, _mm_mul_ps (src1.f[i], src2.f[i])); > tmp = _mm_hadd_ps (tmp, tmp); > tmp = _mm_hadd_ps (tmp, tmp); > return __builtin_ia32_vec_ext_v4sf (tmp, 0); > } > } > > > #define N 100000 > #define DEF(type) vec<type,N> v##type##1, v##type##2; type > type##res, type##cmp > DEF(float); > > float g; > > int > main () > { > float f = sinf (g); > printf ("%g\n", f); > > asm volatile ("cpuid" > : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) > : "0" (1)); > > float floatres = optscalar (vfloat1, vfloat2); > printf ("%g\n", floatres); > > return 0; > } > > > -- > > drepper at redhat dot com changed: > > What |Removed |Added > --- > --- > ---------------------------------------------------------------------- > Status|WAITING |UNCONFIRMED > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39840 > (In reply to comment #3) > Gcc 4.4 and above supports different target options on the function > level but not on a basic block level. So you can create an interneral > version for AVX. This doesn't work either. Aside from being also impractical. First, you'd have to switch to AVX mode, in this case, to include <immintrin.h>. How do you switch back to what was used before? How to even determine it? Even if you can, try it, and you'll see that gcc is horribly broken when it comes to the target("...") attributes. In the current Fedora 11 compiler (4.4) all target options are apparently turned off and none of the intrinsics work at all. Even if the necessary support would be added and the bugs fixed it still differs from icc (where all this comes from) and not in a nice way. To the contrary, it's much, much more complicated. Created attachment 17667 [details] An example I am enclosing a modified example which can be compiled with both icc and gcc. I also included assembly codes generated by "icc -O2" and "gcc -avx -O2". Icc generates: 54: c5 ff 7c c8 vhaddps %ymm0,%ymm0,%ymm1 58: c5 f7 7c d1 vhaddps %ymm1,%ymm1,%ymm2 5c: c5 ef 7c da vhaddps %ymm2,%ymm2,%ymm3 60: c5 fc 29 5c 24 e0 vmovaps %ymm3,-0x20(%rsp) 66: f3 0f 10 44 24 e0 movss -0x20(%rsp),%xmm0 for if (has_avx ()) { ... } There is f3 0f 10 44 24 e0 movss -0x20(%rsp),%xmm0 although this code will only run on AVX targets. Since we don't support basic block optimization, I don't see how we can avoid SSE instructions in AVX code path. The best option I can think of is function level optimization. But as we all know, function level optimization isn't usable, as least in this context. I think we should go back and another look at function level optimization. We should do it right this time. I have some ideas in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37565 Created attachment 17668 [details]
An eample
Here is an example for gcc 4.4. If function level optimization works,
we don't need separate files for AVX and SSE3.
The problem with different instruction sets in different BBs is also how to avoid code motion across them. IMNSHO this is a bad idea. (In reply to comment #7) > The problem with different instruction sets in different BBs is also how to > avoid code motion across them. IMNSHO this is a bad idea. > I agree. There are too many issues with it. I'd like to see function level optimization work properly for all cases. GCC has support turning on/off target specific extensions since at least GCC 5, maybe earlier. So closing as fixed. (In reply to Andrew Pinski from comment #9) > GCC has support turning on/off target specific extensions since at least GCC > 5, maybe earlier. So closing as fixed. I Mean on specific on a per function level (via either the #pragma or the target attribute). |