Take the following testcase: #include <immintrin.h> __m256 foo(__m256 a, __m256 b, __m256 c) { return _mm256_add_ps(_mm256_mul_ps(a, b), c); } __m128 foo(__m128 a, __m128 b, __m128 c) { return _mm_add_ps(_mm_mul_ps(a, b), c); } float foo(float a, float b, float c) { return a * b + c; } compiled with 'g++ -O3 -mfma -ffp-contract=fast -fabi-version=0 -c' only the third function uses fmas (same for -mfma4). The SSE and AVX variant should make the same contraction as is implemented for scalar operations.
Confirmed. That's because we have __m256 foo(__m256, __m256, __m256) (__m256 a, __m256 b, __m256 c) { __m256 D.6689; __m256 D.6686; __m256 _5; __m256 _6; <bb 2>: _5 = __builtin_ia32_mulps256 (a_1(D), b_2(D)); _6 = __builtin_ia32_addps256 (_5, c_3(D)); return _6; instead of _5 = a_1(D) * b_2(D); _6 = _5 + c_3(D); not sure why we use builtins for these basic operations... _mm256_add_ps could for example be simply extern __inline __m256 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm256_add_ps (__m256 __A, __m256 __B) { return (__m256) ((__v8sf)__A + (__v8sf)__B); } with the caveat of using a GNU extension.
For _mm256_fmadd_pd and friends the only possibility is to fold the target builtins via targetm.fold_builtin to FMA_EXPR. Which is of course also possible for the simple add/muls (but getting rid of builtins is a good idea - they are quite heavy in compiler startup time).
(In reply to comment #1) > not sure why we use builtins for these basic operations... Because they have to be emitted also for non-SSE math. From config/i386/sse.md: ;; The standard names for fma is only available with SSE math enabled. (define_expand "fma<mode>4" [(set (match_operand:FMAMODE 0 "register_operand") (fma:FMAMODE (match_operand:FMAMODE 1 "nonimmediate_operand") (match_operand:FMAMODE 2 "nonimmediate_operand") (match_operand:FMAMODE 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4) && TARGET_SSE_MATH") ... ;; The builtin for intrinsics is not constrained by SSE math enabled. (define_expand "fma4i_fmadd_<mode>" [(set (match_operand:FMAMODE 0 "register_operand") (fma:FMAMODE (match_operand:FMAMODE 1 "nonimmediate_operand") (match_operand:FMAMODE 2 "nonimmediate_operand") (match_operand:FMAMODE 3 "nonimmediate_operand")))] "TARGET_FMA || TARGET_FMA4")
(In reply to comment #3) > (In reply to comment #1) > > > not sure why we use builtins for these basic operations... > > Because they have to be emitted also for non-SSE math. > > From config/i386/sse.md: > > ;; The standard names for fma is only available with SSE math enabled. > (define_expand "fma<mode>4" > [(set (match_operand:FMAMODE 0 "register_operand") > (fma:FMAMODE > (match_operand:FMAMODE 1 "nonimmediate_operand") > (match_operand:FMAMODE 2 "nonimmediate_operand") > (match_operand:FMAMODE 3 "nonimmediate_operand")))] > "(TARGET_FMA || TARGET_FMA4) && TARGET_SSE_MATH") > > ... > > ;; The builtin for intrinsics is not constrained by SSE math enabled. > > (define_expand "fma4i_fmadd_<mode>" > [(set (match_operand:FMAMODE 0 "register_operand") > (fma:FMAMODE > (match_operand:FMAMODE 1 "nonimmediate_operand") > (match_operand:FMAMODE 2 "nonimmediate_operand") > (match_operand:FMAMODE 3 "nonimmediate_operand")))] > "TARGET_FMA || TARGET_FMA4") Ah, of course ... That leaves the option of folding in targetm.fold_builtin (when the standard names are available, of course).
(In reply to comment #3) > (In reply to comment #1) > > > not sure why we use builtins for these basic operations... > > Because they have to be emitted also for non-SSE math. > > From config/i386/sse.md: > > ;; The standard names for fma is only available with SSE math enabled. > (define_expand "fma<mode>4" > [(set (match_operand:FMAMODE 0 "register_operand") > (fma:FMAMODE > (match_operand:FMAMODE 1 "nonimmediate_operand") > (match_operand:FMAMODE 2 "nonimmediate_operand") > (match_operand:FMAMODE 3 "nonimmediate_operand")))] > "(TARGET_FMA || TARGET_FMA4) && TARGET_SSE_MATH") > > ... > > ;; The builtin for intrinsics is not constrained by SSE math enabled. > > (define_expand "fma4i_fmadd_<mode>" > [(set (match_operand:FMAMODE 0 "register_operand") > (fma:FMAMODE > (match_operand:FMAMODE 1 "nonimmediate_operand") > (match_operand:FMAMODE 2 "nonimmediate_operand") > (match_operand:FMAMODE 3 "nonimmediate_operand")))] > "TARGET_FMA || TARGET_FMA4") Hmm, I wonder how the vectorizer then accesses add/sub patterns without SSE math. It just queries optabs ... We cannot handle the FMA case with standard operations anyway. But if SSE modes are used, why should convert_mult_to_fma have to back off (it also just looks at standard optabs)? That said - should the above TARGET_SSE_MATH restriction not only apply to scalar modes?
(In reply to comment #5) > Hmm, I wonder how the vectorizer then accesses add/sub patterns without > SSE math. It just queries optabs ... > > We cannot handle the FMA case with standard operations anyway. But > if SSE modes are used, why should convert_mult_to_fma have to back off > (it also just looks at standard optabs)? > > That said - should the above TARGET_SSE_MATH restriction not only > apply to scalar modes? Ugh, you are right. TARGET_SSE_MATH should apply to scalar modes only. Patch in works.
So to re-cap, for vector intrinsics we can use the vector GCC extensions while for the scalar intrinsics we would have to use target builtin folding. Not sure if it's worth the asymmetry (thus instead do everything with target builtin folding). Testcase that still needs to "work" (emit addsd) with -m32 -msse2 and without -mfpmath=sse: #include <emmintrin.h> double foo (double x) { return _mm_cvtsd_f64 (_mm_add_sd (_mm_set_sd (x), _mm_set_sd (1.0))); } Exposing the intrinsics internals to the compiler also would cause us to not "literally" emitting the code the user may have asked for (though RTL optimization already might do that - but RTL for example never re-associates FP, even with -ffast-math).
The same problem applies to other kinds of optimizations, such as algebraic reductions and constant propagation. The method of using operators such as * and + is not portable to other compilers, and it doesn't work with integer vectors for other integer sizes than 64-bits. (I know that there is no integer FMA on Intel CPUs, but I am also talking about other optimizations). Here are some other examples of optimizations I would like gcc to do: #include "x86intrin.h" void dummy2(__m128 a, __m128 b); void dummyi2(__m128i a, __m128i b); void commutative(__m128 a, __m128 b) { // expect reduce a+b = b+a. This is the only reduction that actually works! dummy2(_mm_add_ps(a,b), _mm_add_ps(b,a)); } void associative(__m128i a, __m128i b, __m128i c) { // expect reduce (a+b)+c = a+(b+c) dummy2i(_mm_add_epi32(_mm_add_epi32(a,b),c), _mm_add_epi32(a,_mm_add_epi32(b,c))); } void distributive(__m128i a, __m128i b, __m128i c) { // expect reduce a*b+a*c = a*(b+c) dummy2i(_mm_add_epi32(_mm_mul_epi32(a,b),_mm_mul_epi32(a,c)), _mm_mul_epi32(a,_mm_add_epi32(b,c))); } void constant_propagation() { // expect store c and d as precalculated constants __m128i a = _mm_setr_epi32(1,2,3,4); __m128i b = _mm_set1_epi32(5); __m128i c = _mm_add_epi32(a,b); __m128i d = _mm_mul_epi32(a,b); dummyi2(c,d); } Of course, the same applies to 256-bit and 512-bit vectors.
Many programmers are using a vector class library rather than writing intrinsic functions directly. Such libraries have overloaded operators which are inlined to produce intrinsic functions. Therefore, we cannot expect programmers to make optimizations like FMA contraction, algebraic reduction, constant propagation, etc. manually. I don't know if this more general discussion of optimizations on code with intrinsics fit into this bug or they need to be discussed elsewhere?
Two random links into the latest conversation: https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html That version of the patch doesn't handle integer vectors IIRC, but it is the principle that matters, if the first step is accepted I'll extend it. I suppose I should ping it again soon...
Thanks for the links Marc. You are right, the discussion in the gcc-patches mailing list ignores integer vectors. You need a solution that also allows optimizations on integer intrinsic functions (perhaps cast the vector type?). I am not on any internal mailing list, so please post it there for me. The proposed solution of using vector extensions will not work on masked vector intrinsics in AVX512, so it wouldn't enable e.g. constant propagation through a masked intrinsic, but that is probably too much to ask for :) I will add a new bug report for contraction of broadcast with AVX512.
(In reply to Agner Fog from comment #11) > Thanks for the links Marc. > You are right, the discussion in the gcc-patches mailing list ignores > integer vectors. You need a solution that also allows optimizations on > integer intrinsic functions (perhaps cast the vector type?). If you follow the links, you can find: https://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html where I handled some integer vector intrinsics as well (there were some bugs in that patch, but the idea should be fine). > The proposed solution of using vector extensions will not work on masked > vector intrinsics in AVX512, so it wouldn't enable e.g. constant propagation > through a masked intrinsic, but that is probably too much to ask for :) I expect we can get most of the benefits from using vector extensions for very little effort, while handling the esoteric intrinsics would be a lot more work so it gets lower priority.
Thank you. I agree that integer overflow should be well-defined when using intrinsics. Is it possible to do the same optimization with boolean vector intrinsics, such as _mm_and_epi32 and _mm_or_ps to enable optimizations such as algebraic reduction and constant propagation?
(In reply to Agner Fog from comment #13) > Is it possible to do the same optimization with boolean vector intrinsics, > such as _mm_and_epi32 and _mm_or_ps to enable optimizations such as > algebraic reduction and constant propagation? Anything we already do with vector extensions should be easy, and that includes constant propagation in & and |. The sightly harder part is transformations that are only valid if v is a "bool vector" (like replacing v!=0 with just v), i.e. each component is either 0 or -1. We can test constants, we know the result of comparisons is boolean, we know &, | and ^ preserve that property, but it isn't a purely local property so it requires a bit more work.
Oups, sorry, or_ps may be harder, because representing it with vector extensions requires casts to integer vectors, which makes it much harder to actually generate or_ps in the backend (there is at least one PR about that), so we probably won't do it soon.
This is resolved since 5.1: https://godbolt.org/z/_tpStf