56253 – fp-contract does not work with SSE and AVX FMAs (neither FMA4 nor FMA3)

Bug 56253 - fp-contract does not work with SSE and AVX FMAs (neither FMA4 nor FMA3)

Summary: fp-contract does not work with SSE and AVX FMAs (neither FMA4 nor FMA3)

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.7.2

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	88918
	Show dependency tree / graph

Reported:	2013-02-08 12:14 UTC by Matthias Kretz (Vir)
Modified:	2019-05-22 08:15 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2013-02-08 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Kretz (Vir) 2013-02-08 12:14:36 UTC

Take the following testcase:

#include <immintrin.h>
__m256 foo(__m256 a, __m256 b, __m256 c)
{  
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}
__m128 foo(__m128 a, __m128 b, __m128 c)
{  
    return _mm_add_ps(_mm_mul_ps(a, b), c);
}
float foo(float a, float b, float c)
{  
    return a * b + c;
}

compiled with 'g++ -O3 -mfma -ffp-contract=fast -fabi-version=0 -c' only the third function uses fmas (same for -mfma4). The SSE and AVX variant should make the same contraction as is implemented for scalar operations.

Comment 1 Richard Biener 2013-02-08 12:38:50 UTC

Confirmed.  That's because we have

__m256 foo(__m256, __m256, __m256) (__m256 a, __m256 b, __m256 c)
{
  __m256 D.6689;
  __m256 D.6686;
  __m256 _5;
  __m256 _6;

  <bb 2>:
  _5 = __builtin_ia32_mulps256 (a_1(D), b_2(D));
  _6 = __builtin_ia32_addps256 (_5, c_3(D));
  return _6;

instead of

  _5 = a_1(D) * b_2(D);
  _6 = _5 + c_3(D);

not sure why we use builtins for these basic operations...

_mm256_add_ps could for example be simply

extern __inline __m256 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_add_ps (__m256 __A, __m256 __B)
{
  return (__m256) ((__v8sf)__A + (__v8sf)__B);
}

with the caveat of using a GNU extension.

Comment 2 Richard Biener 2013-02-08 12:52:32 UTC

For _mm256_fmadd_pd and friends the only possibility is to fold the target
builtins via targetm.fold_builtin to FMA_EXPR.  Which is of course also
possible for the simple add/muls (but getting rid of builtins is a good
idea - they are quite heavy in compiler startup time).

Comment 3 Uroš Bizjak 2013-02-08 12:58:49 UTC

(In reply to comment #1)

> not sure why we use builtins for these basic operations...

Because they have to be emitted also for non-SSE math.

From config/i386/sse.md:

;; The standard names for fma is only available with SSE math enabled.
(define_expand "fma<mode>4"
  [(set (match_operand:FMAMODE 0 "register_operand")
	(fma:FMAMODE
	  (match_operand:FMAMODE 1 "nonimmediate_operand")
	  (match_operand:FMAMODE 2 "nonimmediate_operand")
	  (match_operand:FMAMODE 3 "nonimmediate_operand")))]
  "(TARGET_FMA || TARGET_FMA4) && TARGET_SSE_MATH")

...

;; The builtin for intrinsics is not constrained by SSE math enabled.

(define_expand "fma4i_fmadd_<mode>"
  [(set (match_operand:FMAMODE 0 "register_operand")
	(fma:FMAMODE
	  (match_operand:FMAMODE 1 "nonimmediate_operand")
	  (match_operand:FMAMODE 2 "nonimmediate_operand")
	  (match_operand:FMAMODE 3 "nonimmediate_operand")))]
  "TARGET_FMA || TARGET_FMA4")

Comment 4 Richard Biener 2013-02-08 13:30:14 UTC

(In reply to comment #3)
> (In reply to comment #1)
> 
> > not sure why we use builtins for these basic operations...
> 
> Because they have to be emitted also for non-SSE math.
> 
> From config/i386/sse.md:
> 
> ;; The standard names for fma is only available with SSE math enabled.
> (define_expand "fma<mode>4"
>   [(set (match_operand:FMAMODE 0 "register_operand")
>     (fma:FMAMODE
>       (match_operand:FMAMODE 1 "nonimmediate_operand")
>       (match_operand:FMAMODE 2 "nonimmediate_operand")
>       (match_operand:FMAMODE 3 "nonimmediate_operand")))]
>   "(TARGET_FMA || TARGET_FMA4) && TARGET_SSE_MATH")
> 
> ...
> 
> ;; The builtin for intrinsics is not constrained by SSE math enabled.
> 
> (define_expand "fma4i_fmadd_<mode>"
>   [(set (match_operand:FMAMODE 0 "register_operand")
>     (fma:FMAMODE
>       (match_operand:FMAMODE 1 "nonimmediate_operand")
>       (match_operand:FMAMODE 2 "nonimmediate_operand")
>       (match_operand:FMAMODE 3 "nonimmediate_operand")))]
>   "TARGET_FMA || TARGET_FMA4")

Ah, of course ...

That leaves the option of folding in targetm.fold_builtin (when
the standard names are available, of course).

Comment 5 Richard Biener 2013-02-08 13:42:23 UTC

(In reply to comment #3)
> (In reply to comment #1)
> 
> > not sure why we use builtins for these basic operations...
> 
> Because they have to be emitted also for non-SSE math.
> 
> From config/i386/sse.md:
> 
> ;; The standard names for fma is only available with SSE math enabled.
> (define_expand "fma<mode>4"
>   [(set (match_operand:FMAMODE 0 "register_operand")
>     (fma:FMAMODE
>       (match_operand:FMAMODE 1 "nonimmediate_operand")
>       (match_operand:FMAMODE 2 "nonimmediate_operand")
>       (match_operand:FMAMODE 3 "nonimmediate_operand")))]
>   "(TARGET_FMA || TARGET_FMA4) && TARGET_SSE_MATH")
> 
> ...
> 
> ;; The builtin for intrinsics is not constrained by SSE math enabled.
> 
> (define_expand "fma4i_fmadd_<mode>"
>   [(set (match_operand:FMAMODE 0 "register_operand")
>     (fma:FMAMODE
>       (match_operand:FMAMODE 1 "nonimmediate_operand")
>       (match_operand:FMAMODE 2 "nonimmediate_operand")
>       (match_operand:FMAMODE 3 "nonimmediate_operand")))]
>   "TARGET_FMA || TARGET_FMA4")

Hmm, I wonder how the vectorizer then accesses add/sub patterns without
SSE math.  It just queries optabs ...

We cannot handle the FMA case with standard operations anyway.  But
if SSE modes are used, why should convert_mult_to_fma have to back off
(it also just looks at standard optabs)?

That said - should the above TARGET_SSE_MATH restriction not only
apply to scalar modes?

Comment 6 Uroš Bizjak 2013-02-08 14:58:42 UTC

(In reply to comment #5)
> Hmm, I wonder how the vectorizer then accesses add/sub patterns without
> SSE math.  It just queries optabs ...
> 
> We cannot handle the FMA case with standard operations anyway.  But
> if SSE modes are used, why should convert_mult_to_fma have to back off
> (it also just looks at standard optabs)?
> 
> That said - should the above TARGET_SSE_MATH restriction not only
> apply to scalar modes?

Ugh, you are right. TARGET_SSE_MATH should apply to scalar modes only.

Patch in works.

Comment 7 Richard Biener 2013-02-11 11:36:29 UTC

So to re-cap, for vector intrinsics we can use the vector GCC extensions
while for the scalar intrinsics we would have to use target builtin folding.
Not sure if it's worth the asymmetry (thus instead do everything with
target builtin folding).

Testcase that still needs to "work" (emit addsd) with -m32 -msse2
and without -mfpmath=sse:

#include <emmintrin.h>

double foo (double x)
{
  return _mm_cvtsd_f64 (_mm_add_sd (_mm_set_sd (x), _mm_set_sd (1.0)));
}

Exposing the intrinsics internals to the compiler also would cause us to
not "literally" emitting the code the user may have asked for (though
RTL optimization already might do that - but RTL for example never
re-associates FP, even with -ffast-math).

Comment 8 Agner Fog 2014-09-23 16:39:58 UTC

The same problem applies to other kinds of optimizations, such as algebraic reductions and constant propagation. 

The method of using operators such as * and + is not portable to other compilers, and it doesn't work with integer vectors for other integer sizes than 64-bits. (I know that there is no integer FMA on Intel CPUs, but I am also talking about other optimizations).

Here are some other examples of optimizations I would like gcc to do:

#include "x86intrin.h"

void dummy2(__m128 a, __m128 b);
void dummyi2(__m128i a, __m128i b);

void commutative(__m128 a, __m128 b) {
    // expect reduce a+b = b+a. This is the only reduction that actually works!
    dummy2(_mm_add_ps(a,b), _mm_add_ps(b,a));
}

void associative(__m128i a, __m128i b, __m128i c) {
    // expect reduce (a+b)+c = a+(b+c)
    dummy2i(_mm_add_epi32(_mm_add_epi32(a,b),c),
            _mm_add_epi32(a,_mm_add_epi32(b,c)));
}

void distributive(__m128i a, __m128i b, __m128i c) {
    // expect reduce a*b+a*c = a*(b+c)
    dummy2i(_mm_add_epi32(_mm_mul_epi32(a,b),_mm_mul_epi32(a,c)),
            _mm_mul_epi32(a,_mm_add_epi32(b,c)));
}

void constant_propagation() {
    // expect store c and d as precalculated constants
    __m128i a = _mm_setr_epi32(1,2,3,4);
    __m128i b = _mm_set1_epi32(5);
    __m128i c = _mm_add_epi32(a,b);
    __m128i d = _mm_mul_epi32(a,b);
    dummyi2(c,d);
}

Of course, the same applies to 256-bit and 512-bit vectors.

Comment 9 Agner Fog 2014-09-23 19:14:42 UTC

Many programmers are using a vector class library rather than writing intrinsic functions directly. Such libraries have overloaded operators which are inlined to produce intrinsic functions. Therefore, we cannot expect programmers to make optimizations like FMA contraction, algebraic reduction, constant propagation, etc. manually.

I don't know if this more general discussion of optimizations on code with intrinsics fit into this bug or they need to be discussed elsewhere?

Comment 10 Marc Glisse 2014-09-23 19:22:39 UTC

Two random links into the latest conversation:

https://gcc.gnu.org/ml/gcc-patches/2014-07/msg01812.html
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg02288.html

That version of the patch doesn't handle integer vectors IIRC, but it is the principle that matters, if the first step is accepted I'll extend it.

I suppose I should ping it again soon...

Comment 11 Agner Fog 2014-09-24 05:14:22 UTC

Thanks for the links Marc. 
You are right, the discussion in the gcc-patches mailing list ignores integer vectors. You need a solution that also allows optimizations on integer intrinsic functions (perhaps cast the vector type?). I am not on any internal mailing list, so please post it there for me.

The proposed solution of using vector extensions will not work on masked vector intrinsics in AVX512, so it wouldn't enable e.g. constant propagation through a masked intrinsic, but that is probably too much to ask for :) I will add a new bug report for contraction of broadcast with AVX512.

Comment 12 Marc Glisse 2014-09-24 06:27:01 UTC

(In reply to Agner Fog from comment #11)
> Thanks for the links Marc. 
> You are right, the discussion in the gcc-patches mailing list ignores
> integer vectors. You need a solution that also allows optimizations on
> integer intrinsic functions (perhaps cast the vector type?).

If you follow the links, you can find:
https://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
where I handled some integer vector intrinsics as well (there were some bugs in that patch, but the idea should be fine).

> The proposed solution of using vector extensions will not work on masked
> vector intrinsics in AVX512, so it wouldn't enable e.g. constant propagation
> through a masked intrinsic, but that is probably too much to ask for :)

I expect we can get most of the benefits from using vector extensions for very little effort, while handling the esoteric intrinsics would be a lot more work so it gets lower priority.

Comment 13 Agner Fog 2014-09-25 06:14:43 UTC

Thank you. I agree that integer overflow should be well-defined when using intrinsics.

Is it possible to do the same optimization with boolean vector intrinsics, such as _mm_and_epi32 and _mm_or_ps to enable optimizations such as algebraic reduction and constant propagation?

Comment 14 Marc Glisse 2014-09-25 06:36:57 UTC

(In reply to Agner Fog from comment #13)
> Is it possible to do the same optimization with boolean vector intrinsics,
> such as _mm_and_epi32 and _mm_or_ps to enable optimizations such as
> algebraic reduction and constant propagation?

Anything we already do with vector extensions should be easy, and that includes constant propagation in & and |. The sightly harder part is transformations that are only valid if v is a "bool vector" (like replacing v!=0 with just v), i.e. each component is either 0 or -1. We can test constants, we know the result of comparisons is boolean, we know &, | and ^ preserve that property, but it isn't a purely local property so it requires a bit more work.

Comment 15 Marc Glisse 2014-09-25 06:39:37 UTC

Oups, sorry, or_ps may be harder, because representing it with vector extensions requires casts to integer vectors, which makes it much harder to actually generate or_ps in the backend (there is at least one PR about that), so we probably won't do it soon.

Comment 16 Matthias Kretz (Vir) 2019-05-22 08:15:48 UTC

This is resolved since 5.1: https://godbolt.org/z/_tpStf