Bug 81904 - FMA and addsub instructions
Summary: FMA and addsub instructions
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 84361 (view as bug list)
Depends on: 87555
Blocks: 84361
  Show dependency treegraph
 
Reported: 2017-08-20 11:28 UTC by Marc Glisse
Modified: 2023-08-02 06:50 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-08-21 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Glisse 2017-08-20 11:28:34 UTC
(asked in https://stackoverflow.com/questions/45298855/how-to-write-portable-simd-code-for-complex-multiplicative-reduction/45401182#comment77780455_45401182 )

Intel has instructions like vfmaddsubps. Gcc manages, under certain circumstances, to merge mult and plus or mult and minus into FMA, but not mult and this strange addsub mix.

#include <x86intrin.h>
__m128d f(__m128d x, __m128d y, __m128d z){
  return _mm_addsub_pd(_mm_mul_pd(x,y),z);
}
__m128d g(__m128d x, __m128d y, __m128d z){
  return _mm_fmaddsub_pd(x,y,z);
}

(the order of the arguments is probably not right)

My first guess as to how this could be implemented without too much trouble is in ix86_gimple_fold_builtin: for IX86_BUILTIN_ADDSUBPD and others, check that we are late enough in the optimization pipeline (roughly where "widening_mul" is), that contractions are enabled, and that the first (?) argument is a single-use MULT_EXPR.

I didn't check what the situation is with the vectorizer (which IIRC can now generate code that ends up as addsub).
Comment 1 Richard Biener 2017-08-21 09:31:25 UTC
Hmm, I think the issue is we see

f (__m128d x, __m128d y, __m128d z)
{
  vector(2) double _4;
  vector(2) double _6;

  <bb 2> [100.00%]:
  _4 = x_2(D) * y_3(D);
  _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
  return _6;

the vectorizer will implement addsub as

  _6 = _4 + z_5(D);
  _7 = _4 - z_5(D);
  _8 = __builtin_shuffle (_6, _7, {0, 1});
  return _8;

which would then end up as (if the non-single use allows)

  _6 = FMA <x_2, y_3, z_5(D)>
  _9 = -z_5(D);
  _7 = FMA <x_2, y_3, _9>
  _8 = __builtin_shuffle (_6, _7, {0, 1});
  return _8;

a bit interesting for combine to figure out but theoretically possible?
(I think we expand both FMAs properly).

Look at the addsub patterns.

That is, handling this requires open-coding _mm_addsub_pd with add, sub
and suffle ...
Comment 2 Richard Biener 2017-08-21 09:35:08 UTC
__m128d h(__m128d x, __m128d y, __m128d z){
    __m128d tem = _mm_mul_pd (x,y);
    __m128d tem2 = tem + z;
    __m128d tem3 = tem - z;
    return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3});
}

doesn't quite work (the combiner pattern for fmaddsub is missing).  Tried {0, 2} as well.

:
.LFB5021:
        .cfi_startproc
        vmovapd %xmm0, %xmm3
        vfmsub132pd     %xmm1, %xmm2, %xmm0
        vfmadd132pd     %xmm1, %xmm2, %xmm3
        vshufpd $2, %xmm0, %xmm3, %xmm0
Comment 3 Richard Biener 2023-07-21 12:31:42 UTC
*** Bug 84361 has been marked as a duplicate of this bug. ***
Comment 4 Hongtao.liu 2023-07-31 05:26:03 UTC
(In reply to Richard Biener from comment #2)
> __m128d h(__m128d x, __m128d y, __m128d z){
>     __m128d tem = _mm_mul_pd (x,y);
>     __m128d tem2 = tem + z;
>     __m128d tem3 = tem - z;
>     return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3});
> }
> 
> doesn't quite work (the combiner pattern for fmaddsub is missing).  Tried
> {0, 2} as well.
> 
> :
> .LFB5021:
>         .cfi_startproc
>         vmovapd %xmm0, %xmm3
>         vfmsub132pd     %xmm1, %xmm2, %xmm0
>         vfmadd132pd     %xmm1, %xmm2, %xmm3
>         vshufpd $2, %xmm0, %xmm3, %xmm0

  tem2_6 = .FMA (x_2(D), y_3(D), z_5(D));
  # DEBUG tem2 => tem2_6
  # DEBUG BEGIN_STMT
  tem3_7 = .FMS (x_2(D), y_3(D), z_5(D));
  # DEBUG tem3 => NULL
  # DEBUG BEGIN_STMT
  _8 = VEC_PERM_EXPR <tem2_6, tem3_7, { 0, 3 }>;

Can it be handled in match.pd? rewrite fmaddsub pattern into vec_merge fma fms <addsub_cst> looks too complex.

Similar for VEC_ADDSUB + MUL -> VEC_FMADDSUB.
Comment 5 Hongtao.liu 2023-07-31 05:27:38 UTC
(In reply to Richard Biener from comment #1)
> Hmm, I think the issue is we see
> 
> f (__m128d x, __m128d y, __m128d z)
> {
>   vector(2) double _4;
>   vector(2) double _6;
> 
>   <bb 2> [100.00%]:
>   _4 = x_2(D) * y_3(D);
>   _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> VEC_FMADDSUB in match.pd?
Comment 6 Richard Biener 2023-07-31 07:32:05 UTC
(In reply to Hongtao.liu from comment #5)
> (In reply to Richard Biener from comment #1)
> > Hmm, I think the issue is we see
> > 
> > f (__m128d x, __m128d y, __m128d z)
> > {
> >   vector(2) double _4;
> >   vector(2) double _6;
> > 
> >   <bb 2> [100.00%]:
> >   _4 = x_2(D) * y_3(D);
> >   _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
> We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB ->
> VEC_FMADDSUB in match.pd?

I think MUL + .VEC_ADDSUB can be handled in the FMA pass.  For my example
above we early (before FMA recog) get

  _4 = x_2(D) * y_3(D);
  tem2_7 = _4 + z_6(D);
  tem3_8 = _4 - z_6(D);
  _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>;

we could recognize that as .VEC_ADDSUB.  I think we want to avoid doing
this too early, not sure if doing this within the FMA pass itself will
work since we key FMAs on the mult but would need to key the addsub
on the VEC_PERM (we are walking stmts from BB start to end).  Looking
at the code it seems changing the walking order should work.

Note matching

  tem2_7 = _4 + z_6(D);
  tem3_8 = _4 - z_6(D);
  _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>;

to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly
creates .VEC_ADDSUB when possible).
Comment 7 Hongtao.liu 2023-07-31 07:58:43 UTC
> 
> to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly
> creates .VEC_ADDSUB when possible).
Let's put it under -fno-trapping-math.
Comment 8 GCC Commits 2023-08-02 06:50:18 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:f0b7a61d83534fc8f7aa593b1f0f0357a371a800

commit r14-2919-gf0b7a61d83534fc8f7aa593b1f0f0357a371a800
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jul 31 16:03:45 2023 +0800

    Support vec_fmaddsub/vec_fmsubadd for vector HFmode.
    
    AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph.
    Also remove scalar mode from fmaddsub/fmsubadd pattern since there's
    no scalar instruction for that.
    
    gcc/ChangeLog:
    
            PR target/81904
            * config/i386/sse.md (vec_fmaddsub<mode>4): Extend to vector
            HFmode, use mode iterator VFH instead.
            (vec_fmsubadd<mode>4): Ditto.
            (<sd_mask_codefor>fma_fmaddsub_<mode><sd_maskz_name><round_name>):
            Remove scalar mode from iterator, use VFH_AVX512VL instead.
            (<sd_mask_codefor>fma_fmsubadd_<mode><sd_maskz_name><round_name>):
            Ditto.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr81904.c: New test.