(asked in https://stackoverflow.com/questions/45298855/how-to-write-portable-simd-code-for-complex-multiplicative-reduction/45401182#comment77780455_45401182 ) Intel has instructions like vfmaddsubps. Gcc manages, under certain circumstances, to merge mult and plus or mult and minus into FMA, but not mult and this strange addsub mix. #include <x86intrin.h> __m128d f(__m128d x, __m128d y, __m128d z){ return _mm_addsub_pd(_mm_mul_pd(x,y),z); } __m128d g(__m128d x, __m128d y, __m128d z){ return _mm_fmaddsub_pd(x,y,z); } (the order of the arguments is probably not right) My first guess as to how this could be implemented without too much trouble is in ix86_gimple_fold_builtin: for IX86_BUILTIN_ADDSUBPD and others, check that we are late enough in the optimization pipeline (roughly where "widening_mul" is), that contractions are enabled, and that the first (?) argument is a single-use MULT_EXPR. I didn't check what the situation is with the vectorizer (which IIRC can now generate code that ends up as addsub).
Hmm, I think the issue is we see f (__m128d x, __m128d y, __m128d z) { vector(2) double _4; vector(2) double _6; <bb 2> [100.00%]: _4 = x_2(D) * y_3(D); _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] return _6; the vectorizer will implement addsub as _6 = _4 + z_5(D); _7 = _4 - z_5(D); _8 = __builtin_shuffle (_6, _7, {0, 1}); return _8; which would then end up as (if the non-single use allows) _6 = FMA <x_2, y_3, z_5(D)> _9 = -z_5(D); _7 = FMA <x_2, y_3, _9> _8 = __builtin_shuffle (_6, _7, {0, 1}); return _8; a bit interesting for combine to figure out but theoretically possible? (I think we expand both FMAs properly). Look at the addsub patterns. That is, handling this requires open-coding _mm_addsub_pd with add, sub and suffle ...
__m128d h(__m128d x, __m128d y, __m128d z){ __m128d tem = _mm_mul_pd (x,y); __m128d tem2 = tem + z; __m128d tem3 = tem - z; return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3}); } doesn't quite work (the combiner pattern for fmaddsub is missing). Tried {0, 2} as well. : .LFB5021: .cfi_startproc vmovapd %xmm0, %xmm3 vfmsub132pd %xmm1, %xmm2, %xmm0 vfmadd132pd %xmm1, %xmm2, %xmm3 vshufpd $2, %xmm0, %xmm3, %xmm0
*** Bug 84361 has been marked as a duplicate of this bug. ***
(In reply to Richard Biener from comment #2) > __m128d h(__m128d x, __m128d y, __m128d z){ > __m128d tem = _mm_mul_pd (x,y); > __m128d tem2 = tem + z; > __m128d tem3 = tem - z; > return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3}); > } > > doesn't quite work (the combiner pattern for fmaddsub is missing). Tried > {0, 2} as well. > > : > .LFB5021: > .cfi_startproc > vmovapd %xmm0, %xmm3 > vfmsub132pd %xmm1, %xmm2, %xmm0 > vfmadd132pd %xmm1, %xmm2, %xmm3 > vshufpd $2, %xmm0, %xmm3, %xmm0 tem2_6 = .FMA (x_2(D), y_3(D), z_5(D)); # DEBUG tem2 => tem2_6 # DEBUG BEGIN_STMT tem3_7 = .FMS (x_2(D), y_3(D), z_5(D)); # DEBUG tem3 => NULL # DEBUG BEGIN_STMT _8 = VEC_PERM_EXPR <tem2_6, tem3_7, { 0, 3 }>; Can it be handled in match.pd? rewrite fmaddsub pattern into vec_merge fma fms <addsub_cst> looks too complex. Similar for VEC_ADDSUB + MUL -> VEC_FMADDSUB.
(In reply to Richard Biener from comment #1) > Hmm, I think the issue is we see > > f (__m128d x, __m128d y, __m128d z) > { > vector(2) double _4; > vector(2) double _6; > > <bb 2> [100.00%]: > _4 = x_2(D) * y_3(D); > _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> VEC_FMADDSUB in match.pd?
(In reply to Hongtao.liu from comment #5) > (In reply to Richard Biener from comment #1) > > Hmm, I think the issue is we see > > > > f (__m128d x, __m128d y, __m128d z) > > { > > vector(2) double _4; > > vector(2) double _6; > > > > <bb 2> [100.00%]: > > _4 = x_2(D) * y_3(D); > > _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] > We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> > VEC_FMADDSUB in match.pd? I think MUL + .VEC_ADDSUB can be handled in the FMA pass. For my example above we early (before FMA recog) get _4 = x_2(D) * y_3(D); tem2_7 = _4 + z_6(D); tem3_8 = _4 - z_6(D); _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>; we could recognize that as .VEC_ADDSUB. I think we want to avoid doing this too early, not sure if doing this within the FMA pass itself will work since we key FMAs on the mult but would need to key the addsub on the VEC_PERM (we are walking stmts from BB start to end). Looking at the code it seems changing the walking order should work. Note matching tem2_7 = _4 + z_6(D); tem3_8 = _4 - z_6(D); _9 = VEC_PERM_EXPR <tem2_7, tem3_8, { 0, 3 }>; to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly creates .VEC_ADDSUB when possible).
> > to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly > creates .VEC_ADDSUB when possible). Let's put it under -fno-trapping-math.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:f0b7a61d83534fc8f7aa593b1f0f0357a371a800 commit r14-2919-gf0b7a61d83534fc8f7aa593b1f0f0357a371a800 Author: liuhongt <hongtao.liu@intel.com> Date: Mon Jul 31 16:03:45 2023 +0800 Support vec_fmaddsub/vec_fmsubadd for vector HFmode. AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph. Also remove scalar mode from fmaddsub/fmsubadd pattern since there's no scalar instruction for that. gcc/ChangeLog: PR target/81904 * config/i386/sse.md (vec_fmaddsub<mode>4): Extend to vector HFmode, use mode iterator VFH instead. (vec_fmsubadd<mode>4): Ditto. (<sd_mask_codefor>fma_fmaddsub_<mode><sd_maskz_name><round_name>): Remove scalar mode from iterator, use VFH_AVX512VL instead. (<sd_mask_codefor>fma_fmsubadd_<mode><sd_maskz_name><round_name>): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr81904.c: New test.