I stumped upon an example where GCC generates FMA instruction even when FMAs are disabled using -ffp-contract=off (extracted from https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/cscal.c) $ cat cscal.c void cscal(int n, float da_r, float *x) { for (int i = 0; i < n; i += 4) { float temp0 = da_r * x[i] - x[i+1]; float temp1 = da_r * x[i+2] - x[i+3]; x[i+1] = da_r * x[i+1] + x[i]; x[i+3] = da_r * x[i+3] + x[i+2]; x[i] = temp0; x[i+2] = temp1; } } $ gcc -S -march=haswell -O2 -ffp-contract=off cscal.c $ grep fma cscal.s vfmaddsub231ps %xmm0, %xmm2, %xmm1 I would expect there to be no FMA instructions in there.
According to godbolt it's still producing FMAs on trunk: https://godbolt.org/z/aWh6d1E4E
Confirmed. It is SLP that is doing it. At -O3 even GCC 11 is working ok.
Nice catch, thanks for the report. This is due to g:7d810646d421 The documentation should clarify that patterns correspond to basic fma instructions (without intermediate rounding), and SLP pattern matching should check flag_fp_contract_mode != FP_CONTRACT_OFF.
(In reply to Alexander Monakov from comment #3) > Nice catch, thanks for the report. This is due to g:7d810646d421 > > The documentation should clarify that patterns correspond to basic fma > instructions (without intermediate rounding), and SLP pattern matching > should check flag_fp_contract_mode != FP_CONTRACT_OFF. I don't think they should. Because __builtin_fma (and fma and fmaf) should be able to be used from an user program and still get FMA instructions. And even the SLP instruction patterns can be generating using those.
(In reply to Andrew Pinski from comment #4) > (In reply to Alexander Monakov from comment #3) > > Nice catch, thanks for the report. This is due to g:7d810646d421 > > > > The documentation should clarify that patterns correspond to basic fma > > instructions (without intermediate rounding), and SLP pattern matching > > should check flag_fp_contract_mode != FP_CONTRACT_OFF. > > I don't think they should. Because __builtin_fma (and fma and fmaf) should > be able to be used from an user program and still get FMA instructions. And > even the SLP instruction patterns can be generating using those. That is: void f(float *a, float *b, float *c) { float t0 = __builtin_fmaf(a[0], b[0], c[0]); float t1 = __builtin_fmaf(a[1], b[1], c[1]); a[0] = t0; a[1] = t1; } Should produce: vmovq xmm0, QWORD PTR [rdi] vmovq xmm2, QWORD PTR [rsi] vmovq xmm1, QWORD PTR [rdx] vfmadd132ps xmm0, xmm1, xmm2 vmovlps QWORD PTR [rdi], xmm0 ret Even with -ffp-contract=off -march=haswell -O3
Sure, but I was talking specifically about the pattern matching introduced by that commit.
(In reply to Alexander Monakov from comment #6) > Sure, but I was talking specifically about the pattern matching introduced > by that commit. The general rule for pattern matching is if you don't have a FMA (or FMA-like) don't try to generate a FMA when -ffp-contract=off is supplied. The target backend should NOT know about flag_fp_contract_mode for the FMA like patterns. It is up to the middle-end optimizers to produce the correct thing.
I think the SLP pattern detection doesn't even work with a FMADD, FMSUB combo. We don't generate FMSUB early, not sure if we could somehow trick us into folding one with __builtin_fma, maybe __builtin_fma (x, y, -z) would do, I'd have to check. So yes, the SLP pattern detection introduces a FMA where none was before (and then vectorizes it). I'll fix it.
Ah, we only fold those to internal functions _after_ vectorization. SLP will see double x[2]; void foo (double a, double b, double * __restrict c) { x[0] = __builtin_fma (a, b, c[0]); x[1] = __builtin_fma (a, b, -c[1]); } as two calls to FMA and thus fail to optimially vectorize it.
Tamar, are the IFN_COMPLEX_FMA and IFN_COMPLEX_FMA_CONJ FP contracting operations as well?
(In reply to Richard Biener from comment #10) > Tamar, are the IFN_COMPLEX_FMA and IFN_COMPLEX_FMA_CONJ FP contracting > operations as well? Yes, they have no intermediate rounding.
Note that the same IFN is used for integer MLA as well. We didn't split them apart.
Created attachment 53917 [details] patch I am testing OK, I'm testing the following then - can you see if that works for the complex fmas and if the aarch64 testsuite needs any adjustments? (-ffp-contract=fast is default)
(In reply to Richard Biener from comment #13) > Created attachment 53917 [details] > patch I am testing > > OK, I'm testing the following then - can you see if that works for the > complex fmas and if the aarch64 testsuite needs any adjustments? > (-ffp-contract=fast is default) Will do, this needs a check on the type no? SVE2 adds complex integer MLAs which should be fine (we didn't split the IFNs).
I'm confused about the first hunk in the attached patch: --- a/gcc/tree-vect-slp-patterns.cc +++ b/gcc/tree-vect-slp-patterns.cc @@ -1035,8 +1035,10 @@ complex_mul_pattern::matches (complex_operation_t op, auto_vec<slp_tree> left_op, right_op; slp_tree add0 = NULL; - /* Check if we may be a multiply add. */ + /* Check if we may be a multiply add. It's only valid to form FMAs + with -ffp-contract=fast. */ if (!mul0 + && flag_fp_contract_mode != FP_CONTRACT_FAST && vect_match_expression_p (l0node[0], PLUS_EXPR)) { auto vals = SLP_TREE_CHILDREN (l0node[0]); Shouldn't it be ' == FP_CONTRACT_FAST' rather than '!='? It seems we are checking that a match is found and contracting across statement boundaries is allowed.
(In reply to Alexander Monakov from comment #15) > I'm confused about the first hunk in the attached patch: > > --- a/gcc/tree-vect-slp-patterns.cc > +++ b/gcc/tree-vect-slp-patterns.cc > @@ -1035,8 +1035,10 @@ complex_mul_pattern::matches (complex_operation_t op, > auto_vec<slp_tree> left_op, right_op; > slp_tree add0 = NULL; > > - /* Check if we may be a multiply add. */ > + /* Check if we may be a multiply add. It's only valid to form FMAs > + with -ffp-contract=fast. */ > if (!mul0 > + && flag_fp_contract_mode != FP_CONTRACT_FAST > && vect_match_expression_p (l0node[0], PLUS_EXPR)) > { > auto vals = SLP_TREE_CHILDREN (l0node[0]); > > > Shouldn't it be ' == FP_CONTRACT_FAST' rather than '!='? It seems we are > checking that a match is found and contracting across statement boundaries > is allowed. whoops yes, I'll fix and add a check for the type.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:c5df8392c5848c0462558f41cdf6eab31db301cf commit r13-4137-gc5df8392c5848c0462558f41cdf6eab31db301cf Author: Richard Biener <rguenther@suse.de> Date: Thu Nov 17 09:43:31 2022 +0100 tree-optimization/107647 - avoid FMA from SLP with -ffp-contract=off Only with -ffp-contract=fast we can synthesize FMA operations like vfmaddsub231ps, so properly guard the transform in SLP pattern detection. PR tree-optimization/107647 * tree-vect-slp-patterns.cc (addsub_pattern::recognize): Only allow FMA generation with -ffp-contract=fast for FP types. (complex_mul_pattern::matches): Likewise. * gcc.target/i386/pr107647.c: New testcase.
Fixed on trunk sofar (whoops, pushed before giving you a chance to test on aarch64 - hopefully it goes OK).
FWIW, the testsuite on AArch64 was clean after the patch.
The releases/gcc-12 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:a9fafa2f533e25c57528c0294e19a154197848dd commit r12-8974-ga9fafa2f533e25c57528c0294e19a154197848dd Author: Richard Biener <rguenther@suse.de> Date: Thu Nov 17 09:43:31 2022 +0100 tree-optimization/107647 - avoid FMA from SLP with -ffp-contract=off Only with -ffp-contract=fast we can synthesize FMA operations like vfmaddsub231ps, so properly guard the transform in SLP pattern detection. PR tree-optimization/107647 * tree-vect-slp-patterns.cc (addsub_pattern::recognize): Only allow FMA generation with -ffp-contract=fast for FP types. (complex_mul_pattern::matches): Likewise. * gcc.target/i386/pr107647.c: New testcase. (cherry picked from commit c5df8392c5848c0462558f41cdf6eab31db301cf)
Fixed.