On GCC 15 branch, I got FAIL: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfmadd132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 FAIL: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfnmadd132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 FAIL: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfmsub132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 FAIL: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfnmsub132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1
https://gcc.gnu.org/pipermail/gcc-patches/2024-September/662523.html d34cda720988674bcf8a24267c9e1ec61335d6de is the first bad commit commit d34cda720988674bcf8a24267c9e1ec61335d6de Author: Richard Biener <rguenther@suse.de> Date: Fri Sep 29 12:54:17 2023 +0200 Handle non-grouped stores as single-lane SLP
*** Bug 117073 has been marked as a duplicate of this bug. ***
See https://gcc.gnu.org/pipermail/gcc-patches/2024-September/662257.html which mentions this failure explicitly.
.
Compared to gcc14 I have for example for cond_op_fma__Float16-1.c foo1_fnms: .LFB7: .cfi_startproc xorl %eax, %eax .p2align 4,,10 .p2align 3 .L24: vmovdqa b(%rax), %ymm1 vmovdqa d(%rax), %ymm0 addq $32, %rax vcmpph $1, c-32(%rax), %ymm1, %k1 vmovdqa e-32(%rax), %ymm1 vfnmsub213ph a-32(%rax), %ymm0, %ymm1 vmovdqu16 %ymm1, %ymm0{%k1} vmovdqa %ymm0, a-32(%rax) cmpq $1600, %rax jne .L24 vzeroupper ret instead of the expected foo1_fnms: .LFB7: .cfi_startproc xorl %eax, %eax .p2align 4,,10 .p2align 3 .L24: vmovdqa b(%rax), %ymm1 vmovdqa a(%rax), %ymm2 addq $32, %rax vmovdqa d-32(%rax), %ymm0 vcmpph $1, c-32(%rax), %ymm1, %k1 vfnmsub132ph e-32(%rax), %ymm2, %ymm0{%k1} vmovdqa %ymm0, a-32(%rax) cmpq $1600, %rax jne .L24 vzeroupper ret .combine shows in gcc14: Trying 15 -> 16: 15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']} 16: r99:V16HF=vec_merge(r113:V16HF,r102:V16HF,r110:HI) REG_DEAD r113:V16HF REG_DEAD r110:HI REG_DEAD r102:V16HF Successfully matched this instruction: (set (reg:V16HF 99 [ _37 ]) (vec_merge:V16HF (fma:V16HF (neg:V16HF (reg:V16HF 102 [ vect_pretmp_14.315 ])) (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ]) (symbol_ref:DI ("e") [flags 0x2] <var_decl 0x7ffff6810ea0 e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 * 1]+0 S32 A256]) (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ]) (symbol_ref:DI ("a") [flags 0x2] <var_decl 0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.333_9 * 1]+0 S32 A256]))) (reg:V16HF 102 [ vect_pretmp_14.315 ]) (reg:HI 110 [ mask__11.325_55 ]))) but Trying 15 -> 16: 15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']} 16: r99:V16HF=vec_merge(r113:V16HF,r104:V16HF,r110:HI) REG_DEAD r113:V16HF REG_DEAD r110:HI REG_DEAD r104:V16HF Failed to match this instruction: (set (reg:V16HF 99 [ _37 ]) (vec_merge:V16HF (fma:V16HF (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.329 ]) (symbol_ref:DI ("e") [flags 0x2] <var_decl 0x7ffff6810ea0 e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9 * 1]+0 S32 A256])) (reg:V16HF 104 [ vect_pretmp_14.315 ]) (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.329 ]) (symbol_ref:DI ("a") [flags 0x2] <var_decl 0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.329_9 * 1]+0 S32 A256]))) (reg:V16HF 104 [ vect_pretmp_14.315 ]) (reg:HI 110 [ mask__11.309_43 ]))) see how the commutative multiply part of insn 15 differs and causes the matching to fail: good: 15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']} bad: 15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']} this ordering is already present on GIMPLE: vect_pretmp_14.315_45 = MEM <vector(16) _Float16> [(_Float16 *)&d + ivtmp.333_9 * 1]; vect__5.322_52 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 * 1]; _37 = .COND_FNMS (mask__11.325_55, vect_pretmp_14.315_45, vect__5.322_52, vect__3.318_48, vect_pretmp_14.315_45); vs. vect_pretmp_14.315_49 = MEM <vector(16) _Float16> [(_Float16 *)&d + ivtmp.329_9 * 1]; vect__5.312_46 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9 * 1]; _37 = .COND_FNMS (mask__11.309_43, vect__5.312_46, vect_pretmp_14.315_49, vect__3.319_53, vect_pretmp_14.315_49); both are canonicalized correctly (after SSA name version). This is a spurious difference, if we rely on these combines for the now missed micro-optimization we need to beef up the patterns to allow both orders. (avx512vl_fnmsub_v16hf_mask) A target issue IMO? Alternatively make sure RTL canonicalizes (fma (neg non-reg) (reg) ...) to (fma (neg reg) (non-reg) ...) or stop matching that as pattern and thus force RTL expansion + combine to arrive at the correct variant?
Btw, simplify-rtx does /* Canonicalize the two multiplication operands. */ /* a * -b + c => -b * a + c. */ if (swap_commutative_operands_p (op0, op1)) std::swap (op0, op1), any_change = true; but it doesn't try to swap_commutative_operands_p on the negate argument and the non-negated operand, aka -a * b -> -b * a. diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index e8e60404ef6..0c86c204529 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -6835,6 +6835,16 @@ simplify_context::simplify_ternary_operation (rtx_code code, machine_mode mode, if (swap_commutative_operands_p (op0, op1)) std::swap (op0, op1), any_change = true; + /* Canonicalize -a * b + c to -b * a + c if a is not a register + but b is. */ + if (GET_CODE (op0) == NEG && REG_P (op1) && !REG_P (XEXP (op0, 0))) + { + op0 = XEXP (op0, 0); + op1 = simplify_gen_unary (NEG, mode, op1, mode); + std::swap (op0, op1); + any_change = true; + } + if (any_change) return gen_rtx_FMA (mode, op0, op1, op2); return NULL_RTX; fixes part of the observed regressions, diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index e8e60404ef6..13cb2cc0f5c 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -6832,9 +6832,20 @@ simplify_context::simplify_ternary_operation (rtx_code code, machine_mode mode, /* Canonicalize the two multiplication operands. */ /* a * -b + c => -b * a + c. */ - if (swap_commutative_operands_p (op0, op1)) + if (swap_commutative_operands_p (op0, op1) + || (REG_P (op1) && GET_CODE (op0) != NEG && !REG_P (op0))) std::swap (op0, op1), any_change = true; fixes the rest. I'm going to propose this.
OTOH I'll note that no other simplify_* treats canonicalization as simplification and the existing swap_commutative_operands_p transform for FMA is highly uncommon. So why do we recognize (fma (neg (mem...)) ...) and not only (neg (register_operand))?
(In reply to Richard Biener from comment #7) > OTOH I'll note that no other simplify_* treats canonicalization as > simplification and the existing swap_commutative_operands_p transform for FMA > is highly uncommon. > > So why do we recognize (fma (neg (mem...)) ...) and not only (neg > (register_operand))? I think we can relex register_operand to nonimmediate_operand and rely on RA to reload it into a reg just like we did in <sd_mask_codefor>fma_fnmadd_<mode><sd_maskz_name><round_name>. So a backend fix shou be better?
On Fri, 11 Oct 2024, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117072 > > --- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #7) > > OTOH I'll note that no other simplify_* treats canonicalization as > > simplification and the existing swap_commutative_operands_p transform for FMA > > is highly uncommon. > > > > So why do we recognize (fma (neg (mem...)) ...) and not only (neg > > (register_operand))? > > I think we can relex register_operand to nonimmediate_operand and rely on RA to > reload it into a reg just like we did in > <sd_mask_codefor>fma_fnmadd_<mode><sd_maskz_name><round_name>. So a backend fix > shou be better? I think currently the backend isn't consistent with itself and sure, a backend fix would be better (if it doesn't mean bloating the .md with many more patterns).
(In reply to rguenther@suse.de from comment #9) > On Fri, 11 Oct 2024, liuhongt at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117072 > > > > --- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > > (In reply to Richard Biener from comment #7) > > > OTOH I'll note that no other simplify_* treats canonicalization as > > > simplification and the existing swap_commutative_operands_p transform for FMA > > > is highly uncommon. > > > > > > So why do we recognize (fma (neg (mem...)) ...) and not only (neg > > > (register_operand))? > > > > I think we can relex register_operand to nonimmediate_operand and rely on RA to > > reload it into a reg just like we did in > > <sd_mask_codefor>fma_fnmadd_<mode><sd_maskz_name><round_name>. So a backend fix > > shou be better? > > I think currently the backend isn't consistent with itself and sure, > a backend fix would be better (if it doesn't mean bloating the .md > with many more patterns). No, just adjust the existed pattern should be ok.
(In reply to Hongtao Liu from comment #10) > (In reply to rguenther@suse.de from comment #9) > > On Fri, 11 Oct 2024, liuhongt at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117072 > > > > > > --- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > > > (In reply to Richard Biener from comment #7) > > > > OTOH I'll note that no other simplify_* treats canonicalization as > > > > simplification and the existing swap_commutative_operands_p transform for FMA > > > > is highly uncommon. > > > > > > > > So why do we recognize (fma (neg (mem...)) ...) and not only (neg > > > > (register_operand))? > > > > > > I think we can relex register_operand to nonimmediate_operand and rely on RA to > > > reload it into a reg just like we did in > > > <sd_mask_codefor>fma_fnmadd_<mode><sd_maskz_name><round_name>. So a backend fix > > > shou be better? > > > > I think currently the backend isn't consistent with itself and sure, > > a backend fix would be better (if it doesn't mean bloating the .md > > with many more patterns). > > No, just adjust the existed pattern should be ok. Relax the predicate doesn't help since the mask pattern checks extra (match_dup 1) and need to swap operands. we once tried to replace it with (match_operand:VFH_AVX512VL 5 "nonimmediate_operand" "0,0")), but trigger an ICE in reload(reload can handle at most one operand with "0" constraint). 6213(define_insn "<avx512>_fnmsub_<mode>_mask<round_name>" 6214 [(set (match_operand:VFH_AVX512VL 0 "register_operand" "=v,v") 6215 (vec_merge:VFH_AVX512VL 6216 (fma:VFH_AVX512VL 6217 (neg:VFH_AVX512VL 6218 (match_operand:VFH_AVX512VL 1 "nonimmediate_operand" "0,0")) 6219 (match_operand:VFH_AVX512VL 2 "<round_nimm_predicate>" "<round_constraint>,v") 6220 (neg:VFH_AVX512VL 6221 (match_operand:VFH_AVX512VL 3 "<round_nimm_predicate>" "v,<round_constraint>"))) 6222 (match_dup 1) 6223 (match_operand:<avx512fmaskmode> 4 "register_operand" "Yk,Yk")))] 6224 "TARGET_AVX512F && <round_mode_condition>" So the backend fix should at least add 8 patterns to handle that, in that case, maybe the middle-end canonicalization would be better.
> > So the backend fix should at least add 8 patterns to handle that, in that > case, maybe the middle-end canonicalization would be better. And I will still submit a patch to make the FMA predicates more consistent.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:330782a1b6cfe881ad884617ffab441aeb1c2b5c commit r15-4398-g330782a1b6cfe881ad884617ffab441aeb1c2b5c Author: liuhongt <hongtao.liu@intel.com> Date: Mon Oct 14 17:16:13 2024 +0800 Canonicalize (vec_merge (fma op2 op1 op3) op1 mask) to (vec_merge (fma op1 op2 op3) op1 mask). For x86 masked fma, there're 2 rtl representations 1) (vec_merge (fma op2 op1 op3) op1 mask) 2) (vec_merge (fma op1 op2 op3) op1 mask). 5894(define_insn "<avx512>_fmadd_<mode>_mask<round_name>" 5895 [(set (match_operand:VFH_AVX512VL 0 "register_operand" "=v,v") 5896 (vec_merge:VFH_AVX512VL 5897 (fma:VFH_AVX512VL 5898 (match_operand:VFH_AVX512VL 1 "nonimmediate_operand" "0,0") 5899 (match_operand:VFH_AVX512VL 2 "<round_nimm_predicate>" "<round_constraint>,v") 5900 (match_operand:VFH_AVX512VL 3 "<round_nimm_predicate>" "v,<round_constraint>")) 5901 (match_dup 1) 5902 (match_operand:<avx512fmaskmode> 4 "register_operand" "Yk,Yk")))] 5903 "TARGET_AVX512F && <round_mode_condition>" 5904 "@ 5905 vfmadd132<ssemodesuffix>\t{<round_op5>%2, %3, %0%{%4%}|%0%{%4%}, %3, %2<round_op5>} 5906 vfmadd213<ssemodesuffix>\t{<round_op5>%3, %2, %0%{%4%}|%0%{%4%}, %2, %3<round_op5>}" 5907 [(set_attr "type" "ssemuladd") 5908 (set_attr "prefix" "evex") 5909 (set_attr "mode" "<MODE>")]) Here op1 has constraint "0", and the scecond op1 is (match_dup 1), we once tried to replace it with (match_operand:M 5 "nonimmediate_operand" "0")) to enable more flexibility for pattern match and recog, but it triggered an ICE in reload(reload can handle at most one perand with "0" constraint). So we need either add 2 patterns in the backend or just do the canonicalization in the middle-end. gcc/ChangeLog: PR middle-end/117072 * combine.cc (maybe_swap_commutative_operands): Canonicalize (vec_merge (fma op2 op1 op3) op1 mask) to (vec_merge (fma op1 op2 op3) op1 mask).
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:edf4db8355dead3413bad64f6a89bae82dabd0ad commit r15-4399-gedf4db8355dead3413bad64f6a89bae82dabd0ad Author: liuhongt <hongtao.liu@intel.com> Date: Mon Oct 14 13:09:59 2024 +0800 Canonicalize (vec_merge (fma: op2 op1 op3) (match_dup 1)) mask) to (vec_merge (fma: op1 op2 op3) (match_dup 1)) mask) For masked FMA, there're 2 forms of RTL representation 1) (vec_merge (fma: op2 op1 op3) op1) mask) 2) (vec_merge (fma: op1 op2 op3) op1) mask) It's because op1 op2 are communatative in RTL(the second op1 is written as (match_dup 1)) we once tried to replace (match_dup 1) with (match_operand:VFH_AVX512VL 5 "nonimmediate_operand" "0,0")), but trigger an ICE in reload(reload can handle at most one operand with "0" constraint). So the patch do the canonicalizaton for the backend part. gcc/ChangeLog: PR target/117072 * config/i386/sse.md (<avx512>_fmadd_<mode>_mask<round_name>): Relax predicates of fma operands from register_operand to nonimmediate_operand. (<avx512>_fmadd_<mode>_mask3<round_name>): Ditto. (<avx512>_fmsub_<mode>_mask<round_name>): Ditto. (<avx512>_fmsub_<mode>_mask3<round_name>): Ditto. (<avx512>_fnmadd_<mode>_mask<round_name>): Ditto. (<avx512>_fnmadd_<mode>_mask3<round_name>): Ditto. (<avx512>_fnmsub_<mode>_mask<round_name>): Ditto. (<avx512>_fnmsub_<mode>_mask3<round_name>): Ditto. (<avx512>_fmaddsub_<mode>_mask3<round_name>): Ditto. (<avx512>_fmsubadd_<mode>_mask<round_name>): Ditto. (<avx512>_fmsubadd_<mode>_mask3<round_name>): Ditto. (avx512f_vmfmadd_<mode>_mask<round_name>): Ditto. (avx512f_vmfmadd_<mode>_mask3<round_name>): Ditto. (avx512f_vmfmadd_<mode>_maskz_1<round_name>): Ditto. (*avx512f_vmfmsub_<mode>_mask<round_name>): Ditto. (avx512f_vmfmsub_<mode>_mask3<round_name>): Ditto. (*avx512f_vmfmsub_<mode>_maskz_1<round_name>): Ditto. (avx512f_vmfnmadd_<mode>_mask<round_name>): Ditto. (avx512f_vmfnmadd_<mode>_mask3<round_name>): Ditto. (avx512f_vmfnmadd_<mode>_maskz_1<round_name>): Ditto. (*avx512f_vmfnmsub_<mode>_mask<round_name>): Ditto. (*avx512f_vmfnmsub_<mode>_mask3<round_name>): Ditto. (*avx512f_vmfnmsub_<mode>_maskz_1<round_name>): Ditto. (avx10_2_fmaddnepbf16_<mode>_mask3): Ditto. (avx10_2_fnmaddnepbf16_<mode>_mask3): Ditto. (avx10_2_fmsubnepbf16_<mode>_mask3): Ditto. (avx10_2_fnmsubnepbf16_<mode>_mask3): Ditto. (fmai_vmfmadd_<mode><round_name>): Swap operands[1] and operands[2]. (fmai_vmfmsub_<mode><round_name>): Ditto. (fmai_vmfnmadd_<mode><round_name>): Ditto. (fmai_vmfnmsub_<mode><round_name>): Ditto. (*fmai_fmadd_<mode>): Swap operands[1] and operands[2] adjust operands[1] predicates from register_operand to nonimmediate_operand. (*fmai_fmsub_<mode>): Ditto. (*fmai_fnmadd_<mode><round_name>): Ditto. (*fmai_fnmsub_<mode><round_name>): Ditto.
Tests that now work, but didn't before (24 tests): gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfmadd132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfmsub132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfnmadd132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfnmsub132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfmadd132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfmsub132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfnmadd132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfnmsub132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfmadd132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfmsub132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfnmadd132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfnmsub132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfmadd132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfmsub132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfnmadd132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma__Float16-1.c scan-assembler-times vfnmsub132ph[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfmadd132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfmsub132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfnmadd132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_double-1.c scan-assembler-times vfnmsub132pd[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfmadd132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfmsub132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfnmadd132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1 unix/-m32: gcc: gcc.target/i386/cond_op_fma_float-1.c scan-assembler-times vfnmsub132ps[ \\t]+[^{\n]*%ymm[0-9]+{%k[1-7]}(?:\n|[ \\t]+#) 1