[PATCH v2] tree-optimization/110279- Check for nested FMA chains in reassoc

Wed Jul 19 09:05:22 GMT 2023

On Tue, Jul 11, 2023 at 5:00 AM Di Zhao OS via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Attached is an updated version of the patch.
>
> Based on Philipp's review, some changes:
>
> 1. Defined new enum fma_state to describe the state of FMA candidates
>    for a list of operands. (Since the tests seems simple after the
>    change, I didn't add predicates on it.)
> 2. Changed return type of convert_mult_to_fma_1 and convert_mult_to_fma
>    to tree, to remove the in/out parameter.
> 3. Added description of return value values of rank_ops_for_fma.

I'll note that rank_ops_for_fma works on the single-use addition chain only so
there might be FMA ops "elsewhere" in the dependence chain that are not
in 'ops'.

You are using defer_p = false for the fma_deferring_state so I wonder why
you need this machinery at all?  Wouldn't the restriction only apply when
we'd actually deferred a FMA generation?  I'm CCing Martin who did this work.

But what I'm curious about is that if any of the new conditions trigger then
you leave the ops chain alone - but it _could_ already be in "bad" shape,
is there anything we could do to improve ordering?

Also

   if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
     {
+      if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
+                   param_avoid_fma_max_bits))
+       /* Avoid re-arrange to produce less FMA chains that can be slow.  */
+       return FMA_STATE_MULTIPLE;
+
       /* Put no-mult ops and mult ops alternately at the end of the
         queue, which is conducive to generating more FMA and reducing the
         loss of FMA when breaking the chain.  */
@@ -6829,9 +6909,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
          if (opindex > 0)
            opindex--;
        }
-      return true;
+      return FMA_STATE_MULTIPLE;

so we end up parallel rewriting in any case and just avoid
"optimizing" the ops list here.
>From the PR it looks like without the rewriting we are lucky that the
FMA generation
heuristic to avoid cross backedge FMA dependences doesn't trigger
without associating
(but parallel rewrite is still good)?

For the NESTED case we avoid parallel rewriting completely, independent on
param_avoid_fma_max_bits - it seems from the PR we want more FMAs here
and the parallel rewriting makes it worse?  But I don't see exactly how.

I think these are all a bit fragile heuristics also give that reassoc
works on single-use
chains only.  The more we interwind FMA creation in widen_mult and associating
for FMA in reassoc the more I think that reassoc itself should form the FMAs?
The passes are unfortunately quite a bit separated.

Can you produce testcases for the two problematical cases in the PR?

That said, the added heuristics are not looking universally good to me without
better motivation.

> ---
> gcc/ChangeLog:
>

Missing

           PR tree-optimization/110279

so bugzilla picks up the commit.

>         * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter
>         check_only_p. Changed return type to tree.
>         (struct fma_transformation_info): Moved to header.
>         (class fma_deferring_state): Moved to header.
>         (convert_mult_to_fma): Added new parameter check_only_p. Changed
>         return type to tree.
>         * tree-ssa-math-opts.h (struct fma_transformation_info): Moved from .cc.
>         (class fma_deferring_state): Moved from .cc.
>         (convert_mult_to_fma): Add function decl.
>         * tree-ssa-reassoc.cc (enum fma_state): Defined new enum to describe
>         the state of FMA candidates for a list of operands.
>         (rewrite_expr_tree_parallel): Changed boolean parameter to enum type.
>         (rank_ops_for_fma): Return enum fma_state.
>         (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found.
>
> Thanks,
> Di Zhao
>
>