[PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

Richard Biener richard.guenther@gmail.com
Mon May 22 13:15:04 GMT 2023


On Wed, May 17, 2023 at 3:05 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> > I think to make a difference you need to hit the number of parallel fadd/fmul
> > the pipeline can perform.  I don't think issue width is ever a problem for
> > chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> >
>
> Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should have the same width.
> The current implementation is reasonable  /* reassoc int, fp, vec_int, vec_fp.  */.
>
> > That said, I think iff then we should try to improve
> > rewrite_expr_tree_parallel rather than adding a new function.  For example
> > for the case with equal rank operands we can try to sort adds first.  I can't
> > convince myself that rewrite_expr_tree_parallel honors ranks properly
> > quickly.
> >
>
> I rewrite this patch, there are mainly two changes:
> 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it instead of rewrite_expr_tree_parallel. The following example shows that the sequence generated by the this patch is better.
> 2. Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more fma and reducing the loss of FMA when breaking the chain.
>
> With these two changes, GCC can break the chain with width = 2 and generates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350  without any params.
>
> ------------------------------------------------------------------------------------------------------------------
> Source code: g + h + j + s + m + n+a+b +e  (https://godbolt.org/z/G8sb86n84)
> Compile options: -Ofast -mfpmath=sse -mfma
> Width = 3 was chosen for reassociation
> -----------------------------------------------------------------------------------------------------------------
> Old rewrite_expr_tree_parallel generates:
>   _6 = g_8(D) + h_9(D);       ------> parallel 0
>   _3 = s_11(D) + m_12(D);  ------> parallel 1
>   _5 = _3 + j_10(D);
>   _2 = n_13(D) + a_14(D);   ------> parallel 2
>   _1 = b_15(D) + e_16(D);  -----> Parallel 3, This is not necessary, and it is not friendly to FMA.
>   _4 = _1 + _2;
>   _7 = _4 + _5;
>   _17 = _6 + _7;
>   return _17;
>
> When the width = 3,  we need 5 cycles here.
> ---------------------------------------------first end---------------------------------------------------------
> Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:
>
>   _3 = s_11(D) + m_12(D);  ------> parallel 0
>   _5 = _3 + j_10(D);
>   _2 = n_13(D) + a_14(D);   ------> parallel 1
>   _1 = b_15(D) + e_16(D);   ------> parallel 2
>   _4 = _1 + _2;
>   _6 = _4 + _5;
>   _7 = _6 + h_9(D);
>   _17 = _7 + g_8(D);
>   return _17;
>
> When the width = 3, we need 5 cycles here.
> ---------------------------------------------second end-------------------------------------------------------
> Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel generates:
>
>   _3 = s_11(D) + m_12(D);
>   _6 = _3 + g_8(D);
>   _2 = n_13(D) + a_14(D);
>   _5 = _2 + h_9(D);
>   _1 = b_15(D) + e_16(D);
>   _4 = _1 + j_10(D);
>   _7 = _4 + _5;
>   _17 = _7 + _6;
>   return _17;
>
> When the width = 3, we need 4 cycles here.
> --------------------------------------------third end-----------------------------------------------------------

Yes, so what I was saying is that I doubt rewrite_expr_tree_parallel
is optimal - you show
that for the specific example rewrite_expr_tree_parallel_for_fma is
better.  I was arguing
we want a single function, whether we single out leaves with
multiplications or not.

And we want documentation that shows the strategy will result in optimal latency
(I think we should not sacrifice latency just for the sake of forming
more FMAs).

Richard.

>
> Thanks,
> Lili.
>


More information about the Gcc-patches mailing list