This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH] Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use
- From: Yufeng Zhang <Yufeng dot Zhang at arm dot com>
- To: Richard Henderson <rth at redhat dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "ams at codesourcery dot com" <ams at codesourcery dot com>
- Date: Thu, 24 Oct 2013 15:52:20 +0100
- Subject: Re: [PATCH] Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use
- Authentication-results: sourceware.org; auth=none
- References: <5265A43E dot 7060507 at arm dot com> <526869D0 dot 40905 at redhat dot com>
On 10/24/13 01:29, Richard Henderson wrote:
On 10/21/2013 03:01 PM, Yufeng Zhang wrote:
This patch changes the widening_mul pass to fuse the widening multiply with
accumulate only when the multiply has single use. The widening_mul pass
currently does the conversion regardless of the number of the uses, which can
cause poor code-gen in cases like the following:
typedef int ArrT ;
foo (ArrT Arr, int Idx)
Arr[Idx][Idx] = 1;
Arr[Idx + 10][Idx] = 2;
On AArch64, after widening_mul, the IR is like
_2 = (long unsigned int) Idx_1(D);
_3 = Idx_1(D) w* 40;<----
_5 = Arr_4(D) + _3;
*_5[Idx_1(D)] = 1;
_8 = WIDEN_MULT_PLUS_EXPR<Idx_1(D), 40, 400>;<----
_9 = Arr_4(D) + _8;
*_9[Idx_1(D)] = 2;
Where the arrows point, there are redundant widening multiplies.
So they're redundant. Why does this imply poor code-gen?
If a target has more than one FMA unit, then the target might
be able to issue the computation for _3 and _8 in parallel.
Even if the target only has one FMA unit, but the unit is
pipelined, the computations could overlap.
Thanks for the review.
I think it is a fair point that redundancy doesn't always indicate poor
code-gen, but there are a few reasons that I think this patch makes sense.
Firstly, the generated WIDEN_MULT_PLUS_EXPR can prevents other
optimization passes from analyzing the IR sequence effectively. Like in
the above example, the widening multiply can be part of a larger common
sub-expression (Arr_4(D) + Idx_1(D) w* 40 + Idx_1(D) * 4); by blindly
merging the multiply with accumulate, it makes the recognition of the
common sub-expression rather difficult.
Secondly, it is generally more expensive (in terms of both latency and
energy) to multiply than accumulate. Even though there are multiple MAC
units* or well-working pipeline, it is not always the case that multiple
widening multiply-and-accumulate instructions can be scheduled
(statically or dynamically) together. Merged multiply-and-accumulate
can add to the register pressure as well. So maybe it is better to let
the backend do the conversion (when the multiply has more uses).
Also, isn't it in general that new common sub-expression (widening
multiply in this case) shall not be created in the gimple IR when there
is no obvious benefit? I can sense that it may be a difference case for
the floating-point multiply-and-accumulate, as on one hand the
arithmetic is usually for pure data-processing instead of other purposes
like address calculation (as what its integer peers may do), and on the
other hand, on micro-architectures where there are more FMA units than
FADD units, it probably makes more sense to generate more FMA
instructions in order to take advantage of the throughput capacity.
The area where this patch tries to tackle is only about the integer
widening multiply-and-accumulate, and it doesn't seem beneficial to me
to merge the widening multiply with accumulate so aggressively; you
could argue that other optimization passes shall be extended to be able
to handle WIDEN_MULT_PLUS_EXPR and its friends; while it is an option
I'm considering, it is more likely to be a longer-term solution.
*) I think I had abused the word 'fused' in my previous emails. It
seems like 'fused' is more often used to refer to the floating-point
multiply-and-accumulate with a single rounding.