This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use

From: Yufeng Zhang <Yufeng dot Zhang at arm dot com>
To: Richard Henderson <rth at redhat dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "ams at codesourcery dot com" <ams at codesourcery dot com>
Date: Thu, 24 Oct 2013 15:52:20 +0100
Subject: Re: [PATCH] Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use
Authentication-results: sourceware.org; auth=none
References: <5265A43E dot 7060507 at arm dot com> <526869D0 dot 40905 at redhat dot com>

On 10/24/13 01:29, Richard Henderson wrote:

On 10/21/2013 03:01 PM, Yufeng Zhang wrote:


This patch changes the widening_mul pass to fuse the widening multiply with
accumulate only when the multiply has single use.  The widening_mul pass
currently does the conversion regardless of the number of the uses, which can
cause poor code-gen in cases like the following:

typedef int ArrT [10][10];

void
foo (ArrT Arr, int Idx)
{
   Arr[Idx][Idx] = 1;
   Arr[Idx + 10][Idx] = 2;
}

On AArch64, after widening_mul, the IR is like

   _2 = (long unsigned int) Idx_1(D);
   _3 = Idx_1(D) w* 40;<----
   _5 = Arr_4(D) + _3;
   *_5[Idx_1(D)] = 1;
   _8 = WIDEN_MULT_PLUS_EXPR<Idx_1(D), 40, 400>;<----
   _9 = Arr_4(D) + _8;
   *_9[Idx_1(D)] = 2;

Where the arrows point, there are redundant widening multiplies.


So they're redundant.  Why does this imply poor code-gen?

If a target has more than one FMA unit, then the target might
be able to issue the computation for _3 and _8 in parallel.

Even if the target only has one FMA unit, but the unit is
pipelined, the computations could overlap.


Thanks for the review.

I think it is a fair point that redundancy doesn't always indicate poorcode-gen, but there are a few reasons that I think this patch makes sense.

Firstly, the generated WIDEN_MULT_PLUS_EXPR can prevents otheroptimization passes from analyzing the IR sequence effectively. Like inthe above example, the widening multiply can be part of a larger commonsub-expression (Arr_4(D) + Idx_1(D) w* 40 + Idx_1(D) * 4); by blindlymerging the multiply with accumulate, it makes the recognition of thecommon sub-expression rather difficult.

Secondly, it is generally more expensive (in terms of both latency andenergy) to multiply than accumulate. Even though there are multiple MACunits* or well-working pipeline, it is not always the case that multiplewidening multiply-and-accumulate instructions can be scheduled(statically or dynamically) together. Merged multiply-and-accumulatecan add to the register pressure as well. So maybe it is better to letthe backend do the conversion (when the multiply has more uses).

Also, isn't it in general that new common sub-expression (wideningmultiply in this case) shall not be created in the gimple IR when thereis no obvious benefit? I can sense that it may be a difference case forthe floating-point multiply-and-accumulate, as on one hand thearithmetic is usually for pure data-processing instead of other purposeslike address calculation (as what its integer peers may do), and on theother hand, on micro-architectures where there are more FMA units thanFADD units, it probably makes more sense to generate more FMAinstructions in order to take advantage of the throughput capacity.

The area where this patch tries to tackle is only about the integerwidening multiply-and-accumulate, and it doesn't seem beneficial to meto merge the widening multiply with accumulate so aggressively; youcould argue that other optimization passes shall be extended to be ableto handle WIDEN_MULT_PLUS_EXPR and its friends; while it is an optionI'm considering, it is more likely to be a longer-term solution.


Regards,
Yufeng

*) I think I had abused the word 'fused' in my previous emails. Itseems like 'fused' is more often used to refer to the floating-pointmultiply-and-accumulate with a single rounding.

References:
- [PATCH] Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use
  - From: Yufeng Zhang
- Re: [PATCH] Generate fused widening multiply-and-accumulate operations only when the widening multiply has single use
  - From: Richard Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]