This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Superscalar tweaks to synth_mult

From: Roger Sayle <roger at eyesopen dot com>
To: Richard Henderson <rth at redhat dot com>
Cc: gcc-patches at gcc dot gnu dot org
Date: Mon, 30 Aug 2004 14:09:21 -0600 (MDT)
Subject: Re: [PATCH] Superscalar tweaks to synth_mult

On Fri, 27 Aug 2004, Richard Henderson wrote:
> On Fri, Aug 20, 2004 at 11:15:19AM -0600, Roger Sayle wrote:
> > The following patch is an improvement to GCC's synth_mult to improve
> > performance on superscalar and pipelined processors.  In the current
> > implementation, expmed.c's synth_mult evaluates/scores sequences of
> > shifts and adds based purely on their total rtx_cost.  When two or
> > more alternate sequences have exactly the same total rtx_cost, the
> > current implementation arbitrarily chooses the first.
>
> While at a high level this seems a fantastic idea, it's not clear to
> me how you're computing latency.

My apologies for the delay in replying, our internet connectivity has
been down for a few days.

The theory behind the definition of "latency" used in my patch is straight
forward, though I suspect that things may perhaps have been obfuscated by
my poor choice of terminology and the assumptions made by the existing
synth_mult code.

The approach is to think of these computations as their underlying trees
or DAGs.  Using this representation, the multiplication by 252 example
given in my post can be graphically depicted by the following forms:

	       <<		      -
	      /  \		     / \
	     -    2		    /   \
	    / \			  <<     <<
	  <<   x		 /  \   /  \
	 /  \			x    8 x    4
	x    6

In these terms, the minimum possible latency (on a hypothetical
infinitely parallel processor) is given by the tree height [the
length of the longest path from the root to a leaf], also known
as the critical path.  In GCC-land, the "height" or cost of each
operator is given by it's rtx_cost.  Notice, that in both of the
above expressions, the total rtx_cost is the same, but the critical
path lengths are different.

These definitions apply to arbitrary expression trees, and this
formalism was introduced in the late 60s and early 70s:

J.L. Baer and D.P Bovet, "Compilation of Arithmetic Expressions for
Parallel Computation", 1968.

R.P. Brent, D.J. Kuck and G.M. Maruyama, "The Parallel Evaluation of
Arithmetic Expressions without Division.", IEEE Trans. Comput. C-22,
pp. 532-534, May 1973.

R.P. Brent, "The Parallel Evaluation of General Arithmetic
Expressions", Journal of the ACM, Vol. 21, No. 2, p. 201-206,
April 1974.

Other useful metrics from these expression trees include the Sethi-Ullman
numbering of an expression tree [which gives an approximation of the
number of registers required to evaluate the expression], or the tree
width [which is an indication of number of parallel operations].

Skip forward thirty years, and consider the algorithm used by expmed.c's
synth_mult.  All synthetic multiplications are constrained to be a
"struct algorithm" composed of primitive steps.

>> These are the operations:
>> alg_zero             total := 0;
>> alg_m                total := multiplicand;
>> alg_shift            total := total * coeff
>> alg_add_t_m2         total := total + multiplicand * coeff;
>> alg_sub_t_m2         total := total - multiplicand * coeff;
>> alg_add_factor       total := total * coeff + total;
>> alg_sub_factor       total := total * coeff - total;
>> alg_add_t2_m         total := total * coeff + multiplicand;
>> alg_sub_t2_m         total := total * coeff - multiplicand;
>>
>> The first operand must be either alg_zero or alg_m.  */

Notice that to avoid an NP-complete search, synth_mult only produces
near linear trees where results are never reused.  The fact that
synth_mult expressions are trees and not DAGs simplifies things.
Also notice, that this linearization means that the computation
of the sub-"total" can be assumed to be the critical path.

So for almost all algorithm steps, the critical path length
increases by the same amount as the total rtx_cost.  The
two exceptions, however, are the steps "alg_add_t_m2" and
"alg_sub_t_m2", in which the evaluation of the multiplication
can theoretically be overlapped with the evaluation of the
sub total.

The fly-in-the-ointment, so to speak, are the processors
such as ARM and x86 that have specialized shift-and-add
instructions, such as the x86's "lea".  I decided to be
conservative and treat shift-and-add/shift-and-sub as a
single atomic instruction as far as scheduling goes [this
is probably conservative for the Pentium which decomposes
these insns into uops].

But this should help explain the following logic:

> > ! 	  op_cost = add_cost[mode] + shift_cost[mode][m];
> > ! 	  if (shiftadd_cost[mode][m] < op_cost)
> > ! 	    {
> > ! 	      op_cost = shiftadd_cost[mode][m];
> > ! 	      op_latency = op_cost;
> > ! 	    }
> > ! 	  else
> > ! 	    op_latency = add_cost[mode];
>
> This, for instance, seems ... off ... somehow.
> And most of the rest of the places seem to use the same value
> for cost and latency.

Exactly.  The array shiftadd_cost holds the cost on a unified
shift-and-add instruction, which is used if its cheaper than
an add_cost plus a shift_cost.  If we use a unified instruction,
we assume op_latency == op_cost.  Otherwise, the total rtx_cost
increases by add_cost + shift_cost, but the critical path length
or minimum latency increases by just add_cost.

There's a tiny bit of code after the hunk quoted above that
correctly determines the tree height near the leaves of the
expression tree.

For the step, t := a + b*c, where b and c are known to be leaves,
the minimum possible latency is given by the expression:

latency(t) = rtx_cost(+) + MAX(latency(a), rtx_cost(*))

Hopefully, this explains/rationalizes things.  I'm happy to add
some more comments to the code, if you think that'll help clarify
further.  Having both this and my previous post searchable in the
gcc-patches archive should certainly help future generations.

Roger
--

Follow-Ups:
- Re: [PATCH] Superscalar tweaks to synth_mult
  - From: Richard Henderson

References:
- Re: [PATCH] Superscalar tweaks to synth_mult
  - From: Richard Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]