[PATCH] Reduce cost of Athlon multiplication sequences

Thu Jan 5 17:51:00 GMT 2006

Hi Jan,

Forgive me Cc'ing gcc-patches on my reply, but the following details
may help RTH and others understand why this patch is beneficial.

On Thu, 5 Jan 2006, Jan Hubicka wrote:
> I don't see how preventing the split can improve score given
> that lea and shift+mov sequence have precisely the same latency.

Alas, they don't, once you consider register renaming and register
allocation.  Have a look at the multiplication sequences in
the AMD optimization guides for 7, 17 and 31.  The Athlon has three
integer units, so the shift and the move can be executed in parallel.
i.e. the move is considered free in all of AMD's cycle counts.

For example, their recommendation for multiply be seven is:

	mov reg2, reg1
	shl reg1, 3
	sub reg1, reg2

which they consider has a latency of ony two cycles!  Whilst it
is true that a straight move followed by a shift that immediately
uses the result of that shift possibly has a latency of two, it's
possible to rename registers (even in hardware), such that shift
and move are executed concurrently, and cheaper than Athlon's lea.
These sequences are certainly never more expensive latency
wise than a two-cycle lea.

For example, in the sequence we now generate for multiplication
by 11, the results of a move are never used by the immediately
following instruction.

> Are you sure the new generated seuqences do have lower latency than AMD
> listed ones?  I will try to check them on my little benchmark tonight.

The AMD recommended sequence for 15 is:

	lea reg1, [reg1+reg1*2]
	lea reg1, [reg1+reg1*4]

which everyone agrees is latency 4.

The GCC generated sequence is:

	mov reg2, reg1
	shl reg1, 4
	sub reg1, reg2

which depending upon how you count moves is either latency 2 or
latency 3.  By analogy to the multiply by 7 cycle count from the
AMD manual above, AMD would consider this sequence two cycles!
Either way, everyone agrees its faster than the 4 cycle sequence
they recommend.

The discrepancy is that their "FINDMUL" program is attempting to
minimize register usage.  To quote their documentation "Sequences
requiring no temporary register have been favored over ones
requiring a temporary register even if the latency is higher".

Roger
--