This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Reduce cost of "a*11" take 2

From: Jan Hubicka <jh at suse dot cz>
To: gcc-patches at gcc dot gnu dot org, rth at redhat dot com, roger at eyesopen dot com
Date: Thu, 5 Jan 2006 14:25:14 +0100
Subject: Reduce cost of "a*11" take 2

Hi,
as discussed earlier this improves the multiplication by 11 sequence
for athlon CPU by artifically reducing cost of lea instruction.
The solution is still not optimal (AMD SWOG lists better, but that one
needs to prevent construcitng of lea, so the shift can execute in
parallel with other instruction we are not quite capable of).

The patch saves few kb on SPEC binaries overall and cause 0.5%
(reproducibly) speedup on SPECint, small speedup on SPECfp too.  I
promised speedup on sixtrack that is unfortunately not happening as one
needs both fix this problem *and* reduce branch cost to 3 to see the
speedup (doing one of these unforutnately don't bring much).  Given the
fact I no longer think it is that important for 4.1, but still would love
to see it in 4.2 at least.  Still might be 4.1 matherial as it is pretty
nonintrusive and fixes the obvious *11 regression.

Bootstrapped/regtested i686-linux, OK?
Honza

/* { dg-do compile { target i?86-*-* } } */
/* { dg-options "-O2 -march=athlon" } */
/* Multiplication by 11 should expand as sequence of two leas.
   Note that this is not optimal solution and it is possible to multiply
   by 11 in 3 cycles latency instead of 4 as discussed in AMD manual,
   but one needs prevent synthetizing lea instruction in combiner we
   don't at the moment.  */
/* { dg-final { scan-assembler "lea" } } */
/* { dg-final { scan-assembler-not "sub" } } */
/* { dg-final { scan-assembler-not "add" } } */
/* { dg-final { scan-assembler-not "neg" } } */
int a(int b)
{
	return b*11;
}
2006-01-05  Zdenek Dvorak  <dvorakz@suse.cz>
	    Jan Hubicka  <jh@suse.cz>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Roger Sayle  <roger@eyesopen.com>

	* config/i386/i386.c (athlon_cost, k8_cost): Reduce cost of lea
	instruction to speed up "a*11".
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c	(revision 109378)
--- config/i386/i386.c	(working copy)
*************** struct processor_costs k6_cost = {
*** 379,385 ****
  static const
  struct processor_costs athlon_cost = {
    COSTS_N_INSNS (1),			/* cost of an add instruction */
!   COSTS_N_INSNS (2),			/* cost of a lea instruction */
    COSTS_N_INSNS (1),			/* variable shift costs */
    COSTS_N_INSNS (1),			/* constant shift costs */
    {COSTS_N_INSNS (5),			/* cost of starting multiply for QI */
--- 379,392 ----
  static const
  struct processor_costs athlon_cost = {
    COSTS_N_INSNS (1),			/* cost of an add instruction */
!   /* The latency of lea instruction really is 2, however setting cost to 2
!      makes multiplication by constant expansion code to use minus that prevents
!      combining of sequence into lea, expands code size and results in longer
!      latency too, since minus is more difficult to register allocate.
!      Setting cost slightly higher than 1 cycle results in best overall
!      latency in multiplication by numbers 0..256.  
!      See for instance sequence produced by "a*11".  */
!   COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
    COSTS_N_INSNS (1),			/* variable shift costs */
    COSTS_N_INSNS (1),			/* constant shift costs */
    {COSTS_N_INSNS (5),			/* cost of starting multiply for QI */
*************** struct processor_costs athlon_cost = {
*** 431,437 ****
  static const
  struct processor_costs k8_cost = {
    COSTS_N_INSNS (1),			/* cost of an add instruction */
!   COSTS_N_INSNS (2),			/* cost of a lea instruction */
    COSTS_N_INSNS (1),			/* variable shift costs */
    COSTS_N_INSNS (1),			/* constant shift costs */
    {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
--- 438,445 ----
  static const
  struct processor_costs k8_cost = {
    COSTS_N_INSNS (1),			/* cost of an add instruction */
!   /* lea cost is artifically lowered for same reason as for athlon_cost.  */
!   COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
    COSTS_N_INSNS (1),			/* variable shift costs */
    COSTS_N_INSNS (1),			/* constant shift costs */
    {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */

Follow-Ups:
- Re: Reduce cost of "a*11" take 2
  - From: Andrew Pinski
- [PATCH] Reduce cost of Athlon multiplication sequences
  - From: Roger Sayle

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]