This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] Tweak cost of lea on Pentium4


The following patch fixes a problem with the parameterization of the x86's
lea instruction on the Pentium4.  The issue is that the i386 backend's
target macro TARGET_DECOMPOSE_LEA, results in GCC over-estimating the cost
of "lea" on the Pentium4.  This macro was introduced by Jan back in 2001,
http://gcc.gnu.org/ml/gcc/2001-11/msg00308.html, in an attempt to generate
better code for Pentium class machines.  As mentioned in that post, the
"lea" instruction is slower on Pentium4 than earlier cores where it had
typically cost only a single cycle.

The solution proposed by Jan was to introduce TARGET_DECOMPOSE_LEA, so
that the x86_rtx_cost function would ignore the existance of the "lea"
instruction patterns on affected cores.  Unfortunately, this approach is
far too Draconian, resulting in "(plus (mult (reg) (const_int 4)) (reg))"
costing the equivalent of an addition *and* a full multiplication.  With
a multiplication on pentium4 taking 15 cycles [or COST_N_INSNS (15) in
i386.c], the cost of an using "lea" is currently a huge 16 cycles!
Of course, 16 cycles is far too high.  At worst, Intel's microcode would
decompose an "lea" into a shift and an add, costing at most 5 cycles.

Note that TARGET_DECOMPOSE_LEA doesn't affect any of the i386's code
generation routines, so GCC can still emit these instructions, it just
heavily penalizes their use.  As a result synth_mult almost never uses
a sequence containing an "lea" when tuning for the pentium4.


My proposed fix below is to use the "lea" field of the i386 backend's
processor_costs struct for the purpose for which it was intended.  By
increasing this parameter to a realistic value, there's no reason to
"tweak" ix86_rtx_cost specially for the pentium4.  The patch below uses
the value 3, as clearly X*4+Y can be performed by three additions.
This also agrees with experimentation, where a value of three was shown
to produce results similar to Intel's compilers in a benchmark that
counted the number of "lea" instructions used to multiply an integer by
all co-efficients between 1 and 10,000.  I believe values between two and
five are reasonable should anyone want to provide better approximation.

The second advantage of this approach is that it allows us to return the
"size" of an lea when optimizing for size, even when tuning for the P4.


The following code has been tested on i686-pc-linux-gnu, with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.

OK for mainline?


2004-06-19  Roger Sayle  <roger@eyesopen.com>

	* config/i386/i386.c (pentium4_cost): Increase "lea" cost from 1 to 3.
	(ix86_rtx_costs) <ASHIFT, PLUS>: Consider ix86_cost->lea even when
	TARGET_DECOMPOSE_LEA.


Index: config/i386/i386.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/i386/i386.c,v
retrieving revision 1.675
diff -c -3 -p -r1.675 i386.c
*** config/i386/i386.c	11 Jun 2004 18:41:42 -0000	1.675
--- config/i386/i386.c	19 Jun 2004 17:11:17 -0000
*************** struct processor_costs k8_cost = {
*** 417,423 ****
  static const
  struct processor_costs pentium4_cost = {
    1,					/* cost of an add instruction */
!   1,					/* cost of a lea instruction */
    4,					/* variable shift costs */
    4,					/* constant shift costs */
    {15, 15, 15, 15, 15},			/* cost of starting a multiply */
--- 417,423 ----
  static const
  struct processor_costs pentium4_cost = {
    1,					/* cost of an add instruction */
!   3,					/* cost of a lea instruction */
    4,					/* variable shift costs */
    4,					/* constant shift costs */
    {15, 15, 15, 15, 15},			/* cost of starting a multiply */
*************** ix86_rtx_costs (rtx x, int code, int out
*** 14904,14910 ****
  	      return false;
  	    }
  	  if ((value == 2 || value == 3)
- 	      && !TARGET_DECOMPOSE_LEA
  	      && ix86_cost->lea <= ix86_cost->shift_const)
  	    {
  	      *total = COSTS_N_INSNS (ix86_cost->lea);
--- 14904,14909 ----
*************** ix86_rtx_costs (rtx x, int code, int out
*** 15007,15014 ****
      case PLUS:
        if (FLOAT_MODE_P (mode))
  	*total = COSTS_N_INSNS (ix86_cost->fadd);
!       else if (!TARGET_DECOMPOSE_LEA
! 	       && GET_MODE_CLASS (mode) == MODE_INT
  	       && GET_MODE_BITSIZE (mode) <= GET_MODE_BITSIZE (Pmode))
  	{
  	  if (GET_CODE (XEXP (x, 0)) == PLUS
--- 15006,15012 ----
      case PLUS:
        if (FLOAT_MODE_P (mode))
  	*total = COSTS_N_INSNS (ix86_cost->fadd);
!       else if (GET_MODE_CLASS (mode) == MODE_INT
  	       && GET_MODE_BITSIZE (mode) <= GET_MODE_BITSIZE (Pmode))
  	{
  	  if (GET_CODE (XEXP (x, 0)) == PLUS


Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]