This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: optimizing % CONST_INT


   From: Richard Henderson <rth@redhat.com>
   Date: Wed, 17 Apr 2002 23:59:41 -0700

[ gcc-patches, please reference the thread beginning at:

   http://gcc.gnu.org/ml/gcc/2002-04/msg00835.html

  and also this thread:

   http://gcc.gnu.org/ml/gcc-2002-04/msg00904.html

  to understand the origin of said changes.  ]

   On Wed, Apr 17, 2002 at 06:28:53PM -0700, Dan Nicolaescu wrote:
   >   > or sparcv7 sets the cost of a branch so
   >   > low that we don't think the transformation is helpful.
   > 
   > Maybe fold should take care of this? (I don't know). 
   
   It does.  The answer is in fact C.  Sparc doesn't define
   BRANCH_COST at all, which makes it default to 1, which 
   means that basically nothing will get converted unless it
   is a one-for-one swap.
   
   To fix this, you need to find out what the branch taken
   and miss penalty is for at least a couple of sparc
   implementations and define the macro accordingly.

I've checked in the patch below, it seems to solve all of the
issues.  The original testcase:

int foo (int v)
{
	return v % 16;
}

Now gives the following assembly, for -m32 -mcpu=supersparc
we get:

foo:
	sra	%o0, 31, %g2
	srl	%g2, 28, %g2
	add	%o0, %g2, %g2
	and	%g2, -16, %g2
	retl
	sub	%o0, %g2, %o0

Incidently, in this case -m32 -mtune=ultrasparc gives identical
code, there are no scheduling improvements possible here.

For -m64 we eat some extra overhead before we have to sign
extend the result to 64-bits:

foo:
	sra	%o0, 31, %g2
	srl	%g2, 28, %g2
	add	%o0, %g2, %g2
	and	%g2, -16, %g2
	sub	%o0, %g2, %o0
	retl
	sra	%o0, 0, %o0

This is still not quite as optimal as what SunPRO generates, because
the second shift is hard to schedule on the UltraSPARC (which can only
do shifts in one of its' integer units, UltraSPARC-III lacks this
limitation btw).  But this is better than what we were outputting
before.  For reference the SunPRO sequence for this is:

	sra	%o0, 31, %o1
	and	%o1, 15, %o1
	add	%o0, %o1, %o1
	andn	%o1, 15, %o1
	retl
	 sub	%o0, %o1, %o0

So to complete the loop, the next thing to do is be able to teach
GCC that:

	sra	%o0, 31, %g2
	srl	%g2, 28, %g2

is less optimal than:

	sra	%o0, 31, %g2
	and	%g2, 15, %o1

at least when sparc_cpu == PROCESSOR_ULTRASPARC.  Is there an easy
way to do that, perhaps with RTX_COST?

In this specific test, the cycle count is identical (because there is
a register dependency in the second instruction) but for more involved
cases this would not be true.

Anyways, here is the patch I installed.

2002-04-18  David S. Miller  <davem@redhat.com>

	* config/sparc/sparc.h (BRANCH_COST): Define.

	* fold-const.c (BRANCH_COST): Don't provide default here, expr.h
	does it.

--- fold-const.c.~1~	Fri Feb 22 03:50:47 2002
+++ fold-const.c	Thu Apr 18 15:19:41 2002
@@ -109,10 +109,6 @@
 static tree fold_binary_op_with_conditional_arg 
   PARAMS ((enum tree_code, tree, tree, tree, int));
 							 
-#ifndef BRANCH_COST
-#define BRANCH_COST 1
-#endif
-
 #if defined(HOST_EBCDIC)
 /* bit 8 is significant in EBCDIC */
 #define CHARMASK 0xff
--- config/sparc/sparc.h.~1~	Thu Apr 18 01:54:32 2002
+++ config/sparc/sparc.h	Thu Apr 18 15:44:50 2002
@@ -2619,6 +2619,23 @@
     || (CLASS1) == FPCC_REGS || (CLASS2) == FPCC_REGS)		\
    ? (sparc_cpu == PROCESSOR_ULTRASPARC ? 12 : 6) : 2)
 
+/* Provide the cost of a branch.  For pre-v9 processors we use
+   a value of 3 to take into account the potential annulling of
+   the delay slot (which ends up being a bubble in the pipeline slot)
+   plus a cycle to take into consideration the instruction cache
+   effects.
+
+   On v9 and later, which have branch prediction facilities, we set
+   it to the depth of the pipeline as that is the cost of a
+   mispredicted branch.
+
+   ??? Set to 9 when PROCESSOR_ULTRASPARC3 is added  */
+
+#define BRANCH_COST \
+	((sparc_cpu == PROCESSOR_V9 \
+	  || sparc_cpu == PROCESSOR_ULTRASPARC) \
+	 ? 7 : 3)
+
 /* Provide the costs of a rtl expression.  This is in the body of a
    switch on CODE.  The purpose for the cost of MULT is to encourage
    `synth_mult' to find a synthetic multiply when reasonable.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]