This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Another look at the ARM division routine

From: Ian Lance Taylor <ian at wasabisystems dot com>
To: gcc-patches at gcc dot gnu dot org
Cc: nico at cam dot org
Date: 11 Nov 2003 14:03:21 -0500
Subject: Another look at the ARM division routine

A litle while ago Steve Woodford at Wasabi wrote a new version of the
ARM division routine for the ARMV5, which Ben Elliston submitted to
gcc-patches:
    http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00271.html

Nicolas Pitre independently submitted a different patch improving the
division routines for all the ARM architectures:
    http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00650.html

Ben Elliston withdrew the Wasabi patch:
    http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00688.html

Since then, we've done some careful timing tests, and with some
more-or-less real code which happens to do a lot of divisions we
measured about a 2.5% improvement using Steve Woodford's code
(actually Nicolas's code with the main loop patched to look like
Steve's).  We haven't found any real code where his division routine
is slower.

Nicolas suggests that Steve's code is slower on average, because it
doesn't handle early termination when the dividend reaches zero.  This
is a very tricky argument to judge in practice: is it better to test
and terminate early, or is it better to save the time required to do
the testing?

Nicolas's code tests every four bits for a zero dividend, and then
loops.  The test adds one instruction, and the loop adds three
instructions.  Is it better to add four instructions for each four
bits, with the chance of leaving the loop, or is it better to simply
unroll the loop completely as Steve's code does?  Another way to ask
the question is: how frequently does the divisor end with four or more
zero bits?

It's also worth noting that Steve's code does exactly the required
number of loop iterations, while Nicolas's rounds up the required
number to a multiple of four.  This suggests the question: how often
is the difference between the highest bits in the divisor and the
dividend a multiple of four?

Our testing indicates that Steve's code performs slightly better in
practice.

Of course, it would also be possible to combine the approaches, by
unrolling Nicolas's loop, or adding early termination tests to Steve's
loop.  I haven't pursued this.

Here is a patch which basically add's Steve's loop to the current code
written by Nicolas.  This is the patch which did better in our timing
tests.

I don't know if this is acceptable in stage 3, although obviously it
won't affect anything other than the ARM.

Ian

Index: lib1funcs.asm
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/arm/lib1funcs.asm,v
retrieving revision 1.26
diff -u -r1.26 lib1funcs.asm
--- lib1funcs.asm	30 Sep 2003 10:30:32 -0000	1.26
+++ lib1funcs.asm	11 Nov 2003 18:28:36 -0000
@@ -232,13 +232,21 @@
 
 #if __ARM_ARCH__ >= 5
 
-	clz	\curbit, \divisor
-	clz	\result, \dividend
-	sub	\result, \curbit, \result
-	mov	\curbit, #1
-	mov	\divisor, \divisor, lsl \result
-	mov	\curbit, \curbit, lsl \result
+	clz	\curbit, \dividend
+	clz	\result, \divisor
+	sub	\curbit, \result, \curbit
+	rsbs	\curbit, \curbit, #31
+	addne	\curbit, \curbit, \curbit, lsl #1
 	mov	\result, #0
+	addne	pc, pc, \curbit, lsl #2
+	nop
+	.set	shift, 32
+	.rept	32
+	.set	shift, shift - 1
+	cmp	\dividend, \divisor, lsl #shift
+	adc	\result, \result, \result
+	subcs	\dividend, \dividend, \divisor, lsl #shift
+	.endr
 	
 #else
 
@@ -271,8 +279,6 @@
 
 	mov	\result, #0
 
-#endif
-
 	@ Division loop
 1:	cmp	\dividend, \divisor
 	subhs	\dividend, \dividend, \divisor
@@ -291,6 +297,8 @@
 	movne	\divisor,  \divisor, lsr #4
 	bne	1b
 
+#endif
+
 .endm
 /* ------------------------------------------------------------------------ */	
 .macro ARM_DIV2_ORDER divisor, order
@@ -330,8 +338,16 @@
 	clz	\order, \divisor
 	clz	\spare, \dividend
 	sub	\order, \order, \spare
-	mov	\divisor, \divisor, lsl \order
-	
+	rsbs	\order, \order, #31
+	addne	pc, pc, \order, lsl #3
+	nop
+	.set	shift, 32
+	.rept	32
+	.set	shift, shift - 1
+	cmp	\dividend, \divisor, lsl #shift
+	subcs	\dividend, \dividend, \divisor, lsl #shift
+	.endr
+
 #else
 
 	mov	\order, #0
@@ -354,8 +370,6 @@
 	addlo	\order, \order, #1
 	blo	1b
 
-#endif
-
 	@ Perform all needed substractions to keep only the reminder.
 	@ Do comparisons in batch of 4 first.
 	subs	\order, \order, #3		@ yes, 3 is intended here
@@ -391,6 +405,9 @@
 4:	cmp	\dividend, \divisor
 	subhs	\dividend, \dividend, \divisor
 5:
+
+#endif
+
 .endm
 /* ------------------------------------------------------------------------ */
 .macro THUMB_DIV_MOD_BODY modulo

Follow-Ups:
- Re: Another look at the ARM division routine
  - From: Nicolas Pitre

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]