This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Another look at the ARM division routine
- From: Ian Lance Taylor <ian at wasabisystems dot com>
- To: gcc-patches at gcc dot gnu dot org
- Cc: nico at cam dot org
- Date: 11 Nov 2003 14:03:21 -0500
- Subject: Another look at the ARM division routine
A litle while ago Steve Woodford at Wasabi wrote a new version of the
ARM division routine for the ARMV5, which Ben Elliston submitted to
gcc-patches:
http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00271.html
Nicolas Pitre independently submitted a different patch improving the
division routines for all the ARM architectures:
http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00650.html
Ben Elliston withdrew the Wasabi patch:
http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00688.html
Since then, we've done some careful timing tests, and with some
more-or-less real code which happens to do a lot of divisions we
measured about a 2.5% improvement using Steve Woodford's code
(actually Nicolas's code with the main loop patched to look like
Steve's). We haven't found any real code where his division routine
is slower.
Nicolas suggests that Steve's code is slower on average, because it
doesn't handle early termination when the dividend reaches zero. This
is a very tricky argument to judge in practice: is it better to test
and terminate early, or is it better to save the time required to do
the testing?
Nicolas's code tests every four bits for a zero dividend, and then
loops. The test adds one instruction, and the loop adds three
instructions. Is it better to add four instructions for each four
bits, with the chance of leaving the loop, or is it better to simply
unroll the loop completely as Steve's code does? Another way to ask
the question is: how frequently does the divisor end with four or more
zero bits?
It's also worth noting that Steve's code does exactly the required
number of loop iterations, while Nicolas's rounds up the required
number to a multiple of four. This suggests the question: how often
is the difference between the highest bits in the divisor and the
dividend a multiple of four?
Our testing indicates that Steve's code performs slightly better in
practice.
Of course, it would also be possible to combine the approaches, by
unrolling Nicolas's loop, or adding early termination tests to Steve's
loop. I haven't pursued this.
Here is a patch which basically add's Steve's loop to the current code
written by Nicolas. This is the patch which did better in our timing
tests.
I don't know if this is acceptable in stage 3, although obviously it
won't affect anything other than the ARM.
Ian
Index: lib1funcs.asm
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/arm/lib1funcs.asm,v
retrieving revision 1.26
diff -u -r1.26 lib1funcs.asm
--- lib1funcs.asm 30 Sep 2003 10:30:32 -0000 1.26
+++ lib1funcs.asm 11 Nov 2003 18:28:36 -0000
@@ -232,13 +232,21 @@
#if __ARM_ARCH__ >= 5
- clz \curbit, \divisor
- clz \result, \dividend
- sub \result, \curbit, \result
- mov \curbit, #1
- mov \divisor, \divisor, lsl \result
- mov \curbit, \curbit, lsl \result
+ clz \curbit, \dividend
+ clz \result, \divisor
+ sub \curbit, \result, \curbit
+ rsbs \curbit, \curbit, #31
+ addne \curbit, \curbit, \curbit, lsl #1
mov \result, #0
+ addne pc, pc, \curbit, lsl #2
+ nop
+ .set shift, 32
+ .rept 32
+ .set shift, shift - 1
+ cmp \dividend, \divisor, lsl #shift
+ adc \result, \result, \result
+ subcs \dividend, \dividend, \divisor, lsl #shift
+ .endr
#else
@@ -271,8 +279,6 @@
mov \result, #0
-#endif
-
@ Division loop
1: cmp \dividend, \divisor
subhs \dividend, \dividend, \divisor
@@ -291,6 +297,8 @@
movne \divisor, \divisor, lsr #4
bne 1b
+#endif
+
.endm
/* ------------------------------------------------------------------------ */
.macro ARM_DIV2_ORDER divisor, order
@@ -330,8 +338,16 @@
clz \order, \divisor
clz \spare, \dividend
sub \order, \order, \spare
- mov \divisor, \divisor, lsl \order
-
+ rsbs \order, \order, #31
+ addne pc, pc, \order, lsl #3
+ nop
+ .set shift, 32
+ .rept 32
+ .set shift, shift - 1
+ cmp \dividend, \divisor, lsl #shift
+ subcs \dividend, \dividend, \divisor, lsl #shift
+ .endr
+
#else
mov \order, #0
@@ -354,8 +370,6 @@
addlo \order, \order, #1
blo 1b
-#endif
-
@ Perform all needed substractions to keep only the reminder.
@ Do comparisons in batch of 4 first.
subs \order, \order, #3 @ yes, 3 is intended here
@@ -391,6 +405,9 @@
4: cmp \dividend, \divisor
subhs \dividend, \dividend, \divisor
5:
+
+#endif
+
.endm
/* ------------------------------------------------------------------------ */
.macro THUMB_DIV_MOD_BODY modulo