This is the mail archive of the
mailing list for the GCC project.
Re: Another look at the ARM division routine
- From: Nicolas Pitre <nico at cam dot org>
- To: Ian Lance Taylor <ian at wasabisystems dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Tue, 11 Nov 2003 16:09:46 -0500 (EST)
- Subject: Re: Another look at the ARM division routine
On 11 Nov 2003, Ian Lance Taylor wrote:
> Nicolas's code tests every four bits for a zero dividend, and then
> loops. The test adds one instruction, and the loop adds three
> instructions. Is it better to add four instructions for each four
> bits, with the chance of leaving the loop, or is it better to simply
> unroll the loop completely as Steve's code does?
Actually I just reused the same loop that was there before. I mainly
optimized the code surounding that loop which is now pretty optimal, but the
loop itself isn't that impressive.
> Another way to ask
> the question is: how frequently does the divisor end with four or more
> zero bits?
Right. And that might not be as frequent as I thought.
> It's also worth noting that Steve's code does exactly the required
> number of loop iterations, while Nicolas's rounds up the required
> number to a multiple of four. This suggests the question: how often
> is the difference between the highest bits in the divisor and the
> dividend a multiple of four?
Well that's a compromize to lower the overhead of testing for early
termination and actually branching back. For example, this could be rounded
up to 8 bits for a maximum of 3 tests and branches, at which point the test
and branch doesn't become that significant but more unneeded comparisons are
> Our testing indicates that Steve's code performs slightly better in
Then I'd agree with your patch.
> Of course, it would also be possible to combine the approaches, by
> unrolling Nicolas's loop, or adding early termination tests to Steve's
> loop. I haven't pursued this.
Well, to use the early termination the proposed patch would need to be
reworked with a cursor bit and a test for every comparison, like:
mov \curbit, #1
cmp \dividend, \divisor, lsl #shift
orrcs \result, \result, \curbit, lsl #shift
subcss \dividend, \dividend, \divisor, lsl #shift
thereby increasing code size and cycle count. Maybe you can benchmark this
so to convince us it's a bad idea?
> Here is a patch which basically add's Steve's loop to the current code
> written by Nicolas. This is the patch which did better in our timing
> I don't know if this is acceptable in stage 3, although obviously it
> won't affect anything other than the ARM.
Just like my single bit DImode shift patch... ;-)