This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Another look at the ARM division routine

On 11 Nov 2003, Ian Lance Taylor wrote:

> Nicolas's code tests every four bits for a zero dividend, and then
> loops.  The test adds one instruction, and the loop adds three
> instructions.  Is it better to add four instructions for each four
> bits, with the chance of leaving the loop, or is it better to simply
> unroll the loop completely as Steve's code does?

Actually I just reused the same loop that was there before.  I mainly
optimized the code surounding that loop which is now pretty optimal, but the 
loop itself isn't that impressive.

> Another way to ask
> the question is: how frequently does the divisor end with four or more
> zero bits?

Right.  And that might not be as frequent as I thought.

> It's also worth noting that Steve's code does exactly the required
> number of loop iterations, while Nicolas's rounds up the required
> number to a multiple of four.  This suggests the question: how often
> is the difference between the highest bits in the divisor and the
> dividend a multiple of four?

Well that's a compromize to lower the overhead of testing for early 
termination and actually branching back.  For example, this could be rounded 
up to 8 bits for a maximum of 3 tests and branches, at which point the test 
and branch doesn't become that significant but more unneeded comparisons are 

> Our testing indicates that Steve's code performs slightly better in
> practice.

Then I'd agree with your patch.

> Of course, it would also be possible to combine the approaches, by
> unrolling Nicolas's loop, or adding early termination tests to Steve's
> loop.  I haven't pursued this.

Well, to use the early termination the proposed patch would need to be
reworked with a cursor bit and a test for every comparison, like:

	mov	\curbit, #1
	cmp	\dividend, \divisor, lsl #shift
	orrcs	\result, \result, \curbit, lsl #shift
	subcss	\dividend, \dividend, \divisor, lsl #shift
	beq	out

thereby increasing code size and cycle count.  Maybe you can benchmark this 
so to convince us it's a bad idea?

> Here is a patch which basically add's Steve's loop to the current code
> written by Nicolas.  This is the patch which did better in our timing
> tests.
> I don't know if this is acceptable in stage 3, although obviously it
> won't affect anything other than the ARM.

Just like my single bit DImode shift patch...  ;-)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]