This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)

From: Georg-Johann Lay <avr at gjlay dot de>
To: "Weddington, Eric" <Eric dot Weddington at atmel dot com>
Cc: gcc-patches at gcc dot gnu dot org, Anatoly Sokolov <aesok at post dot ru>, Denis Chertykov <chertykov at gmail dot com>, Richard Henderson <rth at redhat dot com>
Date: Mon, 25 Jul 2011 18:29:03 +0200
Subject: Re: [Patch,AVR]: PR49687 (better widening 32-bit mul)
References: <4E2D3821.3090007@gjlay.de> <8D64F155F1C88743BFDC71288E8E2DA8032C1E5F@csomb01.corp.atmel.com>

Weddington, Eric wrote:
> 
>> Eric, can you review the assembler routines and say if such reuse is ok or if you'd prefer a
>> speed-optimized version of __mulsi3 like in the current libgcc?
> 
> Hi Johann,
> 
> Typically a penalty on speed is preferred over a penalty on code size. Do you already have
> information on how it compares on code size with the old routines?
> 
> Eric

The old sizes are

62 __mulsi3
26 __mulhisi3
22 __umulhisi3
10 __xmulhisi3

where the __[u]mulhisi3 will drag in __xmulhisi3 and the insns don't combine
with constants.

The new implementation has more fragments, the indented modules are dragged
in i.e. used by respective function:

12 __mulhisi3
         __umulhisi3
         __usmulhisi3_tail

30 __umulhisi3

02 __usmulhisi3
10 __usmulhisi3_tail

20 __muluhisi3
         __umulhisi3

08 __mulohisi3
04 __mulshisi3
         __muluhisi3

30 __mulsi3
         __muluhisi3

This means that a pure __mulsi3 will have 30+30+20 = 80 bytes (+18).

If all functions are used they occupy 116 bytes (-4), so they actually
save a little space if they are used all with the benefit that they also
can one-extend, extend 32 = 16*32 as well as 32=16*16 and work for
small (17 bit signed) constants.

__umulhisi3 reads:

DEFUN __umulhisi3
    mul     A0, B0
    movw    C0, r0
    mul     A1, B1
    movw    C2, r0
    mul     A0, B1
    add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    mul     A1, B0
    add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    ret
ENDF __umulhisi3

It could be compressed to the following sequence, i.e.
24 bytes instead of 30, but I think that's too much of
quenching the last byte out of the code:

DEFUN __umulhisi3
    mul     A0, B0
    movw    C0, r0
    mul     A1, B1
    movw    C2, r0
    mul     A0, B1
    rcall   1f
    mul     A1, B0
1:  add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    ret
ENDF __umulhisi3

In that lack of real-world-code that uses 32-bit arithmetic I trust
my intuition that code size will decrease in general ;-)

Tiny examples are sometimes misleading because of additional moves from
unpleasant register allocation, bit that's a different story...

Johann

Follow-Ups:
- RE: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  - From: Weddington, Eric

References:
- [Patch,AVR]: PR49687 (better widening 32-bit mul)
  - From: Georg-Johann Lay
- RE: [Patch,AVR]: PR49687 (better widening 32-bit mul)
  - From: Weddington, Eric

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]