[Patch,AVR]: PR49687 (better widening 32-bit mul)

Mon Jul 25 17:06:00 GMT 2011

Weddington, Eric wrote:
> 
>> Eric, can you review the assembler routines and say if such reuse is ok or if you'd prefer a
>> speed-optimized version of __mulsi3 like in the current libgcc?
> 
> Hi Johann,
> 
> Typically a penalty on speed is preferred over a penalty on code size. Do you already have
> information on how it compares on code size with the old routines?
> 
> Eric

The old sizes are

62 __mulsi3
26 __mulhisi3
22 __umulhisi3
10 __xmulhisi3

where the __[u]mulhisi3 will drag in __xmulhisi3 and the insns don't combine
with constants.

The new implementation has more fragments, the indented modules are dragged
in i.e. used by respective function:

12 __mulhisi3
         __umulhisi3
         __usmulhisi3_tail

30 __umulhisi3

02 __usmulhisi3
10 __usmulhisi3_tail

20 __muluhisi3
         __umulhisi3

08 __mulohisi3
04 __mulshisi3
         __muluhisi3

30 __mulsi3
         __muluhisi3

This means that a pure __mulsi3 will have 30+30+20 = 80 bytes (+18).

If all functions are used they occupy 116 bytes (-4), so they actually
save a little space if they are used all with the benefit that they also
can one-extend, extend 32 = 16*32 as well as 32=16*16 and work for
small (17 bit signed) constants.

__umulhisi3 reads:

DEFUN __umulhisi3
    mul     A0, B0
    movw    C0, r0
    mul     A1, B1
    movw    C2, r0
    mul     A0, B1
    add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    mul     A1, B0
    add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    ret
ENDF __umulhisi3

It could be compressed to the following sequence, i.e.
24 bytes instead of 30, but I think that's too much of
quenching the last byte out of the code:

DEFUN __umulhisi3
    mul     A0, B0
    movw    C0, r0
    mul     A1, B1
    movw    C2, r0
    mul     A0, B1
    rcall   1f
    mul     A1, B0
1:  add     C1, r0
    adc     C2, r1
    clr     __zero_reg__
    adc     C3, __zero_reg__
    ret
ENDF __umulhisi3

In that lack of real-world-code that uses 32-bit arithmetic I trust
my intuition that code size will decrease in general ;-)

Tiny examples are sometimes misleading because of additional moves from
unpleasant register allocation, bit that's a different story...

Johann