[PATCH] Overflow-trapping integer arithmetic routines7code

Wed Nov 25 13:18:09 GMT 2020

Jeff Law <law@redhat.com> wrote:

> On 11/10/20 10:21 AM, Stefan Kanthak wrote:
>
>>> So with all that in mind, I installed everything except the bits which
>>> have the LIBGCC2_BAD_CODE ifdefs after testing on the various crosses.
>>> If you could remove the ifdefs on the abs/mult changes and resubmit them
>>> it'd be appreciated.
>> Done.
> THanks. I'm doing some testing on the abs changes right now. They look
> pretty reasonable, though they will tend to generate worse code on
> targets that don't handle overflow arithmetic and testing all that well.

OTOH the changes yield better code on targets which have a proper
overflow handling, and may benefit from eventual improvements in the
compiler/code generator itself on all targets.

> H8 would be an example that generates more efficient code with the
> original implementation. But I don't think the H8 is terribly
> representative of embedded ports these days, though it's likely to get
> better at precisely these kinds of scenarios in the near future as
> certain modernization patches continue landing.
>
> Something like visium is much better for evaluation as it's got a modern
> structure which includes reasonable exposure of carry/overflow. Not
> surprisingly the visium port generates better code with your improved
> abs routines.
>
>
>
>
>
>> libgcc2.patch
>>
>> --- -/libgcc/libgcc2.c
>> +++ +/libgcc/libgcc2.c
>>
>>  #ifdef L_mulvdi3
>>  DWtype
>> -__mulvDI3 (DWtype u, DWtype v)
>> +__mulvDI3 (DWtype a, DWtype b)
> [ ... ]
> You've essentially collapsed the function into:
>
> DWtype
> __mulvDI3 (DWtype a, DWtype b)
> {
>  DWtype t;
>  const DWunion u = {.ll = a};
>  const DWunion v = {.ll = b};
>  DWunion w = {.ll = __umulsidi3 (u.s.low, v.s.low)};
>
>  if (__builtin_mul_overflow (u.s.low, v.s.high, &t)
>   || __builtin_add_overflow (t, w.s.high, &w.s.high)
>   || __builtin_mul_overflow (u.s.high, v.s.low, &t)
>   || __builtin_add_overflow (t, w.s.high, &w.s.high))
>    abort ();
>
>  return w.ll
> }
>
> The first thing to note is that I believe that using "DWtype" for "t" is
> going to inhibit the overflow check for the __builtin_mul_overflow calls
> that store their result into "t". Your inputs are both 32bits and the
> output is 64 bits. Multiplying 2 32bit numbers will not overflow a
> 64bit result. As a result those overflow checks are actually removed in
> the code we generate for mulvDI3, defeating the entire purpose of using
> mulvdi3 instead of muldi3.

ARGH, my fault(s).
I wanted to show an alternative, but less efficient implementation which
I guarded with the #ifdef, but garbled the main implementation.

> Second while the result of multiplying u.s.high with v.s.high can never
> change the result in w.ll, it does affect the overall overflow status
> that we need to compute.
>
> I think (but have not yet verified) that to fix these problems we need
> to change the type of "t" to SItype and add a check like:
>
> if (u.s.high && v.s.high)
> abort ();

Correct in both cases!
See <https://skanthak.homepage.t-online.de/gcc.html#case17> for my
original code.
I rewrote it to minimise the changes to the current implementation and
introduced 2 silly errors on the way.

> Also note that your approach always does 3 multiplies, which can be very
> expensive on some architectures. The existing version in libgcc2.c will
> often just do one or two multiplies. So while your implementation looks
> a lot simpler, I suspect its often much slower. And on targets without
> 32bit multiplication support, it's probably horribly bad.

All (current) processors I know have super-scalar architecture and a
hardware multiplier, they'll execute the 3 multiplies in parallel.
In my tests on i386 and AMD64 (Core2, Skylake, Ryzen/EPYC), the code
generated for <https://skanthak.homepage.t-online.de/gcc.html#case17>
as well as the (almost) branch-free code shown below and in
<https://skanthak.homepage.t-online.de/gcc.html#case13> runs 10% to 25%
faster than the __mulvDI3 routine from libgcc: the many conditional
branches of your current implementation impair performance more than 3
multiplies!

> My inclination is to leave the overflow checking double-word multiplier
> as-is.

See but <https://gcc.gnu.org/pipermail/gcc/2020-October/234048.html> ff.

> Though I guess you could keep the same structure as the existing
> implementation which tries to avoid unnecessary multiplies and still use
> the __builtin_{add,mul}_overflow to simplify the code a bit less
> aggressively.

Tertium datur: take a look at the __udivmodDI4 routine.
It has separate code paths for targets without hardware divider, and
also for targets where the hardware divider needs a normalized dividend.
I therefore propose to add separate code paths for targets with and
without hardware multiplier for the __mulvDI3 routine too, guarded by a
preprocessor macro which tells whether a target has a hardware multiplier.

regards
Stefan

--- umulvdi3.s ---
# Copyright © 2004-2020, Stefan Kanthak <stefan.kanthak@nexgo.de>

        .arch   generic32
        .code32
        .intel_syntax noprefix
        .text
                                # [esp+16] = high dword of multiplier
                                # [esp+12] = low dword of multiplier
                                # [esp+8] = high dword of multiplicand
                                # [esp+4] = low dword of multiplicand
__umulvdi3:
        push    ebx
        xor     ebx, ebx        # ebx = 0

        mov     ecx, [esp+12]   # ecx = high dword of multiplicand
        cmp     ebx, ecx
        sbb     edx, edx        # edx = (high dword of multiplicand == 0)  ? 0 : -1
                                #     = (multiplicand < 2**32) ? 0 : -1
        mov     eax, [esp+20]   # eax = high dword of multiplier
        cmp     ebx, eax
        sbb     ebx, ebx        # ebx = (high dword of multiplier == 0) ? 0 : -1
                                #     = (multiplier < 2**32) ? 0 : -1
        and     ebx, edx        # ebx = (multiplicand < 2**32)
                                #     & (multiplier < 2**32) ? 0 : -1
                                #     = (product < 2**64) ? 0 : -1

        mov     edx, [esp+8]    # edx = low dword of multiplicand
        mul     edx             # edx:eax = high dword of multiplier
                                #         * low dword of multiplicand
        adc     ebx, ebx        # ebx = (product < 2**64) ? 0 : *

        xchg    ecx, eax        # ecx = high dword of multiplier
                                #     * low dword of multiplicand,
                                # eax = high dword of multiplicand
        mov     edx, [esp+16]   # edx = low dword of multiplier
        mul     edx             # edx:eax = high dword of multiplicand
                                #         * low dword of multiplier
        adc     ebx, ebx        # ebx = (product < 2**64) ? 0 : *

        add     ecx, eax        # ecx = high dword of multiplier
                                #     * low dword of multiplicand
                                #     + high dword of multiplicand
                                #     * low dword of multiplier
#       adc     ebx, ebx        # ebx = (product < 2**64) ? 0 : *

        mov     eax, [esp+8]    # eax = low dword of multiplicand
        mov     edx, [esp+16]   # edx = low dword of multiplier
        mul     edx             # edx:eax = low dword of multiplicand
                                #         * low dword of multiplier
        add     edx, ecx        # edx:eax = product % 2**64
        adc     ebx, ebx        # ebx = (product < 2**64) ? 0 : *
        jnz     .Loverflow      # product >= 2**64?

        pop     ebx
        ret

.Loverflow:
        ud2

        .size   __umulvdi3, .-__umulvdi3
        .type   __umulvdi3, @function
        .global __umulvdi3
        .end
--- EOF ---

--- mulvdi3.s ---
# Copyright © 2004-2020, Stefan Kanthak <stefan.kanthak@nexgo.de>

        .arch   generic32
        .code32
        .intel_syntax noprefix
        .text
                                # [esp+16] = high dword of multiplier
                                # [esp+12] = low dword of multiplier
                                # [esp+8] = high dword of multiplicand
                                # [esp+4] = low dword of multiplicand
__mulvdi3:
        push    ebx
        mov     eax, [esp+20]   # eax = high dword of multiplier
        mov     ecx, [esp+16]   # ecx = low dword of multiplier
        cdq                     # edx = (multiplier < 0) ? -1 : 0
        mov     ebx, edx        # ebx = (multiplier < 0) ? -1 : 0
        xor     ecx, edx
        xor     eax, edx        # eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
        sub     ecx, edx
        sbb     eax, edx        # eax:ecx = (multiplier < 0) ? -multiplier : multiplier
                                #         = |multiplier|
        mov     [esp+16], ecx
        mov     [esp+20], eax   # multiplier = |multiplier|

        mov     eax, [esp+12]   # eax = high dword of multiplicand
        mov     ecx, [esp+8]    # ecx = low dword of multiplicand
        cdq                     # edx = (multiplicand < 0) ? -1 : 0
        xor     ebx, edx        # ebx = (multiplier < 0)
                                #     ^ (multiplicand < 0) ? -1 : 0
                                #     = (product < 0) ? -1 : 0
        xor     ecx, edx
        xor     eax, edx        # eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
        sub     ecx, edx
        sbb     eax, edx        # eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
                                #         = |multiplicand|
        mov     [esp+8], ecx
        mov     [esp+12], eax   # multiplicand = |multiplicand|

        push    ebx             # [esp] = (product < 0) ? -1 : 0

        xor     edx, edx        # edx = 0
#       mov     eax, [esp+16]   # eax = high dword of |multiplicand|
        cmp     edx, eax
        sbb     ebx, ebx        # ebx = (high dword of |multiplicand| == 0) ? 0 : -1
                                #     = (|multiplicand| < 2**32) ? 0 : -1
        mov     ecx, [esp+24]   # ecx = high dword of |multiplier|
        cmp     edx, ecx
        sbb     edx, edx        # edx = (high dword of |multiplier| == 0) ? 0 : -1
                                #     = (|multiplier| < 2**32) ? 0 : -1
        and     ebx, edx        # ebx = (|multiplicand| < 2**32)
                                #     & (|multiplier| < 2**32) ? 0 : -1
                                #     = (|product| < 2**64) ? 0 : -1
.ifdef INTO
        shr     ebx, 1
        into
        mov     ebx, [esp+20]   # ebx = low dword of |multiplier|
        mul     ebx             # edx:eax = high dword of |multiplicand|
                                #         * low dword of |multiplier|
        into
        xchg    ecx, eax        # ecx = high dword of |multiplicand|
                                #     * low dword of |multiplier|,
                                # eax = high dword of |multiplier|
        mov     edx, [esp+12]   # edx = low dword of |multiplicand|
        mul     edx             # edx:eax = high dword of |multiplier|
                                #         * low dword of |multiplicand|
        into
        add     ecx, eax        # ecx = high dword of |multiplicand|
                                #     * low dword of |multiplier|
                                #     + high dword of |multiplier|
                                #     * low dword of |multiplicand|
#       sbb     eax, eax        # eax = (|product| < 2**64) ? 0 : -1
#       shr     eax, 1
#       into
        mov     eax, [esp+12]   # eax = low dword of |multiplicand|
        mul     ebx             # edx:eax = low dword of |multiplicand|
                                #         * low dword of |multiplier|
        add     edx, ecx        # edx:eax = |product % 2**64|
                                #         = |product| % 2**64
        sbb     ecx, ecx        # ecx = (|product| < 2**64) ? 0 : -1
        shr     ecx, 1
        into
        pop     ebx             # ebx = (product < 0) ? -1 : 0
        add     eax, ebx
        adc     edx, ebx        # edx:eax = (product < 0) ? ~product % 2**64 : product % 2**64
        mov     ecx, edx
        shr     ecx, 1
        into
        xor     eax, ebx
        xor     edx, ebx        # edx:eax = product % 2**64
        pop     ebx
        ret
.else
        mov     edx, [esp+20]   # edx = low dword of |multiplier|
        mul     edx             # edx:eax = high dword of |multiplicand|
                                #         * low dword of |multiplier|
        adc     ebx, ebx        # ebx = (|product| < 2**64) ? 0 : *

        xchg    ecx, eax        # ecx = high dword of |multiplicand|
                                #     * low dword of |multiplier|,
                                # eax = high dword of |multiplier|
        mov     edx, [esp+12]   # edx = low dword of |multiplicand|
        mul     edx             # edx:eax = high dword of |multiplier|
                                #         * low dword of |multiplicand|
        adc     ebx, ebx        # ebx = (|product| < 2**64) ? 0 : *

        add     ecx, eax        # ecx = high dword of |multiplicand|
                                #     * low dword of |multiplier|
                                #     + high dword of |multiplier|
                                #     * low dword of |multiplicand|
#       adc     ebx, ebx        # ebx = (|product| < 2**64) ? 0 : *

        mov     eax, [esp+12]   # eax = low dword of |multiplicand|
        mov     edx, [esp+20]   # edx = low dword of |multiplier|
        mul     edx             # edx:eax = low dword of |multiplicand|
                                #         * low dword of |multiplier|
        add     edx, ecx        # edx:eax = |product % 2**64|
                                #         = |product| % 2**64
        adc     ebx, ebx        # ebx = (|product| < 2**64) ? 0 : *

        pop     ecx             # ecx = (product < 0) ? -1 : 0
        xor     eax, ecx
        xor     edx, ecx        # edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
        sub     eax, ecx
        sbb     edx, ecx        # edx:eax = product % 2**64

        xor     ecx, edx        # ecx = (multiplicand < 0)
                                #     ^ (multiplier < 0)
                                #     ^ (product < 0) ? negative : positive
        add     ecx, ecx
        adc     ebx, ebx        # ebx = (-2**63 <= product < 2**63) ? 0 : *
        jnz     .Loverflow      # product < -2**63?
                                # product >= 2**63?
        pop     ebx
        ret

.Loverflow:
        ud2
.endif # INTO
        .size   __mulvdi3, .-__mulvdi3
        .type   __mulvdi3, @function
        .global __mulvdi3
        .end
--- EOF ---