[PATCH] Overflow-trapping integer arithmetic routines7code
Stefan Kanthak
stefan.kanthak@nexgo.de
Wed Nov 25 13:18:09 GMT 2020
Jeff Law <law@redhat.com> wrote:
> On 11/10/20 10:21 AM, Stefan Kanthak wrote:
>
>>> So with all that in mind, I installed everything except the bits which
>>> have the LIBGCC2_BAD_CODE ifdefs after testing on the various crosses.
>>> If you could remove the ifdefs on the abs/mult changes and resubmit them
>>> it'd be appreciated.
>> Done.
> THanks. I'm doing some testing on the abs changes right now. They look
> pretty reasonable, though they will tend to generate worse code on
> targets that don't handle overflow arithmetic and testing all that well.
OTOH the changes yield better code on targets which have a proper
overflow handling, and may benefit from eventual improvements in the
compiler/code generator itself on all targets.
> H8 would be an example that generates more efficient code with the
> original implementation. But I don't think the H8 is terribly
> representative of embedded ports these days, though it's likely to get
> better at precisely these kinds of scenarios in the near future as
> certain modernization patches continue landing.
>
> Something like visium is much better for evaluation as it's got a modern
> structure which includes reasonable exposure of carry/overflow. Not
> surprisingly the visium port generates better code with your improved
> abs routines.
>
>
>
>
>
>> libgcc2.patch
>>
>> --- -/libgcc/libgcc2.c
>> +++ +/libgcc/libgcc2.c
>>
>> #ifdef L_mulvdi3
>> DWtype
>> -__mulvDI3 (DWtype u, DWtype v)
>> +__mulvDI3 (DWtype a, DWtype b)
> [ ... ]
> You've essentially collapsed the function into:
>
> DWtype
> __mulvDI3 (DWtype a, DWtype b)
> {
> DWtype t;
> const DWunion u = {.ll = a};
> const DWunion v = {.ll = b};
> DWunion w = {.ll = __umulsidi3 (u.s.low, v.s.low)};
>
> if (__builtin_mul_overflow (u.s.low, v.s.high, &t)
> || __builtin_add_overflow (t, w.s.high, &w.s.high)
> || __builtin_mul_overflow (u.s.high, v.s.low, &t)
> || __builtin_add_overflow (t, w.s.high, &w.s.high))
> abort ();
>
> return w.ll
> }
>
> The first thing to note is that I believe that using "DWtype" for "t" is
> going to inhibit the overflow check for the __builtin_mul_overflow calls
> that store their result into "t". Your inputs are both 32bits and the
> output is 64 bits. Multiplying 2 32bit numbers will not overflow a
> 64bit result. As a result those overflow checks are actually removed in
> the code we generate for mulvDI3, defeating the entire purpose of using
> mulvdi3 instead of muldi3.
ARGH, my fault(s).
I wanted to show an alternative, but less efficient implementation which
I guarded with the #ifdef, but garbled the main implementation.
> Second while the result of multiplying u.s.high with v.s.high can never
> change the result in w.ll, it does affect the overall overflow status
> that we need to compute.
>
> I think (but have not yet verified) that to fix these problems we need
> to change the type of "t" to SItype and add a check like:
>
> if (u.s.high && v.s.high)
> abort ();
Correct in both cases!
See <https://skanthak.homepage.t-online.de/gcc.html#case17> for my
original code.
I rewrote it to minimise the changes to the current implementation and
introduced 2 silly errors on the way.
> Also note that your approach always does 3 multiplies, which can be very
> expensive on some architectures. The existing version in libgcc2.c will
> often just do one or two multiplies. So while your implementation looks
> a lot simpler, I suspect its often much slower. And on targets without
> 32bit multiplication support, it's probably horribly bad.
All (current) processors I know have super-scalar architecture and a
hardware multiplier, they'll execute the 3 multiplies in parallel.
In my tests on i386 and AMD64 (Core2, Skylake, Ryzen/EPYC), the code
generated for <https://skanthak.homepage.t-online.de/gcc.html#case17>
as well as the (almost) branch-free code shown below and in
<https://skanthak.homepage.t-online.de/gcc.html#case13> runs 10% to 25%
faster than the __mulvDI3 routine from libgcc: the many conditional
branches of your current implementation impair performance more than 3
multiplies!
> My inclination is to leave the overflow checking double-word multiplier
> as-is.
See but <https://gcc.gnu.org/pipermail/gcc/2020-October/234048.html> ff.
> Though I guess you could keep the same structure as the existing
> implementation which tries to avoid unnecessary multiplies and still use
> the __builtin_{add,mul}_overflow to simplify the code a bit less
> aggressively.
Tertium datur: take a look at the __udivmodDI4 routine.
It has separate code paths for targets without hardware divider, and
also for targets where the hardware divider needs a normalized dividend.
I therefore propose to add separate code paths for targets with and
without hardware multiplier for the __mulvDI3 routine too, guarded by a
preprocessor macro which tells whether a target has a hardware multiplier.
regards
Stefan
--- umulvdi3.s ---
# Copyright © 2004-2020, Stefan Kanthak <stefan.kanthak@nexgo.de>
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of multiplier
# [esp+12] = low dword of multiplier
# [esp+8] = high dword of multiplicand
# [esp+4] = low dword of multiplicand
__umulvdi3:
push ebx
xor ebx, ebx # ebx = 0
mov ecx, [esp+12] # ecx = high dword of multiplicand
cmp ebx, ecx
sbb edx, edx # edx = (high dword of multiplicand == 0) ? 0 : -1
# = (multiplicand < 2**32) ? 0 : -1
mov eax, [esp+20] # eax = high dword of multiplier
cmp ebx, eax
sbb ebx, ebx # ebx = (high dword of multiplier == 0) ? 0 : -1
# = (multiplier < 2**32) ? 0 : -1
and ebx, edx # ebx = (multiplicand < 2**32)
# & (multiplier < 2**32) ? 0 : -1
# = (product < 2**64) ? 0 : -1
mov edx, [esp+8] # edx = low dword of multiplicand
mul edx # edx:eax = high dword of multiplier
# * low dword of multiplicand
adc ebx, ebx # ebx = (product < 2**64) ? 0 : *
xchg ecx, eax # ecx = high dword of multiplier
# * low dword of multiplicand,
# eax = high dword of multiplicand
mov edx, [esp+16] # edx = low dword of multiplier
mul edx # edx:eax = high dword of multiplicand
# * low dword of multiplier
adc ebx, ebx # ebx = (product < 2**64) ? 0 : *
add ecx, eax # ecx = high dword of multiplier
# * low dword of multiplicand
# + high dword of multiplicand
# * low dword of multiplier
# adc ebx, ebx # ebx = (product < 2**64) ? 0 : *
mov eax, [esp+8] # eax = low dword of multiplicand
mov edx, [esp+16] # edx = low dword of multiplier
mul edx # edx:eax = low dword of multiplicand
# * low dword of multiplier
add edx, ecx # edx:eax = product % 2**64
adc ebx, ebx # ebx = (product < 2**64) ? 0 : *
jnz .Loverflow # product >= 2**64?
pop ebx
ret
.Loverflow:
ud2
.size __umulvdi3, .-__umulvdi3
.type __umulvdi3, @function
.global __umulvdi3
.end
--- EOF ---
--- mulvdi3.s ---
# Copyright © 2004-2020, Stefan Kanthak <stefan.kanthak@nexgo.de>
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of multiplier
# [esp+12] = low dword of multiplier
# [esp+8] = high dword of multiplicand
# [esp+4] = low dword of multiplicand
__mulvdi3:
push ebx
mov eax, [esp+20] # eax = high dword of multiplier
mov ecx, [esp+16] # ecx = low dword of multiplier
cdq # edx = (multiplier < 0) ? -1 : 0
mov ebx, edx # ebx = (multiplier < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
sub ecx, edx
sbb eax, edx # eax:ecx = (multiplier < 0) ? -multiplier : multiplier
# = |multiplier|
mov [esp+16], ecx
mov [esp+20], eax # multiplier = |multiplier|
mov eax, [esp+12] # eax = high dword of multiplicand
mov ecx, [esp+8] # ecx = low dword of multiplicand
cdq # edx = (multiplicand < 0) ? -1 : 0
xor ebx, edx # ebx = (multiplier < 0)
# ^ (multiplicand < 0) ? -1 : 0
# = (product < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
sub ecx, edx
sbb eax, edx # eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
# = |multiplicand|
mov [esp+8], ecx
mov [esp+12], eax # multiplicand = |multiplicand|
push ebx # [esp] = (product < 0) ? -1 : 0
xor edx, edx # edx = 0
# mov eax, [esp+16] # eax = high dword of |multiplicand|
cmp edx, eax
sbb ebx, ebx # ebx = (high dword of |multiplicand| == 0) ? 0 : -1
# = (|multiplicand| < 2**32) ? 0 : -1
mov ecx, [esp+24] # ecx = high dword of |multiplier|
cmp edx, ecx
sbb edx, edx # edx = (high dword of |multiplier| == 0) ? 0 : -1
# = (|multiplier| < 2**32) ? 0 : -1
and ebx, edx # ebx = (|multiplicand| < 2**32)
# & (|multiplier| < 2**32) ? 0 : -1
# = (|product| < 2**64) ? 0 : -1
.ifdef INTO
shr ebx, 1
into
mov ebx, [esp+20] # ebx = low dword of |multiplier|
mul ebx # edx:eax = high dword of |multiplicand|
# * low dword of |multiplier|
into
xchg ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|,
# eax = high dword of |multiplier|
mov edx, [esp+12] # edx = low dword of |multiplicand|
mul edx # edx:eax = high dword of |multiplier|
# * low dword of |multiplicand|
into
add ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
# + high dword of |multiplier|
# * low dword of |multiplicand|
# sbb eax, eax # eax = (|product| < 2**64) ? 0 : -1
# shr eax, 1
# into
mov eax, [esp+12] # eax = low dword of |multiplicand|
mul ebx # edx:eax = low dword of |multiplicand|
# * low dword of |multiplier|
add edx, ecx # edx:eax = |product % 2**64|
# = |product| % 2**64
sbb ecx, ecx # ecx = (|product| < 2**64) ? 0 : -1
shr ecx, 1
into
pop ebx # ebx = (product < 0) ? -1 : 0
add eax, ebx
adc edx, ebx # edx:eax = (product < 0) ? ~product % 2**64 : product % 2**64
mov ecx, edx
shr ecx, 1
into
xor eax, ebx
xor edx, ebx # edx:eax = product % 2**64
pop ebx
ret
.else
mov edx, [esp+20] # edx = low dword of |multiplier|
mul edx # edx:eax = high dword of |multiplicand|
# * low dword of |multiplier|
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
xchg ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|,
# eax = high dword of |multiplier|
mov edx, [esp+12] # edx = low dword of |multiplicand|
mul edx # edx:eax = high dword of |multiplier|
# * low dword of |multiplicand|
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
add ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
# + high dword of |multiplier|
# * low dword of |multiplicand|
# adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
mov eax, [esp+12] # eax = low dword of |multiplicand|
mov edx, [esp+20] # edx = low dword of |multiplier|
mul edx # edx:eax = low dword of |multiplicand|
# * low dword of |multiplier|
add edx, ecx # edx:eax = |product % 2**64|
# = |product| % 2**64
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
pop ecx # ecx = (product < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
sub eax, ecx
sbb edx, ecx # edx:eax = product % 2**64
xor ecx, edx # ecx = (multiplicand < 0)
# ^ (multiplier < 0)
# ^ (product < 0) ? negative : positive
add ecx, ecx
adc ebx, ebx # ebx = (-2**63 <= product < 2**63) ? 0 : *
jnz .Loverflow # product < -2**63?
# product >= 2**63?
pop ebx
ret
.Loverflow:
ud2
.endif # INTO
.size __mulvdi3, .-__mulvdi3
.type __mulvdi3, @function
.global __mulvdi3
.end
--- EOF ---
More information about the Gcc-patches
mailing list