14224 – GCC generates pessimizes code for integer division

Bug 14224 - GCC generates pessimizes code for integer division

Summary: GCC generates pessimizes code for integer division

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	3.3.3

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	14255 (view as bug list)
Depends on:
Blocks:	14255
	Show dependency tree / graph

Reported:	2004-02-20 11:38 UTC by Wesley W. Terpstra
Modified:	2021-06-08 08:59 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	i686-pc-linux-gnu
Build:
Known to work:
Known to fail:	4.0.0
Last reconfirmed:	2005-12-09 04:37:52

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wesley W. Terpstra 2004-02-20 11:38:42 UTC

When dividing a 64 bit integer by a 32 bit integer (normal division or
remainder), gcc uses __umoddi3 even though the intel instruction 'div' can do
this natively.

On my machine, this bad assembly has a speed impact of at least 1.5* by simply
replacing the call with 'div' and no further optimizations applied.

These types of operations (multiplication and then modulus division) are very
common in cryptography and coding theory. It is quite possible that fixing this
bug would greatly improve gcc's performance in these areas.

I have confirmed that this bug applies to versions 3.5, 3.4, and 2.95.4 in
addition to the 3.3.3 against which it is filed.

The example program is:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
 
int main(int argc, char** argv)
{
        int i = 0;
        uint32_t x = atol(argv[1]), y = atol(argv[2]), z = 0;
         
        for (i = 0; i < 1000000000; ++i)
        {
                z += ((uint64_t)x*(uint64_t)y) % 3221225473UL;
                ++x; ++y;
        }
         
        printf("%d\n", z);
         
        return 0;
}

Compile with gcc -O2 foo.c -o foo (version 3.3.3).
Then generate the assembly gcc -O2 foo.c -o foo.s and apply this patch:

47,54c47,64
<       movl    %eax, (%esp)
<       xorl    %eax, %eax
<       movl    %edx, 4(%esp)
<       movl    $-1073741823, %edx
<       movl    %edx, 8(%esp)
<       movl    %eax, 12(%esp)
<       call    __umoddi3
<       addl    %eax, -16(%ebp)
---
>
> // stuff from gcc:
> //    movl    %eax, (%esp)
> //    xorl    %eax, %eax
> //    movl    %edx, 4(%esp)
> //    movl    $-1073741823, %edx
> //    movl    %edx, 8(%esp)
> //    movl    %eax, 12(%esp)
> //    call    __umoddi3
> //    addl    %eax, -16(%ebp)
>
> // my custom assembly:
>       movl    $-1073741823, %ecx
>       div     %ecx
>       addl    %edx, -16(%ebp)
>
> // --- end changes
>

Now compile the assembly (gcc foo.s -o foo2) and compare the results and the
speed difference. It is quite likely that gcc could do an even better job
through good pipelining and better register assignment. Furthermore, in this
particular example code, the divisor is a constant which might allow gcc to even
optimize away the div altogether. However, I have no idea what the performance
impact of this would be.

Comment 1 Andrew Pinski 2004-02-20 16:23:16 UTC

Confirmed.

Comment 2 Uroš Bizjak 2004-04-05 13:31:28 UTC

Problem here is, that maximum quotient should always be 2^32-1, as it should fit
in 32bit EAX register. Consider the case, when dividend is 0x0000 0001 0000 0000
and divisor is 0x0000 0001. The result won't fit in 32 bits, and division will
produce #DE exception.

For your case, divisor is 0xC000 0001. And with your assembly, if x*y is equal
or more than 0xC000 0001 0000 0000 (= 13835058059577131008), #DE exception will
be generated.

I suggest to mark this bug as invalid, because it is not possible for gcc to
know maximum value of (x*y).

Comment 3 Wesley W. Terpstra 2004-04-05 14:57:46 UTC

(In reply to comment #2)
> Problem here is, that maximum quotient should always be 2^32-1, as it should fit
> in 32bit EAX register. Consider the case, when dividend is 0x0000 0001 0000 0000
> and divisor is 0x0000 0001. The result won't fit in 32 bits, and division will
> produce #DE exception.
> 
> For your case, divisor is 0xC000 0001. And with your assembly, if x*y is equal
> or more than 0xC000 0001 0000 0000 (= 13835058059577131008), #DE exception will
> be generated.

You're quite right; I hadn't considered this.

> I suggest to mark this bug as invalid, because it is not possible for gcc to
> know maximum value of (x*y).

It's really a shame there's no way to inform gcc about the size.
For modulus arithmetic, x and y are known to be less than m.
Therefore, x*y%m and x*y/m must both fit in the registers.
                                                                                
However, I have a couple points to make still.
                                                                                
Firstly, the modulus will always fit inside EDX, regardless of the size of
the product. I'm not an assembler expert, but isn't it possible to turn off
the arithmetic exceptions prior to running DIV? As long as the code is only
interested in the modulus, this would be much faster.
                                                                                
At least in the case of finite mathematics (cryptography, coding theory,
etc) only the modulus really matters and computing it is often also the
critical section of the code.
                                                                                
Even if the code is interested in both the quotient and the modulus, it
seems that a quick comparison of the high 32bit part of EDX:EAX with the
modulus could determine whether or not a DIV would suffice. If it doesn't
then umoddi could be a fallback.

Comment 4 Andrew Pinski 2004-06-03 03:56:10 UTC

*** Bug 14255 has been marked as a duplicate of this bug. ***

Comment 5 Andrew Pinski 2021-06-08 08:41:13 UTC

(In reply to Uroš Bizjak from comment #2)
> I suggest to mark this bug as invalid, because it is not possible for gcc to
> know maximum value of (x*y).

Invalid then.

Comment 6 Jakub Jelinek 2021-06-08 08:59:01 UTC

Well, during expansion VRP can be queried (SSA_NAME_RANGE_INFO or ranger if it is extended to be queryable during expansion) and in some cases (not in the #c0 testcase though) we can know the range.