When dividing a 64 bit integer by a 32 bit integer (normal division or remainder), gcc uses __umoddi3 even though the intel instruction 'div' can do this natively. On my machine, this bad assembly has a speed impact of at least 1.5* by simply replacing the call with 'div' and no further optimizations applied. These types of operations (multiplication and then modulus division) are very common in cryptography and coding theory. It is quite possible that fixing this bug would greatly improve gcc's performance in these areas. I have confirmed that this bug applies to versions 3.5, 3.4, and 2.95.4 in addition to the 3.3.3 against which it is filed. The example program is: #include <stdint.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char** argv) { int i = 0; uint32_t x = atol(argv[1]), y = atol(argv[2]), z = 0; for (i = 0; i < 1000000000; ++i) { z += ((uint64_t)x*(uint64_t)y) % 3221225473UL; ++x; ++y; } printf("%d\n", z); return 0; } Compile with gcc -O2 foo.c -o foo (version 3.3.3). Then generate the assembly gcc -O2 foo.c -o foo.s and apply this patch: 47,54c47,64 < movl %eax, (%esp) < xorl %eax, %eax < movl %edx, 4(%esp) < movl $-1073741823, %edx < movl %edx, 8(%esp) < movl %eax, 12(%esp) < call __umoddi3 < addl %eax, -16(%ebp) --- > > // stuff from gcc: > // movl %eax, (%esp) > // xorl %eax, %eax > // movl %edx, 4(%esp) > // movl $-1073741823, %edx > // movl %edx, 8(%esp) > // movl %eax, 12(%esp) > // call __umoddi3 > // addl %eax, -16(%ebp) > > // my custom assembly: > movl $-1073741823, %ecx > div %ecx > addl %edx, -16(%ebp) > > // --- end changes > Now compile the assembly (gcc foo.s -o foo2) and compare the results and the speed difference. It is quite likely that gcc could do an even better job through good pipelining and better register assignment. Furthermore, in this particular example code, the divisor is a constant which might allow gcc to even optimize away the div altogether. However, I have no idea what the performance impact of this would be.
Confirmed.
Problem here is, that maximum quotient should always be 2^32-1, as it should fit in 32bit EAX register. Consider the case, when dividend is 0x0000 0001 0000 0000 and divisor is 0x0000 0001. The result won't fit in 32 bits, and division will produce #DE exception. For your case, divisor is 0xC000 0001. And with your assembly, if x*y is equal or more than 0xC000 0001 0000 0000 (= 13835058059577131008), #DE exception will be generated. I suggest to mark this bug as invalid, because it is not possible for gcc to know maximum value of (x*y).
(In reply to comment #2) > Problem here is, that maximum quotient should always be 2^32-1, as it should fit > in 32bit EAX register. Consider the case, when dividend is 0x0000 0001 0000 0000 > and divisor is 0x0000 0001. The result won't fit in 32 bits, and division will > produce #DE exception. > > For your case, divisor is 0xC000 0001. And with your assembly, if x*y is equal > or more than 0xC000 0001 0000 0000 (= 13835058059577131008), #DE exception will > be generated. You're quite right; I hadn't considered this. > I suggest to mark this bug as invalid, because it is not possible for gcc to > know maximum value of (x*y). It's really a shame there's no way to inform gcc about the size. For modulus arithmetic, x and y are known to be less than m. Therefore, x*y%m and x*y/m must both fit in the registers. However, I have a couple points to make still. Firstly, the modulus will always fit inside EDX, regardless of the size of the product. I'm not an assembler expert, but isn't it possible to turn off the arithmetic exceptions prior to running DIV? As long as the code is only interested in the modulus, this would be much faster. At least in the case of finite mathematics (cryptography, coding theory, etc) only the modulus really matters and computing it is often also the critical section of the code. Even if the code is interested in both the quotient and the modulus, it seems that a quick comparison of the high 32bit part of EDX:EAX with the modulus could determine whether or not a DIV would suffice. If it doesn't then umoddi could be a fallback.
*** Bug 14255 has been marked as a duplicate of this bug. ***
(In reply to Uroš Bizjak from comment #2) > I suggest to mark this bug as invalid, because it is not possible for gcc to > know maximum value of (x*y). Invalid then.
Well, during expansion VRP can be queried (SSA_NAME_RANGE_INFO or ranger if it is extended to be queryable during expansion) and in some cases (not in the #c0 testcase though) we can know the range.