[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Fri May 13 16:43:03 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596
Bug ID: 105596
Summary: Loop counter widened to 128-bit unnecessarily
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
For total *= i with a u128 total and a u32 loop counter, GCC pessimizes by
widening i and doing a full 128x128 => 128-bit multiply, and having to do a
128-bit increment and compare.
uint64_t i to make it a full register width doesn't help.
unsigned __int128 fact(unsigned n){
unsigned __int128 total = n;
for (unsigned i=2 ; i < n ; i++)
total *= i;
return total;
}
// 0! = 0 isn't mathematically correct, but that's not the point
https://godbolt.org/z/W4MW9b6T3 (gcc trunk 13.0.0 20220508 (experimental) and
clang 14, which makes efficient asm for all of these.)
# gcc -O3
fact:
movl %edi, %r9d
xorl %r11d, %r11d
movq %r9, %r10 # total = n zext into R11:R10
cmpl $2, %edi
jbe .L7 # if n<=2 return r11:r10
movl $2, %esi # i = 2 in RDI:RSI
xorl %edi, %edi
.L9: # do{
movq %r11, %rcx
movq %rdi, %rdx
movq %r10, %rax
movq %r9, %r8 # copy original n to destroy later
imulq %r10, %rdx # 128x128 multiply with 2x imul, 1x
widening mul
imulq %rsi, %rcx
addq %rdx, %rcx
mulq %rsi
movq %rdx, %r11 # update total in r11:r10
movq %rax, %r10
addq %rcx, %r11 # last partial product
addq $1, %rsi # i++ as a 128-bit integer
adcq $0, %rdi
xorq %rsi, %r8 # r8 = n^i
movq %rdi, %rcx # useless copy, we're already destroying
r8
orq %r8, %rcx # hi(i^n) | lo(i^n)
jne .L9 # }while(i != n);
.L7:
movq %r10, %rax
movq %r11, %rdx
ret
So as well as creating extra work to do, it's not even doing it very
efficiently, with multiple unnecessary mov instructions.
This doesn't seem to be x86-64 specific. It also compiles similarly for
AArch64 and MIPS64. For some ISAs, I'm not sure if potentially-infinite loops
are making a difference, e.g. PowerPC is hard for me to read. RV64 has three
multiply instructions in both versions.
I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i.
This anti-optimization goes back to GCC4.6. With GCC4.5 and earlier, the above
C compiles to a tight loop with the expected mul reg + imul reg,reg and 1
register loop counter: https://godbolt.org/z/6KheaqTx4 (using __uint128_t,
since unsigned __int128 wasn't supported on GCC4.4 or 4.1)
GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly
xor-zeroed register. It's still just incrementing and comparing a 32-bit loop
counter, but widening it for a 128x128-bit multiply recipe. GCC4.4 optimizes
away the parts that are useless for the high 64 bits of (u128)i being zero.
-----
A different version compiles efficiently with GCC6 and earlier, only becoming
slow like the above with GCC7 and later.
unsigned __int128 fact_downcount(unsigned n){
unsigned __int128 total = n;
for (unsigned i=n-1 ; i > 1 ; i--)
total *= i;
return total; // 0! = 0 isn't mathematically correct
}
-----
When the loop condition is possibly always-true, GCC can't prove the loop is
non-infinite, and as usual can't widen the loop counter. In this case, that's
a good thing:
unsigned __int128 fact_gcc_handhold(unsigned n){
unsigned __int128 total = 1; // loop does do final n
for (unsigned i=2 ; i <= n ; i++) // potentially infinite loop defeats
this pessimization
total *= i;
return total; // fun fact: 0! = 1 is mathematically correct
}
fact_gcc_handhold:
cmpl $1, %edi
jbe .L4
movl $2, %ecx # i = 2 in ECX
movl $1, %eax # total = 1 in RDX:RAX
xorl %edx, %edx
.L3: #do{
movl %ecx, %esi # copy i instead of just incrementing it
later :/
movq %rdx, %r8 # save high half of total
addl $1, %ecx # i++
imulq %rsi, %r8 # lo x hi cross product
mulq %rsi # lo x lo widening
addq %r8, %rdx # 128x64-bit multiply
cmpl %ecx, %edi
jnb .L3 # }while(i < n)
ret
Allocating total in RDX:RAX is nice, putting the lo part where we need it for
mulq anyway.
More information about the Gcc-bugs
mailing list