[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily

Fri May 13 16:43:03 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

            Bug ID: 105596
           Summary: Loop counter widened to 128-bit unnecessarily
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

For  total *= i  with a u128 total and a u32 loop counter, GCC pessimizes by
widening i and doing a full 128x128 => 128-bit multiply, and having to do a
128-bit increment and compare.

uint64_t i to make it a full register width doesn't help.

unsigned __int128 fact(unsigned n){
    unsigned __int128 total = n;
    for (unsigned i=2 ; i < n ; i++)
        total *= i;
    return total;
}
// 0! = 0  isn't mathematically correct, but that's not the point

https://godbolt.org/z/W4MW9b6T3  (gcc trunk 13.0.0 20220508 (experimental) and
clang 14, which makes efficient asm for all of these.)

# gcc -O3
fact:
        movl    %edi, %r9d
        xorl    %r11d, %r11d
        movq    %r9, %r10           # total = n  zext into  R11:R10
        cmpl    $2, %edi
        jbe     .L7                 # if n<=2 return r11:r10
        movl    $2, %esi            # i = 2  in  RDI:RSI
        xorl    %edi, %edi
.L9:                              # do{
        movq    %r11, %rcx
        movq    %rdi, %rdx
        movq    %r10, %rax
        movq    %r9, %r8              # copy original n to destroy later
        imulq   %r10, %rdx          # 128x128 multiply with 2x imul, 1x
widening mul
        imulq   %rsi, %rcx
        addq    %rdx, %rcx
        mulq    %rsi
        movq    %rdx, %r11          # update total in r11:r10
        movq    %rax, %r10
        addq    %rcx, %r11          # last partial product

        addq    $1, %rsi            # i++ as a 128-bit integer
        adcq    $0, %rdi
        xorq    %rsi, %r8           #  r8 = n^i
        movq    %rdi, %rcx           # useless copy, we're already destroying
r8
        orq     %r8, %rcx            # hi(i^n) | lo(i^n)
        jne     .L9               # }while(i != n);
.L7:
        movq    %r10, %rax
        movq    %r11, %rdx
        ret

So as well as creating extra work to do, it's not even doing it very
efficiently, with multiple unnecessary mov instructions.

This doesn't seem to be x86-64 specific.  It also compiles similarly for
AArch64 and MIPS64.  For some ISAs, I'm not sure if potentially-infinite loops
are making a difference, e.g. PowerPC is hard for me to read.  RV64 has three
multiply instructions in both versions.

I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i.

This anti-optimization goes back to GCC4.6.  With GCC4.5 and earlier, the above
C compiles to a tight loop with the expected mul reg + imul reg,reg and 1
register loop counter: https://godbolt.org/z/6KheaqTx4  (using __uint128_t,
since unsigned __int128 wasn't supported on GCC4.4 or 4.1)

GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly
xor-zeroed register.  It's still just incrementing and comparing a 32-bit loop
counter, but widening it for a 128x128-bit multiply recipe.  GCC4.4 optimizes
away the parts that are useless for the high 64 bits of (u128)i being zero.

-----

A different version compiles efficiently with GCC6 and earlier, only becoming
slow like the above with GCC7 and later.

unsigned __int128 fact_downcount(unsigned n){
    unsigned __int128 total = n;
    for (unsigned i=n-1 ; i > 1 ; i--)
        total *= i;
    return total;  // 0! = 0 isn't mathematically correct
}

-----

When the loop condition is possibly always-true, GCC can't prove the loop is
non-infinite, and as usual can't widen the loop counter.  In this case, that's
a good thing:

unsigned __int128 fact_gcc_handhold(unsigned n){
    unsigned __int128 total = 1;   // loop does do final n
    for (unsigned i=2 ; i <= n ; i++)  // potentially infinite loop defeats
this pessimization
        total *= i;
    return total;  // fun fact:  0! = 1  is mathematically correct
}

fact_gcc_handhold:
        cmpl    $1, %edi
        jbe     .L4
        movl    $2, %ecx           # i = 2   in    ECX
        movl    $1, %eax           # total = 1  in RDX:RAX
        xorl    %edx, %edx
.L3:                             #do{
        movl    %ecx, %esi            # copy i instead of just incrementing it
later :/

        movq    %rdx, %r8           # save high half of total
        addl    $1, %ecx              # i++
        imulq   %rsi, %r8           # lo x hi cross product
        mulq    %rsi                # lo x lo widening
        addq    %r8, %rdx           # 128x64-bit multiply

        cmpl    %ecx, %edi
        jnb     .L3               # }while(i < n)
        ret

Allocating total in RDX:RAX is nice, putting the lo part where we need it for
mulq anyway.