[Bug tree-optimization/50693] Loop optimization restricted by GOTOs

Tue Oct 11 01:12:00 GMT 2011

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693

David Edelsohn <dje at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-10-11
     Ever Confirmed|0                           |1

--- Comment #8 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 01:11:47 UTC ---
Both loop1 and loop2 produce the same code on LLVM, presumably from its memset
pattern:

        movq    %rax, 8(%r15)
        movq    %rbx, (%r15)
        testq   %rbx, %rbx
        je      .LBB1_3
# BB#1:
        movq    %rbx, %rcx
        movq    %rax, %rdx
        .align  16, 0x90
.LBB1_2:                                # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
        movb    %r14b, (%rdx)
        incq    %rdx
        decq    %rcx
        jne     .LBB1_2
.LBB1_3:                                # %._crit_edge
        movb    $0, (%rax,%rbx)

Direct pointer arithmetic might not be recommended, but Intel makes do.

For loop1, GCC produces:

        testq   %rbx, %rbx
        movq    %rax, 8(%rbp)
        movq    %rbx, 0(%rbp)
        je      .L3
        xorl    %edx, %edx
        .p2align 4,,10
        .p2align 3
.L5:
        movb    %r12b, (%rax,%rdx)
        addq    $1, %rdx
        movq    8(%rbp), %rax
        cmpq    %rbx, %rdx
        jne     .L5
.L3:
        movb    $0, (%rax,%rbx)

For loop2, GCC produces:

        xorl    %edx, %edx
        testq   %rbx, %rbx
        movq    %rax, 8(%rbp)
        movq    %rbx, 0(%rbp)
        jne     .L13
        jmp     .L9
        .p2align 4,,10
        .p2align 3
.L11:
        movq    8(%rbp), %rax
.L8:
.L13:
.L10:
        movb    %r12b, (%rax,%rdx)
        addq    $1, %rdx
        cmpq    %rbx, %rdx
        jne     .L11
        movq    8(%rbp), %rax
.L9:
        movb    $0, (%rax,%rbx)

In both cases GCC unnecessarily re-reads v->chars.

Is loop2 slower because jne .L13 jump into the middle of the loop confuses the
Intel loop branch predictor logic?  Or the loop2 instructions order cracks into
uops badly?  The cause of the performance difference is not obvious.