[Bug tree-optimization/50693] Loop optimization restricted by GOTOs
dje at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Oct 11 01:12:00 GMT 2011
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50693
David Edelsohn <dje at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011-10-11
Ever Confirmed|0 |1
--- Comment #8 from David Edelsohn <dje at gcc dot gnu.org> 2011-10-11 01:11:47 UTC ---
Both loop1 and loop2 produce the same code on LLVM, presumably from its memset
pattern:
movq %rax, 8(%r15)
movq %rbx, (%r15)
testq %rbx, %rbx
je .LBB1_3
# BB#1:
movq %rbx, %rcx
movq %rax, %rdx
.align 16, 0x90
.LBB1_2: # %.lr.ph
# =>This Inner Loop Header: Depth=1
movb %r14b, (%rdx)
incq %rdx
decq %rcx
jne .LBB1_2
.LBB1_3: # %._crit_edge
movb $0, (%rax,%rbx)
Direct pointer arithmetic might not be recommended, but Intel makes do.
For loop1, GCC produces:
testq %rbx, %rbx
movq %rax, 8(%rbp)
movq %rbx, 0(%rbp)
je .L3
xorl %edx, %edx
.p2align 4,,10
.p2align 3
.L5:
movb %r12b, (%rax,%rdx)
addq $1, %rdx
movq 8(%rbp), %rax
cmpq %rbx, %rdx
jne .L5
.L3:
movb $0, (%rax,%rbx)
For loop2, GCC produces:
xorl %edx, %edx
testq %rbx, %rbx
movq %rax, 8(%rbp)
movq %rbx, 0(%rbp)
jne .L13
jmp .L9
.p2align 4,,10
.p2align 3
.L11:
movq 8(%rbp), %rax
.L8:
.L13:
.L10:
movb %r12b, (%rax,%rdx)
addq $1, %rdx
cmpq %rbx, %rdx
jne .L11
movq 8(%rbp), %rax
.L9:
movb $0, (%rax,%rbx)
In both cases GCC unnecessarily re-reads v->chars.
Is loop2 slower because jne .L13 jump into the middle of the loop confuses the
Intel loop branch predictor logic? Or the loop2 instructions order cracks into
uops badly? The cause of the performance difference is not obvious.
More information about the Gcc-bugs
mailing list