This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/37194] Autovectorization of small constant iteration loop degrades performance
- From: "rguenth at gcc dot gnu dot org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 22 Aug 2008 09:53:11 -0000
- Subject: [Bug tree-optimization/37194] Autovectorization of small constant iteration loop degrades performance
- References: <bug-37194-14936@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #2 from rguenth at gcc dot gnu dot org 2008-08-22 09:53 -------
The x86_64 generated code looks like
ggSpectrum_Set:
.LFB0:
.cfi_startproc
movq %rdi, %rax
xorl %ecx, %ecx
movq %rdi, %rdx
andl $15, %eax
shrq $2, %rax
negl %eax
andl $3, %eax
je .L15
movl $8, %r8d
.p2align 4,,10
.p2align 3
.L10:
addl $1, %ecx
movl %r8d, %esi
movss %xmm0, (%rdx)
subl %ecx, %esi
addq $4, %rdx
cmpl %ecx, %eax
ja .L10
.L3:
movl $8, %r10d
subl %eax, %r10d
movl %r10d, %r8d
shrl $2, %r8d
leal 0(,%r8,4), %r9d
testl %r9d, %r9d
je .L5
movaps %xmm0, %xmm2
sall $2, %eax
mov %eax, %eax
xorl %edx, %edx
shufps $0, %xmm2, %xmm2
leaq (%rdi,%rax), %rax
movaps %xmm2, %xmm1
.p2align 4,,10
.p2align 3
.L6:
addl $1, %edx
movaps %xmm1, (%rax)
addq $16, %rax
cmpl %r8d, %edx
jb .L6
addl %r9d, %ecx
subl %r9d, %esi
cmpl %r9d, %r10d
je .L9
.L5:
movslq %ecx,%rax
leaq (%rdi,%rax,4), %rax
.p2align 4,,10
.p2align 3
.L8:
movss %xmm0, (%rax)
addq $4, %rax
subl $1, %esi
jne .L8
.L9:
rep
ret
.L15:
movl $8, %esi
movl %eax, %ecx
jmp .L3
.cfi_endproc
I wonder why we do not use movups instead.
t.i:3: note: Alignment of access forced using peeling.
t.i:3: note: Peeling for alignment will be applied.
t.i:3: note: Cost model analysis:
Vector inside of loop cost: 1
Vector outside of loop cost: 13
Scalar iteration cost: 1
Scalar outside cost: 7
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 7
t.i:3: note: === vect_do_peeling_for_alignment ===
t.i:3: note: created vect_p.29_13
t.i:3: note: niters for prolog loop: (unsigned int) (4 - (((long unsigned int)
vect_p.29_13 & 15) >> 2)) & 3
t.i:3: note: Vectorization may not be profitable.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37194