This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 25 Jan 2017 11:26:50 +0000
- Subject: [Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)
- Auto-submitted: auto-generated
- References: <bug-25621-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
On trunk only the cost model prevents vectorization of the s32 loop now (with
generic tuning/arch). With core-avx2 I get for both innermost loops
.L6:
addl $1, %r10d
vmovapd (%rbx,%r8), %ymm3
vfmadd231pd (%rax,%r8), %ymm3, %ymm0
addq $32, %r8
cmpl %r12d, %r10d
jb .L6
...
.L26:
addl $1, %ecx
vmovupd (%rdi,%rax), %ymm4
vfmadd231pd (%rsi,%rax), %ymm4, %ymm0
addq $32, %rax
cmpl %r8d, %ecx
jb .L26
...
with only the reduction after it varying. With forcing avx128 the s32 loop
isn't vectorized (cost model again):
t.f90:22:0: note: Cost model analysis:
Vector inside of loop cost: 16
Vector prologue cost: 8
Vector epilogue cost: 12
Scalar iteration cost: 8
Scalar outside cost: 6
Vector outside cost: 20
prologue iterations: 0
epilogue iterations: 1