This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/25621] Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
On trunk only the cost model prevents vectorization of the s32 loop now (with
generic tuning/arch).  With core-avx2 I get for both innermost loops

.L6:
        addl    $1, %r10d
        vmovapd (%rbx,%r8), %ymm3
        vfmadd231pd     (%rax,%r8), %ymm3, %ymm0
        addq    $32, %r8
        cmpl    %r12d, %r10d
        jb      .L6
...

.L26:
        addl    $1, %ecx
        vmovupd (%rdi,%rax), %ymm4
        vfmadd231pd     (%rsi,%rax), %ymm4, %ymm0
        addq    $32, %rax
        cmpl    %r8d, %ecx
        jb      .L26
...

with only the reduction after it varying.  With forcing avx128 the s32 loop
isn't vectorized (cost model again):

t.f90:22:0: note: Cost model analysis:
  Vector inside of loop cost: 16
  Vector prologue cost: 8
  Vector epilogue cost: 12
  Scalar iteration cost: 8
  Scalar outside cost: 6
  Vector outside cost: 20
  prologue iterations: 0
  epilogue iterations: 1

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]