This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/37194] Autovectorization of small constant iteration loop degrades performance



------- Comment #2 from rguenth at gcc dot gnu dot org  2008-08-22 09:53 -------
The x86_64 generated code looks like

ggSpectrum_Set:
.LFB0:
        .cfi_startproc
        movq    %rdi, %rax
        xorl    %ecx, %ecx
        movq    %rdi, %rdx
        andl    $15, %eax
        shrq    $2, %rax
        negl    %eax
        andl    $3, %eax
        je      .L15
        movl    $8, %r8d
        .p2align 4,,10
        .p2align 3
.L10:
        addl    $1, %ecx
        movl    %r8d, %esi
        movss   %xmm0, (%rdx)
        subl    %ecx, %esi
        addq    $4, %rdx
        cmpl    %ecx, %eax
        ja      .L10
.L3:
        movl    $8, %r10d
        subl    %eax, %r10d
        movl    %r10d, %r8d
        shrl    $2, %r8d
        leal    0(,%r8,4), %r9d
        testl   %r9d, %r9d
        je      .L5
        movaps  %xmm0, %xmm2
        sall    $2, %eax
        mov     %eax, %eax
        xorl    %edx, %edx
        shufps  $0, %xmm2, %xmm2
        leaq    (%rdi,%rax), %rax
        movaps  %xmm2, %xmm1
        .p2align 4,,10
        .p2align 3
.L6:
        addl    $1, %edx
        movaps  %xmm1, (%rax)
        addq    $16, %rax
        cmpl    %r8d, %edx
        jb      .L6
        addl    %r9d, %ecx
        subl    %r9d, %esi
        cmpl    %r9d, %r10d
        je      .L9
.L5:
        movslq  %ecx,%rax
        leaq    (%rdi,%rax,4), %rax
        .p2align 4,,10
        .p2align 3
.L8:
        movss   %xmm0, (%rax)
        addq    $4, %rax
        subl    $1, %esi
        jne     .L8
.L9:
        rep
        ret
.L15:
        movl    $8, %esi
        movl    %eax, %ecx
        jmp     .L3
        .cfi_endproc

I wonder why we do not use movups instead.

t.i:3: note: Alignment of access forced using peeling.
t.i:3: note: Peeling for alignment will be applied.

t.i:3: note: Cost model analysis:
  Vector inside of loop cost: 1
  Vector outside of loop cost: 13
  Scalar iteration cost: 1
  Scalar outside cost: 7
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

t.i:3: note: === vect_do_peeling_for_alignment ===
t.i:3: note: created vect_p.29_13
t.i:3: note: niters for prolog loop: (unsigned int) (4 - (((long unsigned int)
vect_p.29_13 & 15) >> 2)) & 3
t.i:3: note: Vectorization may not be profitable.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37194


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]