This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/37194] New: Autovectorization of constant iteration loop degrades performance


Seeing a degradation in cpu2000 benchmark 252.eon that is caused by
autovectorization of a simple loop in function ggSpectrum::Set(float).

Here's a simple C version.

void ggSpectrum_Set(float * data, float d) {
   int i;
   for (i = 0; i < 8; i++)
      data[i] = d;
}


When compiled with -O3 -mcpu=970 the following code is generated:

ggSpectrum_Set:
        mfvrsave 0
        stwu 1,-48(1)
        stw 0,44(1)
        oris 0,0,0x8000
        mtvrsave 0
        li 10,0
        rlwinm 0,3,30,30,31
        subfic 0,0,4
        andi. 9,0,3
        beq- 0,.L16
        mtctr 9
        .p2align 4,,15
.L10:
        slwi 0,10,2
        addi 10,10,1
        stfsx 1,3,0
        subfic 8,10,8
        bdnz .L10
.L3:
        subfic 6,9,8
        srwi 0,6,2
        slwi. 7,0,2
        beq- 0,.L5
        mtctr 0
        stfs 1,16(1)
        cmpwi 7,0,0
        li 0,16
        slwi 9,9,2
        li 11,0
        add 9,3,9
        lvewx 0,1,0
        vspltw 0,0,0
        beq- 7,.L17
        .p2align 4,,15
.L6:
        slwi 0,11,4
        addi 11,11,1
        stvx 0,9,0
        bdnz .L6
        cmpw 7,6,7
        subf 8,7,8
        add 10,10,7
        beq- 7,.L9
.L5:
        mtctr 8
        slwi 0,10,2
        add 3,3,0
        .p2align 4,,15
.L8:
        stfs 1,0(3)
        addi 3,3,4
        bdnz .L8
.L9:
        lwz 12,44(1)
        mtvrsave 12
        addi 1,1,48
        blr
.L16:
        mr 10,9
        li 8,8
        b .L3
.L17:
        li 0,1
        mtctr 0
        b .L6


Adding -mno-altivec results in this simpler sequence, and a significant boost
in performance (~40% speedup for the benchmark):

ggSpectrum_Set:
        stfs 1,28(3)
        stfs 1,0(3)
        stfs 1,4(3)
        stfs 1,8(3)
        stfs 1,12(3)
        stfs 1,16(3)
        stfs 1,20(3)
        stfs 1,24(3)
        blr


Another thing that stood out from the benchmark run was that the code was
taking a pretty big hit on a couple of the statically predicted branches
(apparently the address was already 16 byte aligned a lot of the time). So it
seems like it would be best to remove the static prediction and let the
hardware prediction take over.


-- 
           Summary: Autovectorization of constant iteration loop degrades
                    performance
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: pthaugen at gcc dot gnu dot org
 GCC build triplet: powerpc64-linux
  GCC host triplet: powerpc64-linux
GCC target triplet: powerpc64-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37194


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]