[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark

Tue Jun 12 10:12:00 GMT 2012

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533

--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:11:51 UTC ---
Btw, when I run the benchmark with the addition of -march=native (for me,
that's
-march=corei7) then GCC 4.7 performs better than 4.6:

4.6:

./t 100000 

test               description   absolute   operations   ratio with
number                           time       per second   test0

 0 "int32_t for loop unroll 1"   0.41 sec   1951.22 M     1.00
 1 "int32_t for loop unroll 2"   0.51 sec   1568.63 M     1.24
 2 "int32_t for loop unroll 3"   0.47 sec   1702.13 M     1.15
 3 "int32_t for loop unroll 4"   0.48 sec   1666.67 M     1.17
 4 "int32_t for loop unroll 5"   0.47 sec   1702.13 M     1.15
 5 "int32_t for loop unroll 6"   0.51 sec   1568.63 M     1.24
 6 "int32_t for loop unroll 7"   0.47 sec   1702.13 M     1.15
 7 "int32_t for loop unroll 8"   0.47 sec   1702.13 M     1.15

Total absolute time for int32_t for loop unrolling: 3.79 sec

4.7:

./t 100000 

test               description   absolute   operations   ratio with
number                           time       per second   test0

 0 "int32_t for loop unroll 1"   0.39 sec   2051.28 M     1.00
 1 "int32_t for loop unroll 2"   0.40 sec   2000.00 M     1.03
 2 "int32_t for loop unroll 3"   0.39 sec   2051.28 M     1.00
 3 "int32_t for loop unroll 4"   0.39 sec   2051.28 M     1.00
 4 "int32_t for loop unroll 5"   0.38 sec   2105.26 M     0.97
 5 "int32_t for loop unroll 6"   0.41 sec   1951.22 M     1.05
 6 "int32_t for loop unroll 7"   0.37 sec   2162.16 M     0.95
 7 "int32_t for loop unroll 8"   0.36 sec   2222.22 M     0.92

Total absolute time for int32_t for loop unrolling: 3.09 sec

The loop then looks like (the expected)

.L53:
        movdqa  (%rax), %xmm4
        paddd   %xmm3, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm2, %xmm4
        paddd   %xmm4, %xmm6
        movdqa  16(%rax), %xmm4
        addq    $32, %rax
        cmpq    $data32+32000, %rax
        paddd   %xmm3, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm1, %xmm4
        pmulld  %xmm0, %xmm4
        paddd   %xmm2, %xmm4
        paddd   %xmm4, %xmm5
        jne     .L53

looks like pmulld is only available with SSE 4.1 and otherwise we fall back
to the define_insn_and_split "*sse2_mulv4si3".  But that complexity is not
reflected in the vectorizer cost model (which needs improvement ...).