[Bug rtl-optimization/53533] [4.7/4.8 regression] vectorization causes loop unrolling test slowdown as measured by Adobe's C++Benchmark
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Jun 12 10:12:00 GMT 2012
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53533
--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-06-12 10:11:51 UTC ---
Btw, when I run the benchmark with the addition of -march=native (for me,
that's
-march=corei7) then GCC 4.7 performs better than 4.6:
4.6:
./t 100000
test description absolute operations ratio with
number time per second test0
0 "int32_t for loop unroll 1" 0.41 sec 1951.22 M 1.00
1 "int32_t for loop unroll 2" 0.51 sec 1568.63 M 1.24
2 "int32_t for loop unroll 3" 0.47 sec 1702.13 M 1.15
3 "int32_t for loop unroll 4" 0.48 sec 1666.67 M 1.17
4 "int32_t for loop unroll 5" 0.47 sec 1702.13 M 1.15
5 "int32_t for loop unroll 6" 0.51 sec 1568.63 M 1.24
6 "int32_t for loop unroll 7" 0.47 sec 1702.13 M 1.15
7 "int32_t for loop unroll 8" 0.47 sec 1702.13 M 1.15
Total absolute time for int32_t for loop unrolling: 3.79 sec
4.7:
./t 100000
test description absolute operations ratio with
number time per second test0
0 "int32_t for loop unroll 1" 0.39 sec 2051.28 M 1.00
1 "int32_t for loop unroll 2" 0.40 sec 2000.00 M 1.03
2 "int32_t for loop unroll 3" 0.39 sec 2051.28 M 1.00
3 "int32_t for loop unroll 4" 0.39 sec 2051.28 M 1.00
4 "int32_t for loop unroll 5" 0.38 sec 2105.26 M 0.97
5 "int32_t for loop unroll 6" 0.41 sec 1951.22 M 1.05
6 "int32_t for loop unroll 7" 0.37 sec 2162.16 M 0.95
7 "int32_t for loop unroll 8" 0.36 sec 2222.22 M 0.92
Total absolute time for int32_t for loop unrolling: 3.09 sec
The loop then looks like (the expected)
.L53:
movdqa (%rax), %xmm4
paddd %xmm3, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm2, %xmm4
paddd %xmm4, %xmm6
movdqa 16(%rax), %xmm4
addq $32, %rax
cmpq $data32+32000, %rax
paddd %xmm3, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm1, %xmm4
pmulld %xmm0, %xmm4
paddd %xmm2, %xmm4
paddd %xmm4, %xmm5
jne .L53
looks like pmulld is only available with SSE 4.1 and otherwise we fall back
to the define_insn_and_split "*sse2_mulv4si3". But that complexity is not
reflected in the vectorizer cost model (which needs improvement ...).
More information about the Gcc-bugs
mailing list