[Bug tree-optimization/93055] New: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism
hubicka at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Dec 23 19:28:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055
Bug ID: 93055
Summary: accumulation loops in stepanov_vector benchmark use
more instruction level parpallelism
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: hubicka at gcc dot gnu.org
Target Milestone: ---
stepanov_vector benchmark form
https://gitlab.com/chriscox/CppPerformanceBenchmarks gets poor codegen on
TestOneType<double>
Built with -march=bdver1 -O3 (but the regression happens on core too)
Clang compiles accumulation loops for testOneType<int> as follows:
│ vpxor %xmm0,%xmm0,%xmm0
│ vpxor %xmm1,%xmm1,%xmm1
│ vpxor %xmm2,%xmm2,%xmm2
0.05 │ vpxor %xmm3,%xmm3,%xmm3=
│ data16 nopw %cs:0x0(%rax,%rax,1)
6.95 │ 300:┌─→vpaddd 0x5f0(%rsp,%rcx,4),%xmm0,%xmm0
0.05 │ │ vpaddd 0x600(%rsp,%rcx,4),%xmm1,%xmm1
7.13 │ │ vpaddd 0x610(%rsp,%rcx,4),%xmm2,%xmm2
0.16 │ │ vpaddd 0x620(%rsp,%rcx,4),%xmm3,%xmm3
│ │ add $0x10,%rcx
│ │ cmp $0x7dc,%rcx
7.04 │ └──jne 300
0.07 │ vpaddd %xmm0,%xmm1,%xmm0
1.61 │ vpaddd %xmm0,%xmm2,%xmm0
│ vpaddd %xmm0,%xmm3,%xmm0
│ vpshuf $0x4e,%xmm0,%xmm1
0.07 │ vpaddd %xmm1,%xmm0,%xmm0
0.02 │ vpshuf $0xe5,%xmm0,%xmm1
while GCC10 does:
│ 1c0: vxorps %xmm0,%xmm0,%xmm0
│ mov %rbx,%rax
│ nop
2.25 │ 1d0:┌─→vpaddd (%rax),%xmm0,%xmm0
0.01 │ │ lea 0x2100(%rsp),%rdi
0.95 │ │ add $0x10,%rax
1.04 │ │ cmp %rax,%rdi
2.24 │ └──jne 1d0
Which runs slower:
test description absolute operations
ratio with
number time per second
test0
0 "int32_t accumulate pointer verify2" 1.06 sec 12440.17 M
1.00
1 "int32_t accumulate vector iterator" 1.06 sec 12458.15 M
1.00
2 "int32_t accumulate pointer reverse reverse" 1.06 sec 12440.34 M
1.00
3 "int32_t accumulate vector reverse_iterator reverse" 1.05 sec 12602.74 M
0.99
4 "int32_t accumulate vector iterator reverse reverse" 1.04 sec 12749.27 M
0.98
5 "int32_t accumulate array Riterator reverse reverse" 1.06 sec 12486.26 M
1.00
Total absolute time for int32_t Vector Accumulate: 6.32 sec
int32_t Vector Accumulate Penalty: 0.99
compared to:
test description absolute operations
ratio with
number time per second
test0
0 "int32_t accumulate pointer verify2" 2.29 sec 5773.60 M
1.00
1 "int32_t accumulate vector iterator" 2.27 sec 5806.96 M
0.99
2 "int32_t accumulate pointer reverse reverse" 2.26 sec 5830.72 M
0.99
3 "int32_t accumulate vector reverse_iterator reverse" 2.27 sec 5827.45 M
0.99
4 "int32_t accumulate vector iterator reverse reverse" 2.27 sec 5821.29 M
0.99
5 "int32_t accumulate array Riterator reverse reverse" 2.27 sec 5826.58 M
0.99
Total absolute time for int32_t Vector Accumulate: 13.62 sec
int32_t Vector Accumulate Penalty: 0.99
More information about the Gcc-bugs
mailing list