[Bug target/87455] New: sse_packed_single_insn_optimal is suboptimal on Zen
fanael4 at gmail dot com
gcc-bugzilla@gcc.gnu.org
Thu Sep 27 15:59:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87455
Bug ID: 87455
Summary: sse_packed_single_insn_optimal is suboptimal on Zen
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: fanael4 at gmail dot com
Target Milestone: ---
GCC by default enables -mtune-ctrl=sse_packed_single_insn_optimal on
-mtune=znver1, even though that microarchitecture doesn't like it for the same
reason Intel's microarchitectures don't: there's additional latency for domain
crossing operations, using e.g. xorps for integer data costs one cycle more
than using pxor.
Example code:
#include <immintrin.h>
int main() {
auto x = _mm_setr_epi32(1, 2, 3, 4);
auto y = _mm_setr_epi32(5, 6, 7, 8);
auto z = _mm_setr_epi32(9, 10, 11, 12);
for(int i = 0; i < 1000000000; ++i) {
x = _mm_add_epi32(x, y);
y = _mm_xor_si128(y, z);
z = _mm_add_epi32(z, x);
x = _mm_xor_si128(x, y);
y = _mm_add_epi32(y, z);
z = _mm_xor_si128(z, x);
}
asm volatile("" :: "m"(x), "m"(y), "m"(z));
}
Compiled with GCC 8.2, with -O3 -mtune=znver1 running it yields the following
perf counters:
$ perf stat -e task-clock,cycles,instructions ./a.out
Performance counter stats for './a.out':
1 193,69 msec task-clock:u # 0,989 CPUs utilized
4 040 330 384 cycles:u # 3386697,723 GHz
10 002 005 027 instructions:u # 2,48 insn per cycle
1,206801245 seconds time elapsed
1,190625000 seconds user
0,003995000 seconds sys
However, the code compiled with -O3 -mtune=znver1
-mtune-ctrl=^sse_packed_single_insn_optimal is significantly faster:
$ perf stat -e task-clock,cycles,instructions ./a.out
Performance counter stats for './a.out':
894,08 msec task-clock:u # 0,998 CPUs utilized
3 012 492 242 cycles:u # 3369678,123 GHz
10 002 004 492 instructions:u # 3,32 insn per cycle
0,895728255 seconds time elapsed
0,894688000 seconds user
0,000000000 seconds sys
This is on a Ryzen 5 2500U.
More information about the Gcc-bugs
mailing list