[Bug target/84986] New: Performance regression: loop no longer vectorized (x86-64)
gergo.barany at inria dot fr
gcc-bugzilla@gcc.gnu.org
Tue Mar 20 08:16:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986
Bug ID: 84986
Summary: Performance regression: loop no longer vectorized
(x86-64)
Product: gcc
Version: 8.0.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: gergo.barany at inria dot fr
Target Milestone: ---
Created attachment 43713
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43713&action=edit
input function showing performance regression
For context: I throw randomly generated code at compilers and look at
differences in how they optimize; see
https://github.com/gergo-/missed-optimizations for details if interested. The
test case below is entirely artificial, I do *not* have any real-world
application that depends on this.
The attached test.c file contains a function with a simple loop:
int N;
long fn1(void) {
short i;
long a;
i = a = 0;
while (i < N)
a -= i++;
return a;
}
Until recently, this loop used to be vectorized on x86-64, with the core loop
(if I understand the code correctly) looking something like this, as generated
by GCC trunk from 20180206 (with -O3):
40: 66 0f 6f ce movdqa %xmm6,%xmm1
44: 66 0f 6f e3 movdqa %xmm3,%xmm4
48: 66 0f 6f d3 movdqa %xmm3,%xmm2
4c: 83 c0 01 add $0x1,%eax
4f: 66 0f 65 cb pcmpgtw %xmm3,%xmm1
53: 66 0f fd df paddw %xmm7,%xmm3
57: 66 0f 69 e1 punpckhwd %xmm1,%xmm4
5b: 66 0f 61 d1 punpcklwd %xmm1,%xmm2
5f: 66 0f 6f cc movdqa %xmm4,%xmm1
63: 66 0f 6f e5 movdqa %xmm5,%xmm4
67: 66 44 0f 6f c2 movdqa %xmm2,%xmm8
6c: 66 0f 66 e2 pcmpgtd %xmm2,%xmm4
70: 66 44 0f 62 c4 punpckldq %xmm4,%xmm8
75: 66 0f 6a d4 punpckhdq %xmm4,%xmm2
79: 66 0f 6f e1 movdqa %xmm1,%xmm4
7d: 66 41 0f fb c0 psubq %xmm8,%xmm0
82: 66 0f fb c2 psubq %xmm2,%xmm0
86: 66 0f 6f d5 movdqa %xmm5,%xmm2
8a: 66 0f 66 d1 pcmpgtd %xmm1,%xmm2
8e: 66 0f 62 e2 punpckldq %xmm2,%xmm4
92: 66 0f 6a ca punpckhdq %xmm2,%xmm1
96: 66 0f fb c4 psubq %xmm4,%xmm0
9a: 66 0f fb c1 psubq %xmm1,%xmm0
9e: 39 c1 cmp %eax,%ecx
a0: 77 9e ja 40 <fn1+0x40>
(I'm sorry this comes from objdump, I didn't keep that GCC version around to
generate a nicer assembly listing.)
With a version from 20180319 (r258665), this is no longer the case:
.L3:
movswq %dx, %rcx
addl $1, %edx
subq %rcx, %rax
movswl %dx, %ecx
cmpl %esi, %ecx
jl .L3
Linking the two versions against a driver program, which simply calls this
function many times after setting N to SHRT_MAX, shows a slowdown of about
1.8x:
$ time ./test.20180206 ; time ./test.20180319
32767 elements in 0.000009 sec on average, result = -536821761000000
real 0m8.875s
user 0m8.844s
sys 0m0.028s
32767 elements in 0.000016 sec on average, result = -536821761000000
real 0m15.691s
user 0m15.688s
sys 0m0.000s
Target: x86_64-pc-linux-gnu
Configured with: ../../src/gcc/configure
--prefix=/home/gergo/optcheck/compilers/install --enable-languages=c
--with-newlib --without-headers --disable-bootstrap --disable-nls
--disable-shared --disable-multilib --disable-decimal-float --disable-threads
--disable-libatomic --disable-libgomp --disable-libmpx --disable-libquadmath
--disable-libssp --disable-libvtv --disable-libstdcxx
--program-prefix=optcheck-x86- --target=x86_64-pc-linux-gnu
Thread model: single
This is under Linux on a machine whose CPU identifies itself as Intel(R)
Core(TM) i7-4712HQ CPU @ 2.30GHz.
For whatever it's worth, Clang goes the opposite way, vectorizes very
aggressively, and ends up slower:
$ time ./test.clang
32767 elements in 0.000019 sec on average, result = -536821761000000
real 0m18.930s
user 0m18.928s
sys 0m0.000s
With the previous version, GCC was about 2.1x faster than Clang, this seems to
have regressed to "only" 1.2x faster.
More information about the Gcc-bugs
mailing list