[Bug target/84986] New: Performance regression: loop no longer vectorized (x86-64)

Tue Mar 20 08:16:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986

            Bug ID: 84986
           Summary: Performance regression: loop no longer vectorized
                    (x86-64)
           Product: gcc
           Version: 8.0.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 43713
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43713&action=edit
input function showing performance regression

For context: I throw randomly generated code at compilers and look at
differences in how they optimize; see
https://github.com/gergo-/missed-optimizations for details if interested. The
test case below is entirely artificial, I do *not* have any real-world
application that depends on this.

The attached test.c file contains a function with a simple loop:

int N;
long fn1(void) {
  short i;
  long a;
  i = a = 0;
  while (i < N)
    a -= i++;
  return a;
}

Until recently, this loop used to be vectorized on x86-64, with the core loop
(if I understand the code correctly) looking something like this, as generated
by GCC trunk from 20180206 (with -O3):

  40:   66 0f 6f ce             movdqa %xmm6,%xmm1
  44:   66 0f 6f e3             movdqa %xmm3,%xmm4
  48:   66 0f 6f d3             movdqa %xmm3,%xmm2
  4c:   83 c0 01                add    $0x1,%eax
  4f:   66 0f 65 cb             pcmpgtw %xmm3,%xmm1
  53:   66 0f fd df             paddw  %xmm7,%xmm3
  57:   66 0f 69 e1             punpckhwd %xmm1,%xmm4
  5b:   66 0f 61 d1             punpcklwd %xmm1,%xmm2
  5f:   66 0f 6f cc             movdqa %xmm4,%xmm1
  63:   66 0f 6f e5             movdqa %xmm5,%xmm4
  67:   66 44 0f 6f c2          movdqa %xmm2,%xmm8
  6c:   66 0f 66 e2             pcmpgtd %xmm2,%xmm4
  70:   66 44 0f 62 c4          punpckldq %xmm4,%xmm8
  75:   66 0f 6a d4             punpckhdq %xmm4,%xmm2
  79:   66 0f 6f e1             movdqa %xmm1,%xmm4
  7d:   66 41 0f fb c0          psubq  %xmm8,%xmm0
  82:   66 0f fb c2             psubq  %xmm2,%xmm0
  86:   66 0f 6f d5             movdqa %xmm5,%xmm2
  8a:   66 0f 66 d1             pcmpgtd %xmm1,%xmm2
  8e:   66 0f 62 e2             punpckldq %xmm2,%xmm4
  92:   66 0f 6a ca             punpckhdq %xmm2,%xmm1
  96:   66 0f fb c4             psubq  %xmm4,%xmm0
  9a:   66 0f fb c1             psubq  %xmm1,%xmm0
  9e:   39 c1                   cmp    %eax,%ecx
  a0:   77 9e                   ja     40 <fn1+0x40>

(I'm sorry this comes from objdump, I didn't keep that GCC version around to
generate a nicer assembly listing.)

With a version from 20180319 (r258665), this is no longer the case:

.L3:
        movswq  %dx, %rcx
        addl    $1, %edx
        subq    %rcx, %rax
        movswl  %dx, %ecx
        cmpl    %esi, %ecx
        jl      .L3

Linking the two versions against a driver program, which simply calls this
function many times after setting N to SHRT_MAX, shows a slowdown of about
1.8x:

$ time ./test.20180206 ; time ./test.20180319 
32767 elements in 0.000009 sec on average, result = -536821761000000

real    0m8.875s
user    0m8.844s
sys     0m0.028s
32767 elements in 0.000016 sec on average, result = -536821761000000

real    0m15.691s
user    0m15.688s
sys     0m0.000s

Target: x86_64-pc-linux-gnu
Configured with: ../../src/gcc/configure
--prefix=/home/gergo/optcheck/compilers/install --enable-languages=c
--with-newlib --without-headers --disable-bootstrap --disable-nls
--disable-shared --disable-multilib --disable-decimal-float --disable-threads
--disable-libatomic --disable-libgomp --disable-libmpx --disable-libquadmath
--disable-libssp --disable-libvtv --disable-libstdcxx
--program-prefix=optcheck-x86- --target=x86_64-pc-linux-gnu
Thread model: single

This is under Linux on a machine whose CPU identifies itself as Intel(R)
Core(TM) i7-4712HQ CPU @ 2.30GHz.

For whatever it's worth, Clang goes the opposite way, vectorizes very
aggressively, and ends up slower:

$ time ./test.clang 
32767 elements in 0.000019 sec on average, result = -536821761000000

real    0m18.930s
user    0m18.928s
sys     0m0.000s

With the previous version, GCC was about 2.1x faster than Clang, this seems to
have regressed to "only" 1.2x faster.