[Bug target/84986] New: Performance regression: loop no longer vectorized (x86-64)

gergo.barany at inria dot fr gcc-bugzilla@gcc.gnu.org
Tue Mar 20 08:16:00 GMT 2018


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986

            Bug ID: 84986
           Summary: Performance regression: loop no longer vectorized
                    (x86-64)
           Product: gcc
           Version: 8.0.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gergo.barany at inria dot fr
  Target Milestone: ---

Created attachment 43713
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43713&action=edit
input function showing performance regression

For context: I throw randomly generated code at compilers and look at
differences in how they optimize; see
https://github.com/gergo-/missed-optimizations for details if interested. The
test case below is entirely artificial, I do *not* have any real-world
application that depends on this.

The attached test.c file contains a function with a simple loop:

int N;
long fn1(void) {
  short i;
  long a;
  i = a = 0;
  while (i < N)
    a -= i++;
  return a;
}

Until recently, this loop used to be vectorized on x86-64, with the core loop
(if I understand the code correctly) looking something like this, as generated
by GCC trunk from 20180206 (with -O3):

  40:   66 0f 6f ce             movdqa %xmm6,%xmm1
  44:   66 0f 6f e3             movdqa %xmm3,%xmm4
  48:   66 0f 6f d3             movdqa %xmm3,%xmm2
  4c:   83 c0 01                add    $0x1,%eax
  4f:   66 0f 65 cb             pcmpgtw %xmm3,%xmm1
  53:   66 0f fd df             paddw  %xmm7,%xmm3
  57:   66 0f 69 e1             punpckhwd %xmm1,%xmm4
  5b:   66 0f 61 d1             punpcklwd %xmm1,%xmm2
  5f:   66 0f 6f cc             movdqa %xmm4,%xmm1
  63:   66 0f 6f e5             movdqa %xmm5,%xmm4
  67:   66 44 0f 6f c2          movdqa %xmm2,%xmm8
  6c:   66 0f 66 e2             pcmpgtd %xmm2,%xmm4
  70:   66 44 0f 62 c4          punpckldq %xmm4,%xmm8
  75:   66 0f 6a d4             punpckhdq %xmm4,%xmm2
  79:   66 0f 6f e1             movdqa %xmm1,%xmm4
  7d:   66 41 0f fb c0          psubq  %xmm8,%xmm0
  82:   66 0f fb c2             psubq  %xmm2,%xmm0
  86:   66 0f 6f d5             movdqa %xmm5,%xmm2
  8a:   66 0f 66 d1             pcmpgtd %xmm1,%xmm2
  8e:   66 0f 62 e2             punpckldq %xmm2,%xmm4
  92:   66 0f 6a ca             punpckhdq %xmm2,%xmm1
  96:   66 0f fb c4             psubq  %xmm4,%xmm0
  9a:   66 0f fb c1             psubq  %xmm1,%xmm0
  9e:   39 c1                   cmp    %eax,%ecx
  a0:   77 9e                   ja     40 <fn1+0x40>

(I'm sorry this comes from objdump, I didn't keep that GCC version around to
generate a nicer assembly listing.)

With a version from 20180319 (r258665), this is no longer the case:

.L3:
        movswq  %dx, %rcx
        addl    $1, %edx
        subq    %rcx, %rax
        movswl  %dx, %ecx
        cmpl    %esi, %ecx
        jl      .L3

Linking the two versions against a driver program, which simply calls this
function many times after setting N to SHRT_MAX, shows a slowdown of about
1.8x:

$ time ./test.20180206 ; time ./test.20180319 
32767 elements in 0.000009 sec on average, result = -536821761000000

real    0m8.875s
user    0m8.844s
sys     0m0.028s
32767 elements in 0.000016 sec on average, result = -536821761000000

real    0m15.691s
user    0m15.688s
sys     0m0.000s

Target: x86_64-pc-linux-gnu
Configured with: ../../src/gcc/configure
--prefix=/home/gergo/optcheck/compilers/install --enable-languages=c
--with-newlib --without-headers --disable-bootstrap --disable-nls
--disable-shared --disable-multilib --disable-decimal-float --disable-threads
--disable-libatomic --disable-libgomp --disable-libmpx --disable-libquadmath
--disable-libssp --disable-libvtv --disable-libstdcxx
--program-prefix=optcheck-x86- --target=x86_64-pc-linux-gnu
Thread model: single

This is under Linux on a machine whose CPU identifies itself as Intel(R)
Core(TM) i7-4712HQ CPU @ 2.30GHz.


For whatever it's worth, Clang goes the opposite way, vectorizes very
aggressively, and ends up slower:

$ time ./test.clang 
32767 elements in 0.000019 sec on average, result = -536821761000000

real    0m18.930s
user    0m18.928s
sys     0m0.000s

With the previous version, GCC was about 2.1x faster than Clang, this seems to
have regressed to "only" 1.2x faster.


More information about the Gcc-bugs mailing list