This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

From: "aldyh at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 28 Feb 2018 09:54:01 +0000
Subject: [Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
Auto-submitted: auto-generated
References: <bug-57534-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

Aldy Hernandez <aldyh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|i?86-*-*                    |i?86-*-*, x86-64

--- Comment #20 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
For the record, an even smaller test that I believe shows the problem even on
x86-64:

int ind;
int cond(void);

double hand_benchmark_cache_ronly( double *x) {
    double sum=0.0;
    while (cond())
        sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3];
    return sum;
}

with -O2 we get an extra lea in the loop:

        movslq  ind(%rip), %rdx
        leaq    0(,%rdx,8), %rax        <-- BOO!
        movsd   8(%rbx,%rax), %xmm0
        addsd   (%rbx,%rdx,8), %xmm0
        addsd   16(%rbx,%rax), %xmm0
        addsd   24(%rbx,%rax), %xmm0
        addsd   8(%rsp), %xmm0
        movsd   %xmm0, 8(%rsp)

whereas with -O2 -fno-tree-slsr we get:

        movslq  ind(%rip), %rax
        movsd   8(%rbx,%rax,8), %xmm0
        addsd   (%rbx,%rax,8), %xmm0
        addsd   16(%rbx,%rax,8), %xmm0
        addsd   24(%rbx,%rax,8), %xmm0
        addsd   8(%rsp), %xmm0
        movsd   %xmm0, 8(%rsp)

The .optimized dump for -O2 shows ind*8 being CSE'd away, and the address being
calculated as "ind*8 + CST":

  _2 = (long unsigned int) ind.0_1;
  _3 = _2 * 8;          ;; common expression: ind*8
  _4 = x_26(D) + _3;
  _5 = *_4;
  _7 = _3 + 8;          ;; ind*8 + 8
  _8 = x_26(D) + _7;
  _9 = *_8;
...

Whereas with -O2 -fno-tree-slsr, the address is calculated as "(ind+CST) * 8 +
x"

  ind.0_1 = ind;
  _2 = (long unsigned int) ind.0_1;
  _3 = _2 * 8;
  _4 = x_26(D) + _3;
  _5 = *_4;
  _6 = _2 + 1;
  _7 = _6 * 8;          ;; (ind+1) * 8
  _8 = x_26(D) + _7;    ;; (ind+1) * 8 + x
  _9 = *_8;

Ironically the -O2 gimple looks more efficient, but gets crappy addressing on
x86.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]