This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
- From: "aldyh at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 28 Feb 2018 09:54:01 +0000
- Subject: [Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
- Auto-submitted: auto-generated
- References: <bug-57534-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target|i?86-*-* |i?86-*-*, x86-64
--- Comment #20 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
For the record, an even smaller test that I believe shows the problem even on
x86-64:
int ind;
int cond(void);
double hand_benchmark_cache_ronly( double *x) {
double sum=0.0;
while (cond())
sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3];
return sum;
}
with -O2 we get an extra lea in the loop:
movslq ind(%rip), %rdx
leaq 0(,%rdx,8), %rax <-- BOO!
movsd 8(%rbx,%rax), %xmm0
addsd (%rbx,%rdx,8), %xmm0
addsd 16(%rbx,%rax), %xmm0
addsd 24(%rbx,%rax), %xmm0
addsd 8(%rsp), %xmm0
movsd %xmm0, 8(%rsp)
whereas with -O2 -fno-tree-slsr we get:
movslq ind(%rip), %rax
movsd 8(%rbx,%rax,8), %xmm0
addsd (%rbx,%rax,8), %xmm0
addsd 16(%rbx,%rax,8), %xmm0
addsd 24(%rbx,%rax,8), %xmm0
addsd 8(%rsp), %xmm0
movsd %xmm0, 8(%rsp)
The .optimized dump for -O2 shows ind*8 being CSE'd away, and the address being
calculated as "ind*8 + CST":
_2 = (long unsigned int) ind.0_1;
_3 = _2 * 8; ;; common expression: ind*8
_4 = x_26(D) + _3;
_5 = *_4;
_7 = _3 + 8; ;; ind*8 + 8
_8 = x_26(D) + _7;
_9 = *_8;
...
Whereas with -O2 -fno-tree-slsr, the address is calculated as "(ind+CST) * 8 +
x"
ind.0_1 = ind;
_2 = (long unsigned int) ind.0_1;
_3 = _2 * 8;
_4 = x_26(D) + _3;
_5 = *_4;
_6 = _2 + 1;
_7 = _6 * 8; ;; (ind+1) * 8
_8 = x_26(D) + _7; ;; (ind+1) * 8 + x
_9 = *_8;
Ironically the -O2 gimple looks more efficient, but gets crappy addressing on
x86.