This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/82666] New: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 22 Oct 2017 22:56:48 +0000
- Subject: [Bug tree-optimization/82666] New: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666
Bug ID: 82666
Summary: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov
on the critical path (at -O2)
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
long long sumarray(const int *data)
{
data = (const int*)__builtin_assume_aligned(data, 64);
long long sum = 0;
for (int c=0 ; c<32768 ; c++)
sum += (data[c] >= 128 ? data[c] : 0);
return sum;
}
The loop body is written to encourage gcc to make the loop-carried dep chain
just an ADD, with independent branchless zeroing of each input. But
unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older
gcc -O3 from
if (data[c] >= 128) // doesn't auto-vectorize with gcc4, unlike the
above
sum += data[c];
See also
https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2.
https://godbolt.org/g/GgVp7E
gcc8.0 8.0.0 20171022 -O2 -mtune=haswell (slow)
leaq 131072(%rdi), %rsi
xorl %eax, %eax
.L3:
movslq (%rdi), %rdx
movq %rdx, %rcx
addq %rax, %rdx # mov+add could have been LEA
cmpl $127, %ecx
cmovg %rdx, %rax # sum = (x>=128 : sum+x : sum)
addq $4, %rdi
cmpq %rsi, %rdi
jne .L3
ret
This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx
and cmov). It's also 8 fused-domain uops (1 more than older gcc) but using LEA
would fix that.
gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test)
leaq 131072(%rdi), %rsi
xorl %eax, %eax
xorl %ecx, %ecx # extra zero constant for a cmov source
.L3:
movslq (%rdi), %rdx
cmpl $127, %edx
cmovle %rcx, %rdx # rdx = 0 when rdx<=128
addq $4, %rdi
addq %rdx, %rax # sum += ... critical path 1c latency
cmpq %rsi, %rdi
jne .L3
ret
7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell).
Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd
number of uops in the loop buffer), bottlenecked on the front-end. The latency
bottleneck is only 1 cycle. (Which Ryzen might come closer to.)
Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x
slower with gcc7/8 than with gcc6 and earlier.
Moreover, gcc should try to optimize something like this:
if (data[c] >= 128)
sum += data[c];
into conditionally zeroing a register instead of using a loop-carried cmov dep
chain.