This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82667] New: SSE2 redundant pcmpgtd for sign-extension of values known to be >= 0
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Mon, 23 Oct 2017 01:29:57 +0000
- Subject: [Bug target/82667] New: SSE2 redundant pcmpgtd for sign-extension of values known to be >= 0
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82667
Bug ID: 82667
Summary: SSE2 redundant pcmpgtd for sign-extension of values
known to be >= 0
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
long long sumarray(const int *data)
{
data = (const int*)__builtin_assume_aligned(data, 64);
long long sum = 0;
for (int c=0 ; c<32768 ; c++)
sum += (data[c] >= 128 ? data[c] : 0);
return sum;
}
// Same function as pr 82666, see that for scalar cmov choices.
Same result with
if (data[c] >= 128)
sum += data[c];
https://godbolt.org/g/NwcPmh
gcc 8.0.0 20171022 -O3
movdqa .LC0(%rip), %xmm5 # set1(127)
leaq 131072(%rdi), %rax
pxor %xmm2, %xmm2 # accumulator
pxor %xmm4, %xmm4 # for Intel CPUs we should re-materialize
with pxor inside the loop instead instead of movdqa. But not AMD
.L2:
movdqa (%rdi), %xmm0
addq $16, %rdi
movdqa %xmm0, %xmm1
pcmpgtd %xmm5, %xmm1
pand %xmm1, %xmm0
# so far so good: we have conditionally zeroed xmm0
movdqa %xmm4, %xmm1
pcmpgtd %xmm0, %xmm1 # 0 > x to generate high-half for
sign-extension
movdqa %xmm0, %xmm3
punpckldq %xmm1, %xmm3 # unpack with compare result
punpckhdq %xmm1, %xmm0 # (instead of just zero)
paddq %xmm3, %xmm2
paddq %xmm0, %xmm2
cmpq %rdi, %rax
jne .L2
movdqa %xmm2, %xmm0
psrldq $8, %xmm0 # requires a wasted movdqa vs. pshufd or
movhlps
paddq %xmm2, %xmm0
movq %xmm0, %rax
ret
There are multiple inefficiencies that I pointed out in comments, but this bug
report is about doing sign extension when we can prove that simple zero
extension is sufficient. Negative numbers are impossible from (x>=128 ? x :
0).
Changing the source to do zero-extension but still a signed compare stops it
from auto-vectorizing.
int to_add = (data[c] >= 128 ? data[c] : 0);
unsigned tmp = to_add;
sum += (unsigned long long)tmp; // zero-extension
Making everything unsigned does zero-extension as expected, but if the
comparison is signed, it either fails to auto-vectorize or it still uses
sign-extension.
e.g. this auto-vectorizes with sign-extension, but if you change the constant
to -128, it won't auto-vectorize at all (because then sign and zero extension
are no longer equivalent).
int to_add = (data[c] >= 128 ? data[c] : 0);
unsigned tmp = to_add;
unsigned long long tmp_ull = tmp; // zero-extension
long long tmp_ll = tmp_ull;
sum += tmp_ll;