This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82139] New: unnecessary movapd with _mm_castsi128_pd to use BLENDPD on __m128i results
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 08 Sep 2017 05:47:52 +0000
- Subject: [Bug target/82139] New: unnecessary movapd with _mm_castsi128_pd to use BLENDPD on __m128i results
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82139
Bug ID: 82139
Summary: unnecessary movapd with _mm_castsi128_pd to use
BLENDPD on __m128i results
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include <immintrin.h>
#include <stdint.h>
// stripped down from a real function that did something more useful
void foo(uint64_t blocks[]) {
for (int i = 0 ; i<10240 ; i+=2) {
__m128i v = _mm_loadu_si128((__m128i*)&blocks[i]);
__m128i t1 = _mm_add_epi32(v, _mm_set1_epi32(1));
__m128i t2 = _mm_add_epi32(v, _mm_set1_epi32(-1));
__m128d blend = _mm_blend_pd(_mm_castsi128_pd(t1),
_mm_castsi128_pd(t2), 2);
// is this even aliasing-safe? Could cast back to __m128i
_mm_storeu_pd((double*)(__m128d*)&blocks[i], blend);
}
}
https://godbolt.org/g/im1kcc for source and gcc-trunk asm output (and the
slightly larger version of this function that I simplified).
blendpd/blendps have better throughput than pblendw on Intel CPUs, so I played
with that in this function I was looking at.
gcc4.8 and later waste a MOVAPD for no reason instead of clobbering one of the
PADDD results with the blend. (The larger version of this function,
pairs_u64_sse2 in the godbolt link, avoids the extra MOVAPD with gcc4.9.4 and
earlier, but not in foo(). So maybe it's just by chance, or maybe 4.8 changed
something. Anyway, still present in 7.2 and 8.0-trunk, and with -O2 or -O3
(GCC-Explorer-Build) 8.0.0 20170907 -xc -std=gnu99 -O3 -Wall -msse4 -mno-avx
foo:
pcmpeqd %xmm2, %xmm2
leaq 81920(%rdi), %rax
movdqa .LC0(%rip), %xmm3
.L6:
movdqa %xmm3, %xmm1
addq $16, %rdi
movdqu -16(%rdi), %xmm0
paddd %xmm0, %xmm1
movapd %xmm1, %xmm4
paddd %xmm2, %xmm0
blendpd $2, %xmm0, %xmm4
movups %xmm4, -16(%rdi)
cmpq %rdi, %rax
jne .L6
rep ret
Notice that BLENDPD's operands aren't the two output registers from the PADDD
instructions. Different versions/options (like -mtune=skylake) put the extra
MOVAPD between the PADDD instructions, or right before BLENDPD, so don't let it
fool you. :P
With the function even simpler (like only one _mm_add_epi32), blending between
the original load result and the add result didn't appear to have an extra
MOVAPD