[Bug target/82139] New: unnecessary movapd with _mm_castsi128_pd to use BLENDPD on __m128i results

Fri Sep 8 05:48:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82139

            Bug ID: 82139
           Summary: unnecessary movapd with _mm_castsi128_pd to use
                    BLENDPD on __m128i results
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

#include <immintrin.h>
#include <stdint.h>
// stripped down from a real function that did something more useful
void foo(uint64_t blocks[]) {
    for (int i = 0 ; i<10240 ; i+=2) {
        __m128i v = _mm_loadu_si128((__m128i*)&blocks[i]);
        __m128i t1 = _mm_add_epi32(v, _mm_set1_epi32(1));
        __m128i t2 = _mm_add_epi32(v, _mm_set1_epi32(-1));
        __m128d blend = _mm_blend_pd(_mm_castsi128_pd(t1),
                             _mm_castsi128_pd(t2), 2);
          // is this even aliasing-safe?  Could cast back to __m128i
        _mm_storeu_pd((double*)(__m128d*)&blocks[i], blend);
    }
}

https://godbolt.org/g/im1kcc for source and gcc-trunk asm output (and the
slightly larger version of this function that I simplified).

blendpd/blendps have better throughput than pblendw on Intel CPUs, so I played
with that in this function I was looking at.

gcc4.8 and later waste a MOVAPD for no reason instead of clobbering one of the
PADDD results with the blend.  (The larger version of this function,
pairs_u64_sse2 in the godbolt link, avoids the extra MOVAPD with gcc4.9.4 and
earlier, but not in foo().  So maybe it's just by chance, or maybe 4.8 changed
something.  Anyway, still present in 7.2 and 8.0-trunk, and with -O2 or -O3

(GCC-Explorer-Build) 8.0.0 20170907 -xc -std=gnu99 -O3 -Wall -msse4 -mno-avx

foo:
        pcmpeqd %xmm2, %xmm2
        leaq    81920(%rdi), %rax
        movdqa  .LC0(%rip), %xmm3
.L6:
        movdqa  %xmm3, %xmm1
        addq    $16, %rdi
        movdqu  -16(%rdi), %xmm0
        paddd   %xmm0, %xmm1
        movapd  %xmm1, %xmm4
        paddd   %xmm2, %xmm0
        blendpd $2, %xmm0, %xmm4
        movups  %xmm4, -16(%rdi)
        cmpq    %rdi, %rax
        jne     .L6
        rep ret

Notice that BLENDPD's operands aren't the two output registers from the PADDD
instructions.  Different versions/options (like -mtune=skylake) put the extra
MOVAPD between the PADDD instructions, or right before BLENDPD, so don't let it
fool you. :P

With the function even simpler (like only one _mm_add_epi32), blending between
the original load result and the add result didn't appear to have an extra
MOVAPD