[Bug target/81496] AVX load from adjacent memory location followed by concatenation

Thu Jul 20 17:09:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81496

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Maybe even better would be to emit vmovq %r1, %xmm0; vpinsrq $1, %r2, %xmm0;
vpinsrq $2, %r3, %ymm0; vpinsrq $3, %r4, %ymm0; but not sure how to achieve
that.

For another testcase:
typedef long long W __attribute__((vector_size (32)));

W f1 (long long x, long long y, long long z, long long w) { return (W) { x, y,
z, w }; }
W f2 (long long x, long long y, long long z, long long w) { return (W) { w, z,
y, x }; }

we emit with -O3 -mavx2 -mtune=intel:
        vmovq   %rsi, %xmm2
        vmovq   %rcx, %xmm3
        vpinsrq $1, %rdi, %xmm2, %xmm1
        vpinsrq $1, %rdx, %xmm3, %xmm0
        vinserti128     $0x1, %xmm1, %ymm0, %ymm0
and here again, I wonder if vmovq + 3x vpinsrq wouldn't be better.
In that case, handling this in i386.c ix86_expand_vector_init or helpers
thereof would be possible.  Guess it should be benchmarked on various CPUs.