[Bug target/81496] AVX load from adjacent memory location followed by concatenation
jakub at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Thu Jul 20 17:09:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81496
--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Maybe even better would be to emit vmovq %r1, %xmm0; vpinsrq $1, %r2, %xmm0;
vpinsrq $2, %r3, %ymm0; vpinsrq $3, %r4, %ymm0; but not sure how to achieve
that.
For another testcase:
typedef long long W __attribute__((vector_size (32)));
W f1 (long long x, long long y, long long z, long long w) { return (W) { x, y,
z, w }; }
W f2 (long long x, long long y, long long z, long long w) { return (W) { w, z,
y, x }; }
we emit with -O3 -mavx2 -mtune=intel:
vmovq %rsi, %xmm2
vmovq %rcx, %xmm3
vpinsrq $1, %rdi, %xmm2, %xmm1
vpinsrq $1, %rdx, %xmm3, %xmm0
vinserti128 $0x1, %xmm1, %ymm0, %ymm0
and here again, I wonder if vmovq + 3x vpinsrq wouldn't be better.
In that case, handling this in i386.c ix86_expand_vector_init or helpers
thereof would be possible. Guess it should be benchmarked on various CPUs.
More information about the Gcc-bugs
mailing list