[Bug target/81496] New: AVX load from adjacent memory location followed by concatenation

Thu Jul 20 17:01:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81496

            Bug ID: 81496
           Summary: AVX load from adjacent memory location followed by
                    concatenation
           Product: gcc
           Version: 7.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jakub at gcc dot gnu.org
  Target Milestone: ---

With -O2 -mavx{,2,512f}, we get on the following testcase:

typedef __int128 V __attribute__((vector_size (32)));
typedef long long W __attribute__((vector_size (32)));
typedef int X __attribute__((vector_size (16)));
typedef __int128 Y __attribute__((vector_size (64)));
typedef long long Z __attribute__((vector_size (64)));

W f1 (__int128 x, __int128 y) { return (W) ((V) { x, y }); }
W f2 (__int128 x, __int128 y) { return (W) ((V) { y, x }); }

        movq    %rdi, -16(%rsp)
        movq    %rsi, -8(%rsp)
        movq    %rdx, -32(%rsp)
        movq    %rcx, -24(%rsp)
        vmovdqa -32(%rsp), %xmm0
        vmovdqa -16(%rsp), %xmm1
        vinserti128     $0x1, %xmm0, %ymm1, %ymm0
for f1, which I'm afraid is hard to do anything about, because RA didn't see
the usefulness to spill in different order, but for f2:
        movq    %rdx, -32(%rsp)
        movq    %rcx, -24(%rsp)
        vmovdqa -32(%rsp), %xmm0
        movq    %rdi, -16(%rsp)
        movq    %rsi, -8(%rsp)
        vinserti128     $0x1, -16(%rsp), %ymm0, %ymm0
Before scheduling, the movdqa is next to vinserti128 from the adjacent mem; in
that case it might be a win to use a vmovdqa -32(%rsp), %ymm0 instead.
Though, the MEM has just A128 in the rtl dump, so maybe we need to use vmovdqu
instead, unless we can prove it is 256-bit aligned (it is in this case, but not
generally).