[hjl@gnu-efi-2 prpr87317]$ cat x.c #define MAX 4 long long int dst[MAX]; int src[MAX]; void foo (void) { int i; for (i = 0; i < MAX; i++) dst[i] = src[i]; } [hjl@gnu-efi-2 prpr87317]$ gcc -S -O3 -march=haswell x.c [hjl@gnu-efi-2 prpr87317]$ cat x.s .file "x.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc movslq src(%rip), %rax movslq src+8(%rip), %rcx movslq src+12(%rip), %rdx vmovq %rax, %xmm0 movslq src+4(%rip), %rax vmovq %rcx, %xmm1 vpinsrq $1, %rdx, %xmm1, %xmm1 vpinsrq $1, %rax, %xmm0, %xmm0 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 vmovdqu %ymm0, dst(%rip) vzeroupper ret .cfi_endproc .LFE0: .size foo, .-foo .comm src,16,16 .comm dst,32,32 .ident "GCC: (GNU) 8.2.1 20181011 (Red Hat 8.2.1-4)" .section .note.GNU-stack,"",@progbits [hjl@gnu-efi-2 prpr87317]$
Works for me on aarch64: ldr q0, [x1] sshll v1.2d, v0.2s, 0 sshll2 v0.2d, v0.4s, 0 str q1, [x0] str q0, [x0, 16] So it has to be a target issue.
[hjl@gnu-efi-2 pr87317]$ cat y.c #define MAX 4 long long int dst[MAX]; short src[MAX]; void foo (void) { int i; for (i = 0; i < MAX; i++) dst[i] = src[i]; } [hjl@gnu-efi-2 pr87317]$ /export/ssd/build/tools-build/glibc-many/install/compilers/aarch64-linux-gnu/bin/aarch64-glibc-linux-gnu-gcc -S -O3 y.c [hjl@gnu-efi-2 pr87317]$ cat y.s .arch armv8-a .file "y.c" .text .align 2 .p2align 3,,7 .global foo .type foo, %function foo: .LFB0: .cfi_startproc adrp x3, src add x1, x3, :lo12:src adrp x2, dst add x0, x2, :lo12:dst ldrsh x5, [x3, #:lo12:src] ldrsh x4, [x1, 2] ldrsh x3, [x1, 4] ldrsh x1, [x1, 6] str x5, [x2, #:lo12:dst] stp x4, x3, [x0, 8] str x1, [x0, 24] ret .cfi_endproc .LFE0: .size foo, .-foo .comm src,8,8 .comm dst,32,8 .ident "GCC: (GNU) 8.2.1 20180922" .section .note.GNU-stack,"",@progbits [hjl@gnu-efi-2 pr87317]$ gcc -march=haswell -S -O3 y.c [hjl@gnu-efi-2 pr87317]$ cat y.s .file "y.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc movswq src(%rip), %rax movswq src+4(%rip), %rcx movswq src+6(%rip), %rdx vmovq %rax, %xmm0 movswq src+2(%rip), %rax vmovq %rcx, %xmm1 vpinsrq $1, %rdx, %xmm1, %xmm1 vpinsrq $1, %rax, %xmm0, %xmm0 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 vmovdqu %ymm0, dst(%rip) vzeroupper ret .cfi_endproc .LFE0: .size foo, .-foo .comm src,8,8 .comm dst,32,32 .ident "GCC: (GNU) 8.2.1 20181011 (Red Hat 8.2.1-4)" .section .note.GNU-stack,"",@progbits [hjl@gnu-efi-2 pr87317]$ I don't see much differences between x86-64 and arm64.
Hmm, it was working in GCC 7.3.x.
Even for x86_64: vmovdqa src(%rip), %xmm0 vpmovsxdq %xmm0, %xmm1 vpsrldq $8, %xmm0, %xmm0 vpmovsxdq %xmm0, %xmm0 vmovaps %xmm1, dst(%rip) vmovaps %xmm0, 16+dst(%rip) ret
(In reply to Andrew Pinski from comment #4) > Even for x86_64: > vmovdqa src(%rip), %xmm0 > vpmovsxdq %xmm0, %xmm1 > vpsrldq $8, %xmm0, %xmm0 > vpmovsxdq %xmm0, %xmm0 > vmovaps %xmm1, dst(%rip) > vmovaps %xmm0, 16+dst(%rip) > ret Only when AVX2 is disabled.
(In reply to H.J. Lu from comment #5) > (In reply to Andrew Pinski from comment #4) > > Even for x86_64: > > vmovdqa src(%rip), %xmm0 > > vpmovsxdq %xmm0, %xmm1 > > vpsrldq $8, %xmm0, %xmm0 > > vpmovsxdq %xmm0, %xmm0 > > vmovaps %xmm1, dst(%rip) > > vmovaps %xmm0, 16+dst(%rip) > > ret > > Only when AVX2 is disabled. I mean YMM disable.
Confirmed. It's a cost-model issue. With GCC 7 the vectorization with AVX256 was not profitable so AVX128 was chosen: t.c:12:1: note: Final SLP tree for instance: t.c:12:1: note: node t.c:12:1: note: stmt 0 dst[0] = _11; t.c:12:1: note: stmt 1 dst[1] = _17; t.c:12:1: note: stmt 2 dst[2] = _23; t.c:12:1: note: stmt 3 dst[3] = _29; t.c:12:1: note: node (external) t.c:12:1: note: stmt 0 _11 = (long long int) _10; t.c:12:1: note: stmt 1 _17 = (long long int) _16; t.c:12:1: note: stmt 2 _23 = (long long int) _22; t.c:12:1: note: stmt 3 _29 = (long long int) _28; t.c:12:1: note: Cost model analysis: Vector inside of basic block cost: 2 Vector prologue cost: 3 Vector epilogue cost: 0 Scalar cost of basic block: 4 t.c:12:1: note: not vectorized: vectorization is not profitable. t.c:12:1: note: ***** Re-trying analysis with vector size 16 but with GCC 8 we now say t.c:12:1: note: Cost model analysis: Vector inside of basic block cost: 20 Vector prologue cost: 28 Vector epilogue cost: 0 Scalar cost of basic block: 48 t.c:12:1: note: Basic block will be vectorized using SLP t.c:12:1: note: SLPing BB part costs on trunk are the same (the above is for generic, for haswell the vector cost is even lower, 12). So we end up with <bb 2> [local count: 214748369]: _10 = src[0]; _11 = (long long int) _10; _16 = src[1]; _17 = (long long int) _16; _22 = src[2]; _23 = (long long int) _22; _28 = src[3]; _29 = (long long int) _28; _13 = {_11, _17, _23, _29}; vect_cst__19 = _13; MEM[(long long int *)&dst] = vect_cst__19; note this just costs the vector construction + vector store against the four scalar stores. Note with my patches to consider both vector sizes this wouldn't be handled either since I didn't update them to work for BB vectorization (and they are not on trunk yet anyways). It would be an apples to oranges comparison anyways since the scalar cost differs (the SLP tree is different for AVX128). Anyways, costing for AVX128 is t.c:12:1: note: Cost model analysis: Vector inside of basic block cost: 44 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 96 (haswell). So if you scale the vector cost by 0.5 because the scalar cost is doubled you end up at 22 which would compare favorably to 12 + 28 == 40.
This has been fixed in GCC 12. Sunil, please submit a GCC patch to add a testcase.