Bug 87743

Summary: Vectorizer doesn't support conversion of different sizes
Product: gcc Reporter: H.J. Lu <hjl.tools>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: crazylht, rguenth, skpgkp2
Priority: P3 Keywords: missed-optimization
Version: 9.0   
Target Milestone: ---   
Host: Target: x86_64-*-*
Build: Known to work: 7.3.0
Known to fail: Last reconfirmed: 2018-10-25 00:00:00
Bug Depends on:    
Bug Blocks: 53947    

Description H.J. Lu 2018-10-25 05:51:04 UTC
[hjl@gnu-efi-2 prpr87317]$ cat x.c 
#define MAX 4

long long int dst[MAX];
int src[MAX];

void
foo (void)
{
  int i;
  for (i = 0; i < MAX; i++)
    dst[i] = src[i];
}
[hjl@gnu-efi-2 prpr87317]$ gcc -S  -O3 -march=haswell x.c
[hjl@gnu-efi-2 prpr87317]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	movslq	src(%rip), %rax
	movslq	src+8(%rip), %rcx
	movslq	src+12(%rip), %rdx
	vmovq	%rax, %xmm0
	movslq	src+4(%rip), %rax
	vmovq	%rcx, %xmm1
	vpinsrq	$1, %rdx, %xmm1, %xmm1
	vpinsrq	$1, %rax, %xmm0, %xmm0
	vinserti128	$0x1, %xmm1, %ymm0, %ymm0
	vmovdqu	%ymm0, dst(%rip)
	vzeroupper
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.comm	src,16,16
	.comm	dst,32,32
	.ident	"GCC: (GNU) 8.2.1 20181011 (Red Hat 8.2.1-4)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-efi-2 prpr87317]$
Comment 1 Andrew Pinski 2018-10-25 06:00:24 UTC
Works for me on aarch64:
        ldr     q0, [x1]
        sshll   v1.2d, v0.2s, 0
        sshll2  v0.2d, v0.4s, 0
        str     q1, [x0]
        str     q0, [x0, 16]

So it has to be a target issue.
Comment 2 H.J. Lu 2018-10-25 06:09:38 UTC
[hjl@gnu-efi-2 pr87317]$ cat y.c
#define MAX 4

long long int dst[MAX];
short src[MAX];

void
foo (void)
{
  int i;
  for (i = 0; i < MAX; i++)
    dst[i] = src[i];
}
[hjl@gnu-efi-2 pr87317]$ /export/ssd/build/tools-build/glibc-many/install/compilers/aarch64-linux-gnu/bin/aarch64-glibc-linux-gnu-gcc -S -O3 y.c 
[hjl@gnu-efi-2 pr87317]$ cat y.s
	.arch armv8-a
	.file	"y.c"
	.text
	.align	2
	.p2align 3,,7
	.global	foo
	.type	foo, %function
foo:
.LFB0:
	.cfi_startproc
	adrp	x3, src
	add	x1, x3, :lo12:src
	adrp	x2, dst
	add	x0, x2, :lo12:dst
	ldrsh	x5, [x3, #:lo12:src]
	ldrsh	x4, [x1, 2]
	ldrsh	x3, [x1, 4]
	ldrsh	x1, [x1, 6]
	str	x5, [x2, #:lo12:dst]
	stp	x4, x3, [x0, 8]
	str	x1, [x0, 24]
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.comm	src,8,8
	.comm	dst,32,8
	.ident	"GCC: (GNU) 8.2.1 20180922"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-efi-2 pr87317]$ gcc -march=haswell -S -O3 y.c 
[hjl@gnu-efi-2 pr87317]$ cat y.s
	.file	"y.c"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	movswq	src(%rip), %rax
	movswq	src+4(%rip), %rcx
	movswq	src+6(%rip), %rdx
	vmovq	%rax, %xmm0
	movswq	src+2(%rip), %rax
	vmovq	%rcx, %xmm1
	vpinsrq	$1, %rdx, %xmm1, %xmm1
	vpinsrq	$1, %rax, %xmm0, %xmm0
	vinserti128	$0x1, %xmm1, %ymm0, %ymm0
	vmovdqu	%ymm0, dst(%rip)
	vzeroupper
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.comm	src,8,8
	.comm	dst,32,32
	.ident	"GCC: (GNU) 8.2.1 20181011 (Red Hat 8.2.1-4)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-efi-2 pr87317]$ 

I don't see much differences between x86-64 and arm64.
Comment 3 Andrew Pinski 2018-10-25 06:11:32 UTC
Hmm, it was working in GCC 7.3.x.
Comment 4 Andrew Pinski 2018-10-25 06:12:55 UTC
Even for x86_64:
        vmovdqa src(%rip), %xmm0
        vpmovsxdq       %xmm0, %xmm1
        vpsrldq $8, %xmm0, %xmm0
        vpmovsxdq       %xmm0, %xmm0
        vmovaps %xmm1, dst(%rip)
        vmovaps %xmm0, 16+dst(%rip)
        ret
Comment 5 H.J. Lu 2018-10-25 06:14:37 UTC
(In reply to Andrew Pinski from comment #4)
> Even for x86_64:
>         vmovdqa src(%rip), %xmm0
>         vpmovsxdq       %xmm0, %xmm1
>         vpsrldq $8, %xmm0, %xmm0
>         vpmovsxdq       %xmm0, %xmm0
>         vmovaps %xmm1, dst(%rip)
>         vmovaps %xmm0, 16+dst(%rip)
>         ret

Only when AVX2 is disabled.
Comment 6 H.J. Lu 2018-10-25 06:15:43 UTC
(In reply to H.J. Lu from comment #5)
> (In reply to Andrew Pinski from comment #4)
> > Even for x86_64:
> >         vmovdqa src(%rip), %xmm0
> >         vpmovsxdq       %xmm0, %xmm1
> >         vpsrldq $8, %xmm0, %xmm0
> >         vpmovsxdq       %xmm0, %xmm0
> >         vmovaps %xmm1, dst(%rip)
> >         vmovaps %xmm0, 16+dst(%rip)
> >         ret
> 
> Only when AVX2 is disabled.

I mean YMM disable.
Comment 7 Richard Biener 2018-10-25 08:56:10 UTC
Confirmed.  It's a cost-model issue.  With GCC 7 the vectorization with AVX256 was not profitable so AVX128 was chosen:

t.c:12:1: note: Final SLP tree for instance:
t.c:12:1: note: node
t.c:12:1: note:         stmt 0 dst[0] = _11;
t.c:12:1: note:         stmt 1 dst[1] = _17;
t.c:12:1: note:         stmt 2 dst[2] = _23;
t.c:12:1: note:         stmt 3 dst[3] = _29;
t.c:12:1: note: node (external)
t.c:12:1: note:         stmt 0 _11 = (long long int) _10;
t.c:12:1: note:         stmt 1 _17 = (long long int) _16;
t.c:12:1: note:         stmt 2 _23 = (long long int) _22;
t.c:12:1: note:         stmt 3 _29 = (long long int) _28;
t.c:12:1: note: Cost model analysis:
  Vector inside of basic block cost: 2
  Vector prologue cost: 3
  Vector epilogue cost: 0
  Scalar cost of basic block: 4
t.c:12:1: note: not vectorized: vectorization is not profitable.
t.c:12:1: note: ***** Re-trying analysis with vector size 16

but with GCC 8 we now say

t.c:12:1: note: Cost model analysis:
  Vector inside of basic block cost: 20
  Vector prologue cost: 28
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.c:12:1: note: Basic block will be vectorized using SLP
t.c:12:1: note: SLPing BB part

costs on trunk are the same (the above is for generic, for haswell the
vector cost is even lower, 12).

So we end up with

  <bb 2> [local count: 214748369]:
  _10 = src[0];
  _11 = (long long int) _10;
  _16 = src[1];
  _17 = (long long int) _16;
  _22 = src[2];
  _23 = (long long int) _22;
  _28 = src[3];
  _29 = (long long int) _28;
  _13 = {_11, _17, _23, _29};
  vect_cst__19 = _13;
  MEM[(long long int *)&dst] = vect_cst__19;

note this just costs the vector construction + vector store against
the four scalar stores.

Note with my patches to consider both vector sizes this wouldn't be handled
either since I didn't update them to work for BB vectorization (and they
are not on trunk yet anyways).  It would be an apples to oranges comparison
anyways since the scalar cost differs (the SLP tree is different for AVX128).
Anyways, costing for AVX128 is

t.c:12:1: note:  Cost model analysis:
  Vector inside of basic block cost: 44
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 96

(haswell).  So if you scale the vector cost by 0.5 because the scalar
cost is doubled you end up at 22 which would compare favorably to
12 + 28 == 40.
Comment 8 H.J. Lu 2021-07-21 14:59:05 UTC
This has been fixed in GCC 12.  Sunil, please submit a GCC patch to
add a testcase.