Bug 89028 - 8-byte loop isn't vectorized
Summary: 8-byte loop isn't vectorized
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: 10.0
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on: 89021
Blocks: vectorizer
  Show dependency treegraph
Reported: 2019-01-24 02:58 UTC by H.J. Lu
Modified: 2021-08-03 02:59 UTC (History)
1 user (show)

See Also:
Target: x86_64-*-* i?86-*-*
Known to work:
Known to fail:
Last reconfirmed: 2019-01-24 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2019-01-24 02:58:47 UTC
[hjl@gnu-skx-1 v64-2]$ cat y.i
rsqrt(char* restrict r, char* restrict a){
    for (int i = 0; i < 8; i++){
        r[i] += a[i];
[hjl@gnu-skx-1 v64-2]$ gcc -S -O2 y.i
[hjl@gnu-skx-1 v64-2]$ cat y.s
	.file	"y.i"
	.p2align 4,,15
	.globl	rsqrt
	.type	rsqrt, @function
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
	movzbl	(%rsi,%rax), %edx
	addb	%dl, (%rdi,%rax)
	addq	$1, %rax
	cmpq	$8, %rax
	jne	.L2
	.size	rsqrt, .-rsqrt
	.ident	"GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 v64-2]$
Comment 1 Richard Biener 2019-01-24 09:28:38 UTC
Of course we do not vectorize at -O2.  At -O3 the issue is the target doesn't advertise word_mode as vector size to use and the vectorizer doesn't support
vectorization using half of a vector.

If you'd do

Index: gcc/config/i386/i386.c
--- gcc/config/i386/i386.c      (revision 268010)
+++ gcc/config/i386/i386.c      (working copy)
@@ -50153,6 +50153,11 @@ ix86_autovectorize_vector_sizes (vector_
       sizes->safe_push (32);
       sizes->safe_push (16);
+  else
+    {
+      sizes->safe_push (16);
+      sizes->safe_push (8);
+    }
 /* Implemenation of targetm.vectorize.get_mask_mode.  */

you get vectorization using DImode regs:

        movabsq $9187201950435737471, %rdx
        movq    (%rdi), %rax
        movq    (%rsi), %rsi
        movq    %rdx, %rcx
        andq    %rax, %rcx
        andq    %rsi, %rdx
        xorq    %rsi, %rax
        addq    %rcx, %rdx
        movabsq $-9187201950435737472, %rcx
        andq    %rcx, %rax
        xorq    %rdx, %rax
        movq    %rax, (%rdi)

not exactly what you wanted I guess ;)  Anything else would require
vectorizer adjustments.
Comment 2 H.J. Lu 2019-01-25 12:38:47 UTC
I am working on a patch to generate:

[hjl@gnu-hsw-1 pr89028]$ cat x.i
foo (char* restrict r, char* restrict a){
    for (int i = 0; i < 8; i++){
        r[i] += a[i];
[hjl@gnu-hsw-1 pr89028]$ make x.s
/export/build/gnu/tools-build/gcc-mmx/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-mmx/build-x86_64-linux/gcc/ -O3  -S x.i
[hjl@gnu-hsw-1 pr89028]$ cat x.s
	.file	"x.i"
	.p2align 4
	.globl	foo
	.type	foo, @function
	movq	(%rdi), %xmm0
	movq	(%rsi), %xmm1
	paddb	%xmm1, %xmm0
	movq	%xmm0, (%rdi)
	.size	foo, .-foo
	.ident	"GCC: (GNU) 9.0.1 20190124 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-hsw-1 pr89028]$
Comment 3 Andrew Pinski 2021-08-03 02:59:50 UTC
Fixed in GCC 10 by r10-1361.