[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Wed Jul 26 14:09:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So after Jakubs update the vectorizer patch yields
sumint:
.LFB0:
.cfi_startproc
vpxor %xmm0, %xmm0, %xmm0
leaq 4096(%rdi), %rax
.p2align 4,,10
.p2align 3
.L2:
vpaddd (%rdi), %ymm0, %ymm0
addq $32, %rdi
cmpq %rdi, %rax
jne .L2
vextracti128 $1, %ymm0, %xmm1
vpaddd %xmm0, %xmm1, %xmm0
vpsrldq $8, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vmovd %xmm0, %eax
vzeroupper
ret
that's not using the unpacking strategy (sum adjacent elements) but still the
vector shift approach (add upper/lower halves). That's sth that can be
changed independently.
Waiting for final vec_extract/init2 optab settling.
More information about the Gcc-bugs
mailing list