This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 26 Jul 2017 14:09:31 +0000
- Subject: [Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
- Auto-submitted: auto-generated
- References: <bug-80846-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So after Jakubs update the vectorizer patch yields
sumint:
.LFB0:
.cfi_startproc
vpxor %xmm0, %xmm0, %xmm0
leaq 4096(%rdi), %rax
.p2align 4,,10
.p2align 3
.L2:
vpaddd (%rdi), %ymm0, %ymm0
addq $32, %rdi
cmpq %rdi, %rax
jne .L2
vextracti128 $1, %ymm0, %xmm1
vpaddd %xmm0, %xmm1, %xmm0
vpsrldq $8, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vmovd %xmm0, %eax
vzeroupper
ret
that's not using the unpacking strategy (sum adjacent elements) but still the
vector shift approach (add upper/lower halves). That's sth that can be
changed independently.
Waiting for final vec_extract/init2 optab settling.