This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 26 Jul 2017 14:09:31 +0000
Subject: [Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
Auto-submitted: auto-generated
References: <bug-80846-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So after Jakubs update the vectorizer patch yields

sumint:
.LFB0:
        .cfi_startproc
        vpxor   %xmm0, %xmm0, %xmm0
        leaq    4096(%rdi), %rax
        .p2align 4,,10
        .p2align 3
.L2:
        vpaddd  (%rdi), %ymm0, %ymm0
        addq    $32, %rdi
        cmpq    %rdi, %rax
        jne     .L2
        vextracti128    $1, %ymm0, %xmm1
        vpaddd  %xmm0, %xmm1, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        vzeroupper
        ret

that's not using the unpacking strategy (sum adjacent elements) but still the
vector shift approach (add upper/lower halves).  That's sth that can be
changed independently.

Waiting for final vec_extract/init2 optab settling.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]