[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Wed Jul 26 14:09:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So after Jakubs update the vectorizer patch yields

sumint:
.LFB0:
        .cfi_startproc
        vpxor   %xmm0, %xmm0, %xmm0
        leaq    4096(%rdi), %rax
        .p2align 4,,10
        .p2align 3
.L2:
        vpaddd  (%rdi), %ymm0, %ymm0
        addq    $32, %rdi
        cmpq    %rdi, %rax
        jne     .L2
        vextracti128    $1, %ymm0, %xmm1
        vpaddd  %xmm0, %xmm1, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        vzeroupper
        ret

that's not using the unpacking strategy (sum adjacent elements) but still the
vector shift approach (add upper/lower halves).  That's sth that can be
changed independently.

Waiting for final vec_extract/init2 optab settling.