[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

Thu Nov 23 17:35:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

Andrew Senkevich <andrew.n.senkevich at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrew.n.senkevich at gmail dot co
                   |                            |m

--- Comment #2 from Andrew Senkevich <andrew.n.senkevich at gmail dot com> ---
Currently -mprefer-avx256 is default for SKX and vzeroupper addition was fixed,
code generated is:

.L3:
        vpsrlw  $8, (%rsi,%rax,2), %ymm0
        vpsrlw  $8, 32(%rsi,%rax,2), %ymm1
        vpand   %ymm0, %ymm2, %ymm0
        vpand   %ymm1, %ymm2, %ymm1
        vpackuswb       %ymm1, %ymm0, %ymm0
        vpermq  $216, %ymm0, %ymm0
        vmovdqu8        %ymm0, (%rdi,%rax)
        addq    $32, %rax
        cmpq    %rax, %rdx
        jne     .L3

vmovdqu8 remains but I cannot confirm it is slower.