This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

Marcus Kool <marcus.kool at urlfilterdb dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
            Summary|use 32-byte version of      |use 32-byte version of
                   |vpbroadcastb on AVX2        |vpbroadcastb (and register
                   |platform                    |to poulate) on AVX/AVX2
                   |                            |platforms
      Known to fail|                            |4.8.4, 4.9.2, 5.1.0
           Severity|minor                       |normal

--- Comment #2 from Marcus Kool <marcus.kool at urlfilterdb dot com> ---
After the comment of Jakub I waited for the release of gcc 5.1.0 but
performance of programs that use *_set1_epi8() got 6% worse because gcc 5.1.0
now uses vpbroadcastb in the intended way but to populate the ymm register it
uses slow memory instead of a register.

This is what 5.1.0 generates:

movl            %edi, -20(%rbp)
vpbroadcastb    -20(%rbp), %ymm0

while this is optimal:
   vmovd         %edi, %xmm0
   vpbroadcastb  %xmm0, %ymm0

Also for the AVX platform (see attachment avx.c) gcc 5.1.0 also uses memory and
many instructions to populate an xmm register:
        movl    %edi, -12(%rsp)
        vpxor   %xmm1, %xmm1, %xmm1
        vmovd   -12(%rsp), %xmm0
        xorl    %eax, %eax
        vpshufb %xmm1, %xmm0, %xmm0
where
        vmovd   %edi, %xmm0
        vpbroadcastb %xmm0, %xmm0
is optimal.

To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised further,
and gcc 5.1.0 produces even slower code which means that the implementation of
*_set1_epi8() is slower/much-slower than that it can be.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]