[Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
marcus.kool at urlfilterdb dot com
gcc-bugzilla@gcc.gnu.org
Fri May 1 13:51:00 GMT 2015
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791
Marcus Kool <marcus.kool at urlfilterdb dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Summary|use 32-byte version of |use 32-byte version of
|vpbroadcastb on AVX2 |vpbroadcastb (and register
|platform |to poulate) on AVX/AVX2
| |platforms
Known to fail| |4.8.4, 4.9.2, 5.1.0
Severity|minor |normal
--- Comment #2 from Marcus Kool <marcus.kool at urlfilterdb dot com> ---
After the comment of Jakub I waited for the release of gcc 5.1.0 but
performance of programs that use *_set1_epi8() got 6% worse because gcc 5.1.0
now uses vpbroadcastb in the intended way but to populate the ymm register it
uses slow memory instead of a register.
This is what 5.1.0 generates:
movl %edi, -20(%rbp)
vpbroadcastb -20(%rbp), %ymm0
while this is optimal:
vmovd %edi, %xmm0
vpbroadcastb %xmm0, %ymm0
Also for the AVX platform (see attachment avx.c) gcc 5.1.0 also uses memory and
many instructions to populate an xmm register:
movl %edi, -12(%rsp)
vpxor %xmm1, %xmm1, %xmm1
vmovd -12(%rsp), %xmm0
xorl %eax, %eax
vpshufb %xmm1, %xmm0, %xmm0
where
vmovd %edi, %xmm0
vpbroadcastb %xmm0, %xmm0
is optimal.
To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised further,
and gcc 5.1.0 produces even slower code which means that the implementation of
*_set1_epi8() is slower/much-slower than that it can be.
More information about the Gcc-bugs
mailing list