[Bug target/81602] Unnecessary zero-extension after 16 bit popcnt

Sun Jul 30 20:48:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602

--- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Christoph Diegelmann from comment #0)
> GCC misses an optimization on this:
> 
>  #include <cstdint>
>  #include "immintrin.h"
> 
>  void test(std::uint16_t* mask, std::uint16_t* data) {
>  for (int i = 0; i < 1024; ++i) {
>  *data = 0;
>  unsigned tmp = *mask++;
>  unsigned step = _mm_popcnt_u32(tmp);
>  data += step;
>  }
>  }
> 
> g++ -O3 -Wall -std=c++14 -march=skylake generates:
> 
>  test(unsigned short*, unsigned short*):
>  leaq 2048(%rdi), %rdx
>  .L2:
>  xorl %eax, %eax
>  addq $2, %rdi
>  movw %ax, (%rsi)
>  popcntw -2(%rdi), %ax
>  movzwl %ax, %eax
>  leaq (%rsi,%rax,2), %rsi
>  cmpq %rdx, %rdi
>  jne .L2
>  ret
> 
> The rax register is known to be zero at the time of `popcntw -2(%rdi), %ax`.
> Anyway gcc still clears the upper bits using `movzwl %ax, %eax` afterwards.

The "xorl %eax, %eax; movw %ax, (%rsi)" pair is just optimized way to implement
"movw $0, (%rsi);". It just happens that peephole pass found unused %eax as an
empty temporary reg when splitting direct move of immediate to memory.

> While clang uses 32 bit popcnt and `movzwl (%rdi,%rax,2), %ecx` it correctly
> recognises that there's no need to clear the upper bits.
> 
> clang -O3 -Wall -std=c++14 -march=skylake -fno-unroll-loops generates:
> 
>  test(unsigned short*, unsigned short*): 
>  xorl %eax, %eax
>  .LBB0_1: 
>  movw $0, (%rsi)
>  movzwl (%rdi,%rax,2), %ecx
>  popcntl %ecx, %ecx
>  leaq (%rsi,%rcx,2), %rsi
>  addq $1, %rax
>  cmpl $1024, %eax # imm = 0x400
>  jne .LBB0_1
>  retq

popcntl has a false dependency on its output in certain situations, where
popcntw doesn have this limitation. So, gcc choose this approach for a reason.