Bug 82855 - AVX512: replace OP+movemask with OP_mask+ktest
Summary: AVX512: replace OP+movemask with OP_mask+ktest
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.2.0
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
: 85833 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-11-05 22:29 UTC by Daniel Fruzynski
Modified: 2018-05-21 10:25 UTC (History)
0 users

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-11-06 00:00:00


Attachments
gcc8-pr82855-1.patch (2.56 KB, patch)
2017-11-06 12:55 UTC, Jakub Jelinek
Details | Diff
gcc8-pr82855-2.patch (441 bytes, patch)
2017-11-06 12:56 UTC, Jakub Jelinek
Details | Diff
gcc8-pr82855-1.patch (915 bytes, patch)
2017-11-06 13:00 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Fruzynski 2017-11-05 22:29:32 UTC
Before AVX512, AVX/SSE code was written as in test1 function below: some operation(s) created mask in vector register, it was then converted to int with movemask instruction, and resulting int value was used in another expression - e.g. compared with some constant. AVX512 added new k1..k7 registers and set of instructions with _mask suffix which writes to them instead of creating mask in vector register. So test2 function is simple attempt to rewrite test1 with new instructions:

#include "immintrin.h"

bool test1(void* ptr)
{
  __m256i v = _mm256_loadu_si256((const __m256i*)ptr);
  v = _mm256_cmpeq_epi32(v, _mm256_setzero_si256());
  return 0 == _mm256_movemask_epi8(v);
}

bool test2(void* ptr)
{
  __m256i v = _mm256_loadu_si256((const __m256i*)ptr);
  __mmask8 m = _mm256_cmpeq_epi32_mask(v, _mm256_setzero_si256());
  return 0 == m;
}

I have tried to compile this using Compiler Explorer at https://godbolt.org/ with with following options:
-O3 -mavx -ftree-vectorize -mbmi -mpopcnt -mbmi2 -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq

gcc 7.2 and gcc trunk created following code:
test1(void*):
  vmovdqu8 xmm0, XMMWORD PTR [rdi]
  vinserti128 ymm1, ymm0, XMMWORD PTR [rdi+16], 0x1
  vpxord xmm0, xmm0, xmm0
  vpcmpeqd ymm0, ymm0, ymm1
  vpmovmskb eax, ymm0
  test eax, eax
  sete al
  ret
test2(void*):
  vmovdqu8 xmm0, XMMWORD PTR [rdi]
  vpxord xmm1, xmm1, xmm1
  vinserti128 ymm0, ymm0, XMMWORD PTR [rdi+16], 0x1
  vpcmpeqd k1, ymm0, ymm1
  kmovb eax, k1
  test al, al
  sete al
  ret

clang 5.0.0 created this:

test1(void*): # @test1(void*)
  vpxor ymm0, ymm0, ymm0
  vpcmpeqd k0, ymm0, ymmword ptr [rdi]
  vpmovm2d ymm0, k0
  vpmovmskb eax, ymm0
  test eax, eax
  sete al
  vzeroupper
  ret
test2(void*): # @test2(void*)
  vpxor ymm0, ymm0, ymm0
  vpcmpeqd k0, ymm0, ymmword ptr [rdi]
  ktestb k0, k0
  sete al
  vzeroupper
  ret

gcc output does not look very optimal. clang output for test2 is better, it uses ktestb instead of kmovb+test. gcc should be able to do this too.

There is also one more possible optimization which can be applied for test1: automatically replace OP and movemask instruction pair with OP_mask instruction. Something like this is already performed for FMA3, gcc is able to replace mul/add instruction pair with one FMA instruction. I do not have access to any machine with AVX512 so I cannot perform any benchmarks. However this kind of optimization looks promising, so it is worth exploring.
Comment 1 Jakub Jelinek 2017-11-06 12:55:40 UTC
Created attachment 42548 [details]
gcc8-pr82855-1.patch

Untested patch, part 1.
Comment 2 Jakub Jelinek 2017-11-06 12:56:45 UTC
Created attachment 42549 [details]
gcc8-pr82855-2.patch

Untested patch, part 2.
Comment 3 Jakub Jelinek 2017-11-06 13:00:41 UTC
Created attachment 42550 [details]
gcc8-pr82855-1.patch

Untested patch, part 3.  With these 3, I get for the second function
with -O2 -mavx512{vl,dq} -mtune=intel:
        vpxord  %xmm0, %xmm0, %xmm0
        xorl    %eax, %eax
        vpcmpeqd        (%rdi), %ymm0, %k1
        ktestb  %k1, %k1
        sete    %al
With the generic tuning, the load is performed separately from the comparison.
Comment 4 Jakub Jelinek 2017-11-07 20:49:07 UTC
Author: jakub
Date: Tue Nov  7 20:48:35 2017
New Revision: 254509

URL: https://gcc.gnu.org/viewcvs?rev=254509&root=gcc&view=rev
Log:
	PR target/82855
	* config/i386/i386.c (ix86_swap_binary_operands_p): Treat
	RTX_COMM_COMPARE as commutative as well.
	(ix86_binary_operator_ok): Formatting fix.
	* config/i386/sse.md (*mul<mode>3<mask_name><round_name>,
	*<code><mode>3<mask_name><round_saeonly_name>,
	*<code><mode>3<mask_name>, *<code>tf3, *mul<mode>3<mask_name>,
	*<s>mul<mode>3_highpart<mask_name>,
	*vec_widen_umult_even_v16si<mask_name>,
	*vec_widen_umult_even_v8si<mask_name>,
	*vec_widen_umult_even_v4si<mask_name>,
	*vec_widen_smult_even_v16si<mask_name>,
	*vec_widen_smult_even_v8si<mask_name>, *sse4_1_mulv2siv2di3<mask_name>,
	*avx2_pmaddwd, *sse2_pmaddwd, *<sse4_1_avx2>_mul<mode>3<mask_name>,
	*avx2_<code><mode>3, *avx512f_<code><mode>3<mask_name>,
	*sse4_1_<code><mode>3<mask_name>, *<code>v8hi3,
	*sse4_1_<code><mode>3<mask_name>, *<code>v16qi3, *avx2_eq<mode>3,
	<avx512>_eq<mode>3<mask_scalar_merge_name>_1, *sse4_1_eqv2di3,
	*sse2_eq<mode>3, <mask_codefor><code><mode>3<mask_name>,
	*<code><mode>3, *<sse2_avx2>_uavg<mode>3<mask_name>,
	*<ssse3_avx2>_pmulhrsw<mode>3<mask_name>, *ssse3_pmulhrswv4hi3): Use
	!(MEM_P (operands[1]) && MEM_P (operands[2])) condition instead of
	ix86_binary_operator_ok.  Formatting fixes.
	(*<plusminus_insn><mode>3<mask_name><round_name>,
	*<plusminus_insn><mode>3, *<plusminus_insn><mode>3_m): Formatting
	fixes.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sse.md
Comment 5 Jakub Jelinek 2017-11-07 20:50:02 UTC
Author: jakub
Date: Tue Nov  7 20:49:30 2017
New Revision: 254510

URL: https://gcc.gnu.org/viewcvs?rev=254510&root=gcc&view=rev
Log:
	PR target/82855
	* config/i386/i386.md (SWI1248_AVX512BWDQ2_64): New mode iterator.
	(*cmp<mode>_ccz_1): New insn with $k alternative.

	* gcc.target/i386/avx512dq-pr82855.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx512dq-pr82855.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog
Comment 6 Jakub Jelinek 2017-11-08 20:16:14 UTC
Author: jakub
Date: Wed Nov  8 20:15:42 2017
New Revision: 254552

URL: https://gcc.gnu.org/viewcvs?rev=254552&root=gcc&view=rev
Log:
	PR target/82855
	* config/i386/sse.md (<avx512>_eq<mode>3<mask_scalar_merge_name>,
	<avx512>_eq<mode>3<mask_scalar_merge_name>_1): Use
	nonimmediate_operand predicate for operand 1 instead of
	register_operand.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
Comment 7 Jakub Jelinek 2017-12-15 09:27:38 UTC
Fixed.
Comment 8 Jakub Jelinek 2018-05-21 10:25:33 UTC
*** Bug 85833 has been marked as a duplicate of this bug. ***