82855 – AVX512: replace OP+movemask with OP_mask+ktest

Bug 82855 - AVX512: replace OP+movemask with OP_mask+ktest

Summary: AVX512: replace OP+movemask with OP_mask+ktest

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	7.2.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Jakub Jelinek

URL:
Keywords:	missed-optimization

Duplicates (1):	85833 (view as bug list)
Depends on:
Blocks:

Reported:	2017-11-05 22:29 UTC by Daniel Fruzynski
Modified:	2018-05-21 10:25 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-11-06 00:00:00

Attachments
gcc8-pr82855-1.patch (2.56 KB, patch) 2017-11-06 12:55 UTC, Jakub Jelinek	Details \| Diff
gcc8-pr82855-2.patch (441 bytes, patch) 2017-11-06 12:56 UTC, Jakub Jelinek	Details \| Diff
gcc8-pr82855-1.patch (915 bytes, patch) 2017-11-06 13:00 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Fruzynski 2017-11-05 22:29:32 UTC

Before AVX512, AVX/SSE code was written as in test1 function below: some operation(s) created mask in vector register, it was then converted to int with movemask instruction, and resulting int value was used in another expression - e.g. compared with some constant. AVX512 added new k1..k7 registers and set of instructions with _mask suffix which writes to them instead of creating mask in vector register. So test2 function is simple attempt to rewrite test1 with new instructions:

#include "immintrin.h"

bool test1(void* ptr)
{
  __m256i v = _mm256_loadu_si256((const __m256i*)ptr);
  v = _mm256_cmpeq_epi32(v, _mm256_setzero_si256());
  return 0 == _mm256_movemask_epi8(v);
}

bool test2(void* ptr)
{
  __m256i v = _mm256_loadu_si256((const __m256i*)ptr);
  __mmask8 m = _mm256_cmpeq_epi32_mask(v, _mm256_setzero_si256());
  return 0 == m;
}

I have tried to compile this using Compiler Explorer at https://godbolt.org/ with with following options:
-O3 -mavx -ftree-vectorize -mbmi -mpopcnt -mbmi2 -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq

gcc 7.2 and gcc trunk created following code:
test1(void*):
  vmovdqu8 xmm0, XMMWORD PTR [rdi]
  vinserti128 ymm1, ymm0, XMMWORD PTR [rdi+16], 0x1
  vpxord xmm0, xmm0, xmm0
  vpcmpeqd ymm0, ymm0, ymm1
  vpmovmskb eax, ymm0
  test eax, eax
  sete al
  ret
test2(void*):
  vmovdqu8 xmm0, XMMWORD PTR [rdi]
  vpxord xmm1, xmm1, xmm1
  vinserti128 ymm0, ymm0, XMMWORD PTR [rdi+16], 0x1
  vpcmpeqd k1, ymm0, ymm1
  kmovb eax, k1
  test al, al
  sete al
  ret

clang 5.0.0 created this:

test1(void*): # @test1(void*)
  vpxor ymm0, ymm0, ymm0
  vpcmpeqd k0, ymm0, ymmword ptr [rdi]
  vpmovm2d ymm0, k0
  vpmovmskb eax, ymm0
  test eax, eax
  sete al
  vzeroupper
  ret
test2(void*): # @test2(void*)
  vpxor ymm0, ymm0, ymm0
  vpcmpeqd k0, ymm0, ymmword ptr [rdi]
  ktestb k0, k0
  sete al
  vzeroupper
  ret

gcc output does not look very optimal. clang output for test2 is better, it uses ktestb instead of kmovb+test. gcc should be able to do this too.

There is also one more possible optimization which can be applied for test1: automatically replace OP and movemask instruction pair with OP_mask instruction. Something like this is already performed for FMA3, gcc is able to replace mul/add instruction pair with one FMA instruction. I do not have access to any machine with AVX512 so I cannot perform any benchmarks. However this kind of optimization looks promising, so it is worth exploring.

Comment 1 Jakub Jelinek 2017-11-06 12:55:40 UTC

Created attachment 42548 [details]
gcc8-pr82855-1.patch

Untested patch, part 1.

Comment 2 Jakub Jelinek 2017-11-06 12:56:45 UTC

Created attachment 42549 [details]
gcc8-pr82855-2.patch

Untested patch, part 2.

Comment 3 Jakub Jelinek 2017-11-06 13:00:41 UTC

Created attachment 42550 [details]
gcc8-pr82855-1.patch

Untested patch, part 3.  With these 3, I get for the second function
with -O2 -mavx512{vl,dq} -mtune=intel:
        vpxord  %xmm0, %xmm0, %xmm0
        xorl    %eax, %eax
        vpcmpeqd        (%rdi), %ymm0, %k1
        ktestb  %k1, %k1
        sete    %al
With the generic tuning, the load is performed separately from the comparison.

Comment 4 Jakub Jelinek 2017-11-07 20:49:07 UTC

Author: jakub
Date: Tue Nov  7 20:48:35 2017
New Revision: 254509

URL: https://gcc.gnu.org/viewcvs?rev=254509&root=gcc&view=rev
Log:
	PR target/82855
	* config/i386/i386.c (ix86_swap_binary_operands_p): Treat
	RTX_COMM_COMPARE as commutative as well.
	(ix86_binary_operator_ok): Formatting fix.
	* config/i386/sse.md (*mul<mode>3<mask_name><round_name>,
	*<code><mode>3<mask_name><round_saeonly_name>,
	*<code><mode>3<mask_name>, *<code>tf3, *mul<mode>3<mask_name>,
	*<s>mul<mode>3_highpart<mask_name>,
	*vec_widen_umult_even_v16si<mask_name>,
	*vec_widen_umult_even_v8si<mask_name>,
	*vec_widen_umult_even_v4si<mask_name>,
	*vec_widen_smult_even_v16si<mask_name>,
	*vec_widen_smult_even_v8si<mask_name>, *sse4_1_mulv2siv2di3<mask_name>,
	*avx2_pmaddwd, *sse2_pmaddwd, *<sse4_1_avx2>_mul<mode>3<mask_name>,
	*avx2_<code><mode>3, *avx512f_<code><mode>3<mask_name>,
	*sse4_1_<code><mode>3<mask_name>, *<code>v8hi3,
	*sse4_1_<code><mode>3<mask_name>, *<code>v16qi3, *avx2_eq<mode>3,
	<avx512>_eq<mode>3<mask_scalar_merge_name>_1, *sse4_1_eqv2di3,
	*sse2_eq<mode>3, <mask_codefor><code><mode>3<mask_name>,
	*<code><mode>3, *<sse2_avx2>_uavg<mode>3<mask_name>,
	*<ssse3_avx2>_pmulhrsw<mode>3<mask_name>, *ssse3_pmulhrswv4hi3): Use
	!(MEM_P (operands[1]) && MEM_P (operands[2])) condition instead of
	ix86_binary_operator_ok.  Formatting fixes.
	(*<plusminus_insn><mode>3<mask_name><round_name>,
	*<plusminus_insn><mode>3, *<plusminus_insn><mode>3_m): Formatting
	fixes.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sse.md

Comment 5 Jakub Jelinek 2017-11-07 20:50:02 UTC

Author: jakub
Date: Tue Nov  7 20:49:30 2017
New Revision: 254510

URL: https://gcc.gnu.org/viewcvs?rev=254510&root=gcc&view=rev
Log:
	PR target/82855
	* config/i386/i386.md (SWI1248_AVX512BWDQ2_64): New mode iterator.
	(*cmp<mode>_ccz_1): New insn with $k alternative.

	* gcc.target/i386/avx512dq-pr82855.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx512dq-pr82855.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog

Comment 6 Jakub Jelinek 2017-11-08 20:16:14 UTC

Author: jakub
Date: Wed Nov  8 20:15:42 2017
New Revision: 254552

URL: https://gcc.gnu.org/viewcvs?rev=254552&root=gcc&view=rev
Log:
	PR target/82855
	* config/i386/sse.md (<avx512>_eq<mode>3<mask_scalar_merge_name>,
	<avx512>_eq<mode>3<mask_scalar_merge_name>_1): Use
	nonimmediate_operand predicate for operand 1 instead of
	register_operand.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md

Comment 7 Jakub Jelinek 2017-12-15 09:27:38 UTC

Fixed.

Comment 8 Jakub Jelinek 2018-05-21 10:25:33 UTC

*** Bug 85833 has been marked as a duplicate of this bug. ***