Before AVX512, AVX/SSE code was written as in test1 function below: some operation(s) created mask in vector register, it was then converted to int with movemask instruction, and resulting int value was used in another expression - e.g. compared with some constant. AVX512 added new k1..k7 registers and set of instructions with _mask suffix which writes to them instead of creating mask in vector register. So test2 function is simple attempt to rewrite test1 with new instructions: #include "immintrin.h" bool test1(void* ptr) { __m256i v = _mm256_loadu_si256((const __m256i*)ptr); v = _mm256_cmpeq_epi32(v, _mm256_setzero_si256()); return 0 == _mm256_movemask_epi8(v); } bool test2(void* ptr) { __m256i v = _mm256_loadu_si256((const __m256i*)ptr); __mmask8 m = _mm256_cmpeq_epi32_mask(v, _mm256_setzero_si256()); return 0 == m; } I have tried to compile this using Compiler Explorer at https://godbolt.org/ with with following options: -O3 -mavx -ftree-vectorize -mbmi -mpopcnt -mbmi2 -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq gcc 7.2 and gcc trunk created following code: test1(void*): vmovdqu8 xmm0, XMMWORD PTR [rdi] vinserti128 ymm1, ymm0, XMMWORD PTR [rdi+16], 0x1 vpxord xmm0, xmm0, xmm0 vpcmpeqd ymm0, ymm0, ymm1 vpmovmskb eax, ymm0 test eax, eax sete al ret test2(void*): vmovdqu8 xmm0, XMMWORD PTR [rdi] vpxord xmm1, xmm1, xmm1 vinserti128 ymm0, ymm0, XMMWORD PTR [rdi+16], 0x1 vpcmpeqd k1, ymm0, ymm1 kmovb eax, k1 test al, al sete al ret clang 5.0.0 created this: test1(void*): # @test1(void*) vpxor ymm0, ymm0, ymm0 vpcmpeqd k0, ymm0, ymmword ptr [rdi] vpmovm2d ymm0, k0 vpmovmskb eax, ymm0 test eax, eax sete al vzeroupper ret test2(void*): # @test2(void*) vpxor ymm0, ymm0, ymm0 vpcmpeqd k0, ymm0, ymmword ptr [rdi] ktestb k0, k0 sete al vzeroupper ret gcc output does not look very optimal. clang output for test2 is better, it uses ktestb instead of kmovb+test. gcc should be able to do this too. There is also one more possible optimization which can be applied for test1: automatically replace OP and movemask instruction pair with OP_mask instruction. Something like this is already performed for FMA3, gcc is able to replace mul/add instruction pair with one FMA instruction. I do not have access to any machine with AVX512 so I cannot perform any benchmarks. However this kind of optimization looks promising, so it is worth exploring.
Created attachment 42548 [details] gcc8-pr82855-1.patch Untested patch, part 1.
Created attachment 42549 [details] gcc8-pr82855-2.patch Untested patch, part 2.
Created attachment 42550 [details] gcc8-pr82855-1.patch Untested patch, part 3. With these 3, I get for the second function with -O2 -mavx512{vl,dq} -mtune=intel: vpxord %xmm0, %xmm0, %xmm0 xorl %eax, %eax vpcmpeqd (%rdi), %ymm0, %k1 ktestb %k1, %k1 sete %al With the generic tuning, the load is performed separately from the comparison.
Author: jakub Date: Tue Nov 7 20:48:35 2017 New Revision: 254509 URL: https://gcc.gnu.org/viewcvs?rev=254509&root=gcc&view=rev Log: PR target/82855 * config/i386/i386.c (ix86_swap_binary_operands_p): Treat RTX_COMM_COMPARE as commutative as well. (ix86_binary_operator_ok): Formatting fix. * config/i386/sse.md (*mul<mode>3<mask_name><round_name>, *<code><mode>3<mask_name><round_saeonly_name>, *<code><mode>3<mask_name>, *<code>tf3, *mul<mode>3<mask_name>, *<s>mul<mode>3_highpart<mask_name>, *vec_widen_umult_even_v16si<mask_name>, *vec_widen_umult_even_v8si<mask_name>, *vec_widen_umult_even_v4si<mask_name>, *vec_widen_smult_even_v16si<mask_name>, *vec_widen_smult_even_v8si<mask_name>, *sse4_1_mulv2siv2di3<mask_name>, *avx2_pmaddwd, *sse2_pmaddwd, *<sse4_1_avx2>_mul<mode>3<mask_name>, *avx2_<code><mode>3, *avx512f_<code><mode>3<mask_name>, *sse4_1_<code><mode>3<mask_name>, *<code>v8hi3, *sse4_1_<code><mode>3<mask_name>, *<code>v16qi3, *avx2_eq<mode>3, <avx512>_eq<mode>3<mask_scalar_merge_name>_1, *sse4_1_eqv2di3, *sse2_eq<mode>3, <mask_codefor><code><mode>3<mask_name>, *<code><mode>3, *<sse2_avx2>_uavg<mode>3<mask_name>, *<ssse3_avx2>_pmulhrsw<mode>3<mask_name>, *ssse3_pmulhrswv4hi3): Use !(MEM_P (operands[1]) && MEM_P (operands[2])) condition instead of ix86_binary_operator_ok. Formatting fixes. (*<plusminus_insn><mode>3<mask_name><round_name>, *<plusminus_insn><mode>3, *<plusminus_insn><mode>3_m): Formatting fixes. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/sse.md
Author: jakub Date: Tue Nov 7 20:49:30 2017 New Revision: 254510 URL: https://gcc.gnu.org/viewcvs?rev=254510&root=gcc&view=rev Log: PR target/82855 * config/i386/i386.md (SWI1248_AVX512BWDQ2_64): New mode iterator. (*cmp<mode>_ccz_1): New insn with $k alternative. * gcc.target/i386/avx512dq-pr82855.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512dq-pr82855.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md trunk/gcc/testsuite/ChangeLog
Author: jakub Date: Wed Nov 8 20:15:42 2017 New Revision: 254552 URL: https://gcc.gnu.org/viewcvs?rev=254552&root=gcc&view=rev Log: PR target/82855 * config/i386/sse.md (<avx512>_eq<mode>3<mask_scalar_merge_name>, <avx512>_eq<mode>3<mask_scalar_merge_name>_1): Use nonimmediate_operand predicate for operand 1 instead of register_operand. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md
Fixed.
*** Bug 85833 has been marked as a duplicate of this bug. ***