Bug 88464 - AVX-512 vectorization of masked scatter failing with "not suitable for scatter store"
Summary: AVX-512 vectorization of masked scatter failing with "not suitable for scatte...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.2.0
: P3 enhancement
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-12-12 13:46 UTC by Moritz Kreutzer
Modified: 2018-12-19 10:16 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail: 9.0
Last reconfirmed: 2018-12-12 00:00:00


Attachments
gcc9-pr88464.patch (1.24 KB, patch)
2018-12-12 19:13 UTC, Jakub Jelinek
Details | Diff
gcc9-pr88464-2.patch (4.15 KB, patch)
2018-12-14 17:28 UTC, Jakub Jelinek
Details | Diff
gcc9-pr88464-3.patch (3.55 KB, patch)
2018-12-18 11:10 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Moritz Kreutzer 2018-12-12 13:46:16 UTC
Hi,

I have the following simple loop which I want to compile for Skylake (AVX-512):

================================
#pragma GCC ivdep
for (int i = 0; i < n; ++i)
{
    if (b[off1[i]] < b[off2[i]])
        a[off1[i]] = b[off1[i]];
    else
        a[off2[i]] = b[off2[i]];
}
================================

Given AVX-512 masked scatter instructions and the absence of data conflicts ("ivdep"), vectorization should be possible along the lines of:
1. gather b[off1[i]] into zmm1
2. gather b[off2[i]] into zmm2
3. compare zmm1 and zmm2 with "<" and store result in mask1
4. compare zmm1 and zmm2 with ">=" and store result in mask2
5. scatter zmm1 to a[off1[i]] with mask1
6. scatter zmm2 to a[off2[i]] with mask2

However, GCC is not able to vectorize this loop (failing with "not vectorized: not suitable for scatter store"). I have tested this with the latest GCC trunk but the issue also occurs with all previous versions. If you want to have a look, here's a Godbolt example: https://godbolt.org/z/Is7Zml

I understand that this loop is not a trivial case for vectorization and AVX-512 hasn't been around for too long, so it's likely that it isn't fully supported yet. But still, I'm wondering:
1. Am I missing some flags or hints to GCC in order to vectorize this loop? (I can imagine something related to the cost model, etc..)
2. Or is GCC currently just not capable of vectorizing it?

If the answer is "2.":
3. Can we estimate to amount of work needed to support this?
4. Is there any plan on when this kind of pattern will be supported? 
5. If it's realistic for a non-GCC developer to look into this, is there anything I can do to help?


Many thanks in advance,
Moritz
Comment 1 Richard Biener 2018-12-12 14:01:28 UTC
Confirmed.  The issue here is the target doesn't advertise scatter:

3901              if (targetm.vectorize.builtin_scatter)
(gdb) n
3902                decl = targetm.vectorize.builtin_scatter (vectype, offtype, scale);
(gdb) 
3905          if (!decl)
(gdb) p decl
$5 = (tree) 0x0
(gdb) p decl
$5 = (tree) 0x0
(gdb) p debug_generic_expr (vectype)
vector(4) double
$6 = void
(gdb) p debug_generic_expr (offtype)
int
$7 = void
(gdb) p scale
$8 = 8

probably because of the "mixup" of 4-byte index and 8 byte store element type.

I'm not sure whether Jakub implemented widening of the index at all.

OTOH when using long for idx it doesn't help.  Using -march=knl we get
a little bit further but ultimately fail the same way.
Comment 2 Jakub Jelinek 2018-12-12 14:06:35 UTC
I haven't been involved with the scatters, only gathers.  So guess the first step would be try something similar with gathers only and see if we vectorize that and see what is different for scatters.
Comment 3 Richard Biener 2018-12-12 14:10:05 UTC
Looks like there are no scatter patterns with DFmode, only SFmode ones?  But even

void loop (float * __restrict__ a, float const * __restrict__ b, int const * __restrict__ off1, int const * __restrict__ off2, int n)
{
#if defined(__clang__)
#pragma clang loop vectorize(assume_safety)
#elif defined(__GNUC__)
#pragma GCC ivdep
#endif
    for (int i = 0; i < n; ++i)
    {
        if (b[i] < b[i])
            a[off1[i]] = b[i];
        else
            a[off2[i]] = b[i];
    }
}

doesn't work.
Comment 4 Jakub Jelinek 2018-12-12 17:40:25 UTC
So, let's use a more complete testcase:
void
f1 (double * __restrict__ a, double const * __restrict__ b,
    int const * __restrict__ off1, int const * __restrict__ off2, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    {
      if (b[off1[i]] < b[off2[i]])
	a[off1[i]] = b[off1[i]];
      else
	a[off2[i]] = b[off2[i]];
    }
}

void
f2 (double * __restrict__ a, double const * __restrict__ b,
    long const * __restrict__ off1, long const * __restrict__ off2, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    {
      if (b[off1[i]] < b[off2[i]])
	a[off1[i]] = b[off1[i]];
      else
	a[off2[i]] = b[off2[i]];
    }
}

void
f3 (float * __restrict__ a, float const * __restrict__ b,
    int const * __restrict__ off1, int const * __restrict__ off2, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    {
      if (b[off1[i]] < b[off2[i]])
	a[off1[i]] = b[off1[i]];
      else
	a[off2[i]] = b[off2[i]];
    }
}

void
f4 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    a[i] = b[off[i]];
}

void
f5 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    a[i] = b[off[i]];
}

void
f6 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    a[i] = b[off[i]];
}

void
f7 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    a[off[i]] = b[i];
}

void
f8 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    a[off[i]] = b[i];
}

void
f9 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    a[off[i]] = b[i];
}

void
f10 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    if (a[i] > 10)
      a[i] = b[off[i]];
}

void
f11 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    if (a[i] > 10)
      a[i] = b[off[i]];
}

void
f12 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    if (a[i] > 10)
      a[i] = b[off[i]];
}

void
f13 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    if (b[i] > 10)
      a[off[i]] = b[i];
}

void
f14 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    if (b[i] > 10)
      a[off[i]] = b[i];
}

void
f15 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n)
{
#pragma GCC ivdep
  for (int i = 0; i < n; ++i)
    if (b[i] > 10)
      a[off[i]] = b[i];
}

With -O3 -mavx512{f,bw,dq,vl} -mtune=skylake-avx512
the f4/f5/f6 loops are vectorized for all of -mprefer-vector-width={128,256,512}
(though e.g. for -mtune=generic not) and f7/f8/f9 are vectorized only with -mprefer-vector-width=512 using vscatter*.
f10/f11/f12 are vectorized using vgather*, but strangely even in -mprefer-vector-width=512 mode using the AVX2 gathers, so that suggests that while unconditional scatters and gathers are properly supported, for conditional ones
(both MASK_LOAD and MASK_STORE) the support for the cases when using a mask register rather than a vector register with mask either hasn't been done or doesn't work properly.
Comment 5 Jakub Jelinek 2018-12-12 18:05:03 UTC
We have:
      else if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
        {
          tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gs_info.decl));
          tree masktype
            = TREE_VALUE (TREE_CHAIN (TREE_CHAIN (TREE_CHAIN (arglist))));
          if (TREE_CODE (masktype) == INTEGER_TYPE)
            {
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "masked gather with integer mask not"
                                 " supported.");
              return false;
            }
        }     
and no support for masked scatter stores when not using the aarch64 style ifns.
Comment 6 Jakub Jelinek 2018-12-12 18:10:23 UTC
See PR59617 for the masked gathers.
Comment 7 Jakub Jelinek 2018-12-12 19:13:08 UTC
Created attachment 45220 [details]
gcc9-pr88464.patch

Completely untested patch for the AVX512F conditional gather vectorization support.
Comment 8 Moritz Kreutzer 2018-12-13 14:20:36 UTC
Thanks for the input and for confirming that "for conditional ones (both MASK_LOAD and MASK_STORE) the support for the cases when using a mask register rather than a vector register with mask either hasn't been done or doesn't work properly." 

Any idea about a possible way forward to fix or implement those? As stated above, even though I have no experience with GCC development, I'd be happy to assist to the best of my resources if at all possible.
Comment 9 Jakub Jelinek 2018-12-13 17:02:22 UTC
Author: jakub
Date: Thu Dec 13 17:01:50 2018
New Revision: 267097

URL: https://gcc.gnu.org/viewcvs?rev=267097&root=gcc&view=rev
Log:
	PR tree-optimization/88464
	* tree-vect-stmts.c (vect_build_gather_load_calls): Handle INTEGER_TYPE
	masktype if mask is non-NULL.
	(vectorizable_load): Don't reject masked gather loads if masktype
	in the decl is INTEGER_TYPE.

	* gcc.target/i386/avx512f-pr88462-1.c: New test.
	* gcc.target/i386/avx512f-pr88462-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-1.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-stmts.c
Comment 10 Jakub Jelinek 2018-12-13 20:02:46 UTC
For masked gathers it is now fixed on the trunk, for masked stores I'll have a look tomorrow.
Comment 11 Moritz Kreutzer 2018-12-14 11:44:28 UTC
Jakub, I can confirm it's working for masked gathers (we have a similar pattern elsewhere in our code) with the latest trunk. Thanks for looking at the scatters as well!
Comment 12 Jakub Jelinek 2018-12-14 17:28:06 UTC
Created attachment 45239 [details]
gcc9-pr88464-2.patch

Incremental untested patch.
First of all, I've missed an important case for gathers in the testsuite coverage, float with long index type on -m64, that actually ICEd due to multiple issues.
The rest of the patch implements masked scatters, though at least for now only for 512-bit vectors.  The problem for 128-bit and 256-bit is that the vectorizer computes the comparisons in an integral vector rather than bool vector (i.e. int); perhaps we could just emit an instruction that sets masks from the MSB bits of such a vector.
Comment 13 Jakub Jelinek 2018-12-15 11:03:00 UTC
Author: jakub
Date: Sat Dec 15 11:02:28 2018
New Revision: 267169

URL: https://gcc.gnu.org/viewcvs?rev=267169&root=gcc&view=rev
Log:
	PR tree-optimization/88464
	PR target/88498
	* tree-vect-stmts.c (vect_build_gather_load_calls): For NARROWING
	and mask with integral masktype, don't try to permute mask vectors,
	instead emit VEC_UNPACK_{LO,HI}_EXPR.  Fix up NOP_EXPR operand.
	(vectorizable_store): Handle masked scatters with decl and integral
	mask type.
	(permute_vec_elements): Allow scalar_dest to be NULL.
	* config/i386/i386.c (ix86_get_builtin)
	<case IX86_BUILTIN_GATHER3ALTDIV16SF>: Use lowpart_subreg for masks.
	<case IX86_BUILTIN_GATHER3ALTDIV8SF>: Don't assume mask and src have
	to be the same.

	* gcc.target/i386/avx512f-pr88462-1.c: Rename to ...
	* gcc.target/i386/avx512f-pr88464-1.c: ... this.  Fix up PR number.
	Expect 4 vectorized loops instead of 3.
	(f4): New function.
	* gcc.target/i386/avx512f-pr88462-2.c: Rename to ...
	* gcc.target/i386/avx512f-pr88464-2.c: ... this.  Fix up PR number
	and #include.
	(avx512f_test): Prepare arguments for f4 and check the results.
	* gcc.target/i386/avx512f-pr88464-3.c: New test.
	* gcc.target/i386/avx512f-pr88464-4.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-1.c
      - copied, changed from r267168, trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-1.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-2.c
      - copied, changed from r267168, trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-2.c
Removed:
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-1.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-stmts.c
Comment 14 Jakub Jelinek 2018-12-15 11:05:14 UTC
Author: jakub
Date: Sat Dec 15 11:04:41 2018
New Revision: 267170

URL: https://gcc.gnu.org/viewcvs?rev=267170&root=gcc&view=rev
Log:
	PR tree-optimization/88464
	PR target/88498
	* tree-vect-stmts.c (vect_build_gather_load_calls): For NARROWING
	and mask with integral masktype, don't try to permute mask vectors,
	instead emit VEC_UNPACK_{LO,HI}_EXPR.  Fix up NOP_EXPR operand.
	(vectorizable_store): Handle masked scatters with decl and integral
	mask type.
	(permute_vec_elements): Allow scalar_dest to be NULL.
	* config/i386/i386.c (ix86_get_builtin)
	<case IX86_BUILTIN_GATHER3ALTDIV16SF>: Use lowpart_subreg for masks.
	<case IX86_BUILTIN_GATHER3ALTDIV8SF>: Don't assume mask and src have
	to be the same.

	* gcc.target/i386/avx512f-pr88462-1.c: Rename to ...
	* gcc.target/i386/avx512f-pr88464-1.c: ... this.  Fix up PR number.
	Expect 4 vectorized loops instead of 3.
	(f4): New function.
	* gcc.target/i386/avx512f-pr88462-2.c: Rename to ...
	* gcc.target/i386/avx512f-pr88464-2.c: ... this.  Fix up PR number
	and #include.
	(avx512f_test): Prepare arguments for f4 and check the results.
	* gcc.target/i386/avx512f-pr88464-3.c: New test.
	* gcc.target/i386/avx512f-pr88464-4.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-3.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-4.c
Comment 15 Jakub Jelinek 2018-12-15 11:23:31 UTC
Should be fixed now.
Comment 16 Moritz Kreutzer 2018-12-17 09:52:43 UTC
I can confirm the fix from my side.

Thanks again!
Comment 17 Jakub Jelinek 2018-12-18 11:10:18 UTC
Created attachment 45251 [details]
gcc9-pr88464-3.patch

Incremental patch over the PR88514 fix to add scatter vectorization with AVX512VL even for -mprefer-vector-width={128,256}.
Comment 18 Jakub Jelinek 2018-12-18 18:41:58 UTC
Author: jakub
Date: Tue Dec 18 18:41:26 2018
New Revision: 267239

URL: https://gcc.gnu.org/viewcvs?rev=267239&root=gcc&view=rev
Log:
	PR target/88464
	* config/i386/i386-builtin-types.def
	(VOID_FTYPE_PDOUBLE_QI_V8SI_V4DF_INT,
	VOID_FTYPE_PFLOAT_QI_V4DI_V8SF_INT,
	VOID_FTYPE_PLONGLONG_QI_V8SI_V4DI_INT,
	VOID_FTYPE_PINT_QI_V4DI_V8SI_INT,
	VOID_FTYPE_PDOUBLE_QI_V4SI_V2DF_INT,
	VOID_FTYPE_PFLOAT_QI_V2DI_V4SF_INT,
	VOID_FTYPE_PLONGLONG_QI_V4SI_V2DI_INT,
	VOID_FTYPE_PINT_QI_V2DI_V4SI_INT): New builtin types.
	* config/i386/i386.c (enum ix86_builtins): Add
	IX86_BUILTIN_SCATTERALTSIV4DF, IX86_BUILTIN_SCATTERALTDIV8SF,
	IX86_BUILTIN_SCATTERALTSIV4DI, IX86_BUILTIN_SCATTERALTDIV8SI,
	IX86_BUILTIN_SCATTERALTSIV2DF, IX86_BUILTIN_SCATTERALTDIV4SF,
	IX86_BUILTIN_SCATTERALTSIV2DI and IX86_BUILTIN_SCATTERALTDIV4SI.
	(ix86_init_mmx_sse_builtins): Fix up names of IX86_BUILTIN_GATHERALT*,
	IX86_BUILTIN_GATHER3ALT* and IX86_BUILTIN_SCATTERALT* builtins to
	match the IX86_BUILTIN codes.  Build 	IX86_BUILTIN_SCATTERALTSIV4DF,
	IX86_BUILTIN_SCATTERALTDIV8SF, IX86_BUILTIN_SCATTERALTSIV4DI,
	IX86_BUILTIN_SCATTERALTDIV8SI, IX86_BUILTIN_SCATTERALTSIV2DF,
	IX86_BUILTIN_SCATTERALTDIV4SF, IX86_BUILTIN_SCATTERALTSIV2DI and
	IX86_BUILTIN_SCATTERALTDIV4SI decls.
	(ix86_vectorize_builtin_scatter): Expand those new builtins.

	* gcc.target/i386/avx512f-pr88464-5.c: New test.
	* gcc.target/i386/avx512f-pr88464-6.c: New test.
	* gcc.target/i386/avx512f-pr88464-7.c: New test.
	* gcc.target/i386/avx512f-pr88464-8.c: New test.
	* gcc.target/i386/avx512vl-pr88464-5.c: New test.
	* gcc.target/i386/avx512vl-pr88464-6.c: New test.
	* gcc.target/i386/avx512vl-pr88464-7.c: New test.
	* gcc.target/i386/avx512vl-pr88464-8.c: New test.
	* gcc.target/i386/avx512vl-pr88464-9.c: New test.
	* gcc.target/i386/avx512vl-pr88464-10.c: New test.
	* gcc.target/i386/avx512vl-pr88464-11.c: New test.
	* gcc.target/i386/avx512vl-pr88464-12.c: New test.
	* gcc.target/i386/avx512vl-pr88464-13.c: New test.
	* gcc.target/i386/avx512vl-pr88464-14.c: New test.
	* gcc.target/i386/avx512vl-pr88464-15.c: New test.
	* gcc.target/i386/avx512vl-pr88464-16.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-5.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-6.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-7.c
    trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-8.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-10.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-11.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-12.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-13.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-14.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-15.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-16.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-5.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-6.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-7.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-8.c
    trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-9.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-builtin-types.def
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
Comment 19 Uroš Bizjak 2018-12-19 08:39:49 UTC
FYI, there are quite some sequences like:

        kmovw   %k1, %r11d
        testb   %r11b, %r11b
        jne     .L63

(e.g. when compiling avx512f-pr88464-1.c).

Perhaps ktest insn can be utilized here?
Comment 20 Jakub Jelinek 2018-12-19 08:49:59 UTC
(In reply to Uroš Bizjak from comment #19)
> FYI, there are quite some sequences like:
> 
>         kmovw   %k1, %r11d
>         testb   %r11b, %r11b
>         jne     .L63
> 
> (e.g. when compiling avx512f-pr88464-1.c).
> 
> Perhaps ktest insn can be utilized here?

It can if -mavx512dq, because we need ktestb rather than ktestw.  Or we could do kandw first, but that would be longer (movl $255, %r11d; kmovw $r11d, %k2; kandw %k1, %k2; ktestw %k2, %k2).  Have you tried with -mavx512dq?  I haven't added it to dg-options of the tests, because I wanted to test the behavior without it too and duplicating all the tests also for -mavx512dq would be too much.
Comment 21 Uroš Bizjak 2018-12-19 08:52:55 UTC
(In reply to Jakub Jelinek from comment #20)
> (In reply to Uroš Bizjak from comment #19)
> > FYI, there are quite some sequences like:
> > 
> >         kmovw   %k1, %r11d
> >         testb   %r11b, %r11b
> >         jne     .L63
> > 
> > (e.g. when compiling avx512f-pr88464-1.c).
> > 
> > Perhaps ktest insn can be utilized here?
> 
> It can if -mavx512dq, because we need ktestb rather than ktestw.  Or we
> could do kandw first, but that would be longer (movl $255, %r11d; kmovw
> $r11d, %k2; kandw %k1, %k2; ktestw %k2, %k2).  Have you tried with
> -mavx512dq?  I haven't added it to dg-options of the tests, because I wanted
> to test the behavior without it too and duplicating all the tests also for
> -mavx512dq would be too much.

Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm.
Comment 22 Uroš Bizjak 2018-12-19 08:58:56 UTC
(In reply to Uroš Bizjak from comment #21)
> Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm.

f3 and f4 functions however can use ktestw even without -mavx512dq.
Comment 23 Uroš Bizjak 2018-12-19 09:17:17 UTC
(In reply to Uroš Bizjak from comment #22)
> (In reply to Uroš Bizjak from comment #21)
> > Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm.
> 
> f3 and f4 functions however can use ktestw even without -mavx512dq.

Eh, not my day. We can use kortestw to check if operand is zero with -mavx512f.
Comment 24 Uroš Bizjak 2018-12-19 09:42:57 UTC
(In reply to Uroš Bizjak from comment #23)
> (In reply to Uroš Bizjak from comment #22)
> > (In reply to Uroš Bizjak from comment #21)
> > > Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm.
> > 
> > f3 and f4 functions however can use ktestw even without -mavx512dq.
> 
> Eh, not my day. We can use kortestw to check if operand is zero with
> -mavx512f.

So, something like:

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 6e29427e30c..c529406cccc 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1244,20 +1244,20 @@
        (compare:CC (match_operand:SWI48 0 "nonimmediate_operand")
                    (match_operand:SWI48 1 "<general_operand>")))])
 
-(define_mode_iterator SWI1248_AVX512BWDQ2_64
-  [(QI "TARGET_AVX512DQ") (HI "TARGET_AVX512DQ")
+(define_mode_iterator SWI1248_AVX512BWDQ_64
+  [(QI "TARGET_AVX512DQ") HI
    (SI "TARGET_AVX512BW") (DI "TARGET_AVX512BW && TARGET_64BIT")])
 
 (define_insn "*cmp<mode>_ccz_1"
   [(set (reg FLAGS_REG)
-       (compare (match_operand:SWI1248_AVX512BWDQ2_64 0
+       (compare (match_operand:SWI1248_AVX512BWDQ_64 0
                        "nonimmediate_operand" "<r>,?m<r>,$k")
-                (match_operand:SWI1248_AVX512BWDQ2_64 1 "const0_operand")))]
-  "ix86_match_ccmode (insn, CCZmode)"
+                (match_operand:SWI1248_AVX512BWDQ_64 1 "const0_operand")))]
+  "TARGET_AVX512F && ix86_match_ccmode (insn, CCZmode)"
   "@
    test{<imodesuffix>}\t%0, %0
    cmp{<imodesuffix>}\t{%1, %0|%0, %1}
-   ktest<mskmodesuffix>\t%0, %0"
+   kortest<mskmodesuffix>\t%0, %0"
   [(set_attr "type" "test,icmp,msklog")
    (set_attr "length_immediate" "0,1,*")
    (set_attr "prefix" "*,*,vex")
--cut here--
Comment 25 Jakub Jelinek 2018-12-19 09:43:42 UTC
Isn't ktestw and kortestw the same thing when both operands are the same mask register?
Comment 26 Jakub Jelinek 2018-12-19 09:45:28 UTC
And the TARGET_AVX512F &&  looks incorrect, then we wouldn't be able to test or cmp without -mavx512f.
Comment 27 Uroš Bizjak 2018-12-19 10:00:02 UTC
(In reply to Jakub Jelinek from comment #25)
> Isn't ktestw and kortestw the same thing when both operands are the same
> mask register?
True, but kortestw is available with AVX512F, where ktestw is not.

(In reply to Jakub Jelinek from comment #26)
> And the TARGET_AVX512F &&  looks incorrect, then we wouldn't be able to test
> or cmp without -mavx512f.
No, we fall to *cmp<mode>_ccno_1, which is compatible with CCZmode.
Comment 28 Jakub Jelinek 2018-12-19 10:16:56 UTC
(In reply to Uroš Bizjak from comment #27)
> (In reply to Jakub Jelinek from comment #25)
> > Isn't ktestw and kortestw the same thing when both operands are the same
> > mask register?
> True, but kortestw is available with AVX512F, where ktestw is not.
> 
> (In reply to Jakub Jelinek from comment #26)
> > And the TARGET_AVX512F &&  looks incorrect, then we wouldn't be able to test
> > or cmp without -mavx512f.
> No, we fall to *cmp<mode>_ccno_1, which is compatible with CCZmode.

You're right, sorry for the noise.  Your patch looks good to me.

There is another issue though (I guess not correctness, but efficiency), e.g. on avx512vl-pr88464-{1,3}.c.
E.g. in avx512vl-pr88464-3.c we have:
  if (mask__40.16_82 == { 0, 0 })
    goto <bb 7>; [100.00%]
  else
    goto <bb 6>; [20.00%]
in *.optimized, an attempt to jump around masked stores if the mask is all zeros.

We emit:
;; if (mask__40.16_82 == { 0, 0 })
    
(insn 44 43 45 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 131 [ mask__40.16 ])
            (const_int 0 [0]))) -1
     (nil))
  
(jump_insn 45 44 0 (set (pc)
        (if_then_else (eq (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 0)
            (pc))) -1
     (int_list:REG_BR_PROB 1073741831 (nil)))
for this, i.e.
        kmovw   %k2, %r10d
        testb   %r10b, %r10b
        je      .L4
(without -mavx512dq, I guess ktestb or kortestb with -mavx512dq), but perhaps we should emit kmovw %k2, %r10d; testb $3, %r10b; je .L4 instead?
If the setter is a compare that clears the higher bit, then it makes no difference, but if we are e.g. looking at the low 2 or 4 bits of 4 or 8 bit mask, then it will do a masked store even if the 2 or 4 bits we care about are clear, just some upper bits are not.