Bug 98348 - [10 Regression] GCC 10.2 AVX512 Mask regression from GCC 9
Summary: [10 Regression] GCC 10.2 AVX512 Mask regression from GCC 9
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.2.0
: P2 normal
Target Milestone: 11.3
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 96891 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-12-17 14:43 UTC by Daniel Han-Chen
Modified: 2022-01-12 06:15 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
This patch can fix this issue (2.12 KB, patch)
2020-12-19 02:50 UTC, Hongtao.liu
Details | Diff
gcc11-pr98348.patch (1.20 KB, patch)
2020-12-19 11:22 UTC, Jakub Jelinek
Details | Diff
Incremental to gcc11-pr99348.patch (3.57 KB, patch)
2020-12-21 11:25 UTC, Hongtao.liu
Details | Diff
gcc11-pr98348_v3.patch (4.44 KB, patch)
2020-12-22 09:14 UTC, Hongtao.liu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Han-Chen 2020-12-17 14:43:30 UTC
In GCC 9, vector comparisons on 128 and 256bit vectors on a AVX512 machine used vpcmpeqd without any masks.

In GCC 10, for 128bit and 256bit vectors, AVX512 mask instructions are used.
https://gcc.godbolt.org/z/1sPzM5

GCC 10 should follow GCC 9 for vector comparisons when a mask is not needed.

The reason why is https://uops.info/table.html shows that using mask registers makes 128/256/512 operations have a throughput of 1 and a latency of 3.

However, using a vector comparison directly has a throughput of 2 and a latency of 1.
Comment 1 Daniel Han-Chen 2020-12-17 14:45:51 UTC
I also just noticed that in GCC 10, an extra movdqa is done, which is also not necessary.
Comment 2 H.J. Lu 2020-12-17 16:26:59 UTC
Hongtao, can you take a look?
Comment 3 Jakub Jelinek 2020-12-17 19:34:46 UTC
Started with r10-5250-g8b905e9b0c09530c0f660563540257f3d181c2ac
Perhaps peephole2s or something similar to catch that?
Comment 4 Hongtao.liu 2020-12-18 01:37:23 UTC
(In reply to Jakub Jelinek from comment #3)
> Started with r10-5250-g8b905e9b0c09530c0f660563540257f3d181c2ac
> Perhaps peephole2s or something similar to catch that?

A patch(with peephole2) is posted at https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559301.html.

Maybe i'll add a combine spliter to handle this case.
Comment 5 Hongtao.liu 2020-12-19 02:50:39 UTC
Created attachment 49804 [details]
This patch can fix this issue

I'm testing the patch.
Comment 6 Jakub Jelinek 2020-12-19 10:22:18 UTC
In the light of the recent discussions I've been wondering about doing it as
combine splitters only, like roughly:
--- sse.md.jj	2020-12-03 10:04:35.862093285 +0100
+++ sse.md	2020-12-19 11:00:14.272140859 +0100
@@ -2965,6 +2965,40 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "<MODE>")])
 
+(define_split
+  [(set (match_operand 0 "register_operand")
+	(vec_merge
+	  (match_operand 1 "vector_all_ones_operand")
+	  (match_operand 2 "const0_operand")
+	  (unspec
+	    [(match_operand 3 "register_operand")
+	     (match_operand 4 "nonimmediate_operand")
+	     (match_operand:SI 5 "const_0_to_31_operand")]
+	     UNSPEC_PCMP)))]
+  "TARGET_AVX512VL
+   && GET_MODE_CLASS (GET_MODE (operands[0])) == MODE_VECTOR_INT
+   && (GET_MODE_SIZE (GET_MODE (operands[1])) == 16
+       || GET_MODE_SIZE (GET_MODE (operands[1])) == 32)
+   && GET_MODE (operands[1]) == GET_MODE (operands[0])
+   && GET_MODE (operands[2]) == GET_MODE (operands[0])
+   && GET_MODE_CLASS (GET_MODE (operands[3])) == MODE_VECTOR_FLOAT
+   && (GET_MODE_SIZE (GET_MODE (operands[3]))
+       == GET_MODE_SIZE (GET_MODE (operands[0])))
+   && (GET_MODE_UNIT_SIZE (GET_MODE (operands[3]))
+       == GET_MODE_UNIT_SIZE (GET_MODE (operands[0])))
+   && GET_MODE (operands[4]) == GET_MODE (operands[3])"
+  [(set (match_dup 6) (match_dup 7))
+   (set (match_dup 0) (match_dup 8))]
+{
+  operands[6] = gen_reg_rtx (GET_MODE (operands[3]));
+  operands[7]
+    = gen_rtx_UNSPEC (GET_MODE (operands[3]),
+		      gen_rtvec (3, operands[3], operands[4], operands[5]),
+		      UNSPEC_PCMP);
+  operands[8] = lowpart_subreg (GET_MODE (operands[0]), operands[6],
+				GET_MODE (operands[3]));
+})
+
 (define_insn "avx_vmcmp<mode>3"
   [(set (match_operand:VF_128 0 "register_operand" "=x")
 	(vec_merge:VF_128

The advantage is that one pattern can then handle in theory all (or half) of the floating point comparison cases.
One problem is that combiner still doesn't even try the splitting if only combining two insns.
Also, but I think that is in your patch too, vector_all_ones_operand will match only integral all ones vectors, I think we want another predicate that will match even MEMs with the floating point version thereof (a NaN kind with all bits set).  And, we should have splitters for not just the -1 0 order in VEC_MERGE, but also the 0 -1 order by inverting the comparison carefully.
Comment 7 Jakub Jelinek 2020-12-19 10:47:36 UTC
Seems the adjustment for last UNSPEC_PCMP operand between all_ones, const0 and
const0, all_ones is GEN_INT (INTVAL (operands[3]) ^ 4).
Comment 8 Jakub Jelinek 2020-12-19 11:22:49 UTC
Created attachment 49806 [details]
gcc11-pr98348.patch

So, if we go for GCC11 the way of pre-reload define_insn_and_split, this is some incremental untested progress on your patch (just the sse.md part of it).
Changes:
1) it is undesirable to put SUBREGs on the SET_DEST side, as it prevents other optimizations later on
2) I don't see the point on the TARGET_AVX512BW ||, the insn in the end is
plain AVX or AVX2 or SSE4* etc. one
3) handles also the const0 vector_all_ones order
4) for commutative cases allows any operand order, for others ensures the right
one of the operands is register
5) handles also the LE case by swapping the comparison operands

The patch doesn't handle the cases where based on the comparison one sets up floating vectors, as can be seen e.g. in:
typedef float V128 __attribute__ ((vector_size(16)));
typedef float V256 __attribute__ ((vector_size(32)));
typedef float V512 __attribute__ ((vector_size(64)));

V128
foo (V128 x)
{
  const union U { unsigned u; float f; } u = { -1U };
  return x > 0.0f ? u.f : 0.0f;
}

V256
bar (V256 x)
{
  const union U { unsigned u; float f; } u = { -1U };
  return x > 0.0f ? u.f : 0.0f;
}
Comment 9 Hongtao.liu 2020-12-21 11:23:26 UTC
(In reply to Jakub Jelinek from comment #8)
> Created attachment 49806 [details]
> gcc11-pr98348.patch
> 
> So, if we go for GCC11 the way of pre-reload define_insn_and_split, this is
> some incremental untested progress on your patch (just the sse.md part of
> it).
> Changes:
> 1) it is undesirable to put SUBREGs on the SET_DEST side, as it prevents
> other optimizations later on
> 2) I don't see the point on the TARGET_AVX512BW ||, the insn in the end is
> plain AVX or AVX2 or SSE4* etc. one
> 3) handles also the const0 vector_all_ones order
> 4) for commutative cases allows any operand order, for others ensures the
> right

The bellow pattern should be equivilent to const0 vector_all_ones order, but the generic part didn't simplify it, so i made a bit adjustment to those patterns, also some changes in ix86_expand_sse_movcc to generate common NOT operator instead of gen_knot<mode> which has UNSPEC_MASKOP inside, so the combine can do the right thing.

Successfully matched this instruction:
(set (reg:V16QI 82 [ <retval> ])
    (vec_merge:V16QI (const_vector:V16QI [
                (const_int -1 [0xffffffffffffffff]) repeated x16
            ])
        (const_vector:V16QI [
                (const_int 0 [0]) repeated x16
            ])
        (not:HI (unspec:HI [
                    (reg:V16QI 89)
                    (reg:V16QI 90)
                    (const_int 4 [0x4])
                ] UNSPEC_PCMP))))

> one of the operands is register
> 5) handles also the LE case by swapping the comparison operands
> 
> The patch doesn't handle the cases where based on the comparison one sets up
> floating vectors, as can be seen e.g. in:
> typedef float V128 __attribute__ ((vector_size(16)));
> typedef float V256 __attribute__ ((vector_size(32)));
> typedef float V512 __attribute__ ((vector_size(64)));
> 
> V128
> foo (V128 x)
> {
>   const union U { unsigned u; float f; } u = { -1U };
>   return x > 0.0f ? u.f : 0.0f;
> }
> 
> V256
> bar (V256 x)
> {
>   const union U { unsigned u; float f; } u = { -1U };
>   return x > 0.0f ? u.f : 0.0f;
> }

I'm adding a new predicate named float_vector_all_ones_operand and corresponding define_insn_and_split to handle it.

Successfully matched this instruction:
(set (reg:V4SF 82 [ <retval> ])
    (vec_merge:V4SF (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16 A128])
        (const_vector:V4SF [
                (const_double:SF 0.0 [0x0.0p+0]) repeated x4
            ])
        (unspec:QI [
                (reg:V4SF 84)
                (reg:V4SF 88)
                (const_int 1 [0x1])
            ] UNSPEC_PCMP)))
Comment 10 Hongtao.liu 2020-12-21 11:25:26 UTC
Created attachment 49819 [details]
Incremental to gcc11-pr99348.patch

Update patch.
Comment 11 Jakub Jelinek 2020-12-21 11:32:27 UTC
I'm not sure about the knot changes, isn't that too risky at least at this point?
I mean, can't we instead just match what knot emits?

As for the new predicate, I think we should check CONST_DOUBLE_AS_FLOAT_P (op)
before trying to do the lowpart_subreg, perhaps even check if the CONST_DOUBLE is NaN if it can be done cheaply before doing the more expensive lowpart_subreg.
Comment 12 Hongtao.liu 2020-12-21 12:12:10 UTC
(In reply to Jakub Jelinek from comment #11)
> I'm not sure about the knot changes, isn't that too risky at least at this
> point?
> I mean, can't we instead just match what knot emits?
> 
As indicated in https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552205.html, we support bitwise operator for avx512 masks, so i believe the change is fine for GCC11, but not suitable for GCC10. 

For backport convenience, I'll make a version that matches the knot.

> As for the new predicate, I think we should check CONST_DOUBLE_AS_FLOAT_P
> (op)
> before trying to do the lowpart_subreg, perhaps even check if the
> CONST_DOUBLE is NaN if it can be done cheaply before doing the more
> expensive lowpart_subreg.

Yes.
Comment 13 Hongtao.liu 2020-12-22 09:14:41 UTC
Created attachment 49832 [details]
gcc11-pr98348_v3.patch

1. Use REAL_VALUE_TO_TARGET_SINGLE/DOUBLE in the "float_vector_all_ones_operands" predicate, it should be cheaper than lower_subreg?

2. Combine failed to match (vec_merge (vector_all_ones_operand) (const0_operand) (knot mask)) even the corresponding pattern exists, peephole2 could be ok for this but abviously not the best solution.
 IMHO the changes about knot should be functionality ok, and to avoid some potential performance issue, I extraly add some new define_insn_and_splits to negate the comparison code when there's not in the result(Dig the hole deeper and deeper:(). The patch survive regtest and bootstrap on both trunk and rlease/gcc-10 branch.
Comment 14 Hongtao.liu 2021-01-04 06:51:44 UTC
*** Bug 96891 has been marked as a duplicate of this bug. ***
Comment 15 Jakub Jelinek 2021-01-21 11:52:38 UTC
I haven't seen the patch posted to gcc-patches, have I missed it?
Patch review should happen there.
Comment 16 Hongtao.liu 2021-01-21 13:38:06 UTC
(In reply to Jakub Jelinek from comment #15)
> I haven't seen the patch posted to gcc-patches, have I missed it?
> Patch review should happen there.

https://gcc.gnu.org/pipermail/gcc-patches/2021-January/562903.html
Comment 17 CVS Commits 2021-01-22 04:37:13 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:ee78c20e74d30284fee36e22a64e86e45e676029

commit r11-6849-gee78c20e74d30284fee36e22a64e86e45e676029
Author: liuhongt <hongtao.liu@intel.com>
Date:   Fri Dec 18 15:56:06 2020 +0800

    Lower AVX512 vector comparison to AVX version when dest is vector.
    
    gcc/ChangeLog:
    
            PR target/96891
            PR target/98348
            * config/i386/sse.md (VI_128_256): New mode iterator.
            (*avx_cmp<mode>3_1, *avx_cmp<mode>3_2, *avx_cmp<mode>3_3,
             *avx_cmp<mode>3_4, *avx2_eq<mode>3, *avx2_pcmp<mode>3_1,
             *avx2_pcmp<mode>3_2, *avx2_gt<mode>3): New
            define_insn_and_split to lower avx512 vector comparison to avx
            version when dest is vector.
            (*<avx512>_cmp<mode>3,*<avx512>_cmp<mode>3,*<avx512>_ucmp<mode>3):
            define_insn_and_split for negating the comparison result.
            * config/i386/predicates.md (float_vector_all_ones_operand):
            New predicate.
            * config/i386/i386-expand.c (ix86_expand_sse_movcc): Use
            general NOT operator without UNSPEC_MASKOP.
    
    gcc/testsuite/ChangeLog:
    
            PR target/96891
            PR target/98348
            * gcc.target/i386/avx512bw-pr96891-1.c: New test.
            * gcc.target/i386/avx512f-pr96891-1.c: New test.
            * gcc.target/i386/avx512f-pr96891-2.c: New test.
            * gcc.target/i386/avx512f-pr96891-3.c: New test.
            * g++.target/i386/avx512f-pr96891-1.C: New test.
            * gcc.target/i386/bitwise_mask_op-3.c: Adjust testcase.
Comment 18 Hongtao.liu 2021-01-22 04:40:31 UTC
Fix on trunk sofar, as jakub suggests, not back port to gcc10.
Comment 19 Richard Biener 2021-04-08 12:02:22 UTC
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Comment 20 Dávid Bolvanský 2021-04-11 13:17:46 UTC
Some small regression (missed opportunity to use vptestnmd):

Current trunk

compare(unsigned int __vector(16)):
  vpxor xmm1, xmm1, xmm1
  vpcmpd k0, zmm0, zmm1, 0
  vpmovm2d zmm0, k0
  ret

GCC 9.2

compare(unsigned int __vector(16)):
  vptestnmd k0, zmm0, zmm0
  vpmovm2d zmm0, k0
  ret


https://gcc.godbolt.org/z/5vK68jM3r
Comment 21 Hongtao.liu 2021-04-16 07:43:06 UTC
(In reply to Dávid Bolvanský from comment #20)
> Some small regression (missed opportunity to use vptestnmd):
> 
> Current trunk
> 
> compare(unsigned int __vector(16)):
>   vpxor xmm1, xmm1, xmm1
>   vpcmpd k0, zmm0, zmm1, 0
>   vpmovm2d zmm0, k0
>   ret
> 
> GCC 9.2
> 
> compare(unsigned int __vector(16)):
>   vptestnmd k0, zmm0, zmm0
>   vpmovm2d zmm0, k0
>   ret
> 
> 
> https://gcc.godbolt.org/z/5vK68jM3r

I'm testing 

+(define_insn_and_split "*<avx512>_eq<mode>3"
+  [(set (match_operand:<avx512fmaskmode> 0 "register_operand")
+	(unspec:<avx512fmaskmode>
+	  [(match_operand:VI_AVX512BW 1 "nonimm_or_0_operand")
+	   (match_operand:VI_AVX512BW 2 "nonimm_or_0_operand")
+	   (const_int 0)]
+	  UNSPEC_PCMP_ITER))]
+  "TARGET_AVX512F && ix86_pre_reload_split ()
+  && (operands[1] == CONST0_RTX (<MODE>mode)
+      || operands[2] == CONST0_RTX (<MODE>mode))"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+  	(unspec:<avx512fmaskmode>
+	  [(match_dup 1) (match_dup 2)]
+	  UNSPEC_MASKED_EQ))])
+
Comment 22 Hongtao.liu 2021-05-07 02:25:06 UTC
(In reply to Dávid Bolvanský from comment #20)
> Some small regression (missed opportunity to use vptestnmd):
> 
> Current trunk
> 
> compare(unsigned int __vector(16)):
>   vpxor xmm1, xmm1, xmm1
>   vpcmpd k0, zmm0, zmm1, 0
>   vpmovm2d zmm0, k0
>   ret
> 
> GCC 9.2
> 
> compare(unsigned int __vector(16)):
>   vptestnmd k0, zmm0, zmm0
>   vpmovm2d zmm0, k0
>   ret
> 
> 
> https://gcc.godbolt.org/z/5vK68jM3r

A patch is posed at https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568210.html
Comment 23 Hongtao.liu 2022-01-12 01:14:42 UTC
(In reply to Hongtao.liu from comment #22)
> (In reply to Dávid Bolvanský from comment #20)
> > Some small regression (missed opportunity to use vptestnmd):
> > 
> > Current trunk
> > 
> > compare(unsigned int __vector(16)):
> >   vpxor xmm1, xmm1, xmm1
> >   vpcmpd k0, zmm0, zmm1, 0
> >   vpmovm2d zmm0, k0
> >   ret
> > 
> > GCC 9.2
> > 
> > compare(unsigned int __vector(16)):
> >   vptestnmd k0, zmm0, zmm0
> >   vpmovm2d zmm0, k0
> >   ret
> > 
> > 
> > https://gcc.godbolt.org/z/5vK68jM3r
> 
> A patch is posed at
> https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568210.html

Fixed by r12-3239
Comment 24 CVS Commits 2022-01-12 05:58:01 UTC
The releases/gcc-11 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:d5e7b9fd184c22464d8724ff748e1f7ce0cd6bce

commit r11-9455-gd5e7b9fd184c22464d8724ff748e1f7ce0cd6bce
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Aug 30 15:05:14 2021 +0800

    Unify UNSPEC_MASKED_EQ/GT to the form of UNSPEC_PCMP.
    
    Currently for evex vpcmpeqb instruction, we have two forms of rtl
    template representation, one is (unspec [op1 op2] UNSPEC_MASK_EQ), the
    other is (unspec [op1, op2, const_int 0] UNSPEC_PCMP), which increases
    the maintenance burden, such as optimization (not: vpcmpeqb)
    to (vpcmpneqb) requires two define_insn_and_split to match the two
    forms respectively, this patch removes UNSPEC_MASK_EQ/GT, unifying
    them into the form of UNSPEC_PCMP.
    
    gcc/ChangeLog:
    
            PR target/98348
            * config/i386/sse.md (*<avx512>_ucmp<mode>3_1): Change from
            define_split to define_insn_and_split.
            (*avx2_eq<mode>3): Removed.
            (<avx512>_eq<mode>3<mask_scalar_merge_name>): Adjust pattern
            (<avx512>_eq<mode>3<mask_scalar_merge_name>_1): Rename to ..
            (*<avx512>_eq<mode>3<mask_scalar_merge_name>_1): .. this, and
            adjust pattern.
            (*avx2_gt<mode>3): Removed.
            (<avx512>_gt<mode>3<mask_scalar_merge_name>): Change from
            define_insn to define_expand, and adjust pattern.
            (UNSPEC_MASKED_EQ, UNSPEC_MASKED_GT): Removed.
    
    gcc/testsuite/ChangeLog:
    
            PR target/98348
            * gcc.target/i386/avx512bw-vpcmpeqb-1.c: Adjust testcase.
            * gcc.target/i386/avx512bw-vpcmpeqw-1.c: Ditto.
            * gcc.target/i386/avx512bw-vpcmpgtb-1.c: Ditto.
            * gcc.target/i386/avx512bw-vpcmpgtw-1.c: Ditto.
            * gcc.target/i386/avx512f-vpcmpeqd-1.c: Ditto.
            * gcc.target/i386/avx512f-vpcmpeqq-1.c: Ditto.
            * gcc.target/i386/avx512f-vpcmpgtd-1.c: Ditto.
            * gcc.target/i386/avx512f-vpcmpgtq-1.c: Ditto.
            * gcc.target/i386/avx512vl-vpcmpeqd-1.c: Ditto.
            * gcc.target/i386/avx512vl-vpcmpeqq-1.c: Ditto.
            * gcc.target/i386/avx512vl-vpcmpgtd-1.c: Ditto.
            * gcc.target/i386/avx512vl-vpcmpgtq-1.c: Ditto.
            * gcc.target/i386/bitwise_mask_op-1.c: Ditto.
            * gcc.target/i386/bitwise_mask_op-2.c: Ditto.
Comment 25 Hongtao.liu 2022-01-12 05:59:25 UTC
backport to GCC11, no suitable for GCC10, so fixed in GCC11/12.
Comment 26 Andrew Pinski 2022-01-12 06:15:50 UTC
Fixed.
Comment 27 Andrew Pinski 2022-01-12 06:15:58 UTC
.