Bug 103771 - [12/13/14 Regression] Missed vectorization under -mavx512f -mavx512vl after r12-5489
Summary: [12/13/14 Regression] Missed vectorization under -mavx512f -mavx512vl after r...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.0
: P2 normal
Target Milestone: 14.0
Assignee: Andrew Pinski
URL: https://gcc.gnu.org/pipermail/gcc-pat...
Keywords: missed-optimization, patch
Depends on:
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2021-12-20 08:59 UTC by Hongyu Wang
Modified: 2023-05-08 07:39 UTC (History)
7 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-01-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hongyu Wang 2021-12-20 08:59:14 UTC
cat vect.c

typedef unsigned char uint8_t;                                                               
                                                                                             
static uint8_t x264_clip_uint8( int x )                                                      
{                                                                                            
    return x&(~255) ? (-x)>>31 : x;                                                          
}                                                                                            
                                                                                             
void mc_weight( uint8_t * __restrict dst, uint8_t * __restrict src, int i_width, int i_scale)
{                                                                                            
  for( int x = 0; x < i_width; x++ )                                                         
    dst[x] = x264_clip_uint8(src[x] * i_scale);                                              
}                                                                                            

It can not be vectorized with -mavx512f -mavx512vl, but can be vectorized with -mavx2, See https://godbolt.org/z/M1jx161f6

The commit https://gcc.gnu.org/cgi-bin/gcc-gitref.cgi?r=r12-5489 converts  (x & (~255)) == 0 to x <= 255, which may trigger some missing pattern with -mavx512vl. 

Also an 1.5% regression was found on -march=cascadelake due to missing 128bit epilogue for this loop.
Comment 1 Tamar Christina 2022-01-04 15:31:26 UTC
Looks like the change causes the simpler conditional to be detected by the vectorizer as a masked operation, which in principle makes sense:

note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 : iftmp.0_20;
note:   extra pattern stmt: patt_40 = x.1_14 > 255;
note:   extra pattern stmt: patt_42 = (<signed-boolean:8>) patt_40;

However not quite sure how the masking works on x86.  The additional statement generated for patt_42 causes it to fail during vectorization:

note:   ==> examining pattern def statement: patt_42 = (<signed-boolean:8>) patt_40;
note:   ==> examining statement: patt_42 = (<signed-boolean:8>) patt_40;
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
missed:   conversion not supported by target.
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
missed:   not vectorized: relevant stmt not supported: patt_42 = (<signed-boolean:8>) patt_40;
missed:  bad operation or unsupported loop bound.
note:  ***** Analysis  failed with vector mode V32QI

as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and a mask.

which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to allow any QI mode conversions [1] so it fails..

Not sure why it's doing the replacement without checking to see that the target is able to vectorize the statements it generates later. Specifically it doesn't check if what's returned by build_mask_conversion is supported or not.

My guess is because vectorizable_condition will fail anyway without the type of the conditional being a vector boolean.

With -mavx512vl V32QI seems to generate in the pattern mask conversions between vector (8) <signed-boolean:1> and without it vector(32) <signed-boolean:8>. I think some x86 person needs to give a hint here :)

[1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq
Comment 2 Hongtao.liu 2022-01-05 09:00:28 UTC
(In reply to Tamar Christina from comment #1)
> Looks like the change causes the simpler conditional to be detected by the
> vectorizer as a masked operation, which in principle makes sense:
> 
> note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 >
> 255 ? iftmp.0_19 : iftmp.0_20;
> note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 :
> iftmp.0_20;
> note:   extra pattern stmt: patt_40 = x.1_14 > 255;
> note:   extra pattern stmt: patt_42 = (<signed-boolean:8>) patt_40;
> 
> However not quite sure how the masking works on x86.  The additional
> statement generated for patt_42 causes it to fail during vectorization:
> 
> note:   ==> examining pattern def statement: patt_42 = (<signed-boolean:8>)
> patt_40;
> note:   ==> examining statement: patt_42 = (<signed-boolean:8>) patt_40;
> note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
> note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
> missed:   conversion not supported by target.
> note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
> note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
> note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
> note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
> missed:   not vectorized: relevant stmt not supported: patt_42 =
> (<signed-boolean:8>) patt_40;
> missed:  bad operation or unsupported loop bound.
> note:  ***** Analysis  failed with vector mode V32QI
> 
> as there's no conversion patterns for `VEC_UNPACK_LO_EXPR` between bool and
> a mask.
W/ avx512, we're using scalar mode for mask, can we use VEC_UNPACKS_SBOOL_LO_ here?
Since we have vec_unpsack_sbool_lo/hi_qi which should be used for conversion from vector<8> <signed-boolean:1> to vector<4> <signed-boolean:1>.
> 
> which explains why it works for AVX2 and AVX512BW. AVX512F doesn't seem to
> allow any QI mode conversions [1] so it fails..
> 
> Not sure why it's doing the replacement without checking to see that the
> target is able to vectorize the statements it generates later. Specifically
> it doesn't check if what's returned by build_mask_conversion is supported or
> not.
> 
> My guess is because vectorizable_condition will fail anyway without the type
> of the conditional being a vector boolean.
> 
> With -mavx512vl V32QI seems to generate in the pattern mask conversions
> between vector (8) <signed-boolean:1> and without it vector(32)
> <signed-boolean:8>. I think some x86 person needs to give a hint here :)
> 
> [1] https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq
Comment 3 Hongtao.liu 2022-01-13 08:43:38 UTC
for patt_42 = (<signed-boolean:1>) patt_40;

vectype_in (QImode:nunits 4)

 <vector_type 0x7fffea18de70
    type <boolean_type 0x7fffea18ddc8 public QI
        size <integer_cst 0x7fffea2e2e40 constant 8>
        unit-size <integer_cst 0x7fffea2e2e58 constant 1>
        align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18ddc8 precision:1 min <integer_cst 0x7fffea196ab0 -1> max <integer_cst 0x7fffea196b70 0>>
    QI size <integer_cst 0x7fffea2e2e40 8> unit-size <integer_cst 0x7fffea2e2e58 1>
    align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18de70 nunits:4>

vectype_out(HImode)

 <vector_type 0x7fffea18df18
    type <boolean_type 0x7fffea18ddc8 public QI
        size <integer_cst 0x7fffea2e2e40 constant 8>
        unit-size <integer_cst 0x7fffea2e2e58 constant 1>
        align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18ddc8 precision:1 min <integer_cst 0x7fffea196ab0 -1> max <integer_cst 0x7fffea196b70 0>>
    HI
    size <integer_cst 0x7fffea2e2f00 type <integer_type 0x7fffea2fd0a8 bitsizetype> constant 16>
    unit-size <integer_cst 0x7fffea2e2f18 type <integer_type 0x7fffea2fd000 sizetype> constant 2>
    align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffea18df18 nunits:16>

And ‘vec_pack_sbool_trunc_m’ only handle situation when input and output have same mode.
Comment 4 Hongtao.liu 2022-01-13 09:02:37 UTC
(In reply to Hongtao.liu from comment #3)
> for patt_42 = (<signed-boolean:1>) patt_40;
> 
> vectype_in (QImode:nunits 4)
> 
>  <vector_type 0x7fffea18de70
>     type <boolean_type 0x7fffea18ddc8 public QI
>         size <integer_cst 0x7fffea2e2e40 constant 8>
>         unit-size <integer_cst 0x7fffea2e2e58 constant 1>
>         align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18ddc8 precision:1 min <integer_cst 0x7fffea196ab0 -1> max
> <integer_cst 0x7fffea196b70 0>>
>     QI size <integer_cst 0x7fffea2e2e40 8> unit-size <integer_cst
> 0x7fffea2e2e58 1>
>     align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18de70 nunits:4>
> 
> vectype_out(HImode)
> 
>  <vector_type 0x7fffea18df18
>     type <boolean_type 0x7fffea18ddc8 public QI
>         size <integer_cst 0x7fffea2e2e40 constant 8>
>         unit-size <integer_cst 0x7fffea2e2e58 constant 1>
>         align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18ddc8 precision:1 min <integer_cst 0x7fffea196ab0 -1> max
> <integer_cst 0x7fffea196b70 0>>
>     HI
>     size <integer_cst 0x7fffea2e2f00 type <integer_type 0x7fffea2fd0a8
> bitsizetype> constant 16>
>     unit-size <integer_cst 0x7fffea2e2f18 type <integer_type 0x7fffea2fd000
> sizetype> constant 2>
>     align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7fffea18df18 nunits:16>
> 
> And ‘vec_pack_sbool_trunc_m’ only handle situation when input and output
> have same mode.

And GCC vectorizer only handle 2X elements, but not 4X,8X,...

      /* For scalar masks we may have different boolean
	 vector types having the same QImode.  Thus we
	 add additional check for elements number.  */
     if (known_eq (TYPE_VECTOR_SUBPARTS (vectype) * 2,		    
          TYPE_VECTOR_SUBPARTS (narrow_vectype)))
Comment 5 Hongtao.liu 2022-01-13 09:04:33 UTC
> And GCC vectorizer only handle 2X elements, but not 4X,8X,...
> 
>       /* For scalar masks we may have different boolean
> 	 vector types having the same QImode.  Thus we
> 	 add additional check for elements number.  */
>      if (known_eq (TYPE_VECTOR_SUBPARTS (vectype) * 2,		    
>           TYPE_VECTOR_SUBPARTS (narrow_vectype)))


We can have vec_pack_trunc_qi + vec_pack_sbool_trunc_qi by extending vec_pack_sbool_trunc_qi to handle 8/4, 4/2, 2/1 vectype_out/vectype_in elements.
Comment 6 Richard Biener 2022-01-13 09:28:18 UTC
(In reply to Tamar Christina from comment #1)
> Looks like the change causes the simpler conditional to be detected by the
> vectorizer as a masked operation, which in principle makes sense:
> 
> note:   vect_recog_mask_conversion_pattern: detected: iftmp.0_21 = x.1_14 >
> 255 ? iftmp.0_19 : iftmp.0_20;
> note:   mask_conversion pattern recognized: patt_43 = patt_42 ? iftmp.0_19 :
> iftmp.0_20;
> note:   extra pattern stmt: patt_40 = x.1_14 > 255;
> note:   extra pattern stmt: patt_42 = (<signed-boolean:8>) patt_40;

if we look at the ifcvt result we see

  iftmp.0_19 = (unsigned char) _18;
  iftmp.0_20 = (unsigned char) _5;
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
  *_6 = iftmp.0_21;

I suspect we intended to carry the fact that x.1_14 > 255 will produce
a mask from a SImode vector element compare but we need a mask suitable
for a QImode vector element select (iftmp.0_19 and iftmp.0_20).  That's
not something we can express in scalar code and thus a pattern I think.

So the pattern as generated doesn't make very much sense to me.  I suppose
it might try to convert a AVX512 mask to a AVX2 style mask but that needs
to be done with

  patt_42 = patt_40 ? (<signed-boolean:8>)-1 : 0;

not with a conversion.  But it's also somewhat pointless since it will
simply cause the same issue again - the COND_EXPR vectorization will need
to cobble up 4 AVX512 masks to produce the desired result.
Comment 7 Richard Biener 2022-01-13 09:50:32 UTC
Forcing the pattern to not trigger produces the expected

t.c:8:6: missed:   not vectorized: relevant stmt not supported: iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;

since condition vectorization itself doesn't know how to handle this, we end
up at

  if (vectype1 && !useless_type_conversion_p (vectype, vectype1))
    return false;

with vectype V32QI and vectype1 V8SI.

Splitting out the compare from the COND_EXPR in the pattern but leaving out
the attempt to "widen" it reveals the same fact that vectorizable_condition
doesn't support packing of multiple vector defs for the mask operand.

I think that is what we need to add.  We also don't have a good representation
for "packing" of masks.

diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 3ea905538e1..729a1d32612 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
                                  rhs1_type);
        }
 
-      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
-                   TYPE_VECTOR_SUBPARTS (vectype2)))
+      /* AVX512 style masks cannot be packed/unpacked.  */
+      if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
+         && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
+                      TYPE_VECTOR_SUBPARTS (vectype2)))
        tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
       else
        tmp = rhs1;
Comment 8 Hongtao.liu 2022-01-13 10:04:52 UTC
(In reply to Richard Biener from comment #7)
> Forcing the pattern to not trigger produces the expected
> 
> t.c:8:6: missed:   not vectorized: relevant stmt not supported: iftmp.0_21 =
> x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
> 
> since condition vectorization itself doesn't know how to handle this, we end
> up at
> 
>   if (vectype1 && !useless_type_conversion_p (vectype, vectype1))
>     return false;
> 
> with vectype V32QI and vectype1 V8SI.
> 
> Splitting out the compare from the COND_EXPR in the pattern but leaving out
> the attempt to "widen" it reveals the same fact that vectorizable_condition
> doesn't support packing of multiple vector defs for the mask operand.
> 
> I think that is what we need to add.  We also don't have a good
> representation
> for "packing" of masks.
> 
shouldn't multi_step_cvt be supposed to handle this, just a little ambiguous between ‘vec_pack_sbool_trunc_m’ and ‘vec_pack_trunc_m’, need to make vectorizer use vec_pack_sbool_trunc_qi + vec_pack_trunc_qi to get a HI mask, just like how we pack 4 QImode to 1 SImode mask by vec_pack_trunc_qi + vec_pack_trunc_hi (in the main loop with -mprefer-vector-width=256)

 mask_patt_40.26_118 = vect_x.18_95 > { 255, 255, 255, 255, 255, 255, 255, 255 };
  mask_patt_40.26_119 = vect_x.18_96 > { 255, 255, 255, 255, 255, 255, 255, 255 };
  mask_patt_40.26_120 = vect_x.18_97 > { 255, 255, 255, 255, 255, 255, 255, 255 };
  mask_patt_40.26_121 = vect_x.18_98 > { 255, 255, 255, 255, 255, 255, 255, 255 };
  mask_patt_42.28_122 = VEC_PACK_TRUNC_EXPR <mask_patt_40.26_118, mask_patt_40.26_119>;
  mask_patt_42.28_123 = VEC_PACK_TRUNC_EXPR <mask_patt_40.26_120, mask_patt_40.26_121>;
  mask_patt_42.27_124 = VEC_PACK_TRUNC_EXPR <mask_patt_42.28_122, mask_patt_42.28_123>;
  vect_patt_43.29_125 = VEC_COND_EXPR <mask_patt_42.27_124, vect_iftmp.22_113, vect_iftmp.24_116>;
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
Comment 9 Richard Sandiford 2022-01-13 10:11:09 UTC
(In reply to Richard Biener from comment #7)
> I think that is what we need to add.  We also don't have a good
> representation
> for "packing" of masks.
> 
> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> index 3ea905538e1..729a1d32612 100644
> --- a/gcc/tree-vect-patterns.c
> +++ b/gcc/tree-vect-patterns.c
> @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
>                                   rhs1_type);
>         }
>  
> -      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> -                   TYPE_VECTOR_SUBPARTS (vectype2)))
> +      /* AVX512 style masks cannot be packed/unpacked.  */
> +      if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
> +         && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> +                      TYPE_VECTOR_SUBPARTS (vectype2)))
>         tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
>        else
>         tmp = rhs1;
Haven't had time to look at it properly yet, but my first impression
is that that's likely to regress SVE.  Packing and unpacking are
natural operations on boolean vector modes.
Comment 10 Hongtao.liu 2022-01-13 10:22:51 UTC
with
@@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
       c1 = VEC_PACK_TRUNC_EXPR;
       if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
 	  && VECTOR_BOOLEAN_TYPE_P (vectype)
-	  && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
+	  && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
+	      || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
 	  && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
 	optab1 = vec_pack_sbool_trunc_optab;
       else
@@ -12213,6 +12214,7 @@ supportable_narrowing_operation (enum tree_code code,
       if (VECTOR_BOOLEAN_TYPE_P (intermediate_type)
 	  && VECTOR_BOOLEAN_TYPE_P (prev_type)
 	  && intermediate_mode == prev_mode
+	  && known_lt (TYPE_VECTOR_SUBPARTS (intermediate_type), BITS_PER_UNIT)
 	  && SCALAR_INT_MODE_P (prev_mode))
 	interm_optab = vec_pack_sbool_trunc_optab;
       else

-march=icelake-server -O3 -mprefer-vector-width=128 now can get vectorized loop.


	vmovdqu8	(%rsi,%rax), %xmm0
	vpmovzxbw	%xmm0, %xmm2
	vpmovzxwd	%xmm2, %xmm1
	vpsrldq	$8, %xmm0, %xmm0
	vpsrldq	$8, %xmm2, %xmm2
	vpmovzxbw	%xmm0, %xmm0
	vpmovzxwd	%xmm2, %xmm2
	vpmulld	%xmm9, %xmm1, %xmm1
	vpmulld	%xmm9, %xmm2, %xmm2
	vpmovzxwd	%xmm0, %xmm4
	vpsrldq	$8, %xmm0, %xmm0
	vpmovzxwd	%xmm0, %xmm0
	vpmulld	%xmm9, %xmm4, %xmm4
	vpmulld	%xmm9, %xmm0, %xmm0
	vpcmpud	$6, %xmm6, %xmm1, %k0
	vpsubd	%xmm1, %xmm7, %xmm3
	vpcmpud	$6, %xmm6, %xmm2, %k1
	vpsubd	%xmm2, %xmm7, %xmm5
	vpsrad	$31, %xmm5, %xmm5
	vpsrad	$31, %xmm3, %xmm3
	vpermt2w	%xmm5, %xmm8, %xmm3
	vpsubd	%xmm0, %xmm7, %xmm10
	vpsubd	%xmm4, %xmm7, %xmm5
	kshiftlb	$4, %k1, %k1
	vpcmpud	$6, %xmm6, %xmm0, %k2
	vpsrad	$31, %xmm5, %xmm5
	vpsrad	$31, %xmm10, %xmm10
	kandb	%k3, %k0, %k0
	korb	%k1, %k0, %k0
	vpcmpud	$6, %xmm6, %xmm4, %k1
	vpermt2w	%xmm10, %xmm8, %xmm5
	vpermt2w	%xmm2, %xmm8, %xmm1
	vpermt2w	%xmm0, %xmm8, %xmm4
	vpermt2b	%xmm5, %xmm11, %xmm3
	vpermt2b	%xmm4, %xmm11, %xmm1
	kandb	%k3, %k1, %k1
	kshiftlb	$4, %k2, %k2
	korb	%k2, %k1, %k1
	kunpckbw	%k0, %k1, %k1
	vmovdqu8	%xmm3, %xmm1{%k1}
	vmovdqu8	%xmm1, (%rdi,%rax)
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	.L4
Comment 11 Hongtao.liu 2022-01-13 10:43:00 UTC
(In reply to Hongtao.liu from comment #10)
> with
> @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
>        c1 = VEC_PACK_TRUNC_EXPR;
>        if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
>  	  && VECTOR_BOOLEAN_TYPE_P (vectype)
> -	  && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> +	  && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> +	      || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
>  	  && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
>  	optab1 = vec_pack_sbool_trunc_optab;
>        else
> @@ -12213,6 +12214,7 @@ supportable_narrowing_operation (enum tree_code code,
>        if (VECTOR_BOOLEAN_TYPE_P (intermediate_type)
>  	  && VECTOR_BOOLEAN_TYPE_P (prev_type)
>  	  && intermediate_mode == prev_mode
> +	  && known_lt (TYPE_VECTOR_SUBPARTS (intermediate_type), BITS_PER_UNIT)
>  	  && SCALAR_INT_MODE_P (prev_mode))
>  	interm_optab = vec_pack_sbool_trunc_optab;
>        else
> 
> -march=icelake-server -O3 -mprefer-vector-width=128 now can get vectorized
> loop.
> 
> 
> 	vmovdqu8	(%rsi,%rax), %xmm0
> 	vpmovzxbw	%xmm0, %xmm2
> 	vpmovzxwd	%xmm2, %xmm1
> 	vpsrldq	$8, %xmm0, %xmm0
> 	vpsrldq	$8, %xmm2, %xmm2
> 	vpmovzxbw	%xmm0, %xmm0
> 	vpmovzxwd	%xmm2, %xmm2
> 	vpmulld	%xmm9, %xmm1, %xmm1
> 	vpmulld	%xmm9, %xmm2, %xmm2
> 	vpmovzxwd	%xmm0, %xmm4
> 	vpsrldq	$8, %xmm0, %xmm0
> 	vpmovzxwd	%xmm0, %xmm0
> 	vpmulld	%xmm9, %xmm4, %xmm4
> 	vpmulld	%xmm9, %xmm0, %xmm0
> 	vpcmpud	$6, %xmm6, %xmm1, %k0
> 	vpsubd	%xmm1, %xmm7, %xmm3
> 	vpcmpud	$6, %xmm6, %xmm2, %k1
> 	vpsubd	%xmm2, %xmm7, %xmm5
> 	vpsrad	$31, %xmm5, %xmm5
> 	vpsrad	$31, %xmm3, %xmm3
> 	vpermt2w	%xmm5, %xmm8, %xmm3
> 	vpsubd	%xmm0, %xmm7, %xmm10
> 	vpsubd	%xmm4, %xmm7, %xmm5
> 	kshiftlb	$4, %k1, %k1
> 	vpcmpud	$6, %xmm6, %xmm0, %k2
> 	vpsrad	$31, %xmm5, %xmm5
> 	vpsrad	$31, %xmm10, %xmm10
> 	kandb	%k3, %k0, %k0
> 	korb	%k1, %k0, %k0
> 	vpcmpud	$6, %xmm6, %xmm4, %k1
> 	vpermt2w	%xmm10, %xmm8, %xmm5
> 	vpermt2w	%xmm2, %xmm8, %xmm1
> 	vpermt2w	%xmm0, %xmm8, %xmm4
> 	vpermt2b	%xmm5, %xmm11, %xmm3
> 	vpermt2b	%xmm4, %xmm11, %xmm1
> 	kandb	%k3, %k1, %k1
> 	kshiftlb	$4, %k2, %k2
> 	korb	%k2, %k1, %k1
> 	kunpckbw	%k0, %k1, %k1
> 	vmovdqu8	%xmm3, %xmm1{%k1}
> 	vmovdqu8	%xmm1, (%rdi,%rax)
> 	addq	$16, %rax
> 	cmpq	%rax, %r8
> 	jne	.L4

But still not as good as before, since original version we only need to pack data which is produced by vec_cond_expr, but now need to extraly pack mask.

before

  # x_24 = PHI <x_16(9), 0(21)>
  # vectp_src.11_73 = PHI <vectp_src.11_74(9), src_11(D)(21)>
  # vectp_dst.23_112 = PHI <vectp_dst.23_113(9), dst_13(D)(21)>
  # ivtmp_115 = PHI <ivtmp_116(9), 0(21)>
  # DEBUG x => NULL
  # DEBUG BEGIN_STMT
  _1 = (sizetype) x_24;
  _2 = src_11(D) + _1;
  vect__3.13_75 = MEM <vector(16) unsigned char> [(uint8_t *)vectp_src.11_73];
  _3 = *_2;
  vect__4.15_76 = [vec_unpack_lo_expr] vect__3.13_75;
  vect__4.15_77 = [vec_unpack_hi_expr] vect__3.13_75;
  vect__4.14_78 = [vec_unpack_lo_expr] vect__4.15_76;
  vect__4.14_79 = [vec_unpack_hi_expr] vect__4.15_76;
  vect__4.14_80 = [vec_unpack_lo_expr] vect__4.15_77;
  vect__4.14_81 = [vec_unpack_hi_expr] vect__4.15_77;
  _4 = (int) _3;
  vect__5.16_83 = vect__4.14_78 * vect_cst__82;
  vect__5.16_84 = vect__4.14_79 * vect_cst__82;
  vect__5.16_85 = vect__4.14_80 * vect_cst__82;
  vect__5.16_86 = vect__4.14_81 * vect_cst__82;
  _5 = _4 * i_scale_12(D);
  _6 = dst_13(D) + _1;
  # DEBUG x => NULL
  # DEBUG INLINE_ENTRY x264_clip_uint8
  # DEBUG BEGIN_STMT
  vect__14.17_88 = vect__5.16_83 & vect_cst__87;
  vect__14.17_89 = vect__5.16_84 & vect_cst__87;
  vect__14.17_90 = vect__5.16_85 & vect_cst__87;
  vect__14.17_91 = vect__5.16_86 & vect_cst__87;
  _14 = _5 & -256;
  vect__17.18_92 = -vect__5.16_83;
  vect__17.18_93 = -vect__5.16_84;
  vect__17.18_94 = -vect__5.16_85;
  vect__17.18_95 = -vect__5.16_86;
  _17 = -_5;
  vect__18.19_96 = vect__17.18_92 >> 31;
  vect__18.19_97 = vect__17.18_93 >> 31;
  vect__18.19_98 = vect__17.18_94 >> 31;
  vect__18.19_99 = vect__17.18_95 >> 31;
  _18 = _17 >> 31;
  iftmp.0_19 = (unsigned char) _18;
  iftmp.0_20 = (unsigned char) _5;
  _101 = vect__14.17_88 != vect_cst__100;
  vect_patt_40.20_102 = VEC_COND_EXPR <_101, vect__18.19_96, vect__5.16_83>;
  _103 = vect__14.17_89 != vect_cst__100;
  vect_patt_40.20_104 = VEC_COND_EXPR <_103, vect__18.19_97, vect__5.16_84>;
  _105 = vect__14.17_90 != vect_cst__100;
  vect_patt_40.20_106 = VEC_COND_EXPR <_105, vect__18.19_98, vect__5.16_85>;
  _107 = vect__14.17_91 != vect_cst__100;
  vect_patt_40.20_108 = VEC_COND_EXPR <_107, vect__18.19_99, vect__5.16_86>;
  vect_patt_41.22_109 = VEC_PACK_TRUNC_EXPR <vect_patt_40.20_102, vect_patt_40.20_104>;
  vect_patt_41.22_110 = VEC_PACK_TRUNC_EXPR <vect_patt_40.20_106, vect_patt_40.20_108>;
  vect_patt_41.21_111 = VEC_PACK_TRUNC_EXPR <vect_patt_41.22_109, vect_patt_41.22_110>;
  iftmp.0_21 = _14 != 0 ? iftmp.0_19 : iftmp.0_20;
  # DEBUG x => NULL
  MEM <vector(16) unsigned char> [(uint8_t *)vectp_dst.23_112] = vect_patt_41.21_111;
  # DEBUG BEGIN_STMT
  x_16 = x_24 + 1;
  # DEBUG x => x_16
  # DEBUG BEGIN_STMT
  vectp_src.11_74 = vectp_src.11_73 + 16;
  vectp_dst.23_113 = vectp_dst.23_112 + 16;
  ivtmp_116 = ivtmp_115 + 1;

after

  # x_24 = PHI <x_16(9), 0(21)>
  # vectp_src.12_78 = PHI <vectp_src.12_79(9), src_11(D)(21)>
  # vectp_dst.30_123 = PHI <vectp_dst.30_124(9), dst_13(D)(21)>
  # ivtmp_126 = PHI <ivtmp_127(9), 0(21)>
  _1 = (sizetype) x_24;
  _2 = src_11(D) + _1;
  vect__3.14_80 = MEM <vector(16) unsigned char> [(uint8_t *)vectp_src.12_78];
  _3 = *_2;
  vect__4.16_81 = [vec_unpack_lo_expr] vect__3.14_80;
  vect__4.16_82 = [vec_unpack_hi_expr] vect__3.14_80;
  vect__4.15_83 = [vec_unpack_lo_expr] vect__4.16_81;
  vect__4.15_84 = [vec_unpack_hi_expr] vect__4.16_81;
  vect__4.15_85 = [vec_unpack_lo_expr] vect__4.16_82;
  vect__4.15_86 = [vec_unpack_hi_expr] vect__4.16_82;
  _4 = (int) _3;
  vect__5.17_88 = vect__4.15_83 * vect_cst__87;
  vect__5.17_89 = vect__4.15_84 * vect_cst__87;
  vect__5.17_90 = vect__4.15_85 * vect_cst__87;
  vect__5.17_91 = vect__4.15_86 * vect_cst__87;
  _5 = _4 * i_scale_12(D);
  _6 = dst_13(D) + _1;
  vect_x.18_92 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__5.17_88);
  vect_x.18_93 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__5.17_89);
  vect_x.18_94 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__5.17_90);
  vect_x.18_95 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__5.17_91);
  x.1_14 = (unsigned int) _5;
  vect__41.19_96 = -vect_x.18_92;
  vect__41.19_97 = -vect_x.18_93;
  vect__41.19_98 = -vect_x.18_94;
  vect__41.19_99 = -vect_x.18_95;
  _41 = -x.1_14;
  vect__17.20_100 = VIEW_CONVERT_EXPR<vector(4) int>(vect__41.19_96);
  vect__17.20_101 = VIEW_CONVERT_EXPR<vector(4) int>(vect__41.19_97);
  vect__17.20_102 = VIEW_CONVERT_EXPR<vector(4) int>(vect__41.19_98);
  vect__17.20_103 = VIEW_CONVERT_EXPR<vector(4) int>(vect__41.19_99);
  _17 = (int) _41;
  vect__18.21_104 = vect__17.20_100 >> 31;
  vect__18.21_105 = vect__17.20_101 >> 31;
  vect__18.21_106 = vect__17.20_102 >> 31;
  vect__18.21_107 = vect__17.20_103 >> 31;
  _18 = _17 >> 31;
  vect_iftmp.23_108 = VEC_PACK_TRUNC_EXPR <vect__18.21_104, vect__18.21_105>;
  vect_iftmp.23_109 = VEC_PACK_TRUNC_EXPR <vect__18.21_106, vect__18.21_107>;
  vect_iftmp.22_110 = VEC_PACK_TRUNC_EXPR <vect_iftmp.23_108, vect_iftmp.23_109>;
  iftmp.0_19 = (unsigned char) _18;
  vect_iftmp.25_111 = VEC_PACK_TRUNC_EXPR <vect__5.17_88, vect__5.17_89>;
  vect_iftmp.25_112 = VEC_PACK_TRUNC_EXPR <vect__5.17_90, vect__5.17_91>;
  vect_iftmp.24_113 = VEC_PACK_TRUNC_EXPR <vect_iftmp.25_111, vect_iftmp.25_112>;
  iftmp.0_20 = (unsigned char) _5;
  mask_patt_40.26_115 = vect_x.18_92 > { 255, 255, 255, 255 };
  mask_patt_40.26_116 = vect_x.18_93 > { 255, 255, 255, 255 };
  mask_patt_40.26_117 = vect_x.18_94 > { 255, 255, 255, 255 };
  mask_patt_40.26_118 = vect_x.18_95 > { 255, 255, 255, 255 };
  mask_patt_42.28_119 = VEC_PACK_TRUNC_EXPR <mask_patt_40.26_115, mask_patt_40.26_116>;
  mask_patt_42.28_120 = VEC_PACK_TRUNC_EXPR <mask_patt_40.26_117, mask_patt_40.26_118>;
  mask_patt_42.27_121 = VEC_PACK_TRUNC_EXPR <mask_patt_42.28_119, mask_patt_42.28_120>;
  vect_patt_43.29_122 = VEC_COND_EXPR <mask_patt_42.27_121, vect_iftmp.22_110, vect_iftmp.24_113>;
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20;
  MEM <vector(16) unsigned char> [(uint8_t *)vectp_dst.30_123] = vect_patt_43.29_122;
  x_16 = x_24 + 1;
  vectp_src.12_79 = vectp_src.12_78 + 16;
  vectp_dst.30_124 = vectp_dst.30_123 + 16;
  ivtmp_127 = ivtmp_126 + 1;
Comment 12 rguenther@suse.de 2022-01-13 10:44:12 UTC
On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
> with
> @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
>        c1 = VEC_PACK_TRUNC_EXPR;
>        if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
>           && VECTOR_BOOLEAN_TYPE_P (vectype)
> -         && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> +         && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> +             || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
>           && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))

I think we instead simply want

         if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
             && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1
             && VECTOR_BOOLEAN_TYPE_P (vectype)
             && TYPE_PRECISION (TREE_TYPE (vectype)) == 1)

note the docs of vec_pack_sbool_trunc say

This instruction pattern is used when all the vector input and output
operands have the same scalar mode @var{m} and thus using
@code{vec_pack_trunc_@var{m}} would be ambiguous.

It also says "_Narrow_ and merge the elements of two vectors.", I think
"narrow" is misleading here, _trunc in the optab name as well.  So
with the above it suggests we could have used vect_pack_trunc_hi here?

To avoid breaking things for the VnBImode using targets we probably
want to retain the SCALAR_INT_MODE_P (prev_mode) check.  And we
probably want to adjust the documentation a bit.

This all is with my pasted pattern patch or is this with the weird
inserted conversion still?
Comment 13 rguenther@suse.de 2022-01-13 10:49:55 UTC
On Thu, 13 Jan 2022, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #9 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #7)
> > I think that is what we need to add.  We also don't have a good
> > representation
> > for "packing" of masks.
> > 
> > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> > index 3ea905538e1..729a1d32612 100644
> > --- a/gcc/tree-vect-patterns.c
> > +++ b/gcc/tree-vect-patterns.c
> > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
> >                                   rhs1_type);
> >         }
> >  
> > -      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > -                   TYPE_VECTOR_SUBPARTS (vectype2)))
> > +      /* AVX512 style masks cannot be packed/unpacked.  */
> > +      if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
> > +         && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > +                      TYPE_VECTOR_SUBPARTS (vectype2)))
> >         tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
> >        else
> >         tmp = rhs1;
> Haven't had time to look at it properly yet, but my first impression
> is that that's likely to regress SVE.  Packing and unpacking are
> natural operations on boolean vector modes.

Sure, but we can't produce scalar code mimicking this for
1 bit element vectors.  It "works" for the others because based
on the width of the elements we choose a vector with different
number of elements.  But here the pattern produces a 8 bit element
which isn't want is desired here.
Comment 14 Hongtao.liu 2022-01-13 10:50:35 UTC
> 
> But still not as good as before, since original version we only need to pack
> data which is produced by vec_cond_expr, but now need to extraly pack mask.
> 
>

Also for non-avx512 target, it looks like a regression for non-sve target.

https://godbolt.org/z/4aavMjjG4
Comment 15 Hongtao.liu 2022-01-13 12:40:18 UTC
(In reply to rguenther@suse.de from comment #12)
> On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> > 
> > --- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
> > with
> > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
> >        c1 = VEC_PACK_TRUNC_EXPR;
> >        if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> >           && VECTOR_BOOLEAN_TYPE_P (vectype)
> > -         && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > +         && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > +             || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
> >           && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
> 
> I think we instead simply want
> 
>          if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
>              && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1
>              && VECTOR_BOOLEAN_TYPE_P (vectype)
>              && TYPE_PRECISION (TREE_TYPE (vectype)) == 1)
> 
> note the docs of vec_pack_sbool_trunc say
> 
> This instruction pattern is used when all the vector input and output
> operands have the same scalar mode @var{m} and thus using
> @code{vec_pack_trunc_@var{m}} would be ambiguous.
> 
> It also says "_Narrow_ and merge the elements of two vectors.", I think
> "narrow" is misleading here, _trunc in the optab name as well.  So
> with the above it suggests we could have used vect_pack_trunc_hi here?
> 
> To avoid breaking things for the VnBImode using targets we probably
> want to retain the SCALAR_INT_MODE_P (prev_mode) check.  And we
> probably want to adjust the documentation a bit.
> 
> This all is with my pasted pattern patch or is this with the weird
> inserted conversion still?

It's w/o your patch, I'm try to handle the weird conversion with multi steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8 -> HI:16 through vec_pack_sbool_trunc_hi). But on the othersize the weird inserted conversion shouldn't be existed.
Comment 16 rguenther@suse.de 2022-01-13 13:42:32 UTC
On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #15 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to rguenther@suse.de from comment #12)
> > On Thu, 13 Jan 2022, crazylht at gmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> > > 
> > > --- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
> > > with
> > > @@ -12120,7 +12120,8 @@ supportable_narrowing_operation (enum tree_code code,
> > >        c1 = VEC_PACK_TRUNC_EXPR;
> > >        if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> > >           && VECTOR_BOOLEAN_TYPE_P (vectype)
> > > -         && TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > > +         && (TYPE_MODE (narrow_vectype) == TYPE_MODE (vectype)
> > > +             || known_lt (TYPE_VECTOR_SUBPARTS (vectype), BITS_PER_UNIT))
> > >           && SCALAR_INT_MODE_P (TYPE_MODE (vectype)))
> > 
> > I think we instead simply want
> > 
> >          if (VECTOR_BOOLEAN_TYPE_P (narrow_vectype)
> >              && TYPE_PRECISION (TREE_TYPE (narrow_vectype)) == 1
> >              && VECTOR_BOOLEAN_TYPE_P (vectype)
> >              && TYPE_PRECISION (TREE_TYPE (vectype)) == 1)
> > 
> > note the docs of vec_pack_sbool_trunc say
> > 
> > This instruction pattern is used when all the vector input and output
> > operands have the same scalar mode @var{m} and thus using
> > @code{vec_pack_trunc_@var{m}} would be ambiguous.
> > 
> > It also says "_Narrow_ and merge the elements of two vectors.", I think
> > "narrow" is misleading here, _trunc in the optab name as well.  So
> > with the above it suggests we could have used vect_pack_trunc_hi here?
> > 
> > To avoid breaking things for the VnBImode using targets we probably
> > want to retain the SCALAR_INT_MODE_P (prev_mode) check.  And we
> > probably want to adjust the documentation a bit.
> > 
> > This all is with my pasted pattern patch or is this with the weird
> > inserted conversion still?
> 
> It's w/o your patch, I'm try to handle the weird conversion with multi
> steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi, then pack QI:8
> -> HI:16 through vec_pack_sbool_trunc_hi). But on the othersize the weird
> inserted conversion shouldn't be existed.

But the weird conversion suggests packing { 0, 1, 0, 1 } as
{ 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ... }
thus expanding each bit to 8 bits.  So it's rather an unpacking :/
As said, the scalar conversion does not make any sense...

But maybe I'm missing something very obvious?
Comment 17 Hongtao.liu 2022-01-13 14:42:52 UTC
> As said, the scalar conversion does not make any sense...
Agree.
Comment 18 Richard Sandiford 2022-01-14 09:40:23 UTC
Again haven't really looked at this in detail, so could be wrong, but:

(In reply to rguenther@suse.de from comment #13)
> On Thu, 13 Jan 2022, rsandifo at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> > 
> > --- Comment #9 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #7)
> > > I think that is what we need to add.  We also don't have a good
> > > representation
> > > for "packing" of masks.
> > > 
> > > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
> > > index 3ea905538e1..729a1d32612 100644
> > > --- a/gcc/tree-vect-patterns.c
> > > +++ b/gcc/tree-vect-patterns.c
> > > @@ -4679,8 +4679,10 @@ vect_recog_mask_conversion_pattern (vec_info *vinfo,
> > >                                   rhs1_type);
> > >         }
> > >  
> > > -      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > > -                   TYPE_VECTOR_SUBPARTS (vectype2)))
> > > +      /* AVX512 style masks cannot be packed/unpacked.  */
> > > +      if (TYPE_PRECISION (TREE_TYPE (vectype2)) != 1
> > > +         && maybe_ne (TYPE_VECTOR_SUBPARTS (vectype1),
> > > +                      TYPE_VECTOR_SUBPARTS (vectype2)))
> > >         tmp = build_mask_conversion (vinfo, rhs1, vectype1, stmt_vinfo);
> > >        else
> > >         tmp = rhs1;
> > Haven't had time to look at it properly yet, but my first impression
> > is that that's likely to regress SVE.  Packing and unpacking are
> > natural operations on boolean vector modes.
> 
> Sure, but we can't produce scalar code mimicking this for
> 1 bit element vectors.
Yeah, but at least in the SVE case, we're not aiming to do that.
The vector boolean type that we want to use is recorded in the
STMT_VINFO_VECTYPE of the conversion instead, and doesn't get
recomputed later.

Like you said in the earlier comment, the fact that we can't
do that on scalar code is why this needs to be a vector pattern.

(As discussed elsewhere, it would be good if we didn't commit
vector types early like this.  But in this case I don't think
we can rely on scalar code to model the conversion either.)
Comment 19 Richard Biener 2022-01-14 10:18:57 UTC
Ah, so the issue is missing -mavx512bw which means we end up with a AVX2 style
mask for V32QImode.  With -mavx512bw the code vectorizes fine.

I guess ix86_get_mask_mode is too restrictive here?

And indeed to build a AVX2 style mask vector from a AVX512 style mask vector
we'd need to use a ?:, not a simple conversion.  Unfortunately we don't support
vectorizing that:

t.c:10:21: note:   ==> examining statement: patt_42 = patt_40 ? -1 : 0;
t.c:10:21: note:   vect_is_simple_use: operand x.1_14 > 255, type of def: internal
t.c:10:21: note:   vect_is_simple_use: vectype vector(8) <signed-boolean:1>
t.c:10:21: note:   vect_is_simple_use: operand -1, type of def: constant
t.c:10:21: note:   vect_is_simple_use: operand 0, type of def: constant
t.c:8:6: missed:   not vectorized: relevant stmt not supported: patt_42 = patt_40 ? -1 : 0;

or maybe we need to tweak the conversion vectorization for better support
of this case as noted in comment#10 ... but the code generated from that
AFAIU is to produce a HImode mask, not a V32QImode one which puts the fault
on ix86_get_mask_mode again.
Comment 20 Hongtao.liu 2022-01-17 08:22:00 UTC
(In reply to Richard Biener from comment #19)
> Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> style
> mask for V32QImode.  With -mavx512bw the code vectorizes fine.

Vectorization code is worse than before, now we need to pack vectorized mask which takes extra 3 instructions.
Comment 21 Hongtao.liu 2022-01-18 08:31:34 UTC
(In reply to Hongtao.liu from comment #20)
> (In reply to Richard Biener from comment #19)
> > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> > style
> > mask for V32QImode.  With -mavx512bw the code vectorizes fine.
> 
> Vectorization code is worse than before, now we need to pack vectorized mask
> which takes extra 3 instructions.

Current ifcvt convert

---------dump of .ch_vect-------
  if (x.1_14 > 255)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 477815112]:
  _17 = -_5;
  _18 = _17 >> 31;
  iftmp.0_19 = (unsigned char) _18;
  goto <bb 6>; [100.00%]

  <bb 5> [local count: 477815112]:
  iftmp.0_20 = (unsigned char) _5;

  <bb 6> [local count: 955630225]:
  # iftmp.0_21 = PHI <iftmp.0_19(4), iftmp.0_20(5)>
-------dump end---------


to 
---- dump of .ifcvt---------
  _41 = -x.1_14;
  _17 = (int) _41;
  _18 = _17 >> 31;
  iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc
  iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc
  iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc
  *_6 = iftmp.0_21;
  x_16 = x_24 + 1;
-----dump end----------


if ifcvt output things like
------------optimal .ifcvt------
  _41 = -x.1_14;
  _17 = (int) _41;
  _18 = _17 >> 31;
  iftmp.0_21 = x.1_14 > 255 ? _18 : _5;
  iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc
  *_6 = iftmp.0_22;
  x_16 = x_24 + 1;
------------end------------

we can save operations for packing mask(3 vec_pack_trunc vs 1 vec_pack_trunc?).
Comment 22 Hongtao.liu 2022-01-18 08:36:11 UTC
(In reply to Hongtao.liu from comment #21)
> (In reply to Hongtao.liu from comment #20)
> > (In reply to Richard Biener from comment #19)
> > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> > > style
> > > mask for V32QImode.  With -mavx512bw the code vectorizes fine.
> > 
> > Vectorization code is worse than before, now we need to pack vectorized mask
> > which takes extra 3 instructions.
> 
> Current ifcvt convert
> 
> ---------dump of .ch_vect-------
>   if (x.1_14 > 255)
>     goto <bb 4>; [50.00%]
>   else
>     goto <bb 5>; [50.00%]
> 
>   <bb 4> [local count: 477815112]:
>   _17 = -_5;
>   _18 = _17 >> 31;
>   iftmp.0_19 = (unsigned char) _18;
>   goto <bb 6>; [100.00%]
> 
>   <bb 5> [local count: 477815112]:
>   iftmp.0_20 = (unsigned char) _5;
> 
>   <bb 6> [local count: 955630225]:
>   # iftmp.0_21 = PHI <iftmp.0_19(4), iftmp.0_20(5)>
> -------dump end---------
> 
> 
> to 
> ---- dump of .ifcvt---------
>   _41 = -x.1_14;
>   _17 = (int) _41;
>   _18 = _17 >> 31;
>   iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc
>   iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc
>   iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc
>   *_6 = iftmp.0_21;
>   x_16 = x_24 + 1;
> -----dump end----------
> 
> 
> if ifcvt output things like
> ------------optimal .ifcvt------
>   _41 = -x.1_14;
>   _17 = (int) _41;
>   _18 = _17 >> 31;
>   iftmp.0_21 = x.1_14 > 255 ? _18 : _5;
>   iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc
>   *_6 = iftmp.0_22;
>   x_16 = x_24 + 1;
> ------------end------------
> 
> we can save operations for packing mask(3 vec_pack_trunc vs 1
> vec_pack_trunc?).

Or maybe a gimple simplification for it?
Comment 23 rguenther@suse.de 2022-01-18 10:44:45 UTC
On Tue, 18 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Hongtao.liu from comment #21)
> > (In reply to Hongtao.liu from comment #20)
> > > (In reply to Richard Biener from comment #19)
> > > > Ah, so the issue is missing -mavx512bw which means we end up with a AVX2
> > > > style
> > > > mask for V32QImode.  With -mavx512bw the code vectorizes fine.
> > > 
> > > Vectorization code is worse than before, now we need to pack vectorized mask
> > > which takes extra 3 instructions.
> > 
> > Current ifcvt convert
> > 
> > ---------dump of .ch_vect-------
> >   if (x.1_14 > 255)
> >     goto <bb 4>; [50.00%]
> >   else
> >     goto <bb 5>; [50.00%]
> > 
> >   <bb 4> [local count: 477815112]:
> >   _17 = -_5;
> >   _18 = _17 >> 31;
> >   iftmp.0_19 = (unsigned char) _18;
> >   goto <bb 6>; [100.00%]
> > 
> >   <bb 5> [local count: 477815112]:
> >   iftmp.0_20 = (unsigned char) _5;
> > 
> >   <bb 6> [local count: 955630225]:
> >   # iftmp.0_21 = PHI <iftmp.0_19(4), iftmp.0_20(5)>
> > -------dump end---------
> > 
> > 
> > to 
> > ---- dump of .ifcvt---------
> >   _41 = -x.1_14;
> >   _17 = (int) _41;
> >   _18 = _17 >> 31;
> >   iftmp.0_19 = (unsigned char) _18; -- vec_pack_trunc
> >   iftmp.0_20 = (unsigned char) _5; -- vec_pack_trunc
> >   iftmp.0_21 = x.1_14 > 255 ? iftmp.0_19 : iftmp.0_20; -- vec_pack_trunc
> >   *_6 = iftmp.0_21;
> >   x_16 = x_24 + 1;
> > -----dump end----------
> > 
> > 
> > if ifcvt output things like
> > ------------optimal .ifcvt------
> >   _41 = -x.1_14;
> >   _17 = (int) _41;
> >   _18 = _17 >> 31;
> >   iftmp.0_21 = x.1_14 > 255 ? _18 : _5;
> >   iftmp.0_22 = (unsigned char) iftmp.0_21; --- vec_pack_trunc
> >   *_6 = iftmp.0_22;
> >   x_16 = x_24 + 1;
> > ------------end------------
> > 
> > we can save operations for packing mask(3 vec_pack_trunc vs 1
> > vec_pack_trunc?).
> 
> Or maybe a gimple simplification for it?

Yes, I think that's a candidate for a match.pd simplification.
Fortunately if-conversion already folds the built stmts.
Comment 24 Andrew Pinski 2022-01-18 10:49:34 UTC
Note phiopt has code which is supposed to factor out the cast but many reasons, I have noticed it almost never does it (I still don't understand the cases when it does and when it does not yet). This was something which I was going to fix for GCC 13.
Comment 25 Hongtao.liu 2022-01-19 06:03:26 UTC
in fold_unary_loc
---cut from fold-const.cc-----
 9276      else if (TREE_CODE (arg0) == COND_EXPR)
 9277        {
 9278          tree arg01 = TREE_OPERAND (arg0, 1);
 9279          tree arg02 = TREE_OPERAND (arg0, 2);
 9280          if (! VOID_TYPE_P (TREE_TYPE (arg01)))
 9281            arg01 = fold_build1_loc (loc, code, type,
 9282                                 fold_convert_loc (loc,
 9283                                                   TREE_TYPE (op0), arg01));
 9284          if (! VOID_TYPE_P (TREE_TYPE (arg02)))
 9285            arg02 = fold_build1_loc (loc, code, type,
 9286                                 fold_convert_loc (loc,
 9287                                                   TREE_TYPE (op0), arg02));
 9288=>        tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND (arg0, 0),
 9289                             arg01, arg02);

-----------end---------------

gcc always tries to simplify (convert (cond (cmp a b) c d) ---- > (cond (cmp a b) (convert c) (convert d)), exactly the opposite of what this case wants.
Comment 26 Andrew Pinski 2022-01-19 06:10:56 UTC
(In reply to Hongtao.liu from comment #25)
> 
> gcc always tries to simplify (convert (cond (cmp a b) c d) ---- > (cond (cmp
> a b) (convert c) (convert d)), exactly the opposite of what this case wants.

Right that is for generic. For the gimple level the function factor_out_conditional_conversion in phiopt is supposed to undo this but as I mentioned it does not always.
Comment 27 Hongtao.liu 2022-01-19 06:25:29 UTC
(In reply to Andrew Pinski from comment #26)
> (In reply to Hongtao.liu from comment #25)
> > 
> > gcc always tries to simplify (convert (cond (cmp a b) c d) ---- > (cond (cmp
> > a b) (convert c) (convert d)), exactly the opposite of what this case wants.
> 
> Right that is for generic. For the gimple level the function
> factor_out_conditional_conversion in phiopt is supposed to undo this but as
Oh, thanks for the info. I'll leave it to you in GCC13.
Comment 28 Richard Biener 2022-01-19 07:44:20 UTC
(In reply to Hongtao.liu from comment #25)
> in fold_unary_loc
> ---cut from fold-const.cc-----
>  9276      else if (TREE_CODE (arg0) == COND_EXPR)
>  9277        {
>  9278          tree arg01 = TREE_OPERAND (arg0, 1);
>  9279          tree arg02 = TREE_OPERAND (arg0, 2);
>  9280          if (! VOID_TYPE_P (TREE_TYPE (arg01)))
>  9281            arg01 = fold_build1_loc (loc, code, type,
>  9282                                 fold_convert_loc (loc,
>  9283                                                   TREE_TYPE (op0),
> arg01));
>  9284          if (! VOID_TYPE_P (TREE_TYPE (arg02)))
>  9285            arg02 = fold_build1_loc (loc, code, type,
>  9286                                 fold_convert_loc (loc,
>  9287                                                   TREE_TYPE (op0),
> arg02));
>  9288=>        tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND
> (arg0, 0),
>  9289                             arg01, arg02);
> 
> -----------end---------------
> 
> gcc always tries to simplify (convert (cond (cmp a b) c d) ---- > (cond (cmp
> a b) (convert c) (convert d)), exactly the opposite of what this case wants.

It also then undos this if the result didn't simplify and plays trick to avoid
recursions.

I think this particular transform ought to be specialized, maybe to
(T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only.

The "cheap" way of avoiding a conflict is to wrap the match.pd pattern
with opposite logic in

#if GIMPLE
#endif

(with a comment explaining this)

Note that we can move a conversion out only if the sources of the conversions
have compatible types but we always can move a conversion in.

Alternatively this transform can also be done in a vectorizer pattern based
on vector compatibility of the ?: predicate with the data.
Comment 29 Hongtao.liu 2022-01-19 08:33:17 UTC
(In reply to Richard Biener from comment #28)
> (In reply to Hongtao.liu from comment #25)
> > in fold_unary_loc
> > ---cut from fold-const.cc-----
> >  9276      else if (TREE_CODE (arg0) == COND_EXPR)
> >  9277        {
> >  9278          tree arg01 = TREE_OPERAND (arg0, 1);
> >  9279          tree arg02 = TREE_OPERAND (arg0, 2);
> >  9280          if (! VOID_TYPE_P (TREE_TYPE (arg01)))
> >  9281            arg01 = fold_build1_loc (loc, code, type,
> >  9282                                 fold_convert_loc (loc,
> >  9283                                                   TREE_TYPE (op0),
> > arg01));
> >  9284          if (! VOID_TYPE_P (TREE_TYPE (arg02)))
> >  9285            arg02 = fold_build1_loc (loc, code, type,
> >  9286                                 fold_convert_loc (loc,
> >  9287                                                   TREE_TYPE (op0),
> > arg02));
> >  9288=>        tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND
> > (arg0, 0),
> >  9289                             arg01, arg02);
> > 
> > -----------end---------------
> > 
> > gcc always tries to simplify (convert (cond (cmp a b) c d) ---- > (cond (cmp
> > a b) (convert c) (convert d)), exactly the opposite of what this case wants.
> 
> It also then undos this if the result didn't simplify and plays trick to
> avoid
> recursions.
> 
> I think this particular transform ought to be specialized, maybe to
> (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only.
> 
> The "cheap" way of avoiding a conflict is to wrap the match.pd pattern
> with opposite logic in
> 
> #if GIMPLE
> #endif
> 
It doesn't work, 
> (with a comment explaining this)
> 
> Note that we can move a conversion out only if the sources of the conversions
> have compatible types but we always can move a conversion in.
> 
> Alternatively this transform can also be done in a vectorizer pattern based
> on vector compatibility of the ?: predicate with the data.
yes, I'm thinking of doing this in fold_build_cond_expr which is only used by pass_ifcvt to generate cond_expr.
Comment 30 rguenther@suse.de 2022-01-19 08:48:44 UTC
On Wed, 19 Jan 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103771
> 
> --- Comment #29 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #28)
> > (In reply to Hongtao.liu from comment #25)
> > > in fold_unary_loc
> > > ---cut from fold-const.cc-----
> > >  9276      else if (TREE_CODE (arg0) == COND_EXPR)
> > >  9277        {
> > >  9278          tree arg01 = TREE_OPERAND (arg0, 1);
> > >  9279          tree arg02 = TREE_OPERAND (arg0, 2);
> > >  9280          if (! VOID_TYPE_P (TREE_TYPE (arg01)))
> > >  9281            arg01 = fold_build1_loc (loc, code, type,
> > >  9282                                 fold_convert_loc (loc,
> > >  9283                                                   TREE_TYPE (op0),
> > > arg01));
> > >  9284          if (! VOID_TYPE_P (TREE_TYPE (arg02)))
> > >  9285            arg02 = fold_build1_loc (loc, code, type,
> > >  9286                                 fold_convert_loc (loc,
> > >  9287                                                   TREE_TYPE (op0),
> > > arg02));
> > >  9288=>        tem = fold_build3_loc (loc, COND_EXPR, type, TREE_OPERAND
> > > (arg0, 0),
> > >  9289                             arg01, arg02);
> > > 
> > > -----------end---------------
> > > 
> > > gcc always tries to simplify (convert (cond (cmp a b) c d) ---- > (cond (cmp
> > > a b) (convert c) (convert d)), exactly the opposite of what this case wants.
> > 
> > It also then undos this if the result didn't simplify and plays trick to
> > avoid
> > recursions.
> > 
> > I think this particular transform ought to be specialized, maybe to
> > (T)p?(T')a:(T')b or maybe done during gimplification or RTL expansion only.
> > 
> > The "cheap" way of avoiding a conflict is to wrap the match.pd pattern
> > with opposite logic in
> > 
> > #if GIMPLE
> > #endif
> > 
> It doesn't work, 
> > (with a comment explaining this)
> > 
> > Note that we can move a conversion out only if the sources of the conversions
> > have compatible types but we always can move a conversion in.
> > 
> > Alternatively this transform can also be done in a vectorizer pattern based
> > on vector compatibility of the ?: predicate with the data.
> yes, I'm thinking of doing this in fold_build_cond_expr which is only used by
> pass_ifcvt to generate cond_expr.

That's also a reasonable place but the vectorizer pattern recog phase
might have more contextual information to determine the best types
to use.   fold_build_cond_expr is probably easiest to adjust though.
Comment 31 GCC Commits 2022-01-20 08:52:03 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:8bc700f4c3fbe405413db02281ef2918bfa831fc

commit r12-6756-g8bc700f4c3fbe405413db02281ef2918bfa831fc
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jan 17 10:47:46 2022 +0800

    Enhance vec_pack_trunc for integral mode mask.
    
    For testcase in PR, the patch supports QI:4 -> HI:16 pack with
    multi steps(first pack QI:4 -> QI:8 through vec_pack_sbool_trunc_qi,
    then pack QI:8 -> HI:16 through vec_pack_trunc_hi).
    Similar for QI:2 -> HI:16 which is test4 in mask-pack-prefer-128.c.
    
    gcc/ChangeLog:
    
            PR target/103771
            * tree-vect-stmts.cc (supportable_narrowing_operation): Enhance
            integral mode mask pack by multi steps which takes
            vec_pack_sbool_trunc_optab as start when elements number is
            less than BITS_PER_UNITS.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/mask-pack-prefer128.c: New test.
            * gcc.target/i386/mask-pack-prefer256.c: New test.
            * gcc.target/i386/pr103771.c: New test.
Comment 32 Hongtao.liu 2022-01-27 06:03:11 UTC
A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589168.html
Comment 33 Andrew Pinski 2022-02-01 05:21:52 UTC
Oh yes I remember why factor_out_conditional_conversion is not doing it. it is because we have two bb but the current code does not handle that case.
That is we have:
  if (x.1_1 > 255)
    goto <bb 3>; [INV]
  else
    goto <bb 4>; [INV]

  <bb 3> :
  _2 = -x_5(D);
  _3 = _2 >> 31;
  iftmp.0_7 = (uint8_t) _3;
  goto <bb 5>; [INV]

  <bb 4> :
  iftmp.0_6 = (uint8_t) x_5(D);

  <bb 5> :
  # iftmp.0_4 = PHI <iftmp.0_7(3), iftmp.0_6(4)>

While the correct code only handles the case where the one of the bb4 does not exist. Yes I am working on this for GCC 13.
Comment 34 Jeffrey A. Law 2022-02-01 16:13:22 UTC
I've always wanted to see us be able to push something like those matched conversions down through the PHI.

That would make the code look like:

  if (x.1_1 > 255)
    goto <bb 3>; [INV]
  else
    goto <bb 4>; [INV]

  <bb 3> :
  _2 = -x_5(D);
  _3 = _2 >> 31;
  goto <bb 5>; [INV]

  <bb 4> :

  <bb 5> :
  # tmp = PHI <_3, x_5(4)>
  iftmp.0_4 = (uint8_t) tmp

And presumably we'd clean up the empty bb4 which would in turn unblock other optimizations.

Is that what you're working on?
Comment 35 Andrew Pinski 2022-02-01 16:18:30 UTC
(In reply to Jeffrey A. Law from comment #34)
> I've always wanted to see us be able to push something like those matched
> conversions down through the PHI.
> 
....

> Is that what you're working on?

Yes.
Comment 36 GCC Commits 2022-02-13 09:58:32 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:7e204bd2f189850cb940677c99d8d93eb7dd40cd

commit r12-7216-g7e204bd2f189850cb940677c99d8d93eb7dd40cd
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jan 24 11:05:47 2022 +0800

    Add vect_recog_cond_expr_convert_pattern.
    
    The pattern converts (cond (cmp a b) (convert c) (convert d))
    to (convert (cond (cmp a b) c d)) when
    1) types_match (c, d)
    2) single_use for (convert c) and (convert d)
    3) TYPE_PRECISION (TREE_TYPE (c)) == TYPE_PRECISION (TREE_TYPE (a))
    4) INTEGERAL_TYPE_P (TREE_TYPE (c))
    
    The pattern can save packing of mask and data(partial for data, 2 vs
    1).
    
    gcc/ChangeLog:
    
            PR target/103771
            * match.pd (cond_expr_convert_p): New match.
            * tree-vect-patterns.cc (gimple_cond_expr_convert_p): Declare.
            (vect_recog_cond_expr_convert_pattern): New.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr103771-2.c: New test.
            * gcc.target/i386/pr103771-3.c: New test.
Comment 37 Richard Biener 2022-02-14 07:36:51 UTC
Is this now "good enough" for GCC 12?
Comment 38 Hongtao.liu 2022-02-14 07:45:23 UTC
(In reply to Richard Biener from comment #37)
> Is this now "good enough" for GCC 12?

Yes.
Comment 39 Richard Biener 2022-02-14 07:53:18 UTC
Fixed for the target regression part, Andrew is working on a more generic fix for GCC 13.
Comment 40 GCC Commits 2022-02-17 11:01:54 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:754dce903ca28c4c2f2bc8614a8de5e631655f2e

commit r12-7276-g754dce903ca28c4c2f2bc8614a8de5e631655f2e
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Feb 16 12:15:18 2022 +0800

    Restrict the two sources of vect_recog_cond_expr_convert_pattern to be of the same type when convert is extension.
    
    It's not equal to transform
    
     (cond (cmp @1 @2) (convert@3 @4) (convert@5 @6))
    
     to
    
     (convert (cmp @1 @2) (convert)@4 @6)
    
    when(convert@3 @4) is extension because it's zero_extend vs sign_extend.
    
    gcc/ChangeLog:
    
            PR tree-optimization/104551
            PR tree-optimization/103771
            * match.pd (cond_expr_convert_p): Add types_match check when
            convert is extension.
            * tree-vect-patterns.cc
            (gimple_cond_expr_convert_p): Adjust comments.
            (vect_recog_cond_expr_convert_pattern): Ditto.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr104551.c: New test.
Comment 41 Richard Biener 2023-04-26 06:55:37 UTC
GCC 13.1 is being released, retargeting bugs to GCC 13.2.
Comment 42 Andrew Pinski 2023-05-06 23:21:41 UTC
The patch is simple after recent cleanups:
diff --git a/gcc/tree-ssa-phiopt.cc b/gcc/tree-ssa-phiopt.cc
index f14b7e8b7e6..41fea78dc8d 100644
--- a/gcc/tree-ssa-phiopt.cc
+++ b/gcc/tree-ssa-phiopt.cc
@@ -4087,7 +4087,7 @@ pass_phiopt::execute (function *)

       gphi *newphi;
       if (single_pred_p (bb1)
-         && !diamond_p
+         && EDGE_COUNT (merge->preds) == 2
          && (newphi = factor_out_conditional_conversion (e1, e2, phi,
                                                          arg0, arg1,
                                                          cond_stmt)))


I am putting together testcases for this right now.
Comment 43 GCC Commits 2023-05-08 07:38:12 UTC
The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>:

https://gcc.gnu.org/g:69f1a8af45d8a42003c21489019ddfb01d88d30d

commit r14-573-g69f1a8af45d8a42003c21489019ddfb01d88d30d
Author: Andrew Pinski <apinski@marvell.com>
Date:   Wed Apr 26 14:55:46 2023 -0700

    PHIOPT: Add diamond bb form to factor_out_conditional_conversion
    
    So the function factor_out_conditional_conversion already supports
    diamond shaped bb forms, just need to be called for such a thing.
    
    harden-cond-comp.c needed to be changed as we would optimize out the
    conversion now and that causes the compare hardening not needing to
    split the block which it was testing. So change it such that there
    would be no chance of optimization.
    
    Also add two testcases that showed the improvement. PR 103771 is
    solved in ifconvert also for the vectorizer but now it is solved
    in a general sense.
    
    OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.
    
            PR tree-optimization/49959
            PR tree-optimization/103771
    
    gcc/ChangeLog:
    
            * tree-ssa-phiopt.cc (pass_phiopt::execute): Support
            Diamond shapped bb form for factor_out_conditional_conversion.
    
    gcc/testsuite/ChangeLog:
    
            * c-c++-common/torture/harden-cond-comp.c: Change testcase
            slightly to avoid the new phiopt optimization.
            * gcc.dg/tree-ssa/abs-2.c: New test.
            * gcc.dg/tree-ssa/pr103771.c: New test.
Comment 44 Andrew Pinski 2023-05-08 07:39:16 UTC
Fixed.