[PATCH]middle-end convert negate + right shift into compare greater.

Fri Oct 15 11:55:22 GMT 2021

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Friday, October 15, 2021 10:07 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>; gcc-
> patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: RE: [PATCH]middle-end convert negate + right shift into compare
> greater.
> 
> On Fri, 15 Oct 2021, Tamar Christina wrote:
> 
> > > >
> > > > +/* Fold (-x >> C) into x > 0 where C = precision(type) - 1.  */
> > > > +(for cst (INTEGER_CST VECTOR_CST)  (simplify
> > > > +  (rshift (negate:s @0) cst@1)
> > > > +   (if (!flag_wrapv)
> > >
> > > Don't test flag_wrapv directly, instead use the appropriate
> > > TYPE_OVERFLOW_{UNDEFINED,WRAPS} predicates.  But I'm not sure
> what
> > > we are protecting against?  Right-shift of signed integers is
> > > implementation- defined and GCC treats it as you'd expect, sign-
> extending the result.
> > >
> >
> > It's protecting against the overflow of the negate on INT_MIN. When
> > wrapping overflows are enabled the results would be wrong.
> 
> But -INT_MIN == INT_MIN in twos-complement so I fail to see the wrong
> result?  That is, both -INT_MIN >> 31 and INT_MIN >> 31 are -1.
> 
> > > > +    (with { tree ctype = TREE_TYPE (@0);
> > > > +	    tree stype = TREE_TYPE (@1);
> > > > +	    tree bt = truth_type_for (ctype); }
> > > > +     (switch
> > > > +      /* Handle scalar case.  */
> > > > +      (if (INTEGRAL_TYPE_P (ctype)
> > > > +	   && !VECTOR_TYPE_P (ctype)
> > > > +	   && !TYPE_UNSIGNED (ctype)
> > > > +	   && canonicalize_math_after_vectorization_p ()
> > > > +	   && wi::eq_p (wi::to_wide (@1), TYPE_PRECISION (stype) - 1))
> > > > +       (convert:bt (gt:bt @0 { build_zero_cst (stype); })))
> > >
> > > I'm not sure why the result is of type 'bt' rather than the original
> > > type of the expression?
> >
> > That was to satisfy some RTL check that expected results of
> > comparisons to always be a Boolean, though for scalar that logically
> > always is the case, I just added it for consistency.
> >
> > >
> > > In that regard for non-vectors we'd have to add the sign extension
> > > from unsigned bool, in the vector case we'd hope the type of the
> > > comparison is correct.  I think in both cases it might be convenient
> > > to use
> > >
> > >   (cond (gt:bt @0 { build_zero_cst (ctype); }) { build_all_ones_cst
> > > (ctype); } { build_zero_cost (ctype); })
> > >
> > > to compute the correct result and rely on (cond ..) simplifications
> > > to simplify that if possible.
> > >
> > > Btw, 'stype' should be irrelevant - you need to look at the
> > > precision of 'ctype', no?
> >
> > I was working under the assumption that both input types must have the
> > same precision, but turns out that assumption doesn't need to hold.
> >
> > New version attached.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > and no regressions.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* match.pd: New negate+shift pattern.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.dg/signbit-2.c: New test.
> > 	* gcc.dg/signbit-3.c: New test.
> > 	* gcc.target/aarch64/signbit-1.c: New test.
> >
> > --- inline copy of patch ---
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd index
> >
> 7d2a24dbc5e9644a09968f877e12a824d8ba1caa..9532cae582e152cae6e22fcce
> 95a
> > 9744a844e3c2 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -38,7 +38,8 @@ along with GCC; see the file COPYING3.  If not see
> >     uniform_integer_cst_p
> >     HONOR_NANS
> >     uniform_vector_p
> > -   bitmask_inv_cst_vector_p)
> > +   bitmask_inv_cst_vector_p
> > +   expand_vec_cmp_expr_p)
> >
> >  /* Operator lists.  */
> >  (define_operator_list tcc_comparison
> > @@ -826,6 +827,42 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >      { tree utype = unsigned_type_for (type); }
> >      (convert (rshift (lshift (convert:utype @0) @2) @3))))))
> >
> > +/* Fold (-x >> C) into x > 0 where C = precision(type) - 1.  */ (for
> > +cst (INTEGER_CST VECTOR_CST)  (simplify
> > +  (rshift (negate:s @0) cst@1)
> > +   (if (!TYPE_OVERFLOW_WRAPS (type))
> 
> as said, I don't think that's necessary but at least it's now written correctly ;)
> 
> > +    (with { tree ctype = TREE_TYPE (@0);
> 
> Instead of 'ctype' you can use 'type' since the type of the expression is the
> same as that of @0
> 
> > +	    tree stype = TREE_TYPE (@1);
> > +	    tree bt = truth_type_for (ctype);
> > +	    tree zeros = build_zero_cst (ctype); }
> > +     (switch
> > +      /* Handle scalar case.  */
> > +      (if (INTEGRAL_TYPE_P (ctype)
> > +	   && !VECTOR_TYPE_P (ctype)
> 
> INTEGRAL_TYPE_P does not include VECTOR_TYPE_P.
> 
> > +	   && !TYPE_UNSIGNED (ctype)
> > +	   && canonicalize_math_after_vectorization_p ()
> > +	   && wi::eq_p (wi::to_wide (@1), TYPE_PRECISION (ctype) - 1))
> > +       (cond (gt:bt @0 { zeros; }) { build_all_ones_cst (ctype); } { zeros; }))
> > +      /* Handle vector case with a scalar immediate.  */
> > +      (if (VECTOR_INTEGER_TYPE_P (ctype)
> > +	   && !VECTOR_TYPE_P (stype)
> > +	   && !TYPE_UNSIGNED (ctype)
> > +	   && expand_vec_cmp_expr_p (ctype, ctype, { GT_EXPR }))
> > +       (with { HOST_WIDE_INT bits = GET_MODE_UNIT_BITSIZE
> (TYPE_MODE (ctype)); }
> > +	(if (wi::eq_p (wi::to_wide (@1), bits - 1))
> 
> You can use element_precision (@0) - 1 in both the scalar and vector case.
> 
> > +	 (convert:bt (gt:bt @0 { zeros; })))))
> > +      /* Handle vector case with a vector immediate.   */
> > +      (if (VECTOR_INTEGER_TYPE_P (ctype)
> > +	   && VECTOR_TYPE_P (stype)
> > +	   && !TYPE_UNSIGNED (ctype)
> > +	   && uniform_vector_p (@1)
> > +	   && expand_vec_cmp_expr_p (ctype, ctype, { GT_EXPR }))
> > +       (with { tree cst = vector_cst_elt (@1, 0);
> > +	       HOST_WIDE_INT bits = GET_MODE_UNIT_BITSIZE (TYPE_MODE
> (ctype)); }
> > +        (if (wi::eq_p (wi::to_wide (cst), bits - 1))
> > +	 (convert:bt (gt:bt @0 { zeros; }))))))))))
> 
> The (convert:bt are not necessary.  Note that 'ctype' is not a valid mask type
> but we eventually will be forgiving enough here.  You are computing
> thruth_type_for, asking for gt:bt but checking expand_vec_cmp_expr_p
> (ctype, ctype, ...), that's inconsistent and will lead to problems on targets
> with AVX512VL + AVX2 which will have a AVX2 compare of say V4SImode
> producing a V4SImode mask but thruth_type_for will give you a QImode
> vector type.
> 

Right, What I wanted to check here was that the target has the same mode for
the mask as the result, which looking at expand_vec_cmp_expr_p I thought
would give me.  The problem with checking expand_vec_cmp_expr_p with bt
as the mask crashes with SVE because the comparison becomes a predicate
which is applied on an operation. So I guess there are tree various different
possibilities here.

> As said the most general transform would be to
> 
>  (cond (gt:bt @0 { zeros; }) { build_zero_cst (ctype); } { build_all_ones_cst
> (ctype); })
> 

I tried this before but it doesn't optimize the cond away. So with this it produces:

  vect__6.10_7 = vect__4.8_9 > { 0, 0, 0, 0 } ? { 0, 0, 0, 0 } : { -1, -1, -1, -1 };
  MEM <vector(4) int> [(int32_t *)_2] = vect__6.10_7;

Which I suppose is technically correct, but it'll crash at expand time because this form
doesn't look to be expected:

0x1262ac1 prepare_cmp_insn
        /src/gcc/gcc/optabs.c:4532
0x1262f80 emit_cmp_and_jump_insns(rtx_def*, rtx_def*, rtx_code, rtx_def*, machine_mode, int, rtx_def*, profile_probability)
        /src/gcc/gcc/optabs.c:4677
0xdad85d do_compare_rtx_and_jump(rtx_def*, rtx_def*, rtx_code, int, machine_mode, rtx_def*, rtx_code_label*, rtx_code_label*, profile_probability)
        /src/gcc/gcc/dojump.c:1220
0xdadf0c do_compare_and_jump
        /src/gcc/gcc/dojump.c:1294
0xdab0d3 do_jump_1
        /src/gcc/gcc/dojump.c:253
0xdabaa2 do_jump
        /src/gcc/gcc/dojump.c:500
0xdacdd7 jumpifnot(tree_node*, rtx_code_label*, profile_probability)
        /src/gcc/gcc/dojump.c:942
0xeba6f1 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier)
        /src/gcc/gcc/expr.c:10219
0xebb99d expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool)
        /src/gcc/gcc/expr.c:10480
0xeb363a expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool)
        /src/gcc/gcc/expr.c:8713
0xe9116d expand_expr
        /src/gcc/gcc/expr.h:301
0xea3ce5 expand_assignment(tree_node*, tree_node*, bool)
        /src/gcc/gcc/expr.c:5345
0xd01860 expand_gimple_stmt_1
        /src/gcc/gcc/cfgexpand.c:3942
0xd01c4f expand_gimple_stmt
        /src/gcc/gcc/cfgexpand.c:4040
0xd0a9b7 expand_gimple_basic_block
        /src/gcc/gcc/cfgexpand.c:6082
0xd0ce45 execute
        /src/gcc/gcc/cfgexpand.c:6808

> if you think that's not better than the negate + shift but only when the
> compare will directly produce the desired result then I think you have to
> either drop the truth_type_for and manually build the truthtype or check
> that the mode of the truth type and the value type match.  And you need to
> produce
> 

But how do I figure out what the target's format is? Is there a hook/property
I can use for this?

Thanks,
Tamar

>   (view_convert:ctype (gt:bt @0 { zeros; }))
> 
> then.
> 
> > +
> >  /* Fold (C1/X)*C2 into (C1*C2)/X.  */  (simplify
> >   (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2) diff --git
> > a/gcc/testsuite/gcc.dg/signbit-2.c b/gcc/testsuite/gcc.dg/signbit-2.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..fc0157cbc5c7996b481f2998bc
> 30
> > 176c96a669bb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/signbit-2.c
> > @@ -0,0 +1,19 @@
> > +/* { dg-do assemble } */
> > +/* { dg-options "-O3 --save-temps -fdump-tree-optimized" } */
> > +
> > +#include <stdint.h>
> > +
> > +void fun1(int32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (-x[i]) >> 31;
> > +}
> > +
> > +void fun2(int32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (-x[i]) >> 30;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times {\s+>\s+\{ 0, 0, 0, 0 \}} 1
> > +optimized } } */
> > +/* { dg-final { scan-tree-dump-not {\s+>>\s+31} optimized } } */
> > diff --git a/gcc/testsuite/gcc.dg/signbit-3.c
> > b/gcc/testsuite/gcc.dg/signbit-3.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..19e9c06c349b3287610f817628
> f0
> > 0938ece60bf7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/signbit-3.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do assemble } */
> > +/* { dg-options "-O1 --save-temps -fdump-tree-optimized" } */
> > +
> > +#include <stdint.h>
> > +
> > +void fun1(int32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (-x[i]) >> 31;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times {\s+>\s+0;} 1 optimized } } */
> > +/* { dg-final { scan-tree-dump-not {\s+>>\s+31} optimized } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/signbit-1.c
> > b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..3ebfb0586f37de29cf58635b27
> fe
> > 48503714447e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do assemble } */
> > +/* { dg-options "-O3 --save-temps" } */
> > +
> > +#include <stdint.h>
> > +
> > +void fun1(int32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (-x[i]) >> 31;
> > +}
> > +
> > +void fun2(int32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (-x[i]) >> 30;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times {\tcmgt\t} 1 } } */
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)