Bug 113281 - [11 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590
Summary: [11 Regression] Latent wrong code due to vectorization of shift reduction and...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.0
: P2 normal
Target Milestone: 11.5
Assignee: Richard Sandiford
URL: https://gcc.gnu.org/pipermail/gcc-pat...
Keywords: wrong-code
Depends on:
Blocks: vectorizer 110838
  Show dependency treegraph
 
Reported: 2024-01-08 19:17 UTC by Patrick O'Neill
Modified: 2024-10-31 03:45 UTC (History)
10 users (show)

See Also:
Host:
Target: riscv aarch64-linux-gnu
Build:
Known to work: 11.4.1, 12.3.1, 13.2.0, 13.3.1
Known to fail:
Last reconfirmed: 2024-01-09 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick O'Neill 2024-01-08 19:17:44 UTC
Testcase:
unsigned char a;

int main() {
  short b = a = 0;
  for (; a != 19; a++)
    if (a)
      b = 32872 >> a;

  if (b == 0)
    return 0;
  else
    return 1;
}

Commands:
rv64gcv_zvl256b:
> /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -march=rv64gcv_zvl256b -O3 red.c -o user-config.out
> QEMU_CPU=rv64,vlen=256,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/qemu-riscv64 user-config.out
> echo $?
1

rv64gc:
> /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -march=rv64gc -O3 red.c -o rv64gc.out
> QEMU_CPU=rv64,vlen=256,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/qemu-riscv64 rv64gc.out
> echo $?
0
Comment 1 Patrick O'Neill 2024-01-08 19:20:25 UTC
Godbolt: https://godbolt.org/z/efPhqzcr5
Comment 2 Robin Dapp 2024-01-08 21:49:09 UTC
Confirmed.  Funny, we shouldn't vectorize that but really optimize to "return 0".  Costing might be questionable but we also haven't optimized away the loop when comparing costs.

Disregarding that, of course the vectorization should be correct.

The vect output doesn't really make sense to me but I haven't looked very closely yet:

  _177 = .SELECT_VL (2, POLY_INT_CST [16, 16]);
  vect_patt_82.18_166 = (vector([16,16]) unsigned short) { 17, 18, 19, ... };
  vect_patt_84.19_168 = MIN_EXPR <vect_patt_82.18_166, { 15, ... }>;
  vect_patt_85.20_170 = { 32872, ... } >> vect_patt_84.19_168;
  vect_patt_87.21_171 = VIEW_CONVERT_EXPR<vector([16,16]) short intD.12>(vect_patt_85.20_170);
  _173 = _177 + 18446744073709551615;
  # RANGE [irange] short int [0, 16436] MASK 0x7fff VALUE 0x0
  _174 = .VEC_EXTRACT (vect_patt_87.21_171, _173);

vect_patt_85.20_170 should be all zeros and then we'd just vec_extract a 0 and return that.  However, 32872 >> 15 == 1 so we return 1.
Comment 3 JuzheZhong 2024-01-09 01:32:17 UTC
I think there are 2 issues here:

1. We should adjust cost model to let loop vectorizer eliminate the unprofitable
   vectorization. It should be done in RISC-V backend.
2. We should fix run fail bug with -fno-vect-cost-model.

I think 1st issue is simpler than second one. And I highly doubt that the second
one is not RISC-V bug is middle-end bug.
Comment 4 JuzheZhong 2024-01-09 02:06:34 UTC
Confirm reduced case:

#include <assert.h>
unsigned char a;

int main() {
  short b = a = 0;
  for (; a != 19; a++)
    if (a)
      b = 32872 >> a;
  
  assert (b == 0);
}

with -fno-vect-cost-model -march=rv64gcv -O3:

https://godbolt.org/z/joGb3e9Eb

Also run failed assertion "b == 0" failed: file "bug.c", line 10, function: main

I suspect ARM SVE has the same fail.

Hi, Andrew. Could you test this case on ARM to see whether ARM has same issue as RISC-V for me ?
Comment 5 JuzheZhong 2024-01-09 02:08:03 UTC
(In reply to JuzheZhong from comment #4)
> Confirm reduced case:
> 
> #include <assert.h>
> unsigned char a;
> 
> int main() {
>   short b = a = 0;
>   for (; a != 19; a++)
>     if (a)
>       b = 32872 >> a;
>   
>   assert (b == 0);
> }
> 
> with -fno-vect-cost-model -march=rv64gcv -O3:
> 
> https://godbolt.org/z/joGb3e9Eb
> 
> Also run failed assertion "b == 0" failed: file "bug.c", line 10, function:
> main
> 
> I suspect ARM SVE has the same fail.
> 
> Hi, Andrew. Could you test this case on ARM to see whether ARM has same
> issue as RISC-V for me ?

The vect dump tree is quite similar between ARM and RISC-V.
Comment 6 Andrew Pinski 2024-01-09 02:12:46 UTC
Confirmed on aarch64 too:

[apinski@xeond2 upstream-full-cross]$ ./install/bin/aarch64-linux-gnu-gcc -O3 -march=armv9-a+sve2 t.c -static -fno-vect-cost-model
[apinski@xeond2 upstream-full-cross]$ ./install-qemu/bin/qemu-aarch64 a.out ;echo $?
1
Comment 7 Andrew Pinski 2024-01-09 02:16:14 UTC
>  short b = a = 0;

s/short/int/ allows it to work ...
Comment 8 JuzheZhong 2024-01-09 02:17:27 UTC
(In reply to Andrew Pinski from comment #7)
> >  short b = a = 0;
> 
> s/short/int/ allows it to work ...

Thanks Andrew. I think we should change the tittle.

It should be mismatch on both RISC-V and ARM SVE with -fno-vect-cost-model.
Comment 9 Andrew Pinski 2024-01-09 02:59:18 UTC
(In reply to JuzheZhong from comment #8)
> It should be mismatch on both RISC-V and ARM SVE with -fno-vect-cost-model.

Changed and checked to see that GCC 13 didn't vectorize the loop for aarch64 SVE either so it is a regression.
Comment 10 Richard Biener 2024-01-09 08:32:04 UTC
  vect_patt_84.19_168 = MIN_EXPR <vect_patt_82.18_166, { 15, ... }>;

this one is new, but I think there's an existing open bug for the "widening"
pattern recognition failing to make sure the shift argument is kept in range.
Or maybe that was the same that spurred the MIN above (but IIRC that was
some "undefined behavior" thing, not wrong-code).
Comment 11 GCC Commits 2024-01-12 12:16:03 UTC
The master branch has been updated by Pan Li <panli@gcc.gnu.org>:

https://gcc.gnu.org/g:0acb63670bf1058fce00a75bd318c40be3bfa222

commit r14-7183-g0acb63670bf1058fce00a75bd318c40be3bfa222
Author: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Date:   Fri Jan 12 17:28:44 2024 +0800

    RISC-V: Adjust scalar_to_vec cost
    
    1. Introduce vector regmove new tune info.
    2. Adjust scalar_to_vec cost in add_stmt_cost.
    
    We will get optimal codegen after this patch with -march=rv64gcv_zvl256b:
    
            lui     a5,%hi(a)
            li      a4,19
            sb      a4,%lo(a)(a5)
            li      a0,0
            ret
    
    Tested on both RV32/RV64 no regression, Ok for trunk ?
    
            PR target/113281
    
    gcc/ChangeLog:
    
            * config/riscv/riscv-protos.h (struct regmove_vector_cost): New struct.
            (struct cpu_vector_cost): Add regmove struct.
            (get_vector_costs): Export as global.
            * config/riscv/riscv-vector-costs.cc (adjust_stmt_cost): Adjust scalar_to_vec cost.
            (costs::add_stmt_cost): Ditto.
            * config/riscv/riscv.cc (get_common_costs): Export global function.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113281-2.c: New test.
Comment 12 GCC Commits 2024-01-15 12:06:09 UTC
The trunk branch has been updated by Lehua Ding <lhtin@gcc.gnu.org>:

https://gcc.gnu.org/g:405096f908e1ceb0d6a1b5420ded20ad85ddae9e

commit r14-7244-g405096f908e1ceb0d6a1b5420ded20ad85ddae9e
Author: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Date:   Mon Jan 15 09:22:40 2024 +0800

    RISC-V: Adjust loop len by costing 1 when NITER < VF
    
    Rebase in v3: Rebase to the trunk and commit it as it's approved by Robin.
    Update in v2: Add dynmaic lmul test.
    
    This patch fixes the regression between GCC 13.2.0 and trunk GCC (GCC-14)
    
    GCC 13.2.0:
    
            lui     a5,%hi(a)
            li      a4,19
            sb      a4,%lo(a)(a5)
            li      a0,0
            ret
    
    Trunk GCC:
    
            vsetvli a5,zero,e8,mf2,ta,ma
            li      a4,-32768
            vid.v   v1
            vsetvli zero,zero,e16,m1,ta,ma
            addiw   a4,a4,104
            vmv.v.i v3,15
            lui     a1,%hi(a)
            li      a0,19
            vsetvli zero,zero,e8,mf2,ta,ma
            vadd.vi v1,v1,1
            sb      a0,%lo(a)(a1)
            vsetvli zero,zero,e16,m1,ta,ma
            vzext.vf2       v2,v1
            vmv.v.x v1,a4
            vminu.vv        v2,v2,v3
            vsrl.vv v1,v1,v2
            vslidedown.vi   v1,v1,17
            vmv.x.s a0,v1
            snez    a0,a0
            ret
    
    The root cause we are vectorizing the codes inefficiently since we doesn't cost len when NITERS < VF.
    Leverage loop control of mask targets or rs6000 fixes the regression.
    
    Tested no regression. Ok for trunk ?
    
            PR target/113281
    
    gcc/ChangeLog:
    
            * config/riscv/riscv-vector-costs.cc (costs::adjust_vect_cost_per_loop): New function.
            (costs::finish_cost): Adjust cost for LOOP LEN with NITERS < VF.
            * config/riscv/riscv-vector-costs.h: New function.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.dg/vect/costmodel/riscv/rvv/pr113281-3.c: New test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113281-4.c: New test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113281-5.c: New test.
Comment 13 Jakub Jelinek 2024-01-23 16:58:28 UTC
At least on aarch64 this is vectorized since r14-3027-gc5f673dbc252e35e6b66e9b8abd30a4027193e0b
Comment 14 Richard Biener 2024-01-24 07:48:42 UTC
So the issue is we are vectorizing

     b = 32872 >> a;

as a HImode shift with

   b = 32872 >> MIN (a, 15)

that's wrong since it never yields the desired zero.  With a shift value
of 16 invoking undefined behavior for HImode but the expression originally
promoted to int and thus fine to be shifted by even 18 this means we cannot
vectorize this without introducing undefined behavior in GIMPLE?

This was introduced by the fix for PR110838 (and followup fixes).

I wonder if it would make most sense to allow shifts by the width of
the promoted left operand which would also help interoperability of
GIMPLE with Fortran here?  I'm not sure what the reasoning is to
disallow that specific shift amount (any hardware that would not compute
the result to zero or -1?  Esp. for logical right shifts this seems like
an odd restriction).  I'd hate to have another shift operator with
special defined behavior for this single case but I see no other way
to fix this short of complicating the lowering to
a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
avoid requiring masking).
Comment 15 Andrew Pinski 2024-01-24 07:54:14 UTC
(In reply to Richard Biener from comment #14)
> a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> avoid requiring masking).

Note maybe instead of MIN here we use `a & 0xf` since that will more likely be cheaper than a MIN.
Comment 16 rguenther@suse.de 2024-01-24 08:32:12 UTC
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> 
> --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #14)
> > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > avoid requiring masking).
> 
> Note maybe instead of MIN here we use `a & 0xf` since that will more likely be
> cheaper than a MIN.

But it's incorrect (that's what I did originally).
Comment 17 Andrew Pinski 2024-01-24 08:38:30 UTC
(In reply to rguenther@suse.de from comment #16)
> On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> > 
> > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #14)
> > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > > avoid requiring masking).
> > 
> > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be
> > cheaper than a MIN.
> 
> But it's incorrect (that's what I did originally).

But `(a>= 16)? 0: (32872>> (a&0xf))` is correct.

So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` .

Or is it you want to avoid the conditional here.
Comment 18 Jakub Jelinek 2024-01-24 09:52:59 UTC
(In reply to Andrew Pinski from comment #17)
> (In reply to rguenther@suse.de from comment #16)
> > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> > > 
> > > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> > > (In reply to Richard Biener from comment #14)
> > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > > > avoid requiring masking).
> > > 
> > > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be
> > > cheaper than a MIN.
> > 
> > But it's incorrect (that's what I did originally).
> 
> But `(a>= 16)? 0: (32872>> (a&0xf))` is correct.
> 
> So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` .
> 
> Or is it you want to avoid the conditional here.

I believe the thing is that Richi's PR110838 changes for RSHIFT_EXPR are correct for arithmetic right shifts and we should keep doing what it does for those.
But they are not correct for logical right shifts and they don't handle left shifts.
Both logical right shifts and left shifts need to get 0 from the shift counts bigger than new_precision - 1, which can be achieved through one of (or << instead of >>):
(op1 >= min_precision ? 0 : op0) >> (op1 & (min_precision - 1))
(op0 & (op1 < min_precision ? -1 : 0)) >> (op1 & (min_precision - 1))
op1 >= min_precision ? 0 : (op0 >> (op1 & (min_precision - 1)))
(op0 >> (op1 & (min_precision - 1))) & (op1 < min_precision ? -1 : 0)
Which one of these is best?
For simplicity of vect_recog_over_widening_pattern probably the first one, unsure
about what generates best code (and where).
Comment 19 rguenther@suse.de 2024-01-24 09:54:07 UTC
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> 
> --- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #16)
> > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> > > 
> > > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> > > (In reply to Richard Biener from comment #14)
> > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > > > avoid requiring masking).
> > > 
> > > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be
> > > cheaper than a MIN.
> > 
> > But it's incorrect (that's what I did originally).
> 
> But `(a>= 16)? 0: (32872>> (a&0xf))` is correct.
> 
> So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` .
> 
> Or is it you want to avoid the conditional here.

Yeah - at some point not trying to optimize the widening/shortening
is going to be cheaper, no?  Esp. if it is "just" because of
GIMPLE being limited with no way to express the desired semantics
for "out-of-bound" shifts (and no way to query target behavior for
them of course).

The widening/shortening might or might not have an effect on the
VF (we don't know) and we don't (yet) compute if the target
supports it.  We could force a smaller vector size (if the target
supports it) to avoid increasing the VF.
Comment 20 Jakub Jelinek 2024-01-24 10:00:15 UTC
(In reply to Jakub Jelinek from comment #18)
> But they are not correct for logical right shifts and they don't handle left
> shifts.

Oh, and then there are rotates and I think we need to punt on those, those shouldn't be narrowed.  Unless we already punt on them...
Comment 21 Jakub Jelinek 2024-01-24 13:23:29 UTC
So, I think
--- gcc/tree-vect-patterns.cc.jj	2024-01-03 11:51:37.487648527 +0100
+++ gcc/tree-vect-patterns.cc	2024-01-24 14:05:55.911356481 +0100
@@ -3002,6 +3002,12 @@ vect_recog_over_widening_pattern (vec_in
   if (STMT_VINFO_DEF_TYPE (last_stmt_info) == vect_reduction_def)
     return NULL;
 
+  /* FIXME: Aren't actually most operations incompatible with the truncation?
+     Like division/modulo, ...  Shouldn't this punt on
+     !vect_truncatable_operation_p (code) or that plus some exceptions?  */
+  if (code == LROTATE_EXPR || code == RROTATE_EXPR)
+    return NULL;
+
   /* Keep the first operand of a COND_EXPR as-is: only the other two
      operands are interesting.  */
   unsigned int first_op = (code == COND_EXPR ? 2 : 1);
@@ -3169,22 +3175,35 @@ vect_recog_over_widening_pattern (vec_in
 		       op_type, &unprom[0], op_vectype);
 
   /* Limit shift operands.  */
-  if (code == RSHIFT_EXPR)
+  if (code == LSHIFT_EXPR || code == RSHIFT_EXPR)
     {
       wide_int min_value, max_value;
       if (TREE_CODE (ops[1]) == INTEGER_CST)
-	ops[1] = wide_int_to_tree (op_type,
-				   wi::umin (wi::to_wide (ops[1]),
-					     new_precision - 1));
+	{
+	  if (wi::ge_p (wi::to_wide (ops[1]), new_precision,
+			TYPE_SIGN (TREE_TYPE (ops[1]))))
+	    {
+	      if (code == LSHIFT_EXPR || unsigned_p)
+		return NULL;
+	      ops[1] = build_int_cst (TREE_TYPE (ops[1]), new_precision - 1);
+	    }
+	}
       else if (!vect_get_range_info (ops[1], &min_value, &max_value)
-	       || wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type)))
+	       || wi::ge_p (max_value, new_precision,
+			    TYPE_SIGN (TREE_TYPE (ops[1]))))
 	{
+	  /* LSHIFT_EXPR or logical RSHIFT_EXPR can be handled as
+	     (op1 >= new_precision ? 0 : op0) << (op1 & (new_precision - 1))
+	     but perhaps it is too costly.  */
+	  if (code == LSHIFT_EXPR || unsigned_p)
+	    return NULL;
 	  /* ???  Note the following bad for SLP as that only supports
 	     same argument widened shifts and it un-CSEs same arguments.  */
-	  tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
+	  tree new_var = vect_recog_temp_ssa_var (TREE_TYPE (ops[1]), NULL);
 	  gimple *pattern_stmt
 	    = gimple_build_assign (new_var, MIN_EXPR, ops[1],
-				   build_int_cst (op_type, new_precision - 1));
+				   build_int_cst (TREE_TYPE (ops[1]),
+						  new_precision - 1));
 	  gimple_set_location (pattern_stmt, gimple_location (last_stmt));
 	  if (ops[1] == unprom[1].op && unprom[1].dt == vect_external_def)
 	    {
should fix it (given Richi's comments about the conditionals being too costly it instead punts for the truncated left shifts or right logical shifts unless VRP proves
the shift count is in range; I've also changed all the op_type uses on the shift count
operand to TREE_TYPE (ops[1]), because I believe the two types can and most often are different.
But as the first comment suggests, I wonder how comes this doesn't miscompile rotates, or say divisions/multiplications etc.
While vect_determine_precisions_from_range and vect_determine_precisions_from_users call vect_truncatable_operation_p and the latter allows casts and LSHIFT_EXPR/RSHIFT_EXPR next to it, the former seems for !vect_truncatable_operation_p
to just compute minimum/maximum from ranges across all operands and lhs, is that a safe thing for any non-truncatable operations?

Richard S., can you please have a look, this was added in r9-1590-g370c2ebe8fa20e0812cd2d533d4ed38ee2d37c85 I believe.
Comment 22 Richard Sandiford 2024-01-24 20:49:22 UTC
Taking following discussion on irc.
Comment 23 GCC Commits 2024-01-29 12:33:18 UTC
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:1a8261e047f7a2c2b0afb95716f7615cba718cd1

commit r14-8492-g1a8261e047f7a2c2b0afb95716f7615cba718cd1
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 29 12:33:08 2024 +0000

    vect: Tighten vect_determine_precisions_from_range [PR113281]
    
    This was another PR caused by the way that
    vect_determine_precisions_from_range handles shifts.  We tried to
    narrow 32768 >> x to a 16-bit shift based on range information for
    the inputs and outputs, with vect_recog_over_widening_pattern
    (after PR110828) adjusting the shift amount.  But this doesn't
    work for the case where x is in [16, 31], since then 32-bit
    32768 >> x is a well-defined zero, whereas no well-defined
    16-bit 32768 >> y will produce 0.
    
    We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
    but since vect_determine_precisions_from_range was never really
    supposed to rely on fix-ups, it seems better to fix that instead.
    
    The patch also makes the code more selective about which codes
    can be narrowed based on input and output ranges.  This showed
    that vect_truncatable_operation_p was missing cases for
    BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
    (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).
    
    pr113281-1.c is the original testcase.  pr113281-[23].c failed
    before the patch due to overly optimistic narrowing.  pr113281-[45].c
    previously passed and are meant to protect against accidental
    optimisation regressions.
    
    gcc/
            PR target/113281
            * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
            workaround for right shifts.
            (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
            (vect_determine_precisions_from_range): Be more selective about
            which codes can be narrowed based on their input and output ranges.
            For shifts, require at least one more bit of precision than the
            maximum shift amount.
    
    gcc/testsuite/
            PR target/113281
            * gcc.dg/vect/pr113281-1.c: New test.
            * gcc.dg/vect/pr113281-2.c: Likewise.
            * gcc.dg/vect/pr113281-3.c: Likewise.
            * gcc.dg/vect/pr113281-4.c: Likewise.
            * gcc.dg/vect/pr113281-5.c: Likewise.
Comment 24 Richard Sandiford 2024-01-29 12:34:59 UTC
Fixed on trunk so far, but it's latent on branches.  I'll see what
the trunk fallout is like before asking about backports.
Comment 25 Edwin Lu 2024-03-13 20:17:07 UTC
(In reply to Richard Sandiford from comment #24)
> Fixed on trunk so far, but it's latent on branches.  I'll see what
> the trunk fallout is like before asking about backports.

It looks like we have a regression for riscv 

I was going through the scan dump failures on trunk and ended up revisiting https://github.com/patrick-rivos/gcc-postcommit-ci/issues/463 where gcc.dg/vect/costmodel/riscv/rvv/pr113281-[125].c are failing the scan-dump checks. I didn't realize at the time that the scan dumps were checking code correctness and ended up ignoring it. 

It's still persisting on trunk (at least for pr113281-1.c https://godbolt.org/z/M9EK44hKe)

A bisection on https://github.com/patrick-rivos/gcc-postcommit-ci/issues/463 commit range suggests https://gcc.gnu.org/g:1a8261e047f7a2c2b0afb95716f7615cba718cd1 introduced it.

# first bad commit: [1a8261e047f7a2c2b0afb95716f7615cba718cd1] vect: Tighten vect_determine_precisions_from_range [PR113281]

Configuration
../configure --prefix=$(pwd) --with-multilib-generator="rv64gcv-lp64d--"
make stamps/build-gcc-linux-stage1 -j 32

Testing
./build-gcc-linux-stage1/gcc/cc1   ../gcc/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c  -march=rv64gcv -mabi=lp64d -mtune=rocket -mcmodel=medlow   -fdiagnostics-plain-output  -march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -ffat-lto-objects -fno-ident   -o pr113281-1.s
Comment 26 Andrew Pinski 2024-03-13 20:22:52 UTC
(In reply to Edwin Lu from comment #25)
> It's still persisting on trunk (at least for pr113281-1.c
> https://godbolt.org/z/M9EK44hKe)

I looked into what the vectorizer produces:
  vect__22.13_31 = (vector(8) int) vect_vec_iv_.12_8;
  _22 = (int) a.4_25;
  vect__12.14_33 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect__22.13_31;
  _12 = 32872 >> _22;
  vect_b_7.15_34 = (vector(8) short int) vect__12.14_33;

that is valid thing to do. That is do the shift in `vector(8) int` and then do a truncation. The issue originally was about doing the shift in `vector(8) short` which is not happening here.
Comment 27 Patrick O'Neill 2024-03-13 20:29:57 UTC
(In reply to Andrew Pinski from comment #26)
> (In reply to Edwin Lu from comment #25)
> > It's still persisting on trunk (at least for pr113281-1.c
> > https://godbolt.org/z/M9EK44hKe)
> 
> I looked into what the vectorizer produces:
>   vect__22.13_31 = (vector(8) int) vect_vec_iv_.12_8;
>   _22 = (int) a.4_25;
>   vect__12.14_33 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872
> } >> vect__22.13_31;
>   _12 = 32872 >> _22;
>   vect_b_7.15_34 = (vector(8) short int) vect__12.14_33;
> 
> that is valid thing to do. That is do the shift in `vector(8) int` and then
> do a truncation. The issue originally was about doing the shift in
> `vector(8) short` which is not happening here.

The regressed testcase looks like its testing if riscv vectorizes the code at all (the first issue Juzhe noted in comment #3 and then fixed). So this is a performance regression for risc-v, not correctness.
Comment 28 JuzheZhong 2024-03-13 23:18:49 UTC
The original cost model I did work for all cases but with some middle-end changes
the cost model failed.

I don't have time to figure out what's going on here.

Robin may be interested at it.
Comment 29 Robin Dapp 2024-05-31 08:12:44 UTC
Just to document again:  The test case should not be vectorized and at some point we will adjust the cost model so it is not going to be.  I'd prefer to base that decision on real uarchs rather than adjust the generic cost model right away though.
Comment 30 GCC Commits 2024-05-31 14:56:26 UTC
The releases/gcc-13 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:2602b71103d5ef2ef86000cac832b31dad3dfe2b

commit r13-8813-g2602b71103d5ef2ef86000cac832b31dad3dfe2b
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Fri May 31 15:56:05 2024 +0100

    vect: Tighten vect_determine_precisions_from_range [PR113281]
    
    This was another PR caused by the way that
    vect_determine_precisions_from_range handles shifts.  We tried to
    narrow 32768 >> x to a 16-bit shift based on range information for
    the inputs and outputs, with vect_recog_over_widening_pattern
    (after PR110828) adjusting the shift amount.  But this doesn't
    work for the case where x is in [16, 31], since then 32-bit
    32768 >> x is a well-defined zero, whereas no well-defined
    16-bit 32768 >> y will produce 0.
    
    We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
    but since vect_determine_precisions_from_range was never really
    supposed to rely on fix-ups, it seems better to fix that instead.
    
    The patch also makes the code more selective about which codes
    can be narrowed based on input and output ranges.  This showed
    that vect_truncatable_operation_p was missing cases for
    BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
    (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).
    
    pr113281-1.c is the original testcase.  pr113281-[23].c failed
    before the patch due to overly optimistic narrowing.  pr113281-[45].c
    previously passed and are meant to protect against accidental
    optimisation regressions.
    
    gcc/
            PR target/113281
            * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
            workaround for right shifts.
            (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
            (vect_determine_precisions_from_range): Be more selective about
            which codes can be narrowed based on their input and output ranges.
            For shifts, require at least one more bit of precision than the
            maximum shift amount.
    
    gcc/testsuite/
            PR target/113281
            * gcc.dg/vect/pr113281-1.c: New test.
            * gcc.dg/vect/pr113281-2.c: Likewise.
            * gcc.dg/vect/pr113281-3.c: Likewise.
            * gcc.dg/vect/pr113281-4.c: Likewise.
            * gcc.dg/vect/pr113281-5.c: Likewise.
    
    (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)
Comment 31 GCC Commits 2024-06-04 07:48:04 UTC
The releases/gcc-12 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:dfaa13455d67646805bc611aa4373728a460a37d

commit r12-10489-gdfaa13455d67646805bc611aa4373728a460a37d
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Tue Jun 4 08:47:48 2024 +0100

    vect: Tighten vect_determine_precisions_from_range [PR113281]
    
    This was another PR caused by the way that
    vect_determine_precisions_from_range handles shifts.  We tried to
    narrow 32768 >> x to a 16-bit shift based on range information for
    the inputs and outputs, with vect_recog_over_widening_pattern
    (after PR110828) adjusting the shift amount.  But this doesn't
    work for the case where x is in [16, 31], since then 32-bit
    32768 >> x is a well-defined zero, whereas no well-defined
    16-bit 32768 >> y will produce 0.
    
    We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
    but since vect_determine_precisions_from_range was never really
    supposed to rely on fix-ups, it seems better to fix that instead.
    
    The patch also makes the code more selective about which codes
    can be narrowed based on input and output ranges.  This showed
    that vect_truncatable_operation_p was missing cases for
    BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
    (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).
    
    pr113281-1.c is the original testcase.  pr113281-[23].c failed
    before the patch due to overly optimistic narrowing.  pr113281-[45].c
    previously passed and are meant to protect against accidental
    optimisation regressions.
    
    gcc/
            PR target/113281
            * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove
            workaround for right shifts.
            (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
            (vect_determine_precisions_from_range): Be more selective about
            which codes can be narrowed based on their input and output ranges.
            For shifts, require at least one more bit of precision than the
            maximum shift amount.
    
    gcc/testsuite/
            PR target/113281
            * gcc.dg/vect/pr113281-1.c: New test.
            * gcc.dg/vect/pr113281-2.c: Likewise.
            * gcc.dg/vect/pr113281-3.c: Likewise.
            * gcc.dg/vect/pr113281-4.c: Likewise.
            * gcc.dg/vect/pr113281-5.c: Likewise.
    
    (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)
Comment 32 GCC Commits 2024-06-04 12:47:57 UTC
The releases/gcc-11 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62

commit r11-11466-g95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Tue Jun 4 13:47:35 2024 +0100

    vect: Tighten vect_determine_precisions_from_range [PR113281]
    
    This was another PR caused by the way that
    vect_determine_precisions_from_range handles shifts.  We tried to
    narrow 32768 >> x to a 16-bit shift based on range information for
    the inputs and outputs, with vect_recog_over_widening_pattern
    (after PR110828) adjusting the shift amount.  But this doesn't
    work for the case where x is in [16, 31], since then 32-bit
    32768 >> x is a well-defined zero, whereas no well-defined
    16-bit 32768 >> y will produce 0.
    
    We could perhaps generate x < 16 ? 32768 >> x : 0 instead,
    but since vect_determine_precisions_from_range was never really
    supposed to rely on fix-ups, it seems better to fix that instead.
    
    The patch also makes the code more selective about which codes
    can be narrowed based on input and output ranges.  This showed
    that vect_truncatable_operation_p was missing cases for
    BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR
    (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1).
    
    pr113281-1.c is the original testcase.  pr113281-[23].c failed
    before the patch due to overly optimistic narrowing.  pr113281-[45].c
    previously passed and are meant to protect against accidental
    optimisation regressions.
    
    gcc/
            PR target/113281
            * tree-vect-patterns.c (vect_recog_over_widening_pattern): Remove
            workaround for right shifts.
            (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR.
            (vect_determine_precisions_from_range): Be more selective about
            which codes can be narrowed based on their input and output ranges.
            For shifts, require at least one more bit of precision than the
            maximum shift amount.
    
    gcc/testsuite/
            PR target/113281
            * gcc.dg/vect/pr113281-1.c: New test.
            * gcc.dg/vect/pr113281-2.c: Likewise.
            * gcc.dg/vect/pr113281-3.c: Likewise.
            * gcc.dg/vect/pr113281-4.c: Likewise.
            * gcc.dg/vect/pr113281-5.c: Likewise.
    
    (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)
Comment 33 Richard Sandiford 2024-06-04 12:49:26 UTC
Fixed.
Comment 34 GCC Commits 2024-06-26 05:23:36 UTC
The master branch has been updated by Alexandre Oliva <aoliva@gcc.gnu.org>:

https://gcc.gnu.org/g:54d2339c9f87f702e02e571a5460e11c19e1c02f

commit r15-1639-g54d2339c9f87f702e02e571a5460e11c19e1c02f
Author: Alexandre Oliva <oliva@adacore.com>
Date:   Wed Jun 26 02:08:18 2024 -0300

    [testsuite] [arm] [vect] adjust mve-vshr test [PR113281]
    
    The test was too optimistic, alas.  We used to vectorize shifts by
    clamping the shift counts below the bit width of the types (e.g. at 15
    for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
    well defined (because of promotion to 32-bit int) and must yield 0,
    not 1 (as before the fix).
    
    Unfortunately, in the gimple model of vector units, such large shift
    counts wouldn't be well-defined, so we won't vectorize such shifts any
    more, unless we can tell they're in range or undefined.
    
    So the test that expected the vectorization we no longer performed
    needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
    suggested annotating the test with the expected ranges so as to enable
    the optimization, and Christophe Lyon suggested a further
    simplification.
    
    
    Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
    
    for  gcc/testsuite/ChangeLog
    
            PR tree-optimization/113281
            * gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
Comment 35 GCC Commits 2024-08-20 15:12:10 UTC
The releases/gcc-14 branch has been updated by Richard Ball <ricbal02@gcc.gnu.org>:

https://gcc.gnu.org/g:25812d8b789748911e800a972e5a3a030e3ac905

commit r14-10605-g25812d8b789748911e800a972e5a3a030e3ac905
Author: Alexandre Oliva <oliva@adacore.com>
Date:   Wed Jun 26 02:08:18 2024 -0300

    [testsuite] [arm] [vect] adjust mve-vshr test [PR113281]
    
    The test was too optimistic, alas.  We used to vectorize shifts by
    clamping the shift counts below the bit width of the types (e.g. at 15
    for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
    well defined (because of promotion to 32-bit int) and must yield 0,
    not 1 (as before the fix).
    
    Unfortunately, in the gimple model of vector units, such large shift
    counts wouldn't be well-defined, so we won't vectorize such shifts any
    more, unless we can tell they're in range or undefined.
    
    So the test that expected the vectorization we no longer performed
    needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
    suggested annotating the test with the expected ranges so as to enable
    the optimization, and Christophe Lyon suggested a further
    simplification.
    
    Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
    
    for  gcc/testsuite/ChangeLog
    
            PR tree-optimization/113281
            * gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
    
    (cherry picked from commit 54d2339c9f87f702e02e571a5460e11c19e1c02f)
Comment 36 GCC Commits 2024-08-21 13:21:59 UTC
The releases/gcc-13 branch has been updated by Richard Ball <ricbal02@gcc.gnu.org>:

https://gcc.gnu.org/g:3e5cf9f060f39a958edf4b817f632ee93e96d55c

commit r13-8985-g3e5cf9f060f39a958edf4b817f632ee93e96d55c
Author: Alexandre Oliva <oliva@adacore.com>
Date:   Wed Jun 26 02:08:18 2024 -0300

    [testsuite] [arm] [vect] adjust mve-vshr test [PR113281]
    
    The test was too optimistic, alas.  We used to vectorize shifts by
    clamping the shift counts below the bit width of the types (e.g. at 15
    for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
    well defined (because of promotion to 32-bit int) and must yield 0,
    not 1 (as before the fix).
    
    Unfortunately, in the gimple model of vector units, such large shift
    counts wouldn't be well-defined, so we won't vectorize such shifts any
    more, unless we can tell they're in range or undefined.
    
    So the test that expected the vectorization we no longer performed
    needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
    suggested annotating the test with the expected ranges so as to enable
    the optimization, and Christophe Lyon suggested a further
    simplification.
    
    Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
    
    for  gcc/testsuite/ChangeLog
    
            PR tree-optimization/113281
            * gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
    
    (cherry picked from commit 54d2339c9f87f702e02e571a5460e11c19e1c02f)
Comment 37 GCC Commits 2024-08-21 13:23:00 UTC
The releases/gcc-12 branch has been updated by Richard Ball <ricbal02@gcc.gnu.org>:

https://gcc.gnu.org/g:881b54f56e77dc470d27e4746b90dc7819c2be81

commit r12-10680-g881b54f56e77dc470d27e4746b90dc7819c2be81
Author: Alexandre Oliva <oliva@adacore.com>
Date:   Wed Jun 26 02:08:18 2024 -0300

    [testsuite] [arm] [vect] adjust mve-vshr test [PR113281]
    
    The test was too optimistic, alas.  We used to vectorize shifts by
    clamping the shift counts below the bit width of the types (e.g. at 15
    for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
    well defined (because of promotion to 32-bit int) and must yield 0,
    not 1 (as before the fix).
    
    Unfortunately, in the gimple model of vector units, such large shift
    counts wouldn't be well-defined, so we won't vectorize such shifts any
    more, unless we can tell they're in range or undefined.
    
    So the test that expected the vectorization we no longer performed
    needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
    suggested annotating the test with the expected ranges so as to enable
    the optimization, and Christophe Lyon suggested a further
    simplification.
    
    Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com>
    
    for  gcc/testsuite/ChangeLog
    
            PR tree-optimization/113281
            * gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
    
    (cherry picked from commit 54d2339c9f87f702e02e571a5460e11c19e1c02f)