Testcase: unsigned char a; int main() { short b = a = 0; for (; a != 19; a++) if (a) b = 32872 >> a; if (b == 0) return 0; else return 1; } Commands: rv64gcv_zvl256b: > /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -march=rv64gcv_zvl256b -O3 red.c -o user-config.out > QEMU_CPU=rv64,vlen=256,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/qemu-riscv64 user-config.out > echo $? 1 rv64gc: > /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -march=rv64gc -O3 red.c -o rv64gc.out > QEMU_CPU=rv64,vlen=256,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-8-trunk/build-rv64gcv/bin/qemu-riscv64 rv64gc.out > echo $? 0
Godbolt: https://godbolt.org/z/efPhqzcr5
Confirmed. Funny, we shouldn't vectorize that but really optimize to "return 0". Costing might be questionable but we also haven't optimized away the loop when comparing costs. Disregarding that, of course the vectorization should be correct. The vect output doesn't really make sense to me but I haven't looked very closely yet: _177 = .SELECT_VL (2, POLY_INT_CST [16, 16]); vect_patt_82.18_166 = (vector([16,16]) unsigned short) { 17, 18, 19, ... }; vect_patt_84.19_168 = MIN_EXPR <vect_patt_82.18_166, { 15, ... }>; vect_patt_85.20_170 = { 32872, ... } >> vect_patt_84.19_168; vect_patt_87.21_171 = VIEW_CONVERT_EXPR<vector([16,16]) short intD.12>(vect_patt_85.20_170); _173 = _177 + 18446744073709551615; # RANGE [irange] short int [0, 16436] MASK 0x7fff VALUE 0x0 _174 = .VEC_EXTRACT (vect_patt_87.21_171, _173); vect_patt_85.20_170 should be all zeros and then we'd just vec_extract a 0 and return that. However, 32872 >> 15 == 1 so we return 1.
I think there are 2 issues here: 1. We should adjust cost model to let loop vectorizer eliminate the unprofitable vectorization. It should be done in RISC-V backend. 2. We should fix run fail bug with -fno-vect-cost-model. I think 1st issue is simpler than second one. And I highly doubt that the second one is not RISC-V bug is middle-end bug.
Confirm reduced case: #include <assert.h> unsigned char a; int main() { short b = a = 0; for (; a != 19; a++) if (a) b = 32872 >> a; assert (b == 0); } with -fno-vect-cost-model -march=rv64gcv -O3: https://godbolt.org/z/joGb3e9Eb Also run failed assertion "b == 0" failed: file "bug.c", line 10, function: main I suspect ARM SVE has the same fail. Hi, Andrew. Could you test this case on ARM to see whether ARM has same issue as RISC-V for me ?
(In reply to JuzheZhong from comment #4) > Confirm reduced case: > > #include <assert.h> > unsigned char a; > > int main() { > short b = a = 0; > for (; a != 19; a++) > if (a) > b = 32872 >> a; > > assert (b == 0); > } > > with -fno-vect-cost-model -march=rv64gcv -O3: > > https://godbolt.org/z/joGb3e9Eb > > Also run failed assertion "b == 0" failed: file "bug.c", line 10, function: > main > > I suspect ARM SVE has the same fail. > > Hi, Andrew. Could you test this case on ARM to see whether ARM has same > issue as RISC-V for me ? The vect dump tree is quite similar between ARM and RISC-V.
Confirmed on aarch64 too: [apinski@xeond2 upstream-full-cross]$ ./install/bin/aarch64-linux-gnu-gcc -O3 -march=armv9-a+sve2 t.c -static -fno-vect-cost-model [apinski@xeond2 upstream-full-cross]$ ./install-qemu/bin/qemu-aarch64 a.out ;echo $? 1
> short b = a = 0; s/short/int/ allows it to work ...
(In reply to Andrew Pinski from comment #7) > > short b = a = 0; > > s/short/int/ allows it to work ... Thanks Andrew. I think we should change the tittle. It should be mismatch on both RISC-V and ARM SVE with -fno-vect-cost-model.
(In reply to JuzheZhong from comment #8) > It should be mismatch on both RISC-V and ARM SVE with -fno-vect-cost-model. Changed and checked to see that GCC 13 didn't vectorize the loop for aarch64 SVE either so it is a regression.
vect_patt_84.19_168 = MIN_EXPR <vect_patt_82.18_166, { 15, ... }>; this one is new, but I think there's an existing open bug for the "widening" pattern recognition failing to make sure the shift argument is kept in range. Or maybe that was the same that spurred the MIN above (but IIRC that was some "undefined behavior" thing, not wrong-code).
The master branch has been updated by Pan Li <panli@gcc.gnu.org>: https://gcc.gnu.org/g:0acb63670bf1058fce00a75bd318c40be3bfa222 commit r14-7183-g0acb63670bf1058fce00a75bd318c40be3bfa222 Author: Juzhe-Zhong <juzhe.zhong@rivai.ai> Date: Fri Jan 12 17:28:44 2024 +0800 RISC-V: Adjust scalar_to_vec cost 1. Introduce vector regmove new tune info. 2. Adjust scalar_to_vec cost in add_stmt_cost. We will get optimal codegen after this patch with -march=rv64gcv_zvl256b: lui a5,%hi(a) li a4,19 sb a4,%lo(a)(a5) li a0,0 ret Tested on both RV32/RV64 no regression, Ok for trunk ? PR target/113281 gcc/ChangeLog: * config/riscv/riscv-protos.h (struct regmove_vector_cost): New struct. (struct cpu_vector_cost): Add regmove struct. (get_vector_costs): Export as global. * config/riscv/riscv-vector-costs.cc (adjust_stmt_cost): Adjust scalar_to_vec cost. (costs::add_stmt_cost): Ditto. * config/riscv/riscv.cc (get_common_costs): Export global function. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test. * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test. * gcc.dg/vect/costmodel/riscv/rvv/pr113281-2.c: New test.
The trunk branch has been updated by Lehua Ding <lhtin@gcc.gnu.org>: https://gcc.gnu.org/g:405096f908e1ceb0d6a1b5420ded20ad85ddae9e commit r14-7244-g405096f908e1ceb0d6a1b5420ded20ad85ddae9e Author: Juzhe-Zhong <juzhe.zhong@rivai.ai> Date: Mon Jan 15 09:22:40 2024 +0800 RISC-V: Adjust loop len by costing 1 when NITER < VF Rebase in v3: Rebase to the trunk and commit it as it's approved by Robin. Update in v2: Add dynmaic lmul test. This patch fixes the regression between GCC 13.2.0 and trunk GCC (GCC-14) GCC 13.2.0: lui a5,%hi(a) li a4,19 sb a4,%lo(a)(a5) li a0,0 ret Trunk GCC: vsetvli a5,zero,e8,mf2,ta,ma li a4,-32768 vid.v v1 vsetvli zero,zero,e16,m1,ta,ma addiw a4,a4,104 vmv.v.i v3,15 lui a1,%hi(a) li a0,19 vsetvli zero,zero,e8,mf2,ta,ma vadd.vi v1,v1,1 sb a0,%lo(a)(a1) vsetvli zero,zero,e16,m1,ta,ma vzext.vf2 v2,v1 vmv.v.x v1,a4 vminu.vv v2,v2,v3 vsrl.vv v1,v1,v2 vslidedown.vi v1,v1,17 vmv.x.s a0,v1 snez a0,a0 ret The root cause we are vectorizing the codes inefficiently since we doesn't cost len when NITERS < VF. Leverage loop control of mask targets or rs6000 fixes the regression. Tested no regression. Ok for trunk ? PR target/113281 gcc/ChangeLog: * config/riscv/riscv-vector-costs.cc (costs::adjust_vect_cost_per_loop): New function. (costs::finish_cost): Adjust cost for LOOP LEN with NITERS < VF. * config/riscv/riscv-vector-costs.h: New function. gcc/testsuite/ChangeLog: * gcc.dg/vect/costmodel/riscv/rvv/pr113281-3.c: New test. * gcc.dg/vect/costmodel/riscv/rvv/pr113281-4.c: New test. * gcc.dg/vect/costmodel/riscv/rvv/pr113281-5.c: New test.
At least on aarch64 this is vectorized since r14-3027-gc5f673dbc252e35e6b66e9b8abd30a4027193e0b
So the issue is we are vectorizing b = 32872 >> a; as a HImode shift with b = 32872 >> MIN (a, 15) that's wrong since it never yields the desired zero. With a shift value of 16 invoking undefined behavior for HImode but the expression originally promoted to int and thus fine to be shifted by even 18 this means we cannot vectorize this without introducing undefined behavior in GIMPLE? This was introduced by the fix for PR110838 (and followup fixes). I wonder if it would make most sense to allow shifts by the width of the promoted left operand which would also help interoperability of GIMPLE with Fortran here? I'm not sure what the reasoning is to disallow that specific shift amount (any hardware that would not compute the result to zero or -1? Esp. for logical right shifts this seems like an odd restriction). I'd hate to have another shift operator with special defined behavior for this single case but I see no other way to fix this short of complicating the lowering to a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to avoid requiring masking).
(In reply to Richard Biener from comment #14) > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to > avoid requiring masking). Note maybe instead of MIN here we use `a & 0xf` since that will more likely be cheaper than a MIN.
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 > > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #14) > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to > > avoid requiring masking). > > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be > cheaper than a MIN. But it's incorrect (that's what I did originally).
(In reply to rguenther@suse.de from comment #16) > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 > > > > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > > (In reply to Richard Biener from comment #14) > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to > > > avoid requiring masking). > > > > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be > > cheaper than a MIN. > > But it's incorrect (that's what I did originally). But `(a>= 16)? 0: (32872>> (a&0xf))` is correct. So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` . Or is it you want to avoid the conditional here.
(In reply to Andrew Pinski from comment #17) > (In reply to rguenther@suse.de from comment #16) > > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 > > > > > > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > > > (In reply to Richard Biener from comment #14) > > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to > > > > avoid requiring masking). > > > > > > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be > > > cheaper than a MIN. > > > > But it's incorrect (that's what I did originally). > > But `(a>= 16)? 0: (32872>> (a&0xf))` is correct. > > So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` . > > Or is it you want to avoid the conditional here. I believe the thing is that Richi's PR110838 changes for RSHIFT_EXPR are correct for arithmetic right shifts and we should keep doing what it does for those. But they are not correct for logical right shifts and they don't handle left shifts. Both logical right shifts and left shifts need to get 0 from the shift counts bigger than new_precision - 1, which can be achieved through one of (or << instead of >>): (op1 >= min_precision ? 0 : op0) >> (op1 & (min_precision - 1)) (op0 & (op1 < min_precision ? -1 : 0)) >> (op1 & (min_precision - 1)) op1 >= min_precision ? 0 : (op0 >> (op1 & (min_precision - 1))) (op0 >> (op1 & (min_precision - 1))) & (op1 < min_precision ? -1 : 0) Which one of these is best? For simplicity of vect_recog_over_widening_pattern probably the first one, unsure about what generates best code (and where).
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 > > --- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #16) > > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 > > > > > > --- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > > > (In reply to Richard Biener from comment #14) > > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to > > > > avoid requiring masking). > > > > > > Note maybe instead of MIN here we use `a & 0xf` since that will more likely be > > > cheaper than a MIN. > > > > But it's incorrect (that's what I did originally). > > But `(a>= 16)? 0: (32872>> (a&0xf))` is correct. > > So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` . > > Or is it you want to avoid the conditional here. Yeah - at some point not trying to optimize the widening/shortening is going to be cheaper, no? Esp. if it is "just" because of GIMPLE being limited with no way to express the desired semantics for "out-of-bound" shifts (and no way to query target behavior for them of course). The widening/shortening might or might not have an effect on the VF (we don't know) and we don't (yet) compute if the target supports it. We could force a smaller vector size (if the target supports it) to avoid increasing the VF.
(In reply to Jakub Jelinek from comment #18) > But they are not correct for logical right shifts and they don't handle left > shifts. Oh, and then there are rotates and I think we need to punt on those, those shouldn't be narrowed. Unless we already punt on them...
So, I think --- gcc/tree-vect-patterns.cc.jj 2024-01-03 11:51:37.487648527 +0100 +++ gcc/tree-vect-patterns.cc 2024-01-24 14:05:55.911356481 +0100 @@ -3002,6 +3002,12 @@ vect_recog_over_widening_pattern (vec_in if (STMT_VINFO_DEF_TYPE (last_stmt_info) == vect_reduction_def) return NULL; + /* FIXME: Aren't actually most operations incompatible with the truncation? + Like division/modulo, ... Shouldn't this punt on + !vect_truncatable_operation_p (code) or that plus some exceptions? */ + if (code == LROTATE_EXPR || code == RROTATE_EXPR) + return NULL; + /* Keep the first operand of a COND_EXPR as-is: only the other two operands are interesting. */ unsigned int first_op = (code == COND_EXPR ? 2 : 1); @@ -3169,22 +3175,35 @@ vect_recog_over_widening_pattern (vec_in op_type, &unprom[0], op_vectype); /* Limit shift operands. */ - if (code == RSHIFT_EXPR) + if (code == LSHIFT_EXPR || code == RSHIFT_EXPR) { wide_int min_value, max_value; if (TREE_CODE (ops[1]) == INTEGER_CST) - ops[1] = wide_int_to_tree (op_type, - wi::umin (wi::to_wide (ops[1]), - new_precision - 1)); + { + if (wi::ge_p (wi::to_wide (ops[1]), new_precision, + TYPE_SIGN (TREE_TYPE (ops[1])))) + { + if (code == LSHIFT_EXPR || unsigned_p) + return NULL; + ops[1] = build_int_cst (TREE_TYPE (ops[1]), new_precision - 1); + } + } else if (!vect_get_range_info (ops[1], &min_value, &max_value) - || wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type))) + || wi::ge_p (max_value, new_precision, + TYPE_SIGN (TREE_TYPE (ops[1])))) { + /* LSHIFT_EXPR or logical RSHIFT_EXPR can be handled as + (op1 >= new_precision ? 0 : op0) << (op1 & (new_precision - 1)) + but perhaps it is too costly. */ + if (code == LSHIFT_EXPR || unsigned_p) + return NULL; /* ??? Note the following bad for SLP as that only supports same argument widened shifts and it un-CSEs same arguments. */ - tree new_var = vect_recog_temp_ssa_var (op_type, NULL); + tree new_var = vect_recog_temp_ssa_var (TREE_TYPE (ops[1]), NULL); gimple *pattern_stmt = gimple_build_assign (new_var, MIN_EXPR, ops[1], - build_int_cst (op_type, new_precision - 1)); + build_int_cst (TREE_TYPE (ops[1]), + new_precision - 1)); gimple_set_location (pattern_stmt, gimple_location (last_stmt)); if (ops[1] == unprom[1].op && unprom[1].dt == vect_external_def) { should fix it (given Richi's comments about the conditionals being too costly it instead punts for the truncated left shifts or right logical shifts unless VRP proves the shift count is in range; I've also changed all the op_type uses on the shift count operand to TREE_TYPE (ops[1]), because I believe the two types can and most often are different. But as the first comment suggests, I wonder how comes this doesn't miscompile rotates, or say divisions/multiplications etc. While vect_determine_precisions_from_range and vect_determine_precisions_from_users call vect_truncatable_operation_p and the latter allows casts and LSHIFT_EXPR/RSHIFT_EXPR next to it, the former seems for !vect_truncatable_operation_p to just compute minimum/maximum from ranges across all operands and lhs, is that a safe thing for any non-truncatable operations? Richard S., can you please have a look, this was added in r9-1590-g370c2ebe8fa20e0812cd2d533d4ed38ee2d37c85 I believe.
Taking following discussion on irc.
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:1a8261e047f7a2c2b0afb95716f7615cba718cd1 commit r14-8492-g1a8261e047f7a2c2b0afb95716f7615cba718cd1 Author: Richard Sandiford <richard.sandiford@arm.com> Date: Mon Jan 29 12:33:08 2024 +0000 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise.
Fixed on trunk so far, but it's latent on branches. I'll see what the trunk fallout is like before asking about backports.
(In reply to Richard Sandiford from comment #24) > Fixed on trunk so far, but it's latent on branches. I'll see what > the trunk fallout is like before asking about backports. It looks like we have a regression for riscv I was going through the scan dump failures on trunk and ended up revisiting https://github.com/patrick-rivos/gcc-postcommit-ci/issues/463 where gcc.dg/vect/costmodel/riscv/rvv/pr113281-[125].c are failing the scan-dump checks. I didn't realize at the time that the scan dumps were checking code correctness and ended up ignoring it. It's still persisting on trunk (at least for pr113281-1.c https://godbolt.org/z/M9EK44hKe) A bisection on https://github.com/patrick-rivos/gcc-postcommit-ci/issues/463 commit range suggests https://gcc.gnu.org/g:1a8261e047f7a2c2b0afb95716f7615cba718cd1 introduced it. # first bad commit: [1a8261e047f7a2c2b0afb95716f7615cba718cd1] vect: Tighten vect_determine_precisions_from_range [PR113281] Configuration ../configure --prefix=$(pwd) --with-multilib-generator="rv64gcv-lp64d--" make stamps/build-gcc-linux-stage1 -j 32 Testing ./build-gcc-linux-stage1/gcc/cc1 ../gcc/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c -march=rv64gcv -mabi=lp64d -mtune=rocket -mcmodel=medlow -fdiagnostics-plain-output -march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -ffat-lto-objects -fno-ident -o pr113281-1.s
(In reply to Edwin Lu from comment #25) > It's still persisting on trunk (at least for pr113281-1.c > https://godbolt.org/z/M9EK44hKe) I looked into what the vectorizer produces: vect__22.13_31 = (vector(8) int) vect_vec_iv_.12_8; _22 = (int) a.4_25; vect__12.14_33 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect__22.13_31; _12 = 32872 >> _22; vect_b_7.15_34 = (vector(8) short int) vect__12.14_33; that is valid thing to do. That is do the shift in `vector(8) int` and then do a truncation. The issue originally was about doing the shift in `vector(8) short` which is not happening here.
(In reply to Andrew Pinski from comment #26) > (In reply to Edwin Lu from comment #25) > > It's still persisting on trunk (at least for pr113281-1.c > > https://godbolt.org/z/M9EK44hKe) > > I looked into what the vectorizer produces: > vect__22.13_31 = (vector(8) int) vect_vec_iv_.12_8; > _22 = (int) a.4_25; > vect__12.14_33 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 > } >> vect__22.13_31; > _12 = 32872 >> _22; > vect_b_7.15_34 = (vector(8) short int) vect__12.14_33; > > that is valid thing to do. That is do the shift in `vector(8) int` and then > do a truncation. The issue originally was about doing the shift in > `vector(8) short` which is not happening here. The regressed testcase looks like its testing if riscv vectorizes the code at all (the first issue Juzhe noted in comment #3 and then fixed). So this is a performance regression for risc-v, not correctness.
The original cost model I did work for all cases but with some middle-end changes the cost model failed. I don't have time to figure out what's going on here. Robin may be interested at it.
Just to document again: The test case should not be vectorized and at some point we will adjust the cost model so it is not going to be. I'd prefer to base that decision on real uarchs rather than adjust the generic cost model right away though.
The releases/gcc-13 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:2602b71103d5ef2ef86000cac832b31dad3dfe2b commit r13-8813-g2602b71103d5ef2ef86000cac832b31dad3dfe2b Author: Richard Sandiford <richard.sandiford@arm.com> Date: Fri May 31 15:56:05 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)
The releases/gcc-12 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:dfaa13455d67646805bc611aa4373728a460a37d commit r12-10489-gdfaa13455d67646805bc611aa4373728a460a37d Author: Richard Sandiford <richard.sandiford@arm.com> Date: Tue Jun 4 08:47:48 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.cc (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)
The releases/gcc-11 branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62 commit r11-11466-g95e4252f53bc0e5b66a200c611fd2c9f6f7f2a62 Author: Richard Sandiford <richard.sandiford@arm.com> Date: Tue Jun 4 13:47:35 2024 +0100 vect: Tighten vect_determine_precisions_from_range [PR113281] This was another PR caused by the way that vect_determine_precisions_from_range handles shifts. We tried to narrow 32768 >> x to a 16-bit shift based on range information for the inputs and outputs, with vect_recog_over_widening_pattern (after PR110828) adjusting the shift amount. But this doesn't work for the case where x is in [16, 31], since then 32-bit 32768 >> x is a well-defined zero, whereas no well-defined 16-bit 32768 >> y will produce 0. We could perhaps generate x < 16 ? 32768 >> x : 0 instead, but since vect_determine_precisions_from_range was never really supposed to rely on fix-ups, it seems better to fix that instead. The patch also makes the code more selective about which codes can be narrowed based on input and output ranges. This showed that vect_truncatable_operation_p was missing cases for BIT_NOT_EXPR (equivalent to BIT_XOR_EXPR of -1) and NEGATE_EXPR (equivalent to BIT_NOT_EXPR followed by a PLUS_EXPR of 1). pr113281-1.c is the original testcase. pr113281-[23].c failed before the patch due to overly optimistic narrowing. pr113281-[45].c previously passed and are meant to protect against accidental optimisation regressions. gcc/ PR target/113281 * tree-vect-patterns.c (vect_recog_over_widening_pattern): Remove workaround for right shifts. (vect_truncatable_operation_p): Handle NEGATE_EXPR and BIT_NOT_EXPR. (vect_determine_precisions_from_range): Be more selective about which codes can be narrowed based on their input and output ranges. For shifts, require at least one more bit of precision than the maximum shift amount. gcc/testsuite/ PR target/113281 * gcc.dg/vect/pr113281-1.c: New test. * gcc.dg/vect/pr113281-2.c: Likewise. * gcc.dg/vect/pr113281-3.c: Likewise. * gcc.dg/vect/pr113281-4.c: Likewise. * gcc.dg/vect/pr113281-5.c: Likewise. (cherry picked from commit 1a8261e047f7a2c2b0afb95716f7615cba718cd1)
Fixed.
The master branch has been updated by Alexandre Oliva <aoliva@gcc.gnu.org>: https://gcc.gnu.org/g:54d2339c9f87f702e02e571a5460e11c19e1c02f commit r15-1639-g54d2339c9f87f702e02e571a5460e11c19e1c02f Author: Alexandre Oliva <oliva@adacore.com> Date: Wed Jun 26 02:08:18 2024 -0300 [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] The test was too optimistic, alas. We used to vectorize shifts by clamping the shift counts below the bit width of the types (e.g. at 15 for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is well defined (because of promotion to 32-bit int) and must yield 0, not 1 (as before the fix). Unfortunately, in the gimple model of vector units, such large shift counts wouldn't be well-defined, so we won't vectorize such shifts any more, unless we can tell they're in range or undefined. So the test that expected the vectorization we no longer performed needs to be adjusted. Instead of nobbling the test, Richard Earnshaw suggested annotating the test with the expected ranges so as to enable the optimization, and Christophe Lyon suggested a further simplification. Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> for gcc/testsuite/ChangeLog PR tree-optimization/113281 * gcc.target/arm/simd/mve-vshr.c: Add expected ranges.
The releases/gcc-14 branch has been updated by Richard Ball <ricbal02@gcc.gnu.org>: https://gcc.gnu.org/g:25812d8b789748911e800a972e5a3a030e3ac905 commit r14-10605-g25812d8b789748911e800a972e5a3a030e3ac905 Author: Alexandre Oliva <oliva@adacore.com> Date: Wed Jun 26 02:08:18 2024 -0300 [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] The test was too optimistic, alas. We used to vectorize shifts by clamping the shift counts below the bit width of the types (e.g. at 15 for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is well defined (because of promotion to 32-bit int) and must yield 0, not 1 (as before the fix). Unfortunately, in the gimple model of vector units, such large shift counts wouldn't be well-defined, so we won't vectorize such shifts any more, unless we can tell they're in range or undefined. So the test that expected the vectorization we no longer performed needs to be adjusted. Instead of nobbling the test, Richard Earnshaw suggested annotating the test with the expected ranges so as to enable the optimization, and Christophe Lyon suggested a further simplification. Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> for gcc/testsuite/ChangeLog PR tree-optimization/113281 * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. (cherry picked from commit 54d2339c9f87f702e02e571a5460e11c19e1c02f)
The releases/gcc-13 branch has been updated by Richard Ball <ricbal02@gcc.gnu.org>: https://gcc.gnu.org/g:3e5cf9f060f39a958edf4b817f632ee93e96d55c commit r13-8985-g3e5cf9f060f39a958edf4b817f632ee93e96d55c Author: Alexandre Oliva <oliva@adacore.com> Date: Wed Jun 26 02:08:18 2024 -0300 [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] The test was too optimistic, alas. We used to vectorize shifts by clamping the shift counts below the bit width of the types (e.g. at 15 for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is well defined (because of promotion to 32-bit int) and must yield 0, not 1 (as before the fix). Unfortunately, in the gimple model of vector units, such large shift counts wouldn't be well-defined, so we won't vectorize such shifts any more, unless we can tell they're in range or undefined. So the test that expected the vectorization we no longer performed needs to be adjusted. Instead of nobbling the test, Richard Earnshaw suggested annotating the test with the expected ranges so as to enable the optimization, and Christophe Lyon suggested a further simplification. Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> for gcc/testsuite/ChangeLog PR tree-optimization/113281 * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. (cherry picked from commit 54d2339c9f87f702e02e571a5460e11c19e1c02f)
The releases/gcc-12 branch has been updated by Richard Ball <ricbal02@gcc.gnu.org>: https://gcc.gnu.org/g:881b54f56e77dc470d27e4746b90dc7819c2be81 commit r12-10680-g881b54f56e77dc470d27e4746b90dc7819c2be81 Author: Alexandre Oliva <oliva@adacore.com> Date: Wed Jun 26 02:08:18 2024 -0300 [testsuite] [arm] [vect] adjust mve-vshr test [PR113281] The test was too optimistic, alas. We used to vectorize shifts by clamping the shift counts below the bit width of the types (e.g. at 15 for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is well defined (because of promotion to 32-bit int) and must yield 0, not 1 (as before the fix). Unfortunately, in the gimple model of vector units, such large shift counts wouldn't be well-defined, so we won't vectorize such shifts any more, unless we can tell they're in range or undefined. So the test that expected the vectorization we no longer performed needs to be adjusted. Instead of nobbling the test, Richard Earnshaw suggested annotating the test with the expected ranges so as to enable the optimization, and Christophe Lyon suggested a further simplification. Co-Authored-By: Richard Earnshaw <Richard.Earnshaw@arm.com> for gcc/testsuite/ChangeLog PR tree-optimization/113281 * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. (cherry picked from commit 54d2339c9f87f702e02e571a5460e11c19e1c02f)