Such case, LLVM succeed in auto-vectorization but GCC fail. https://godbolt.org/z/1x68jo36d
Please read https://gcc.gnu.org/bugs/ on how to correctly report a bug.
I think we can do conditional reduction but maybe not for variable-length vectors? Otherwise I'm sure there's a duplicate bug about this.
(In reply to Richard Biener from comment #2) > I think we can do conditional reduction but maybe not for variable-length > vectors? Otherwise I'm sure there's a duplicate bug about this. No, GCC still can not vectorize when I specify the fixed-length vectors (-msve-vector-bits=512). If it's a known issue, do you have any suggestion? I am willing to fix it when I finished the RVV auto-vectorization.
Created attachment 54635 [details] testcase
GCC does vectorize: if (a[i] > b[i]) { result += a[i]; } Even for variable-length vectors. Just we don't vectorize when there is an extra conditional operation. GCC will even vectorize with variable-length vectors: #include <stdint.h> uint64_t single_loop_with_if_condition( uint64_t * restrict a, uint64_t * restrict b, uint64_t * restrict c, int loop_size) { uint64_t result = 0; for (int i = 0; i < loop_size; i++) { c[i] = a[i] + 1; if (a[i] > b[i]) { result += c[i]; } } return result; }
After investigations: GCC failed to vectorize reduction with multiple conditional operations: ifcvt dump: # result_20 = PHI <result_9(8), 0(18)> ... _11 = result_20 + 10; result_17 = _4 + _11; _23 = _4 > _7; result_9 = _23 ? result_17 : result_20; It's odd that GCC failed to vectorize it since they are not complicate statements. In LLVM, it will vectorize them into: vector_ssa_2 = <vector_ssa_result, 0> ... vector_ssa_1 = vector_ssa_2 + 10; vector_ssa_3 = vector_ssa_1 + 10; mask_ssa_1 = vector_ssa_4 > vector_ssa_5; vector_ssa_result = select <mask_ssa_1, vector_ssa_3, vector_ssa_2> I think GCC should be able to vectorize it like LLVM: vector_ssa_2 = <vector_ssa_result, 0> ... vector_ssa_1 = vector_ssa_2 + 10; vector_ssa_3 = vector_ssa_1 + 10; mask_ssa_1 = vector_ssa_4 > vector_ssa_5; vector_ssa_result = VCOND_MASK <mask_ssa_1, vector_ssa_3, vector_ssa_2> I saw this code disable the vectorization: else if (!bbs.is_empty () && bb->loop_father->header == bb && bb->loop_father->dont_vectorize) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "splitting region at dont-vectorize loop %d " "entry at bb%d\n", bb->loop_father->num, bb->index); split = true; } I am not familiar with these codes, any ideas ? Thanks.
Update the analysis: We failed to recognize it as reduction because, the PHI result has 2 uses: # result_20 = PHI <result_9(8), 0(18)> ... _11 = result_20 + 10; --------> first use result_17 = _4 + _11; _23 = _4 > _7; result_9 = _23 ? result_17 : result_20; -----> second use It seems that it is the if-conversion issue which makes loop vectorizer failed to vectorize it. I have checked LLVM implementation, their "result" ssa always has single use no matter how I modify the codes (for example, result += a[i] + b[i] + c[i]).
It's because the order of the operations we are doing: For code as follows: result += mask ? a[i] + x : 0; GCC: result_ssa_1 = PHI <result_ssa_2, 0> ... STMT 1. tmp = a[i] + x; STMT 2. tmp2 = tmp + result_ssa_1; STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1; Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1', we end up with 2 uses of the PHI result. Then, we failed to vectorize. Wheras LLVM: result_ssa_1 = PHI <result_ssa_2, 0> ... IR 1. tmp = a[i] + x; IR 2. tmp2 = mask ? tmp : 0; IR 3. result_ssa_2 = tmp2 + result_ssa_1. LLVM only has 1 use. Is it reasonable to swap the order in match.pd ?
(In reply to JuzheZhong from comment #8) > It's because the order of the operations we are doing: > > For code as follows: > > result += mask ? a[i] + x : 0; > > GCC: > result_ssa_1 = PHI <result_ssa_2, 0> > ... > STMT 1. tmp = a[i] + x; > STMT 2. tmp2 = tmp + result_ssa_1; > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1; > > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1', > we end up with 2 uses of the PHI result. Then, we failed to vectorize. > > Wheras LLVM: > > result_ssa_1 = PHI <result_ssa_2, 0> > ... > IR 1. tmp = a[i] + x; > IR 2. tmp2 = mask ? tmp : 0; > IR 3. result_ssa_2 = tmp2 + result_ssa_1. For floating point these are not equivalent (adding zero isn't a no-op). > LLVM only has 1 use. > > Is it reasonable to swap the order in match.pd ? if-conversion could be teached to swap this (it's if-conversion creating the IL for conditional reductions) when valid. IIRC Robin Dapp also has a patch to make if-conversion emit .COND_ADD instead which should make it even better to vectorize.
(In reply to Richard Biener from comment #9) > (In reply to JuzheZhong from comment #8) > > It's because the order of the operations we are doing: > > > > For code as follows: > > > > result += mask ? a[i] + x : 0; > > > > GCC: > > result_ssa_1 = PHI <result_ssa_2, 0> > > ... > > STMT 1. tmp = a[i] + x; > > STMT 2. tmp2 = tmp + result_ssa_1; > > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1; > > > > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1', > > we end up with 2 uses of the PHI result. Then, we failed to vectorize. > > > > Wheras LLVM: > > > > result_ssa_1 = PHI <result_ssa_2, 0> > > ... > > IR 1. tmp = a[i] + x; > > IR 2. tmp2 = mask ? tmp : 0; > > IR 3. result_ssa_2 = tmp2 + result_ssa_1. > > For floating point these are not equivalent (adding zero isn't a no-op). Yes, I agree these are not equivalent for floating-point. But I they are equivalent if we specify -ffast-math. I have double checked LLVM, they failed to vectorize conditionl floating-point reduction too by default. However, if we specify LLVM -ffast-math, it will generate the same if-conversion IR sequence as integer, then vectorization succeed. > > > LLVM only has 1 use. > > > > Is it reasonable to swap the order in match.pd ? > > if-conversion could be teached to swap this (it's if-conversion creating > the IL for conditional reductions) when valid. IIRC Robin Dapp also has > a patch to make if-conversion emit .COND_ADD instead which should make > it even better to vectorize. I knew that patch, Robin is trying fixing the issue (in-order reduction)that I posted. I have confirm that patch can't help since it didn't modify the code for this case, we will end up with multiple use in conditional reduction. The reduction failed since: /* If this isn't a nested cycle or if the nested cycle reduction value is used ouside of the inner loop we cannot handle uses of the reduction value. */ if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "reduction used in loop.\n"); return NULL; } when nphi_def_loop_uses > 1, we failed to vectorize. I have checked LLVM codes, and I think we can extend this function: strip_nop_cond_scalar_reduction We should be able to strip all the statement until we can reach the use of PHI result, like this: LLVM is able to handle this case: for () if (cond) result += a[i] + b[i] + c[i] + .... No matter how many variables are added in the condition reduction. They well handle that since they keep iterating all the statement until reach the result: result_ssa_1 = PHI <> tmp1 = result_ssa_1 + a[i]; tmp2 = tmp1 + b[i]; tmp3 = tmp2 + c[i]; .... We keep iterating until find the result_ssa_1 to hold the reduction variable. Is this LLVM's approach reasonable to GCC? If yes, I can translate LLVM code into GCC. Thanks.
I don't think strip_nop_cond_scalar_reduction is the place to adjust here, maybe it's the caller. I don't have time to dig into the specific issue right now but if we require scalar code adjustments then we need to perform those in if-conversion. But to me it looks like allowing > > STMT 1. tmp = a[i] + x; > > STMT 2. tmp2 = tmp + result_ssa_1; > > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1; in vect_is_simple_reduction might also be a reasonable approach. The use in the COND_EXPR isn't really a use - it's a conditional update.
(In reply to Richard Biener from comment #11) > I don't think strip_nop_cond_scalar_reduction is the place to adjust here, > maybe it's the caller. I don't have time to dig into the specific issue > right now but if we require scalar code adjustments then we need to perform > those in if-conversion. > > But to me it looks like allowing > > > > STMT 1. tmp = a[i] + x; > > > STMT 2. tmp2 = tmp + result_ssa_1; > > > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1; > > in vect_is_simple_reduction might also be a reasonable approach. The > use in the COND_EXPR isn't really a use - it's a conditional update. Thanks Richi. Enhancing vect_is_simple_reduction in loop vectorizer is also a good approach. But I think it's better to recognize the scalar condition reduction (if-conversion) as early as possible. Obviously, current if-conversion failed to recognize it as a feasible conditional reduction. I think enhancing vect_is_simple_reduction is the approach that it's unlikely we can simplify the scalar code in if-converison to fit current loop vectorizer. I believe we will eventually have to enhance both if-converison and loop vectorizer in the future. And I prefer improving the if-conversion and working on it. Will keep you posted. Thanks a lot!
Hi, Richi. This is my draft approach to enhance the finding more potential condtional reduction. diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc index a8c915913ae..c25d2038f16 100644 --- a/gcc/tree-if-conv.cc +++ b/gcc/tree-if-conv.cc @@ -1790,8 +1790,72 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc, tree arg_0, tree arg_1, std::swap (r_op1, r_op2); std::swap (r_nop1, r_nop2); } - else if (r_nop1 != PHI_RESULT (header_phi)) - return false; + else if (r_nop1 == PHI_RESULT (header_phi)) + ; + else + { + /* Analyze the statement chain of STMT so that we could teach generate + better if-converison code sequence. We are trying to catch this + following situation: + + loop-header: + reduc_1 = PHI <..., reduc_2> + ... + if (...) + tmp1 = reduc_1 + rhs1; + tmp2 = tmp1 + rhs2; + tmp3 = tmp2 + rhs3; + ... + reduc_3 = tmpN-1 + rhsN-1; + + reduc_2 = PHI <reduc_1, reduc_3> + + and convert to + + reduc_2 = PHI <0, reduc_1> + tmp1 = rhs1 + rhs2; + tmp2 = tmp1 + rhs3; + tmp3 = tmp2 + rhs4; + ... + tmpN-1 = tmpN-2 + rhsN; + ifcvt = cond_expr ? tmpN-1 : 0 + reduc_1 = tmpN-1 +/- ifcvt; */ + if (num_imm_uses (PHI_RESULT (header_phi)) != 2) + return false; + FOR_EACH_IMM_USE_FAST (use_p, imm_iter, PHI_RESULT (header_phi)) + { + gimple *use_stmt = USE_STMT (use_p); + if (is_gimple_assign (use_stmt)) + { + if (gimple_assign_rhs_code (use_stmt) != reduction_op) + return false; + if (TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME) + return false; + + bool visited_p = false; + while (!visited_p) + { + use_operand_p use; + if (!single_imm_use (gimple_assign_lhs (use_stmt), &use, + &use_stmt) + || gimple_bb (use_stmt) != gimple_bb (stmt) + || !is_gimple_assign (use_stmt) + || TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME + || gimple_assign_rhs_code (use_stmt) != reduction_op) + return false; + + if (gimple_assign_lhs (use_stmt) == gimple_assign_lhs (stmt)) + { + r_op2 = r_op1; + r_op1 = PHI_RESULT (header_phi); + visited_p = true; + } + } + } + else if (use_stmt != phi) + return false; + } + } My approach is doing the check as follows: tmp1 = reduc_1 + rhs1; tmp2 = tmp1 + rhs2; tmp3 = tmp2 + rhs3; ... reduc_3 = tmpN-1 + rhsN-1; Start the iteration check from "tmp1 = reduc_1 + rhs1;" until "reduc_3 = tmpN-1 + rhsN-1;" Make sure each statement are PLUS_EXPR for reduction sum. Does it look reasonable ? It succeed on vectorization.
Sorry for the delay - but this looks exactly like Robins transform to COND_ADD, no? But sure, the current code doesn't handle a reduction path through multiple stmts but when if-conversion would convert the final add to a COND_ADD then it should be a matter of teaching tree-vect-loop.cc:check_reduction_path about conditional operations?
Hi,Richard. Confirmed Robin's patch doesn't help with this issue. The root cause of this issue is failed to recognize it as possible vectorization of conditional reduction which means is_cond_scalar_reduction is FALSE. I have this following patch which bootstrap on X86 and regtest passed also passed on aarch64. This following patch can enhance if-conv conditional reduction recognition. diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc index a8c915913ae..2bdd3710a65 100644 --- a/gcc/tree-if-conv.cc +++ b/gcc/tree-if-conv.cc @@ -1784,14 +1784,119 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc, tree arg_0, tree arg_1, r_nop2 = strip_nop_cond_scalar_reduction (*has_nop, r_op2); /* Make R_OP1 to hold reduction variable. */ + gimple *nonphi_use_stmt = NULL; if (r_nop2 == PHI_RESULT (header_phi) && commutative_tree_code (reduction_op)) { std::swap (r_op1, r_op2); std::swap (r_nop1, r_nop2); } - else if (r_nop1 != PHI_RESULT (header_phi)) - return false; + else if (r_nop1 == PHI_RESULT (header_phi)) + ; + else + { + /* Analyze the statement chain of STMT so that we could teach generate + better if-converison code sequence. We are trying to catch this + following situation: + + loop-header: + reduc_1 = PHI <..., reduc_2> + ... + if (...) + tmp1 = reduc_1 + rhs1; + tmp2 = tmp1 + rhs2; + tmp3 = tmp2 + rhs3; + ... + reduc_3 = tmpN-1 + rhsN-1; + + reduc_2 = PHI <reduc_1, reduc_3> + + and convert to + + reduc_2 = PHI <0, reduc_1> + tmp1 = rhs1; + tmp2 = tmp1 + rhs2; + tmp3 = tmp2 + rhs3; + ... + reduc_3 = tmpN-1 + rhsN-1; + ifcvt = cond_expr ? reduc_3 : 0; + reduc_1 = reduc_1 +/- ifcvt; */ + if (num_imm_uses (PHI_RESULT (header_phi)) != 2) + return false; + if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi))) + && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi))) + && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi))) + && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi))) + && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi))))) + return false; + gimple *prev_use_stmt, *curr_use_stmt; + use_operand_p use; + FOR_EACH_IMM_USE_FAST (use_p, imm_iter, PHI_RESULT (header_phi)) + { + prev_use_stmt = curr_use_stmt = USE_STMT (use_p); + if (is_gimple_assign (curr_use_stmt)) + { + if (TREE_CODE (gimple_assign_lhs (curr_use_stmt)) != SSA_NAME) + return false; + if (*has_nop) + { + if (!CONVERT_EXPR_CODE_P ( + gimple_assign_rhs_code (curr_use_stmt))) + return false; + } + else + { + if (gimple_assign_rhs_code (curr_use_stmt) != reduction_op) + return false; + } + + bool visited_p = false; + nonphi_use_stmt = curr_use_stmt; + while (!visited_p) + { + if (!single_imm_use (gimple_assign_lhs (prev_use_stmt), &use, + &curr_use_stmt) + || gimple_bb (curr_use_stmt) != gimple_bb (stmt) + || !is_gimple_assign (curr_use_stmt) + || TREE_CODE (gimple_assign_lhs (curr_use_stmt)) + != SSA_NAME + || gimple_assign_rhs_code (curr_use_stmt) != reduction_op) + return false; + if (curr_use_stmt == stmt) + { + if (*has_nop) + { + if (!single_imm_use (gimple_assign_lhs ( + nonphi_use_stmt), + &use, &curr_use_stmt)) + return false; + r_op1 = gimple_assign_lhs (nonphi_use_stmt); + r_nop1 = PHI_RESULT (header_phi); + nonphi_use_stmt = curr_use_stmt; + } + else + r_op1 = PHI_RESULT (header_phi); + + if (*has_nop) + { + if (!single_imm_use (gimple_assign_lhs (stmt), &use, + &curr_use_stmt)) + return false; + r_op2 = gimple_assign_lhs (stmt); + r_nop2 = gimple_assign_lhs (curr_use_stmt); + } + else + r_op2 = gimple_assign_lhs (stmt); + visited_p = true; + } + else + prev_use_stmt = curr_use_stmt; + } + } + else if (curr_use_stmt != phi) + return false; + } + } if (*has_nop) { @@ -1816,12 +1921,41 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc, tree arg_0, tree arg_1, continue; if (use_stmt == stmt) continue; + if (use_stmt == nonphi_use_stmt) + continue; if (gimple_code (use_stmt) != GIMPLE_PHI) return false; } *op0 = r_op1; *op1 = r_op2; *reduc = stmt; + + if (nonphi_use_stmt) + { + /* Transform: + + if (...) + tmp1 = reduc_1 + rhs1; + tmp2 = tmp1 + rhs2; + tmp3 = tmp2 + rhs3; + + into: + + tmp1 = rhs1 + 0; ---> We replace reduc_1 into '0' + tmp2 = tmp1 + rhs2; + tmp3 = tmp2 + rhs3; + ... + reduc_3 = tmpN-1 + rhsN-1; + ifcvt = cond_expr ? reduc_3 : 0; */ + gcc_assert (gimple_assign_rhs_code (nonphi_use_stmt) == reduction_op); + if (gimple_assign_rhs1 (nonphi_use_stmt) == r_op1) + gimple_assign_set_rhs1 (nonphi_use_stmt, + build_zero_cst (TREE_TYPE (r_op1))); + else if (gimple_assign_rhs2 (nonphi_use_stmt) == r_op1) + gimple_assign_set_rhs2 (nonphi_use_stmt, + build_zero_cst (TREE_TYPE (r_op1))); + update_stmt (nonphi_use_stmt); + } return true; } @@ -1886,12 +2020,17 @@ convert_scalar_cond_reduction (gimple *reduc, gimple_stmt_iterator *gsi, gsi_remove (&stmt_it, true); release_defs (nop_reduc); } + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT); /* Delete original reduction stmt. */ - stmt_it = gsi_for_stmt (reduc); - gsi_remove (&stmt_it, true); - release_defs (reduc); + if (op1 != gimple_assign_lhs (reduc)) + { + stmt_it = gsi_for_stmt (reduc); + gsi_remove (&stmt_it, true); + release_defs (reduc); + } + return rhs; } I have fully tested it with different kinds of condtional reduction. All can be vectorized. I am not sure whether it is a feasible solution.
it's not exactly clear what the transform is you propose. it looks like a re-association but SSA names do not match up but the transform only replaces a single op with 0!?
Sorry for confusing and not enough information. I am trying to transform: + reduc_1 = PHI <..., reduc_2> + ... + if (...) + tmp1 = reduc_1 + rhs1; + tmp2 = tmp1 + rhs2; + tmp3 = tmp2 + rhs3; + ... + reduc_3 = tmpN-1 + rhsN-1; + + reduc_2 = PHI <reduc_1, reduc_3> First transform the first statement: tmp1 = reduc_1 + rhs1; into tmp1 = rhs1 + 0; Then it will become bogus data move assignment: tmp1 = rhs1. The later PASS will eliminate it. Then, transform the reduction PHI: reduc_1 = PHI <..., reduc_2> into if-convert statement: reduc_1 = PHI <_ifc__35(8), 0(18)> Thid, transform + reduc_3 = tmpN-1 + rhsN-1; + + reduc_2 = PHI <reduc_1, reduc_3> into : reduc_3 = tmpN-1 + rhsN-1; _ifc__35 = .COND_ADD (condition, reduc_1, reduc_3, reduc_1); So finally: result_1 = PHI <_ifc__35(8), 0(18)> ... tmp1 = rhs1; tmp2 = tmp1 + rhs2; tmp3 = tmp2 + rhs3; ... reduc_3 = tmpN-1 + rhsN-1; _ifc__35 = .COND_ADD (condition, reduc_1, reduc_3, reduc_1);
Ah, so this is for int foo (int *a, int x) { int sum = 0; for (int i = 0; i < 1024; ++i) if (a[i] < 10) sum = sum + a[i] + x; return sum; } transforming it to int foo (int *a, int x) { int sum = 0; for (int i = 0; i < 1024; ++i) { tem = a[i] + x; if (a[i] < 10) sum = sum + tem; } return sum; } note this is re-associating and thus not always valid. It also executes stmts unconditionally (also not always valid, for FP spurious exceptions might be raised).
I have added: + if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi))) + && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi))) + && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi))) + && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi))) + && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi))))) + return false; for floating-point. I failed to see which situation will cause FP exceptions ?
On Wed, 15 Nov 2023, juzhe.zhong at rivai dot ai wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088 > > --- Comment #19 from JuzheZhong <juzhe.zhong at rivai dot ai> --- > I have added: > > + if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi))) For TYPE_OVERFLOW_UNDEFINED you have to convert the ops to unsigned to avoid spurious undefined overflow. > + && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi))) > + && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi))) > + && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi))) > + && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi))))) You should check flag_associative_math which covers one half and !flag_trapping_math which covers spurious FP exceptions. > + return false; > > for floating-point. I failed to see which situation will cause FP exceptions ? Ops with NaN cause INVALID, but there's also INEXACT which can be set differently after re-association.
Thanks Richi. Does re-associating (with eliminating exceptions) in if-convert is a reasonable approach ? If yes, I am gonna to revisit this PR on GCC-15 and refine the codes.
On Thu, 16 Nov 2023, juzhe.zhong at rivai dot ai wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088 > > --- Comment #21 from JuzheZhong <juzhe.zhong at rivai dot ai> --- > Thanks Richi. > > Does re-associating (with eliminating exceptions) in if-convert is a reasonable > approach ? Yeah, I don't think we have a much better place at the moment.