Bug 109088 - GCC does not always vectorize conditional reduction
Summary: GCC does not always vectorize conditional reduction
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2023-03-10 09:24 UTC by JuzheZhong
Modified: 2023-11-16 06:50 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-03-10 00:00:00


Attachments
testcase (187 bytes, text/plain)
2023-03-10 14:04 UTC, Andrew Pinski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description JuzheZhong 2023-03-10 09:24:32 UTC
Such case, LLVM succeed in auto-vectorization but GCC fail.

https://godbolt.org/z/1x68jo36d
Comment 1 Uroš Bizjak 2023-03-10 10:39:15 UTC
Please read https://gcc.gnu.org/bugs/ on how to correctly report a bug.
Comment 2 Richard Biener 2023-03-10 12:39:03 UTC
I think we can do conditional reduction but maybe not for variable-length vectors?  Otherwise I'm sure there's a duplicate bug about this.
Comment 3 JuzheZhong 2023-03-10 13:16:55 UTC
(In reply to Richard Biener from comment #2)
> I think we can do conditional reduction but maybe not for variable-length
> vectors?  Otherwise I'm sure there's a duplicate bug about this.

No, GCC still can not vectorize when I specify the fixed-length vectors 
(-msve-vector-bits=512).

If it's a known issue, do you have any suggestion? I am willing to fix it when
I finished the RVV auto-vectorization.
Comment 4 Andrew Pinski 2023-03-10 14:04:45 UTC
Created attachment 54635 [details]
testcase
Comment 5 Andrew Pinski 2023-03-10 14:09:42 UTC
GCC does vectorize:
    if (a[i] > b[i]) {
      result += a[i];
    }

Even for variable-length vectors.
Just we don't vectorize when there is an extra conditional operation.

GCC will even vectorize with variable-length vectors:
#include <stdint.h>

uint64_t single_loop_with_if_condition(
        uint64_t * restrict a,
        uint64_t * restrict b,
        uint64_t * restrict c,
        int loop_size) {
  uint64_t result = 0;

  for (int i = 0; i < loop_size; i++) {
    c[i] = a[i] + 1;
    if (a[i] > b[i]) {
      result += c[i];
    }
  }
  return result;
}
Comment 6 JuzheZhong 2023-09-26 12:14:10 UTC
After investigations:

GCC failed to vectorize reduction with multiple conditional operations:

ifcvt dump:

# result_20 = PHI <result_9(8), 0(18)>
...
_11 = result_20 + 10;
result_17 = _4 + _11;
_23 = _4 > _7;
result_9 = _23 ? result_17 : result_20;

It's odd that GCC failed to vectorize it since they are not complicate statements.

In LLVM, it will vectorize them into:

vector_ssa_2 = <vector_ssa_result, 0>
...
vector_ssa_1 = vector_ssa_2 + 10;
vector_ssa_3 = vector_ssa_1 + 10;
mask_ssa_1 = vector_ssa_4 > vector_ssa_5;
vector_ssa_result = select <mask_ssa_1, vector_ssa_3, vector_ssa_2>

I think GCC should be able to vectorize it like LLVM:

vector_ssa_2 = <vector_ssa_result, 0>
...
vector_ssa_1 = vector_ssa_2 + 10;
vector_ssa_3 = vector_ssa_1 + 10;
mask_ssa_1 = vector_ssa_4 > vector_ssa_5;
vector_ssa_result = VCOND_MASK <mask_ssa_1, vector_ssa_3, vector_ssa_2>

I saw this code disable the vectorization:
      else if (!bbs.is_empty ()
	       && bb->loop_father->header == bb
	       && bb->loop_father->dont_vectorize)
	{
	  if (dump_enabled_p ())
	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
			     "splitting region at dont-vectorize loop %d "
			     "entry at bb%d\n",
			     bb->loop_father->num, bb->index);
	  split = true;
	}

I am not familiar with these codes, any ideas ? Thanks.
Comment 7 JuzheZhong 2023-09-27 02:45:12 UTC
Update the analysis:

We failed to recognize it as reduction because, the PHI result has 2 uses:

# result_20 = PHI <result_9(8), 0(18)>
...
_11 = result_20 + 10;   --------> first use
result_17 = _4 + _11;
_23 = _4 > _7;
result_9 = _23 ? result_17 : result_20;  -----> second use

It seems that it is the if-conversion issue which makes loop vectorizer failed
to vectorize it.

I have checked LLVM implementation, their "result" ssa always has single use no
matter how I modify the codes (for example, result += a[i] + b[i] + c[i]).
Comment 8 JuzheZhong 2023-09-27 02:58:51 UTC
It's because the order of the operations we are doing:

For code as follows:

result += mask ? a[i] + x : 0;

GCC:
result_ssa_1 = PHI <result_ssa_2, 0>
...
STMT 1. tmp = a[i] + x;
STMT 2. tmp2 = tmp + result_ssa_1;
STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;

Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
we end up with 2 uses of the PHI result. Then, we failed to vectorize.

Wheras LLVM:

result_ssa_1 = PHI <result_ssa_2, 0>
...
IR 1. tmp = a[i] + x;
IR 2. tmp2 = mask ? tmp : 0;
IR 3. result_ssa_2 = tmp2 + result_ssa_1.

LLVM only has 1 use.

Is it reasonable to swap the order in match.pd ?
Comment 9 Richard Biener 2023-09-27 07:15:42 UTC
(In reply to JuzheZhong from comment #8)
> It's because the order of the operations we are doing:
> 
> For code as follows:
> 
> result += mask ? a[i] + x : 0;
> 
> GCC:
> result_ssa_1 = PHI <result_ssa_2, 0>
> ...
> STMT 1. tmp = a[i] + x;
> STMT 2. tmp2 = tmp + result_ssa_1;
> STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
> 
> Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
> we end up with 2 uses of the PHI result. Then, we failed to vectorize.
> 
> Wheras LLVM:
> 
> result_ssa_1 = PHI <result_ssa_2, 0>
> ...
> IR 1. tmp = a[i] + x;
> IR 2. tmp2 = mask ? tmp : 0;
> IR 3. result_ssa_2 = tmp2 + result_ssa_1.

For floating point these are not equivalent (adding zero isn't a no-op).

> LLVM only has 1 use.
> 
> Is it reasonable to swap the order in match.pd ?

if-conversion could be teached to swap this (it's if-conversion creating
the IL for conditional reductions) when valid.  IIRC Robin Dapp also has
a patch to make if-conversion emit .COND_ADD instead which should make
it even better to vectorize.
Comment 10 JuzheZhong 2023-09-27 07:34:41 UTC
(In reply to Richard Biener from comment #9)
> (In reply to JuzheZhong from comment #8)
> > It's because the order of the operations we are doing:
> > 
> > For code as follows:
> > 
> > result += mask ? a[i] + x : 0;
> > 
> > GCC:
> > result_ssa_1 = PHI <result_ssa_2, 0>
> > ...
> > STMT 1. tmp = a[i] + x;
> > STMT 2. tmp2 = tmp + result_ssa_1;
> > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
> > 
> > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
> > we end up with 2 uses of the PHI result. Then, we failed to vectorize.
> > 
> > Wheras LLVM:
> > 
> > result_ssa_1 = PHI <result_ssa_2, 0>
> > ...
> > IR 1. tmp = a[i] + x;
> > IR 2. tmp2 = mask ? tmp : 0;
> > IR 3. result_ssa_2 = tmp2 + result_ssa_1.
> 
> For floating point these are not equivalent (adding zero isn't a no-op).


Yes, I agree these are not equivalent for floating-point.
But I they are equivalent if we specify -ffast-math.

I have double checked LLVM, they failed to vectorize conditionl
floating-point reduction too by default.

However, if we specify LLVM -ffast-math, it will generate the same 
if-conversion IR sequence as integer, then vectorization succeed.


> 
> > LLVM only has 1 use.
> > 
> > Is it reasonable to swap the order in match.pd ?
> 
> if-conversion could be teached to swap this (it's if-conversion creating
> the IL for conditional reductions) when valid.  IIRC Robin Dapp also has
> a patch to make if-conversion emit .COND_ADD instead which should make
> it even better to vectorize.

I knew that patch, Robin is trying fixing the issue (in-order reduction)that I posted.

I have confirm that patch can't help since it didn't modify the code for this case, we will end up with multiple use in conditional reduction.

The reduction failed since:

  /* If this isn't a nested cycle or if the nested cycle reduction value
     is used ouside of the inner loop we cannot handle uses of the reduction
     value.  */
  if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1)
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "reduction used in loop.\n");
      return NULL;
    }

when  nphi_def_loop_uses  > 1, we failed to vectorize.

I have checked LLVM codes, and I think we can extend this function:

strip_nop_cond_scalar_reduction

We should be able to strip all the statement until we can reach the
use of PHI result, like this:

LLVM is able to handle this case:

for ()
  if (cond)
    result += a[i] + b[i] + c[i] + .... 

No matter how many variables are added in the condition reduction.
They well handle that since they keep iterating all the statement until
reach the result:

result_ssa_1 = PHI <>
tmp1 = result_ssa_1 + a[i];
tmp2 = tmp1 + b[i];
tmp3 = tmp2 + c[i];
....

We keep iterating until find the result_ssa_1 to hold the reduction variable.

Is this LLVM's approach reasonable to GCC?

If yes, I can translate LLVM code into GCC.

Thanks.
Comment 11 Richard Biener 2023-09-27 09:06:03 UTC
I don't think strip_nop_cond_scalar_reduction is the place to adjust here, maybe it's the caller.  I don't have time to dig into the specific issue right now but if we require scalar code adjustments then we need to perform those in if-conversion.

But to me it looks like allowing

> > STMT 1. tmp = a[i] + x;
> > STMT 2. tmp2 = tmp + result_ssa_1;
> > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;

in vect_is_simple_reduction might also be a reasonable approach.  The
use in the COND_EXPR isn't really a use - it's a conditional update.
Comment 12 JuzheZhong 2023-09-27 09:27:20 UTC
(In reply to Richard Biener from comment #11)
> I don't think strip_nop_cond_scalar_reduction is the place to adjust here,
> maybe it's the caller.  I don't have time to dig into the specific issue
> right now but if we require scalar code adjustments then we need to perform
> those in if-conversion.
> 
> But to me it looks like allowing
> 
> > > STMT 1. tmp = a[i] + x;
> > > STMT 2. tmp2 = tmp + result_ssa_1;
> > > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
> 
> in vect_is_simple_reduction might also be a reasonable approach.  The
> use in the COND_EXPR isn't really a use - it's a conditional update.

Thanks Richi.

Enhancing vect_is_simple_reduction in loop vectorizer is also a good approach.
But I think it's better to recognize the scalar condition reduction (if-conversion) as early as possible. Obviously, current if-conversion failed to
recognize it as a feasible conditional reduction.

I think enhancing vect_is_simple_reduction is the approach that it's unlikely we
can simplify the scalar code in if-converison to fit current loop vectorizer.

I believe we will eventually have to enhance both if-converison and loop vectorizer in the future. And I prefer improving the if-conversion and working
on it. Will keep you posted.

Thanks a lot!
Comment 13 JuzheZhong 2023-09-27 14:11:09 UTC
Hi, Richi. This is my draft approach to enhance the finding more potential
condtional reduction.

diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index a8c915913ae..c25d2038f16 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -1790,8 +1790,72 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc, tree arg_0, tree arg_1,
       std::swap (r_op1, r_op2);
       std::swap (r_nop1, r_nop2);
     }
-  else if (r_nop1 != PHI_RESULT (header_phi))
-    return false;
+  else if (r_nop1 == PHI_RESULT (header_phi))
+    ;
+  else
+    {
+      /* Analyze the statement chain of STMT so that we could teach generate
+        better if-converison code sequence.  We are trying to catch this
+        following situation:
+
+          loop-header:
+          reduc_1 = PHI <..., reduc_2>
+          ...
+          if (...)
+          tmp1 = reduc_1 + rhs1;
+          tmp2 = tmp1 + rhs2;
+          tmp3 = tmp2 + rhs3;
+          ...
+          reduc_3 = tmpN-1 + rhsN-1;
+
+          reduc_2 = PHI <reduc_1, reduc_3>
+
+          and convert to
+
+          reduc_2 = PHI <0, reduc_1>
+          tmp1 = rhs1 + rhs2;
+          tmp2 = tmp1 + rhs3;
+          tmp3 = tmp2 + rhs4;
+          ...
+          tmpN-1 = tmpN-2 + rhsN;
+          ifcvt = cond_expr ? tmpN-1 : 0
+          reduc_1 = tmpN-1 +/- ifcvt;  */
+      if (num_imm_uses (PHI_RESULT (header_phi)) != 2)
+       return false;
+      FOR_EACH_IMM_USE_FAST (use_p, imm_iter, PHI_RESULT (header_phi))
+       {
+         gimple *use_stmt = USE_STMT (use_p);
+         if (is_gimple_assign (use_stmt))
+           {
+             if (gimple_assign_rhs_code (use_stmt) != reduction_op)
+               return false;
+             if (TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME)
+               return false;
+
+             bool visited_p = false;
+             while (!visited_p)
+               {
+                 use_operand_p use;
+                 if (!single_imm_use (gimple_assign_lhs (use_stmt), &use,
+                                      &use_stmt)
+                     || gimple_bb (use_stmt) != gimple_bb (stmt)
+                     || !is_gimple_assign (use_stmt)
+                     || TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME
+                     || gimple_assign_rhs_code (use_stmt) != reduction_op)
+                   return false;
+
+                 if (gimple_assign_lhs (use_stmt) == gimple_assign_lhs (stmt))
+                   {
+                     r_op2 = r_op1;
+                     r_op1 = PHI_RESULT (header_phi);
+                     visited_p = true;
+                   }
+               }
+           }
+         else if (use_stmt != phi)
+           return false;
+       }
+    }


My approach is doing the check as follows:

	   tmp1 = reduc_1 + rhs1;
	   tmp2 = tmp1 + rhs2;
	   tmp3 = tmp2 + rhs3;
	   ...
	   reduc_3 = tmpN-1 + rhsN-1;

Start the iteration check from "tmp1 = reduc_1 + rhs1;" until "reduc_3 = tmpN-1 + rhsN-1;"

Make sure each statement are PLUS_EXPR for reduction sum.
Does it look reasonable ?

It succeed on vectorization.
Comment 14 Richard Biener 2023-10-06 09:44:33 UTC
Sorry for the delay - but this looks exactly like Robins transform to COND_ADD, no?  But sure, the current code doesn't handle a reduction path through multiple stmts but when if-conversion would convert the final add to a COND_ADD then
it should be a matter of teaching tree-vect-loop.cc:check_reduction_path
about conditional operations?
Comment 15 JuzheZhong 2023-11-10 12:02:58 UTC
Hi,Richard.
Confirmed Robin's patch doesn't help with this issue.

The root cause of this issue is failed to recognize it as possible
vectorization of conditional reduction which means is_cond_scalar_reduction
is FALSE.

I have this following patch which bootstrap on X86 and regtest passed
also passed on aarch64. This following patch can enhance if-conv
conditional reduction recognition.

diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index a8c915913ae..2bdd3710a65 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -1784,14 +1784,119 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc, tree arg_0, tree arg_1,
   r_nop2 = strip_nop_cond_scalar_reduction (*has_nop, r_op2);
 
   /* Make R_OP1 to hold reduction variable.  */
+  gimple *nonphi_use_stmt = NULL;
   if (r_nop2 == PHI_RESULT (header_phi)
       && commutative_tree_code (reduction_op))
     {
       std::swap (r_op1, r_op2);
       std::swap (r_nop1, r_nop2);
     }
-  else if (r_nop1 != PHI_RESULT (header_phi))
-    return false;
+  else if (r_nop1 == PHI_RESULT (header_phi))
+    ;
+  else
+    {
+      /* Analyze the statement chain of STMT so that we could teach generate
+	 better if-converison code sequence.  We are trying to catch this
+	 following situation:
+
+	   loop-header:
+	   reduc_1 = PHI <..., reduc_2>
+	   ...
+	   if (...)
+	   tmp1 = reduc_1 + rhs1;
+	   tmp2 = tmp1 + rhs2;
+	   tmp3 = tmp2 + rhs3;
+	   ...
+	   reduc_3 = tmpN-1 + rhsN-1;
+
+	   reduc_2 = PHI <reduc_1, reduc_3>
+
+	   and convert to
+
+	   reduc_2 = PHI <0, reduc_1>
+	   tmp1 = rhs1;
+	   tmp2 = tmp1 + rhs2;
+	   tmp3 = tmp2 + rhs3;
+	   ...
+	   reduc_3 = tmpN-1 + rhsN-1;
+	   ifcvt = cond_expr ? reduc_3 : 0;
+	   reduc_1 = reduc_1 +/- ifcvt;  */
+      if (num_imm_uses (PHI_RESULT (header_phi)) != 2)
+	return false;
+      if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+	  && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+	       && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
+	       && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi)))
+	       && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)))))
+	return false;
+      gimple *prev_use_stmt, *curr_use_stmt;
+      use_operand_p use;
+      FOR_EACH_IMM_USE_FAST (use_p, imm_iter, PHI_RESULT (header_phi))
+	{
+	  prev_use_stmt = curr_use_stmt = USE_STMT (use_p);
+	  if (is_gimple_assign (curr_use_stmt))
+	    {
+	      if (TREE_CODE (gimple_assign_lhs (curr_use_stmt)) != SSA_NAME)
+		return false;
+	      if (*has_nop)
+		{
+		  if (!CONVERT_EXPR_CODE_P (
+			gimple_assign_rhs_code (curr_use_stmt)))
+		    return false;
+		}
+	      else
+		{
+		  if (gimple_assign_rhs_code (curr_use_stmt) != reduction_op)
+		    return false;
+		}
+
+	      bool visited_p = false;
+	      nonphi_use_stmt = curr_use_stmt;
+	      while (!visited_p)
+		{
+		  if (!single_imm_use (gimple_assign_lhs (prev_use_stmt), &use,
+				       &curr_use_stmt)
+		      || gimple_bb (curr_use_stmt) != gimple_bb (stmt)
+		      || !is_gimple_assign (curr_use_stmt)
+		      || TREE_CODE (gimple_assign_lhs (curr_use_stmt))
+			   != SSA_NAME
+		      || gimple_assign_rhs_code (curr_use_stmt) != reduction_op)
+		    return false;
+		  if (curr_use_stmt == stmt)
+		    {
+		      if (*has_nop)
+			{
+			  if (!single_imm_use (gimple_assign_lhs (
+						 nonphi_use_stmt),
+					       &use, &curr_use_stmt))
+			    return false;
+			  r_op1 = gimple_assign_lhs (nonphi_use_stmt);
+			  r_nop1 = PHI_RESULT (header_phi);
+			  nonphi_use_stmt = curr_use_stmt;
+			}
+		      else
+			r_op1 = PHI_RESULT (header_phi);
+
+		      if (*has_nop)
+			{
+			  if (!single_imm_use (gimple_assign_lhs (stmt), &use,
+					       &curr_use_stmt))
+			    return false;
+			  r_op2 = gimple_assign_lhs (stmt);
+			  r_nop2 = gimple_assign_lhs (curr_use_stmt);
+			}
+		      else
+			r_op2 = gimple_assign_lhs (stmt);
+		      visited_p = true;
+		    }
+		  else
+		    prev_use_stmt = curr_use_stmt;
+		}
+	    }
+	  else if (curr_use_stmt != phi)
+	    return false;
+	}
+    }
 
   if (*has_nop)
     {
@@ -1816,12 +1921,41 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc, tree arg_0, tree arg_1,
 	continue;
       if (use_stmt == stmt)
 	continue;
+      if (use_stmt == nonphi_use_stmt)
+	continue;
       if (gimple_code (use_stmt) != GIMPLE_PHI)
 	return false;
     }
 
   *op0 = r_op1; *op1 = r_op2;
   *reduc = stmt;
+
+  if (nonphi_use_stmt)
+    {
+      /* Transform:
+
+	if (...)
+	   tmp1 = reduc_1 + rhs1;
+	   tmp2 = tmp1 + rhs2;
+	   tmp3 = tmp2 + rhs3;
+
+	into:
+
+	   tmp1 = rhs1 + 0;   ---> We replace reduc_1 into '0'
+	   tmp2 = tmp1 + rhs2;
+	   tmp3 = tmp2 + rhs3;
+	   ...
+	   reduc_3 = tmpN-1 + rhsN-1;
+	   ifcvt = cond_expr ? reduc_3 : 0;  */
+      gcc_assert (gimple_assign_rhs_code (nonphi_use_stmt) == reduction_op);
+      if (gimple_assign_rhs1 (nonphi_use_stmt) == r_op1)
+	gimple_assign_set_rhs1 (nonphi_use_stmt,
+				build_zero_cst (TREE_TYPE (r_op1)));
+      else if (gimple_assign_rhs2 (nonphi_use_stmt) == r_op1)
+	gimple_assign_set_rhs2 (nonphi_use_stmt,
+				build_zero_cst (TREE_TYPE (r_op1)));
+      update_stmt (nonphi_use_stmt);
+    }
   return true;
 }
 
@@ -1886,12 +2020,17 @@ convert_scalar_cond_reduction (gimple *reduc, gimple_stmt_iterator *gsi,
       gsi_remove (&stmt_it, true);
       release_defs (nop_reduc);
     }
+
   gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
 
   /* Delete original reduction stmt.  */
-  stmt_it = gsi_for_stmt (reduc);
-  gsi_remove (&stmt_it, true);
-  release_defs (reduc);
+  if (op1 != gimple_assign_lhs (reduc))
+    {
+      stmt_it = gsi_for_stmt (reduc);
+      gsi_remove (&stmt_it, true);
+      release_defs (reduc);
+    }
+
   return rhs;
 }

I have fully tested it with different kinds of condtional reduction.
All can be vectorized.

I am not sure whether it is a feasible solution.
Comment 16 Richard Biener 2023-11-10 13:10:15 UTC
it's not exactly clear what the transform is you propose.  it looks like a re-association but SSA names do not match up but the transform only replaces
a single op with 0!?
Comment 17 JuzheZhong 2023-11-10 13:42:21 UTC
Sorry for confusing and not enough information.

I am trying to transform:

+          reduc_1 = PHI <..., reduc_2>
+          ...
+          if (...)
+          tmp1 = reduc_1 + rhs1;
+          tmp2 = tmp1 + rhs2;
+          tmp3 = tmp2 + rhs3;
+          ...
+          reduc_3 = tmpN-1 + rhsN-1;
+
+          reduc_2 = PHI <reduc_1, reduc_3>

First transform the first statement:

tmp1 = reduc_1 + rhs1; into  tmp1 = rhs1 + 0;

Then it will become bogus data move assignment: tmp1 = rhs1.
The later PASS will eliminate it.

Then, transform the reduction PHI: 

reduc_1 = PHI <..., reduc_2>

into if-convert statement:

reduc_1 = PHI <_ifc__35(8), 0(18)>

Thid, transform 

+          reduc_3 = tmpN-1 + rhsN-1;
+
+          reduc_2 = PHI <reduc_1, reduc_3>

into :

reduc_3 = tmpN-1 + rhsN-1;
_ifc__35 = .COND_ADD (condition, reduc_1, reduc_3, reduc_1);


So finally:

result_1 = PHI <_ifc__35(8), 0(18)>
...
tmp1 = rhs1;
tmp2 = tmp1 + rhs2;
tmp3 = tmp2 + rhs3;
...
reduc_3 = tmpN-1 + rhsN-1;
_ifc__35 = .COND_ADD (condition, reduc_1, reduc_3, reduc_1);
Comment 18 Richard Biener 2023-11-15 14:09:42 UTC
Ah, so this is for

int foo (int *a, int x)
{
  int sum = 0;
  for (int i = 0; i < 1024; ++i)
    if (a[i] < 10)
      sum = sum + a[i] + x;
  return sum;
}

transforming it to

int foo (int *a, int x)
{
  int sum = 0;
  for (int i = 0; i < 1024; ++i)
    {
      tem = a[i] + x;
      if (a[i] < 10)
        sum = sum + tem;
    }
  return sum;
}

note this is re-associating and thus not always valid.  It also executes
stmts unconditionally (also not always valid, for FP spurious exceptions
might be raised).
Comment 19 JuzheZhong 2023-11-15 14:38:05 UTC
I have added:

+      if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+	  && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+	       && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
+	       && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi)))
+	       && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)))))
+	return false;

for floating-point. I failed to see which situation will cause FP exceptions ?
Comment 20 rguenther@suse.de 2023-11-15 14:42:45 UTC
On Wed, 15 Nov 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
> 
> --- Comment #19 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> I have added:
> 
> +      if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))

For TYPE_OVERFLOW_UNDEFINED you have to convert the ops to unsigned
to avoid spurious undefined overflow.

> +         && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
> +              && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
> +              && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi)))
> +              && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)))))

You should check flag_associative_math which covers one half
and !flag_trapping_math which covers spurious FP exceptions.

> +       return false;
> 
> for floating-point. I failed to see which situation will cause FP exceptions ?

Ops with NaN cause INVALID, but there's also INEXACT which can be
set differently after re-association.
Comment 21 JuzheZhong 2023-11-16 01:06:17 UTC
Thanks Richi.

Does re-associating (with eliminating exceptions) in if-convert is a reasonable approach ?

If yes, I am gonna to revisit this PR on GCC-15 and refine the codes.
Comment 22 rguenther@suse.de 2023-11-16 06:50:42 UTC
On Thu, 16 Nov 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
> 
> --- Comment #21 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Thanks Richi.
> 
> Does re-associating (with eliminating exceptions) in if-convert is a reasonable
> approach ?

Yeah, I don't think we have a much better place at the moment.