This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimise sqrt reciprocal multiplications


Hi Richard,

On 23/08/18 11:13, Richard Sandiford wrote:
Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> writes:
Hi all,

This patch aims to optimise sequences involving uses of 1.0 / sqrt (a) under -freciprocal-math and -funsafe-math-optimizations.
In particular consider:

x = 1.0 / sqrt (a);
r1 = x * x;  // same as 1.0 / a
r2 = a * x; // same as sqrt (a)

If x, r1 and r2 are all used further on in the code, this can be transformed into:
tmp1 = 1.0 / a
tmp2 = sqrt (a)
tmp3 = tmp1 * tmp2
x = tmp3
r1 = tmp1
r2 = tmp2
Nice optimisation :-)  Someone who knows the pass better should review,
but:

Thanks for the review.

There seems to be an implicit assumption that this is a win even
when the r1 and r2 assignments are only conditionally executed.
That's probably true, but it might be worth saying explicitly.

I'll admit I had not considered that case.
I think it won't make a difference in practice, as the really expensive operations here
are the sqrt and the division and they are on the executed path in either case and them
becoming independent should be a benefit of its own.

+/* Return TRUE if USE_STMT is a multiplication of DEF by A.  */
+
+static inline bool
+is_mult_by (gimple *use_stmt, tree def, tree a)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
+    {
+      tree op0 = gimple_assign_rhs1 (use_stmt);
+      tree op1 = gimple_assign_rhs2 (use_stmt);
+
+      return (op0 == def && op1 == a)
+	      || (op0 == a && op1 == def);
+    }
+  return 0;
+}
Seems like is_square_of could now be a light-weight wrapper around this.

Indeed, I've done the wrapping now.

@@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
    occ_head = NULL;
  }
+/* Transform sequences like
+   x = 1.0 / sqrt (a);
+   r1 = x * x;
+   r2 = a * x;
+   into:
+   tmp1 = 1.0 / a;
+   tmp2 = sqrt (a);
+   tmp3 = tmp1 * tmp2;
+   x = tmp3;
+   r1 = tmp1;
+   r2 = tmp2;
+   depending on the uses of x, r1, r2.  This removes one multiplication and
+   allows the sqrt and division operations to execute in parallel.
+   DEF_GSI is the gsi of the initial division by sqrt that defines
+   DEF (x in the example abovs).  */
+
+static void
+optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
+{
+  use_operand_p use_p;
+  imm_use_iterator use_iter;
+  gimple *stmt = gsi_stmt (*def_gsi);
+  tree x = def;
+  tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
+  tree div_rhs1 = gimple_assign_rhs1 (stmt);
+
+  if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
+      || TREE_CODE (div_rhs1) != REAL_CST
+      || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
+    return;
+
+  gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name);
+  if (!is_gimple_call (sqrt_stmt)
+      || !gimple_call_lhs (sqrt_stmt))
+    return;
+
+  gcall *call = as_a <gcall *> (sqrt_stmt);
Very minor, but:

   gcall *sqrt_stmt
     = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
   if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
     return;

would avoid the need for the separate as_a<>, and would mean that
we only call gimple_call_* on gcalls.

Ok.

+  if (has_other_use)
+    {
+      /* Using the two temporaries tmp1, tmp2 from above
+	 the original x is now:
+	 x = tmp1 * tmp2.  */
+      gcc_assert (mult_ssa_name);
+      gcc_assert (sqr_ssa_name);
+      gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt);
+
+      tree new_ssa_name
+	= make_temp_ssa_name (TREE_TYPE (a), NULL, "recip_sqrt_transformed");
+      gimple *new_stmt
+	= gimple_build_assign (new_ssa_name, MULT_EXPR,
+			       mult_ssa_name, sqr_ssa_name);
+      gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT);
+      gcc_assert (gsi_stmt (gsi2) == stmt);
+      gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name);
+      fold_stmt (&gsi2);
+      update_stmt (stmt);
In this case we're replacing the statement in its original position,
so there's no real need to use a temporary.  It seems better to
change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same
lhs as before.

Yes, that's cleaner.

@@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun)
        if (optimize_bb_for_size_p (bb))
          continue;
Seems unnecessary to skip the new optimisation when optimising for size.
Like you say, it saves a multiplication overall.  Also:

Indeed.

+      if (flag_unsafe_math_optimizations)
+	{
+	  for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
+	       !gsi_end_p (gsi);
+	       gsi_next (&gsi))
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+
+	      if (gimple_has_lhs (stmt)
+		  && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != NULL
+		  && FLOAT_TYPE_P (TREE_TYPE (def))
+		  && TREE_CODE (def) == SSA_NAME
+		  && is_gimple_assign (stmt)
+		  && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
+		optimize_recip_sqrt (&gsi, def);
+	    }
+	}
It looks like this could safely be done in one of the existing walks
(e.g. the execute_cse_reciprocals_1 one, if we do this when optimising
for size).

You're right. I've moved this into one of the walks above this.

Bootstrapped and tested on aarch64-none-linux-gnu and x86_64-unknown-linux-gnu.
CC'ing richi as he's reviewed these kinds of patches in the past.

Is this ok for trunk?

Thanks,
Kyrill

2018-08-23  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

    * tree-ssa-math-opts.c (is_mult_by): New function.
    (is_square_of): Use the above.
    (optimize_recip_sqrt): New function.
    (pass_cse_reciprocals::execute): Use the above.

2018-08-23  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

    * gcc.dg/recip_sqrt_mult_1.c: New test.
    * gcc.dg/recip_sqrt_mult_2.c: Likewise.
    * gcc.dg/recip_sqrt_mult_3.c: Likewise.

Thanks,
Richard

diff --git a/gcc/testsuite/gcc.dg/recip_sqrt_mult_1.c b/gcc/testsuite/gcc.dg/recip_sqrt_mult_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..86429e4183ba913638c2b931f7e65a9f6e159c6d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/recip_sqrt_mult_1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-recip" } */
+
+double res, res2, tmp;
+void
+foo (double a, double b)
+{
+  tmp = 1.0 / __builtin_sqrt (a);
+  res = tmp * tmp;
+  res2 = a * tmp;
+}
+
+/* { dg-final { scan-tree-dump "Optimizing reciprocal sqrt multiplications" "recip" } } */
+/* { dg-final { scan-tree-dump "Replacing squaring multiplication" "recip" } } */
+/* { dg-final { scan-tree-dump "Replacing multiplication" "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/recip_sqrt_mult_2.c b/gcc/testsuite/gcc.dg/recip_sqrt_mult_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c5fc3de7b657b1769e76254b4bc874e0595e43ef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/recip_sqrt_mult_2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+foo (float a)
+{
+  float tmp = 1.0f / __builtin_sqrtf (a);
+  return a * tmp;
+}
+
+/* { dg-final { scan-tree-dump-not " / " "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/recip_sqrt_mult_3.c b/gcc/testsuite/gcc.dg/recip_sqrt_mult_3.c
new file mode 100644
index 0000000000000000000000000000000000000000..e7d185ba7e22cfc7cca72296d5ccc544f24fdb14
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/recip_sqrt_mult_3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+double
+foo (double a)
+{
+  double tmp = 1.0f / __builtin_sqrt (a);
+  return tmp * tmp;
+}
+
+/* { dg-final { scan-tree-dump-not "__builtin_sqrt" "optimized" } } */
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index a90d9d28e4ef6ce66fa380f314a9749fa952be0a..56571e77dae8d504bb984c2088c774bee5b9dce0 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -337,9 +337,9 @@ is_division_by (gimple *use_stmt, tree def)
 	 && gimple_assign_rhs1 (use_stmt) != def;
 }
 
-/* Return whether USE_STMT is DEF * DEF.  */
+/* Return TRUE if USE_STMT is a multiplication of DEF by A.  */
 static inline bool
-is_square_of (gimple *use_stmt, tree def)
+is_mult_by (gimple *use_stmt, tree def, tree a)
 {
   if (gimple_code (use_stmt) == GIMPLE_ASSIGN
       && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
@@ -347,11 +347,19 @@ is_square_of (gimple *use_stmt, tree def)
       tree op0 = gimple_assign_rhs1 (use_stmt);
       tree op1 = gimple_assign_rhs2 (use_stmt);
 
-      return op0 == op1 && op0 == def;
+      return (op0 == def && op1 == a)
+	      || (op0 == a && op1 == def);
     }
   return 0;
 }
 
+/* Return whether USE_STMT is DEF * DEF.  */
+static inline bool
+is_square_of (gimple *use_stmt, tree def)
+{
+  return is_mult_by (use_stmt, def, def);
+}
+
 /* Return whether USE_STMT is a floating-point division by
    DEF * DEF.  */
 static inline bool
@@ -526,6 +534,173 @@ free_bb (struct occurrence *occ)
     }
 }
 
+/* Transform sequences like
+   x = 1.0 / sqrt (a);
+   r1 = x * x;
+   r2 = a * x;
+   into:
+   tmp1 = 1.0 / a;
+   tmp2 = sqrt (a);
+   tmp3 = tmp1 * tmp2;
+   x = tmp3;
+   r1 = tmp1;
+   r2 = tmp2;
+   depending on the uses of x, r1, r2.  This removes one multiplication and
+   allows the sqrt and division operations to execute in parallel.
+   DEF_GSI is the gsi of the initial division by sqrt that defines
+   DEF (x in the example abovs).  */
+
+static void
+optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
+{
+  use_operand_p use_p;
+  imm_use_iterator use_iter;
+  gimple *stmt = gsi_stmt (*def_gsi);
+  tree x = def;
+  tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
+  tree div_rhs1 = gimple_assign_rhs1 (stmt);
+
+  if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
+      || TREE_CODE (div_rhs1) != REAL_CST
+      || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
+    return;
+
+  gcall *sqrt_stmt
+    = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
+
+  if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
+    return;
+
+  switch (gimple_call_combined_fn (sqrt_stmt))
+    {
+    CASE_CFN_SQRT:
+    CASE_CFN_SQRT_FN:
+      break;
+
+    default:
+      return;
+    }
+  tree a = gimple_call_arg (sqrt_stmt, 0);
+
+  /* We have 'a' and 'x'.  Now analyze the uses of 'x'.  */
+
+  /* Statements that use x in x * x.  */
+  auto_vec<gimple *> sqr_stmts;
+  /* Statements that use x in a * x.  */
+  auto_vec<gimple *> mult_stmts;
+  bool has_other_use = false;
+
+  FOR_EACH_IMM_USE_FAST (use_p, use_iter, x)
+    {
+      gimple *stmt2 = USE_STMT (use_p);
+      if (is_gimple_debug (stmt2))
+	continue;
+      if (is_square_of (stmt2, x))
+	{
+	  if (!sqr_stmts.contains (stmt2))
+	    sqr_stmts.safe_push (stmt2);
+	}
+      else if (is_mult_by (stmt2, x, a))
+	mult_stmts.safe_push (stmt2);
+      else
+	has_other_use = true;
+    }
+  if (sqr_stmts.is_empty () && mult_stmts.is_empty ())
+    return;
+
+  /* If x = 1.0 / sqrt (a) has uses other than those optimized here we want
+     to be able to compose it from the sqr and mult cases.  */
+  if (has_other_use && (sqr_stmts.is_empty () || mult_stmts.is_empty ()))
+    return;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Optimizing reciprocal sqrt multiplications of\n");
+      print_gimple_stmt (dump_file, sqrt_stmt, 0, TDF_NONE);
+      print_gimple_stmt (dump_file, stmt, 0, TDF_NONE);
+      fprintf (dump_file, "\n");
+    }
+
+  tree mult_ssa_name = NULL_TREE;
+  tree sqr_ssa_name = NULL_TREE;
+  if (!sqr_stmts.is_empty ())
+    {
+      /* r1 = x * x.  Transform this into:
+	 tmp1 = 1.0 / a
+	 r1 = tmp1.  */
+
+      sqr_ssa_name
+	= make_temp_ssa_name (TREE_TYPE (a), NULL, "recip_sqrt_sqr");
+      gimple *new_stmt
+	= gimple_build_assign (sqr_ssa_name, RDIV_EXPR,
+				build_real (TREE_TYPE (a), dconst1), a);
+      gsi_insert_before (def_gsi, new_stmt, GSI_NEW_STMT);
+
+      unsigned int i;
+      gimple *sqr_stmt;
+      FOR_EACH_VEC_ELT (sqr_stmts, i, sqr_stmt)
+	{
+	  gimple_stmt_iterator gsi2 = gsi_for_stmt (sqr_stmt);
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Replacing squaring multiplication\n");
+	      print_gimple_stmt (dump_file, sqr_stmt, 0, TDF_NONE);
+	      fprintf (dump_file, "with\n");
+	      print_gimple_stmt (dump_file, new_stmt, 0, TDF_NONE);
+	      fprintf (dump_file, "\n");
+	    }
+	  gimple_assign_set_rhs_from_tree (&gsi2, sqr_ssa_name);
+	  fold_stmt_inplace (&gsi2);
+	  update_stmt (sqr_stmt);
+      }
+    }
+  if (!mult_stmts.is_empty ())
+    {
+      /* r2 = a * x.  Transform this into:
+	 tmp2 = sqrt (a)
+	 r2 = tmp2.  */
+      mult_ssa_name
+	= make_temp_ssa_name (TREE_TYPE (a), NULL, "recip_sqrt_mult");
+      gimple *new_stmt
+	= gimple_build_assign (mult_ssa_name, orig_sqrt_ssa_name);
+      gsi_insert_before (def_gsi, new_stmt, GSI_NEW_STMT);
+
+      unsigned int i;
+      gimple *mult_stmt = NULL;
+      FOR_EACH_VEC_ELT (mult_stmts, i, mult_stmt)
+	{
+	  gimple_stmt_iterator gsi2 = gsi_for_stmt (mult_stmt);
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Replacing multiplication\n");
+	      print_gimple_stmt (dump_file, mult_stmt, 0, TDF_NONE);
+	      fprintf (dump_file, "with\n");
+	      print_gimple_stmt (dump_file, new_stmt, 0, TDF_NONE);
+	      fprintf (dump_file, "\n");
+	    }
+	  gimple_assign_set_rhs_from_tree (&gsi2, mult_ssa_name);
+	  fold_stmt_inplace (&gsi2);
+	  update_stmt (mult_stmt);
+      }
+    }
+
+  if (has_other_use)
+    {
+      /* Using the two temporaries tmp1, tmp2 from above
+	 the original x is now:
+	 x = tmp1 * tmp2.  */
+      gcc_assert (mult_ssa_name);
+      gcc_assert (sqr_ssa_name);
+
+      gimple_assign_set_rhs1 (stmt, mult_ssa_name);
+      gimple_assign_set_rhs2 (stmt, sqr_ssa_name);
+      gimple_assign_set_rhs_code (stmt, MULT_EXPR);
+      fold_stmt_inplace (def_gsi);
+      update_stmt (stmt);
+    }
+}
 
 /* Look for floating-point divisions among DEF's uses, and try to
    replace them by multiplications with the reciprocal.  Add
@@ -546,6 +721,7 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
   int sqrt_recip_count = 0;
 
   gcc_assert (FLOAT_TYPE_P (TREE_TYPE (def)) && TREE_CODE (def) == SSA_NAME);
+
   threshold = targetm.min_divisions_for_recip_mul (TYPE_MODE (TREE_TYPE (def)));
 
   /* If DEF is a square (x * x), count the number of divisions by x.
@@ -756,7 +932,14 @@ pass_cse_reciprocals::execute (function *fun)
 	      && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != NULL
 	      && FLOAT_TYPE_P (TREE_TYPE (def))
 	      && TREE_CODE (def) == SSA_NAME)
-	    execute_cse_reciprocals_1 (&gsi, def);
+	    {
+	      if (flag_unsafe_math_optimizations
+		  && is_gimple_assign (stmt)
+		  && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
+		optimize_recip_sqrt (&gsi, def);
+	      else
+		execute_cse_reciprocals_1 (&gsi, def);
+	    }
 	}
 
       if (optimize_bb_for_size_p (bb))

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]