[PATCH] Optimise sqrt reciprocal multiplications

Thu Aug 23 10:13:00 GMT 2018

Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> writes:
> Hi all,
>
> This patch aims to optimise sequences involving uses of 1.0 / sqrt (a) under -freciprocal-math and -funsafe-math-optimizations.
> In particular consider:
>
> x = 1.0 / sqrt (a);
> r1 = x * x;  // same as 1.0 / a
> r2 = a * x; // same as sqrt (a)
>
> If x, r1 and r2 are all used further on in the code, this can be transformed into:
> tmp1 = 1.0 / a
> tmp2 = sqrt (a)
> tmp3 = tmp1 * tmp2
> x = tmp3
> r1 = tmp1
> r2 = tmp2

Nice optimisation :-)  Someone who knows the pass better should review,
but:

There seems to be an implicit assumption that this is a win even
when the r1 and r2 assignments are only conditionally executed.
That's probably true, but it might be worth saying explicitly.

> +/* Return TRUE if USE_STMT is a multiplication of DEF by A.  */
> +
> +static inline bool
> +is_mult_by (gimple *use_stmt, tree def, tree a)
> +{
> +  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
> +      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
> +    {
> +      tree op0 = gimple_assign_rhs1 (use_stmt);
> +      tree op1 = gimple_assign_rhs2 (use_stmt);
> +
> +      return (op0 == def && op1 == a)
> +	      || (op0 == a && op1 == def);
> +    }
> +  return 0;
> +}

Seems like is_square_of could now be a light-weight wrapper around this.

> @@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
>    occ_head = NULL;
>  }
>  
> +/* Transform sequences like
> +   x = 1.0 / sqrt (a);
> +   r1 = x * x;
> +   r2 = a * x;
> +   into:
> +   tmp1 = 1.0 / a;
> +   tmp2 = sqrt (a);
> +   tmp3 = tmp1 * tmp2;
> +   x = tmp3;
> +   r1 = tmp1;
> +   r2 = tmp2;
> +   depending on the uses of x, r1, r2.  This removes one multiplication and
> +   allows the sqrt and division operations to execute in parallel.
> +   DEF_GSI is the gsi of the initial division by sqrt that defines
> +   DEF (x in the example abovs).  */
> +
> +static void
> +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
> +{
> +  use_operand_p use_p;
> +  imm_use_iterator use_iter;
> +  gimple *stmt = gsi_stmt (*def_gsi);
> +  tree x = def;
> +  tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
> +  tree div_rhs1 = gimple_assign_rhs1 (stmt);
> +
> +  if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
> +      || TREE_CODE (div_rhs1) != REAL_CST
> +      || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
> +    return;
> +
> +  gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name);
> +  if (!is_gimple_call (sqrt_stmt)
> +      || !gimple_call_lhs (sqrt_stmt))
> +    return;
> +
> +  gcall *call = as_a <gcall *> (sqrt_stmt);

Very minor, but:

  gcall *sqrt_stmt
    = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
  if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
    return;

would avoid the need for the separate as_a<>, and would mean that
we only call gimple_call_* on gcalls.

> +  if (has_other_use)
> +    {
> +      /* Using the two temporaries tmp1, tmp2 from above
> +	 the original x is now:
> +	 x = tmp1 * tmp2.  */
> +      gcc_assert (mult_ssa_name);
> +      gcc_assert (sqr_ssa_name);
> +      gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt);
> +
> +      tree new_ssa_name
> +	= make_temp_ssa_name (TREE_TYPE (a), NULL, "recip_sqrt_transformed");
> +      gimple *new_stmt
> +	= gimple_build_assign (new_ssa_name, MULT_EXPR,
> +			       mult_ssa_name, sqr_ssa_name);
> +      gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT);
> +      gcc_assert (gsi_stmt (gsi2) == stmt);
> +      gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name);
> +      fold_stmt (&gsi2);
> +      update_stmt (stmt);

In this case we're replacing the statement in its original position,
so there's no real need to use a temporary.  It seems better to 
change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same
lhs as before.

> @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun)
>        if (optimize_bb_for_size_p (bb))
>          continue;

Seems unnecessary to skip the new optimisation when optimising for size.
Like you say, it saves a multiplication overall.  Also:

> +      if (flag_unsafe_math_optimizations)
> +	{
> +	  for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
> +	       !gsi_end_p (gsi);
> +	       gsi_next (&gsi))
> +	    {
> +	      gimple *stmt = gsi_stmt (gsi);
> +
> +	      if (gimple_has_lhs (stmt)
> +		  && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) != NULL
> +		  && FLOAT_TYPE_P (TREE_TYPE (def))
> +		  && TREE_CODE (def) == SSA_NAME
> +		  && is_gimple_assign (stmt)
> +		  && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
> +		optimize_recip_sqrt (&gsi, def);
> +	    }
> +	}

It looks like this could safely be done in one of the existing walks
(e.g. the execute_cse_reciprocals_1 one, if we do this when optimising
for size).

Thanks,
Richard