This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Optimise sqrt reciprocal multiplications
On Wed, 5 Sep 2018, Kyrill Tkachov wrote:
> On 04/09/18 17:52, Kyrill Tkachov wrote:
> >
> > On 04/09/18 15:31, Richard Biener wrote:
> >> On Tue, 4 Sep 2018, Kyrill Tkachov wrote:
> >>> Hi Richard,
> >>>
> >>> On 31/08/18 12:07, Richard Biener wrote:
> >>>> On Thu, 30 Aug 2018, Kyrill Tkachov wrote:
> >>>>> Ping.
> >>>>>
> >>>>> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01496.html
> >>>>>
> >>>>> Thanks,
> >>>>> Kyrill
> >>>>>
> >>>>> On 23/08/18 18:09, Kyrill Tkachov wrote:
> >>>>>> Hi Richard,
> >>>>>>
> >>>>>> On 23/08/18 11:13, Richard Sandiford wrote:
> >>>>>>> Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> writes:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> This patch aims to optimise sequences involving uses of 1.0 / sqrt
> (a)
> >>>>>>>> under -freciprocal-math and -funsafe-math-optimizations.
> >>>>>>>> In particular consider:
> >>>>>>>>
> >>>>>>>> x = 1.0 / sqrt (a);
> >>>>>>>> r1 = x * x; // same as 1.0 / a
> >>>>>>>> r2 = a * x; // same as sqrt (a)
> >>>>>>>>
> >>>>>>>> If x, r1 and r2 are all used further on in the code, this can be
> >>>>>>>> transformed into:
> >>>>>>>> tmp1 = 1.0 / a
> >>>>>>>> tmp2 = sqrt (a)
> >>>>>>>> tmp3 = tmp1 * tmp2
> >>>>>>>> x = tmp3
> >>>>>>>> r1 = tmp1
> >>>>>>>> r2 = tmp2
> >>>>>>> Nice optimisation :-) Someone who knows the pass better should
> review,
> >>>>>>> but:
> >>>>>> Thanks for the review.
> >>>>>>> There seems to be an implicit assumption that this is a win even
> >>>>>>> when the r1 and r2 assignments are only conditionally executed.
> >>>>>>> That's probably true, but it might be worth saying explicitly.
> >>>>>> I'll admit I had not considered that case.
> >>>>>> I think it won't make a difference in practice, as the really expensive
> >>>>>> operations here
> >>>>>> are the sqrt and the division and they are on the executed path in
> either
> >>>>>> case and them
> >>>>>> becoming independent should be a benefit of its own.
> >>>>>>>> +/* Return TRUE if USE_STMT is a multiplication of DEF by A. */
> >>>>>>>> +
> >>>>>>>> +static inline bool
> >>>>>>>> +is_mult_by (gimple *use_stmt, tree def, tree a)
> >>>>>>>> +{
> >>>>>>>> + if (gimple_code (use_stmt) == GIMPLE_ASSIGN
> >>>>>>>> + && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
> >>>>>>>> + {
> >>>>>>>> + tree op0 = gimple_assign_rhs1 (use_stmt);
> >>>>>>>> + tree op1 = gimple_assign_rhs2 (use_stmt);
> >>>>>>>> +
> >>>>>>>> + return (op0 == def && op1 == a)
> >>>>>>>> + || (op0 == a && op1 == def);
> >>>>>>>> + }
> >>>>>>>> + return 0;
> >>>>>>>> +}
> >>>>>>> Seems like is_square_of could now be a light-weight wrapper around
> this.
> >>>>>> Indeed, I've done the wrapping now.
> >>>>>>>> @@ -652,6 +669,180 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator
> >>>>>>>> *def_gsi, tree def)
> >>>>>>>> occ_head = NULL;
> >>>>>>>> }
> >>>>>>>> +/* Transform sequences like
> >>>>>>>> + x = 1.0 / sqrt (a);
> >>>>>>>> + r1 = x * x;
> >>>>>>>> + r2 = a * x;
> >>>>>>>> + into:
> >>>>>>>> + tmp1 = 1.0 / a;
> >>>>>>>> + tmp2 = sqrt (a);
> >>>>>>>> + tmp3 = tmp1 * tmp2;
> >>>>>>>> + x = tmp3;
> >>>>>>>> + r1 = tmp1;
> >>>>>>>> + r2 = tmp2;
> >>>>>>>> + depending on the uses of x, r1, r2. This removes one
> multiplication
> >>>>>>>> and
> >>>>>>>> + allows the sqrt and division operations to execute in parallel.
> >>>>>>>> + DEF_GSI is the gsi of the initial division by sqrt that defines
> >>>>>>>> + DEF (x in the example abovs). */
> >>>>>>>> +
> >>>>>>>> +static void
> >>>>>>>> +optimize_recip_sqrt (gimple_stmt_iterator *def_gsi, tree def)
> >>>>>>>> +{
> >>>>>>>> + use_operand_p use_p;
> >>>>>>>> + imm_use_iterator use_iter;
> >>>>>>>> + gimple *stmt = gsi_stmt (*def_gsi);
> >>>>>>>> + tree x = def;
> >>>>>>>> + tree orig_sqrt_ssa_name = gimple_assign_rhs2 (stmt);
> >>>>>>>> + tree div_rhs1 = gimple_assign_rhs1 (stmt);
> >>>>>>>> +
> >>>>>>>> + if (TREE_CODE (orig_sqrt_ssa_name) != SSA_NAME
> >>>>>>>> + || TREE_CODE (div_rhs1) != REAL_CST
> >>>>>>>> + || !real_equal (&TREE_REAL_CST (div_rhs1), &dconst1))
> >>>>>>>> + return;
> >>>>>>>> +
> >>>>>>>> + gimple *sqrt_stmt = SSA_NAME_DEF_STMT (orig_sqrt_ssa_name);
> >>>>>>>> + if (!is_gimple_call (sqrt_stmt)
> >>>>>>>> + || !gimple_call_lhs (sqrt_stmt))
> >>>>>>>> + return;
> >>>>>>>> +
> >>>>>>>> + gcall *call = as_a <gcall *> (sqrt_stmt);
> >>>>>>> Very minor, but:
> >>>>>>>
> >>>>>>> gcall *sqrt_stmt
> >>>>>>> = dyn_cast <gcall *> (SSA_NAME_DEF_STMT (orig_sqrt_ssa_name));
> >>>>>>> if (!sqrt_stmt || !gimple_call_lhs (sqrt_stmt))
> >>>>>>> return;
> >>>>>>>
> >>>>>>> would avoid the need for the separate as_a<>, and would mean that
> >>>>>>> we only call gimple_call_* on gcalls.
> >>>>>> Ok.
> >>>>>>>> + if (has_other_use)
> >>>>>>>> + {
> >>>>>>>> + /* Using the two temporaries tmp1, tmp2 from above
> >>>>>>>> + the original x is now:
> >>>>>>>> + x = tmp1 * tmp2. */
> >>>>>>>> + gcc_assert (mult_ssa_name);
> >>>>>>>> + gcc_assert (sqr_ssa_name);
> >>>>>>>> + gimple_stmt_iterator gsi2 = gsi_for_stmt (stmt);
> >>>>>>>> +
> >>>>>>>> + tree new_ssa_name
> >>>>>>>> + = make_temp_ssa_name (TREE_TYPE (a), NULL,
> >>>>>>>> "recip_sqrt_transformed");
> >>>>>>>> + gimple *new_stmt
> >>>>>>>> + = gimple_build_assign (new_ssa_name, MULT_EXPR,
> >>>>>>>> + mult_ssa_name, sqr_ssa_name);
> >>>>>>>> + gsi_insert_before (&gsi2, new_stmt, GSI_SAME_STMT);
> >>>>>>>> + gcc_assert (gsi_stmt (gsi2) == stmt);
> >>>>>>>> + gimple_assign_set_rhs_from_tree (&gsi2, new_ssa_name);
> >>>>>>>> + fold_stmt (&gsi2);
> >>>>>>>> + update_stmt (stmt);
> >>>>>>> In this case we're replacing the statement in its original position,
> >>>>>>> so there's no real need to use a temporary. It seems better to
> >>>>>>> change the rhs_code, rhs1 and rhs2 of stmt in-place, with the same
> >>>>>>> lhs as before.
> >>>>>> Yes, that's cleaner.
> >>>>>>>> @@ -762,6 +953,23 @@ pass_cse_reciprocals::execute (function *fun)
> >>>>>>>> if (optimize_bb_for_size_p (bb))
> >>>>>>>> continue;
> >>>>>>> Seems unnecessary to skip the new optimisation when optimising for
> size.
> >>>>>>> Like you say, it saves a multiplication overall. Also:
> >>>>>> Indeed.
> >>>>>>>> + if (flag_unsafe_math_optimizations)
> >>>>>>>> + {
> >>>>>>>> + for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
> >>>>>>>> + !gsi_end_p (gsi);
> >>>>>>>> + gsi_next (&gsi))
> >>>>>>>> + {
> >>>>>>>> + gimple *stmt = gsi_stmt (gsi);
> >>>>>>>> +
> >>>>>>>> + if (gimple_has_lhs (stmt)
> >>>>>>>> + && (def = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_DEF)) !=
> NULL
> >>>>>>>> + && FLOAT_TYPE_P (TREE_TYPE (def))
> >>>>>>>> + && TREE_CODE (def) == SSA_NAME
> >>>>>>>> + && is_gimple_assign (stmt)
> >>>>>>>> + && gimple_assign_rhs_code (stmt) == RDIV_EXPR)
> >>>>>>>> + optimize_recip_sqrt (&gsi, def);
> >>>>>>>> + }
> >>>>>>>> + }
> >>>>>>> It looks like this could safely be done in one of the existing walks
> >>>>>>> (e.g. the execute_cse_reciprocals_1 one, if we do this when optimising
> >>>>>>> for size).
> >>>>>> You're right. I've moved this into one of the walks above this.
> >>>>>>
> >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu and
> >>>>>> x86_64-unknown-linux-gnu.
> >>>>>> CC'ing richi as he's reviewed these kinds of patches in the past.
> >>>>>>
> >>>>>> Is this ok for trunk?
> >>>> I wonder how it interacts with execute_cse_reciprocals_1 given it
> >>>> introduces a division 1.0 / a which will be not processed by
> >>>> execute_cse_reciprocals_1 given that operates by walking uses of 'a'.
> >>>> That may be just a missed optimization of course.
> >>> Hmm, I believe right now it doesn't interact with
> execute_cse_reciprocals_1 as
> >>> it's either one or the other.
> >>> I've left it as it is for now, but would you like us to call
> >>> execute_cse_reciprocals_1 on the potentially transformed division?
> >> I think that wouldn't "fix" it given execute_cse_reciprocals_1 works by
> >> seeing all reciprocal uses of 'a' so it may have alrady seen two and
> >> you introduced the third. But we can leave "solving" this for
> >> future enhacements (I'd avoid doing two passes over the IL just because
> >> of said possibility right now).
> >>>> + if (has_other_use)
> >>>> + {
> >>>> + /* Using the two temporaries tmp1, tmp2 from above
> >>>> + the original x is now:
> >>>> + x = tmp1 * tmp2. */
> >>>> + gcc_assert (mult_ssa_name);
> >>>> + gcc_assert (sqr_ssa_name);
> >>>> +
> >>>> + gimple_assign_set_rhs1 (stmt, mult_ssa_name);
> >>>> + gimple_assign_set_rhs2 (stmt, sqr_ssa_name);
> >>>> + gimple_assign_set_rhs_code (stmt, MULT_EXPR);
> >>>> + fold_stmt_inplace (def_gsi);
> >>>> + update_stmt (stmt);
> >>>> + }
> >>>>
> >>>> so you are leaving the original stmt in place unchanged even if it is
> >>>> not used? Why? Note that with -fno-call-exceptions this stmt may
> >>>> throw, so you should arrange to code-generate 1./a in place of the
> >>>> original division to preserve EH behavior. Watch out because then
> >>>> with other uses you have to find a place to insert its computation.
> >>> Ok. These are oversights on my part. I've updated the patch to modify the
> >>> original division in place. The multiplication is placed after it.
> >> If the division throws internally you can't insert after it, you
> >> have to insert on the single non-EH edge outgoing from the BB
> >> with the division. Just try a C++ testcase with -fnon-call-exceptions
> >> and a try {} catch(...) {} around.
> >>
> >> Not sure if it's worth to handle so bailing out if
> >> stmt_can_throw_internal (stmt) would be an option as well.
> >>>> + if (is_square_of (stmt2, x))
> >>>> + {
> >>>> + if (!sqr_stmts.contains (stmt2))
> >>>> + sqr_stmts.safe_push (stmt2);
> >>>> + }
> >>>>
> >>>> this is quadratic in the number of square stmts... please consider
> >>>> making sqr_stmts a bitmap of SSA defs (so the stmt you have now
> >>>> is then SSA_NAME_DEF_STMT (ssa_name (bitmap-element))).
> >>> Done. In practice I didn't see there being more than one such use, I
> expect
> >>> them
> >>> to be CSE'd, but maybe if they're in different basic blocks...
> >> Oh, it just occured to me you could use FOR_EACH_IMM_USE_STMT ()
> >> which will only get you each stmt once... ;) Sorry for the misleading
> >> bitmap suggestion.
> >>>> You do not seem to restrict placement of the use stmts but insert
> >>>> before the 1/sqrt(a) stmt. That possibly hoists the multiplications
> >>>> where they are not needed. Consider
> >>>>
> >>>> x = 1./sqrt(a);
> >>>> if (l)
> >>>> l1 = x * 3.;
> >>>> else if (l2)
> >>>> l1 = x * x;
> >>>> else if (l3)
> >>>> l1 = a * x;
> >>>>
> >>>> or similar where on the path to x * 3. you now perform two extra
> >>>> multiplications.
> >>> Ok, I've restricted the optimisation somewhat.
> >>> Now if there's an other use it won't perform the transformation unless
> there
> >>> is already
> >>> a multiplication present on the main path. That way it won't introduce
> >>> multiplications
> >>> on paths where there aren't any.
> >> OK.
> >>>> Your testcases do not cover the case of other uses at all. Or of
> >>>> EH.
> >>> gcc.dg/recip_sqrt_mult_1.c should handle the other uses case as tmp is a
> >>> global
> >>> being written to. I've added more testcases involving uses in different
> basic
> >>> blocks
> >>> and a g++.dg testcase with -fnon-call-exceptions.
> >>> I'm not sure what functionality to test though apart from that it doesn't
> ICE
> >>> and does
> >>> the transformations I expect.
> >> That's good enough I guess. I see you do have a testcase with
> >> try/catch so I wonder why foo4 doesn't ICE when you insert after
> >> the throwing stmt... maybe it doesn't throw? Ah - they probably
> >> fail your new mult_on_main_path test?
> >
> > Yeah. I did manage to reproduce an ICE eventually by removing that check
> > and enabling -ftrapping-math.
> >
> > I think I'll just not enter this at all if stmt_can_throw_internal (stmt)
> like you suggested.
>
>
> And here is the version that does that. It also reverts to using a vector for
> the sqr_stmts for consistency
> and uses FOR_EACH_IMM_USE_STMT to iterate over the use statements.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Ok?
OK.
Thanks,
Richard.
> Thanks,
> Kyrill
>
> 2018-09-03 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * tree-ssa-math-opts.c (is_mult_by): New function.
> (is_square_of): Use the above.
> (optimize_recip_sqrt): New function.
> (pass_cse_reciprocals::execute): Use the above.
>
> 2018-09-03 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.dg/recip_sqrt_mult_1.c: New test.
> * gcc.dg/recip_sqrt_mult_2.c: Likewise.
> * gcc.dg/recip_sqrt_mult_3.c: Likewise.
> * gcc.dg/recip_sqrt_mult_4.c: Likewise.
> * gcc.dg/recip_sqrt_mult_5.c: Likewise.
> * g++.dg/recip_sqrt_mult_1.C: Likewise.
> * g++.dg/recip_sqrt_mult_2.C: Likewise.