This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][tree-complex.c] PR tree-optimization/70291: Inline floating-point complex multiplication more aggressively

From: Richard Biener <richard dot guenther at gmail dot com>
To: Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
Date: Wed, 2 May 2018 12:15:40 +0200
Subject: Re: [PATCH][tree-complex.c] PR tree-optimization/70291: Inline floating-point complex multiplication more aggressively
References: <5AE75555.6060303@foss.arm.com>

On Mon, Apr 30, 2018 at 7:41 PM, Kyrill  Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> We can improve the performance of complex floating-point multiplications by
> inlining the expansion a bit more aggressively.
> We can inline complex x = a * b as:
> x = (ar*br - ai*bi) + i(ar*bi + br*ai);
> if (isunordered (__real__ x, __imag__ x))
>   x = __muldc3 (a, b); //Or __mulsc3 for single-precision
>
> That way the common case where no NaNs are produced we can avoid the libgcc
> call and fall back to the
> NaN handling stuff in libgcc if either components of the expansion are NaN.
>
> The implementation is done in expand_complex_multiplication in
> tree-complex.c and the above expansion
> will be done when optimising for -O1 and greater and when not optimising for
> size.
> At -O0 and -Os the single call to libgcc will be emitted.
>
> For the code:
> __complex double
> foo (__complex double a, __complex double b)
> {
>   return a * b;
> }
>
> We will now emit at -O2 for aarch64:
> foo:
>         fmul    d16, d1, d3
>         fmul    d6, d1, d2
>         fnmsub  d5, d0, d2, d16
>         fmadd   d4, d0, d3, d6
>         fcmp    d5, d4
>         bvs     .L8
>         fmov    d1, d4
>         fmov    d0, d5
>         ret
> .L8:
>         stp     x29, x30, [sp, -16]!
>         mov     x29, sp
>         bl      __muldc3
>         ldp     x29, x30, [sp], 16
>         ret
>
> Instead of just a branch to __muldc3.
>
> Bootstrapped and tested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> x86_64-unknown-linux-gnu.
>
> Ok for trunk? (GCC 9)

+         /* If optimizing for size or not at all just do a libcall.  */
+         if (optimize == 0 || optimize_function_for_size_p (cfun))
+           {
+             expand_complex_libcall (gsi, ar, ai, br, bi, MULT_EXPR);
+             return;
+           }

use optimize_bb_for_size_p instead please (get BB from the mult stmt).

 /* Expand a complex multiplication or division to a libcall to the c99
+   compliant routines.  Unlike expand_complex_libcall create and insert
+   the call, assign it to an output variable and return that rather than
+   modifying existing statements in place.  */
+
+static tree
+insert_complex_mult_libcall (gimple_stmt_iterator *gsi, tree type, tree ar,
+                             tree ai, tree br, tree bi)
+{

can you please try merging both functions instead?

Also it shows a possible issue if with -fnon-call-exceptions the original
multiplication has EH edges.  I think you want to side-step that
by doing the libcall-only way in that case as well (stmt_can_throw_internal).

+         tree isunordered_decl = builtin_decl_explicit (BUILT_IN_ISUNORDERED);
+         tree isunordered_res = create_tmp_var (integer_type_node);
+         gimple *tmpr_unord_check
+           = gimple_build_call (isunordered_decl, 2, tmpr, tmpi);
+         gimple_call_set_lhs (tmpr_unord_check, isunordered_res);
+
+         gsi_insert_before (gsi, tmpr_unord_check, GSI_SAME_STMT);
+         gimple *check
+           = gimple_build_cond (NE_EXPR, isunordered_res, integer_zero_node,
+                                NULL_TREE, NULL_TREE);

why use BUILT_IN_ISUNORDERED but not a GIMPLE_COND with
UNORDERED_EXPR?  Note again that might trap/throw with -fsignalling-nans
so better avoid this transform for flag_signalling_nans as well...

+         /* We have a conditional block with some assignments in cond_bb.
+            Wire up the PHIs to wrap up.  */
+         if (gimple_in_ssa_p (cfun))
+           {

we are always in SSA form(?)  (probably tree-complex.c can use some TLC here).

+       /* If we are not worrying about NaNs expand to
+         (ar*br - ai*bi) + i(ar*bi + br*ai) directly.  */
+       expand_complex_multiplication_limited_range (gsi, inner_type, ar, ai,
+                                                     br, bi, &rr, &ri);

I think the function is badly worded - this isn't about limited
ranges, no?  Which
also means that we can dispatch to this simple variant not only for
flag_complex_method != 2 but for !HONOR_NANS && !HONOR_INFINITIES?
Maybe that should be done as followup.

Richard.


> Thanks,
> Kyrill
>
> 2018-04-30  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     PR tree-optimization/70291
>     * tree-complex.c (insert_complex_mult_libcall): New function.
>     (expand_complex_multiplication_limited_range): Likewise.
>     (expand_complex_multiplication): Expand floating-point complex
>     multiplication using the above.
>
> 2018-04-30  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     PR tree-optimization/70291
>     * gcc.dg/complex-6.c: New test.
>     * gcc.dg/complex-7.c: Likewise.

Follow-Ups:
- Re: [PATCH][tree-complex.c] PR tree-optimization/70291: Inline floating-point complex multiplication more aggressively
  - From: Kyrill Tkachov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]