This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCHv2][GCC] Optimise the fpclassify builtin to perform integer operations when possible

From: Jeff Law <law at redhat dot com>
To: Tamar Christina <Tamar dot Christina at arm dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>
Cc: nd <nd at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "jakub at redhat dot com" <jakub at redhat dot com>, Joseph Myers <joseph at codesourcery dot com>, Michael Meissner <meissner at linux dot vnet dot ibm dot com>, "rguenther at suse dot de" <rguenther at suse dot de>, Moritz Klammler <moritz at klammler dot eu>, Andrew Pinski <pinskia at gmail dot com>
Date: Thu, 20 Oct 2016 16:59:36 -0600
Subject: Re: [PATCHv2][GCC] Optimise the fpclassify builtin to perform integer operations when possible
Authentication-results: sourceware.org; auth=none
References: <VI1PR0801MB2031AD7545A8DAA5F6B9450DFFC10@VI1PR0801MB2031.eurprd08.prod.outlook.com>

On 09/30/2016 07:22 AM, Tamar Christina wrote:

Hi All,

This is v2 of the patch which adds an optimized route to the fpclassify builtin
for floating point numbers which are similar to IEEE-754 in format.

I have addressed most comments from everyone except for two things:

1) Providing a back-end hook to override the functionality. While certainly
   possible the current fpclassify doesn't provide this either. So I'd like to
   treat it as an enhancement rather than an issue.

I think the concern here is PPC, particularly the newer ones which havesignificant hardware support for these kind of characterizations.

Based on the discussions though, I suspect we're going to need somethingnontrivial due to the way the API for __builtin_fpclassify works. Inthe end I can easily see some target way to override the default codesynthesis.

I think these issues should be left for the PPC folks to propose asolution when they're ready to exploit their new hardware. I don'tthink this should block the patch.


2) Doing it in a lowering phase. If the general consensus is that this is the
   path the patch must take then I'd be happy to reconsider. However at this
   this patch does not seem to produce worse code than what there was before.

I think that was a desire from Richi.   I'm a bit torn here.

The code looks more like lowering rather than folding. But it's alsogenerating non-gimple trees and relies on gimple_fold_builtin tore-gimplify the result AFAICT.


Richi -- thoughts?

--

I think its nontrivial to judge worse vs better since it's really afunction of the target's micro-architecture and the context in whichfpclassify is called -- particularly where the input value lives andwhether or not its used in other ways nearby.

In the case where the input value is in memory or not used in floatingpoint arithmetic nearby, your change should be a clear win (with theexception of the latest ppc hardware perhaps).

If the input value is not in memory and used nearby in FP ops, then itgets a lot trickier. We run the risk of making the object addressablewhich means it won't be an SSA_NAME and thus not exposed to the highlevel optimizers.

Richi has indicated that in gimple an object need not be addressablejust because we access random pieces of it, including the ability toavoid marking something as addressable even though we have MEM (&decl)style expressions. I'm not sure how all that works, but trust Richiimplicitly. Additionally you're using VIEW_CONVERT_EXPR now rather thanADDR_EXPR, so that may mitigate things as well.

Finally there's the issue of having to transfer the object between theFP and GP register files which can be highly expensive on somearchitectures. Short of looking at the defs & immediate uses of theinput argument and trying to guess at the cost of moving the objectbetween the register files I don't see a good way to tackle this issue.

I'm inclined to not object on the performance questions. But I wouldlike to hear from Richi WRT lowering vs folding and whether or not hebelieves this belongs elsewhere. If he does, I'd be inclined to suggestearlier rather than later since we do want to expose the generated codeto the gimple optimizers.


Additional implementation comments follow inline.


gcc/
2016-08-25  Tamar Christina  <tamar.christina@arm.com>
	    Wilco Dijkstra  <wilco.dijkstra@arm.com>

	* gcc/builtins.c (fold_builtin_fpclassify): Added optimized version.
	* gcc/real.h (real_format): Added is_ieee_compatible field.
	* gcc/real.c (ieee_single_format): Set is_ieee_compatible flag.
	(mips_single_format): Likewise.
	(motorola_single_format): Likewise.
	(spu_single_format): Likewise.
	(ieee_double_format): Likewise.
	(mips_double_format): Likewise.
	(motorola_double_format): Likewise.
	(ieee_extended_motorola_format): Likewise.
	(ieee_extended_intel_128_format): Likewise.
	(ieee_extended_intel_96_round_53_format): Likewise.
	(ibm_extended_format): Likewise.
	(mips_extended_format): Likewise.
	(ieee_quad_format): Likewise.
	(mips_quad_format): Likewise.
	(vax_f_format): Likewise.
	(vax_d_format): Likewise.
	(vax_g_format): Likewise.
	(decimal_single_format): Likewise.
	(decimal_quad_format): Likewise.
	(iee_half_format): Likewise.
	(mips_single_format): Likewise.
	(arm_half_format): Likewise.
	(real_internal_format): Likewise.


gcc/testsuite/
2016-09-27  Tamar Christina  <tamar.christina@arm.com>

	* gcc.target/aarch64/builtin-fpclassify.c: New codegen test.


gcc-v2-fpclassify.patch


diff --git a/gcc/builtins.c b/gcc/builtins.c
index 9a19a75cc8ed6edb5f543cd7bd26bcc0693e6ebb..1b4878c5ba098dcc0a4a506dbc7959d150cc9028 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -7943,10 +7943,8 @@ static tree
 fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
 {
   tree fp_nan, fp_infinite, fp_normal, fp_subnormal, fp_zero,
-    arg, type, res, tmp;
+    arg, type, res;
   machine_mode mode;
-  REAL_VALUE_TYPE r;
-  char buf[128];

   /* Verify the required arguments in the original call.  */
   if (nargs != 6
@@ -7966,14 +7964,164 @@ fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
   arg = args[5];
   type = TREE_TYPE (arg);
   mode = TYPE_MODE (type);
-  arg = builtin_save_expr (fold_build1_loc (loc, ABS_EXPR, type, arg));
+  const real_format *format = REAL_MODE_FORMAT (mode);
+  const HOST_WIDE_INT type_width = TYPE_PRECISION (type);
+
+  /*
+  For IEEE 754 types:
+
+  fpclassify (x) ->
+       !((exp + 1) & (exp_mask & ~1)) // exponent bits not all set or unset
+	 ? (x & sign_mask == 0 ? FP_ZERO :
+	   (exp & exp_mask == exp_mask
+	      ? (mantisa == 0 ? FP_INFINITE : FP_NAN) :
+	      FP_SUBNORMAL)):
+       FP_NORMAL.
+
+  Otherwise
+
+  fpclassify (x) ->
+       isnan (x) ? FP_NAN :
+	(fabs (x) == Inf ? FP_INFINITE :
+	   (fabs (x) >= DBL_MIN ? FP_NORMAL :
+	     (x == 0 ? FP_ZERO : FP_SUBNORMAL))).
+  */
+
+  /* Check if the number that is being classified is close enough to IEEE 754
+     format to be able to go in the early exit code.  */
+  if (format->is_binary_ieee_compatible
+      && FLOAT_WORDS_BIG_ENDIAN == WORDS_BIG_ENDIAN
+      && !(UNITS_PER_WORD == 4 && type_width == 128)

Why the UNITS_PER_WORD && type width tests against constants?

+
+      /* Extract exp bits from the float, where we expect the exponent to be.
+	 We create a new type because BIT_FIELD_REF does not allow you to
+	 extract less bits than the precision of the storage variable.  */
+      exp_bitfield
+        = fold_build3_loc (loc, BIT_FIELD_REF,
+			   build_nonstandard_integer_type (exp_bits, 0),
+			   int_arg,
+			   build_int_cst (int_type, exp_bits),
+			   build_int_cst (int_type, format->p - 1));
+
+      /* Re-interpret the extracted exponent bits as a 32 bit int.
+	 This allows us to continue doing operations as int_type.  */
+      exp = fold_build1_loc (loc, NOP_EXPR, int_type, exp_bitfield);
+
+      /* Set up some often used constants.  */
+      const_arg0 = build_int_cst (int_arg_type, 0);
+      const_arg1 = build_int_cst (int_arg_type, 1);
+      const_arg2 = build_int_cst (int_arg_type, 2);
+      const0 = build_int_cst (int_type, 0);
+      const1 = build_int_cst (int_type, 1);

Note we have integer_zero_node and integer_one_node. You may be able touse them rather than generating new nodes. If you need the node in adifferent type, you can use fold_convert_const to do that.

+
+      /* b^(p-1) - 1 or 1 << (p - 2)
+	 This creates a mask to be used to check the mantissa value.  */
+      significant_bit = build_int_cst (int_arg_type, format->p - 2);
+      mantissa_mask
+        = fold_build2_loc (loc, MINUS_EXPR, int_arg_type,
+			   fold_build2_loc (loc, LSHIFT_EXPR, int_arg_type,
+					    const_arg2, significant_bit),
+			   const_arg1);

Can you please double-check this? In particular you seem to begenerating 2 << (p - 2), which doesn't seem to match the comment.


diff --git a/gcc/testsuite/gcc.target/aarch64/builtin-fpclassify.c b/gcc/testsuite/gcc.target/aarch64/builtin-fpclassify.c
new file mode 100644
index 0000000000000000000000000000000000000000..84a73a6483780dac2347e72fa7d139545d2087eb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/builtin-fpclassify.c
@@ -0,0 +1,22 @@
+/* This file checks the code generation for the new __builtin_fpclassify.
+   because checking the exact assembly isn't very useful, we'll just be checking
+   for the presence of certain instructions and the omition of others. */

s/omition/omission/


ISTM you'd want to test this for float, double and long double.


jeff

Follow-Ups:
- Re: [PATCHv2][GCC] Optimise the fpclassify builtin to perform integer operations when possible
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]