[PATCH] Optimise the fpclassify builtin to perform integer operations when possible

Fri Sep 16 19:53:00 GMT 2016

On 09/12/2016 10:19 AM, Tamar Christina wrote:
> Hi All,
>
> This patch adds an optimized route to the fpclassify builtin
> for floating point numbers which are similar to IEEE-754 in format.
>
> The goal is to make it faster by:
> 1. Trying to determine the most common case first
>    (e.g. the float is a Normal number) and then the
>    rest. The amount of code generated at -O2 are
>    about the same Â± 1 instruction, but the code
>    is much better.
> 2. Using integer operation in the optimized path.
>
> At a high level, the optimized path uses integer operations
> to perform the following:
>
>   if (exponent bits aren't all set or unset)
>      return Normal;
>   else if (no bits are set on the number after masking out
> 	   sign bits then)
>      return Zero;
>   else if (exponent has no bits set)
>      return Subnormal;
>   else if (mantissa has no bits set)
>      return Infinite;
>   else
>      return NaN;
>
> In case the optimization can't be applied the old
> implementation is used as a fall-back.
>
> A limitation with this new approach is that the exponent
> of the floating point has to fit in 31 bits and the floating
> point has to have an IEEE like format and values for NaN and INF
> (e.g. for NaN and INF all bits of the exp must be set).
>
> To determine this IEEE likeness a new boolean was added to real_format.
>
> Regression tests ran on aarch64-none-linux and arm-none-linux-gnueabi
> and no regression. x86 uses it's own implementation other than
> the fpclassify builtin.
>
> As an example, Aarch64 now generates for classification of doubles:
>
> f:
> 	fmov	x1, d0
> 	mov	w0, 7
> 	sbfx	x2, x1, 52, 11
> 	add	w3, w2, 1
> 	tst	w3, 0x07FE
> 	bne	.L1
> 	mov	w0, 13
> 	tst	x1, 0x7fffffffffffffff
> 	beq	.L1
> 	mov	w0, 11
> 	tbz	x2, 0, .L1
> 	tst	x1, 0xfffffffffffff
> 	mov	w0, 3
> 	mov	w1, 5
> 	csel	w0, w0, w1, ne
>
> .L1:
> 	ret
>
> No new tests as there are existing tests to test functionality.
> glibc benchmarks ran against the builtin and this shows a 31.3%
> performance gain.
>
> Ok for trunk?
>
> Thanks,
> Tamar
>
> PS. I don't have commit rights so if OK can someone apply the patch for me.
>
> gcc/
> 2016-08-25  Tamar Christina  <tamar.christina@arm.com>
> 	    Wilco Dijkstra  <wilco.dijkstra@arm.com>
>
> 	* gcc/builtins.c (fold_builtin_fpclassify): Added optimized version.
> 	* gcc/real.h (real_format): Added is_ieee_compatible field.
> 	* gcc/real.c (ieee_single_format): Set is_ieee_compatible flag.
> 	(mips_single_format): Likewise.
> 	(motorola_single_format): Likewise.
> 	(spu_single_format): Likewise.
> 	(ieee_double_format): Likewise.
> 	(mips_double_format): Likewise.
> 	(motorola_double_format): Likewise.
> 	(ieee_extended_motorola_format): Likewise.
> 	(ieee_extended_intel_128_format): Likewise.
> 	(ieee_extended_intel_96_round_53_format): Likewise.
> 	(ibm_extended_format): Likewise.
> 	(mips_extended_format): Likewise.
> 	(ieee_quad_format): Likewise.
> 	(mips_quad_format): Likewise.
> 	(vax_f_format): Likewise.
> 	(vax_d_format): Likewise.
> 	(vax_g_format): Likewise.
> 	(decimal_single_format): Likewise.
> 	(decimal_quad_format): Likewise.
> 	(iee_half_format): Likewise.
> 	(mips_single_format): Likewise.
> 	(arm_half_format): Likewise.
> 	(real_internal_format): Likewise.
>
>
> gcc-public.patch
>
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 1073e35b17b1bc1f6974c71c940bd9d82bbbfc0f..58bf129f9a0228659fd3b976d38d021d1d5bd6bb 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -7947,10 +7947,8 @@ static tree
>  fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
>  {
>    tree fp_nan, fp_infinite, fp_normal, fp_subnormal, fp_zero,
> -    arg, type, res, tmp;
> +    arg, type, res;
>    machine_mode mode;
> -  REAL_VALUE_TYPE r;
> -  char buf[128];
>
>    /* Verify the required arguments in the original call.  */
>    if (nargs != 6
> @@ -7970,14 +7968,143 @@ fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
>    arg = args[5];
>    type = TREE_TYPE (arg);
>    mode = TYPE_MODE (type);
> -  arg = builtin_save_expr (fold_build1_loc (loc, ABS_EXPR, type, arg));
> +  const real_format *format = REAL_MODE_FORMAT (mode);
> +
> +  /*
> +  For IEEE 754 types:
> +
> +  fpclassify (x) ->
> +       !((exp + 1) & (exp_mask & ~1)) // exponent bits not all set or unset
> +	 ? (x & sign_mask == 0 ? FP_ZERO :
> +	   (exp & exp_mask == exp_mask
> +	      ? (mantisa == 0 ? FP_INFINITE : FP_NAN) :
> +	      FP_SUBNORMAL)):
> +       FP_NORMAL.
> +
> +  Otherwise
> +
> +  fpclassify (x) ->
> +       isnan (x) ? FP_NAN :
> +	(fabs (x) == Inf ? FP_INFINITE :
> +	   (fabs (x) >= DBL_MIN ? FP_NORMAL :
> +	     (x == 0 ? FP_ZERO : FP_SUBNORMAL))).
> +  */
> +
> +  /* Check if the number that is being classified is close enough to IEEE 754
> +     format to be able to go in the early exit code.  */
> +  if (format->is_binary_ieee_compatible)
> +    {
> +      gcc_assert (format->b == 2);
> +
> +      const tree int_type = integer_type_node;
> +      const int exp_bits  = (GET_MODE_SIZE (mode) * BITS_PER_UNIT) - format->p;
> +      const int exp_mask  = (1 << exp_bits) - 1;
> +
> +      tree exp, specials, exp_bitfield,
> +	   const_arg0, const_arg1, const0, const1,
> +	   not_sign_mask, zero_check, mantissa_mask,
> +	   mantissa_any_set, exp_lsb_set, mask_check;
> +      tree int_arg_type, int_arg;
Style nit.  Just use

   tree exp, specials, exp_bitfield;
   tree const_arg0, const_arg1, etc etc.

> +
> +      /* Re-interpret the float as an unsigned integer type
> +	 with equal precision.  */
> +      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION (type), 0);
> +      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
> +		  fold_build1_loc (loc, NOP_EXPR,
> +				   build_pointer_type (int_arg_type),
> +		    fold_build1_loc (loc, ADDR_EXPR,
> +				     build_pointer_type (type), arg)));
Doesn't this make ARG addressable?  Which in turn means ARG won't be 
exposed to the gimple/ssa optimizers.    Or is it the case that when 
fpclassify is used its argument is already in memory (and thus addressable?)

> +      /* ~(1 << location_sign_bit).
> +	 This creates a mask that can be used to mask out the sign bit.  */
> +      not_sign_mask = fold_build1_loc (loc, BIT_NOT_EXPR, int_arg_type,
> +			fold_build2_loc (loc, LSHIFT_EXPR, int_arg_type,
> +			  const_arg1,
> +			  build_int_cst (int_arg_type, format->signbit_rw)));
Formatting nits.  When you have to wrap a call, the arguments are 
formatted like this

foo (arg, arg, arg, ...
      arg, arg

Given you've got calls to fold_build2_loc, build_int_cst, etc embedded 
inside other calls to fold_build2_loc, I'd just create some temporaries 
to hold the results of the inner calls.  That'll clean up the formatting 
significantly.

> +					     exp, const1));
> +
> +      /* Combine the values together.  */
> +      specials = fold_build3_loc (loc, COND_EXPR, int_type, zero_check, fp_zero,
> +		   fold_build3_loc (loc, COND_EXPR, int_type, exp_lsb_set,
> +		    fold_build3_loc (loc, COND_EXPR, int_type, mantissa_any_set,
> +		      HONOR_NANS (mode) ? fp_nan : fp_normal,
> +		      HONOR_INFINITIES (mode) ? fp_infinite : fp_normal),
> +		    fp_subnormal));
So this implies you're running on generic, not gimple, right?  Otherwise 
you can't generate these kinds of expressions.

> diff --git a/gcc/real.h b/gcc/real.h
> index 59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8 100644
> --- a/gcc/real.h
> +++ b/gcc/real.h
> @@ -161,6 +161,15 @@ struct real_format
>    bool has_signed_zero;
>    bool qnan_msb_set;
>    bool canonical_nan_lsbs_set;
> +
> +  /* This flag indicates whether the format can be used in the optimized
> +     code paths for the __builtin_fpclassify function and friends.
> +     The format has to have the same NaN and INF representation as normal
> +     IEEE floats (e.g. exp must have all bits set), most significant bit must be
> +     sign bit, followed by exp bits of at most 32 bits.  Lastly the floating
> +     point number must be representable as an integer.  The base of the number
> +     also must be base 2.  */
> +  bool is_binary_ieee_compatible;
>    const char *name;
>  };
I think Joseph has already commented on the contents of the initializer 
and a few more cases were we can use the optimized paths.

However, I do have a general question.  There are some targets which 
have FPUs that are basically IEEE, but don't support certain IEEE 
features like NaNs, denorms, etc.

Presumably all that's needed is for those targets to define a hook to 
describe which checks will always be false and you can check the hook's 
return value.  Right?

Can you please include some tests to verify you're getting the initial 
code generation you want?  Ideally there'd be execution tests too where 
you generate one of the special nodes, then call the __builtin and 
verify that you get the expected results back.  The latter in particular 
are key since it'll allow us to catch problems much earlier across the 
wide variety of targets GCC supports.

I think you already had plans to post an updated patch.  Please include 
the fixes noted above in that update.

And just to be clear, I like where this is going, I just think we're 
going to need a couple iterations to iron everything out.

Jeff