This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [AArch64] Emit division using the Newton series
- From: Evandro Menezes <e dot menezes at samsung dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Cc: James Greenhalgh <James dot Greenhalgh at arm dot com>, Andrew Pinski <pinskia at gmail dot com>, nd <nd at arm dot com>
- Date: Mon, 04 Apr 2016 14:06:44 -0500
- Subject: Re: [AArch64] Emit division using the Newton series
- Authentication-results: sourceware.org; auth=none
- References: <56EB0EDF dot 3060401 at samsung dot com> <56F2C329 dot 10405 at samsung dot com> <56FDA311 dot 7090309 at samsung dot com> <AM3PR08MB0088DDE6EA428B37CE090953839A0 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <56FED036 dot 2070405 at samsung dot com> <AM3PR08MB00884DBC29E8F0651E1ECEC6839A0 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <56FEEE90 dot 3070707 at samsung dot com> <AM3PR08MB008804766694273167F6E8C5839A0 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <56FEFBC1 dot 1060308 at samsung dot com>
On 04/01/16 17:52, Evandro Menezes wrote:
On 04/01/16 17:45, Wilco Dijkstra wrote:
Evandro Menezes wrote:
However, I don't think that there's the need to handle any special case
for division. The only case when the approximation differs from
division is when the numerator is infinity and the denominator, zero,
when the approximation returns infinity and the division, NAN. So I
don't think that it's a special case that deserves being handled. IOW,
the result of the approximate reciprocal is always needed.
No, the result of the approximate reciprocal is not needed.
Basically a NR approximation produces a correction factor that is
very close
to 1.0, and then multiplies that with the previous estimate to get a
more
accurate estimate. The final calculation for x * recip(y) is:
result = (reciprocal_correction * reciprocal_estimate) * x
while what I am suggesting is a trivial reassociation:
result = reciprocal_correction * (reciprocal_estimate * x)
The computation of the final reciprocal_correction is on the critical
latency
path, while reciprocal_estimate is computed earlier, so we can compute
(reciprocal_estimate * x) without increasing the overall latency. Ie.
we saved
a multiply.
In principle this could be done as a separate optimization pass that
tries to
reassociate to reduce latency. However I'm not too convinced this
would be
easy to implement in GCC's scheduler, so it's best to do it explicitly.
I think that I see what you mean. I'll hack something tomorrow.
[AArch64] Emit division using the Newton series
2016-04-04 Evandro Menezes <e.menezes@samsung.com>
Wilco Dijkstra <Wilco.Dijkstra@arm.com>
gcc/
* config/aarch64/aarch64-tuning-flags.def
* config/aarch64/aarch64-protos.h
(AARCH64_APPROX_MODE): New macro.
(AARCH64_EXTRA_TUNE_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}:
New tuning macros.
(tune_params): Add new member "approx_div_modes".
(aarch64_emit_approx_div): Declare new function.
* config/aarch64/aarch64.c
(generic_tunings): New member "approx_div_modes".
(cortexa35_tunings): Likewise.
(cortexa53_tunings): Likewise.
(cortexa57_tunings): Likewise.
(cortexa72_tunings): Likewise.
(exynosm1_tunings): Likewise.
(thunderx_tunings): Likewise.
(xgene1_tunings): Likewise.
(aarch64_emit_approx_div): Define new function.
* config/aarch64/aarch64.md ("div<mode>3"): New expansion.
* config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
* config/aarch64/aarch64.opt (-mlow-precision-div): Add new
option.
* doc/invoke.texi (-mlow-precision-div): Describe new option.
This version of the patch has a shorter dependency chain at the last
iteration of the series.
Thank you for your feedback,
--
Evandro Menezes
>From c8d94247e5b3c6120436051c8da11850937b7246 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Mon, 4 Apr 2016 14:02:24 -0500
Subject: [PATCH] [AArch64] Emit division using the Newton series
2016-04-04 Evandro Menezes <e.menezes@samsung.com>
Wilco Dijkstra <Wilco.Dijkstra@arm.com>
gcc/
* config/aarch64/aarch64-tuning-flags.def
* config/aarch64/aarch64-protos.h
(AARCH64_APPROX_MODE): New macro.
(AARCH64_EXTRA_TUNE_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}:
New tuning macros.
(tune_params): Add new member "approx_div_modes".
(aarch64_emit_approx_div): Declare new function.
* config/aarch64/aarch64.c
(generic_tunings): New member "approx_div_modes".
(cortexa35_tunings): Likewise.
(cortexa53_tunings): Likewise.
(cortexa57_tunings): Likewise.
(cortexa72_tunings): Likewise.
(exynosm1_tunings): Likewise.
(thunderx_tunings): Likewise.
(xgene1_tunings): Likewise.
(aarch64_emit_approx_div): Define new function.
* config/aarch64/aarch64.md ("div<mode>3"): New expansion.
* config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise.
* config/aarch64/aarch64.opt (-mlow-precision-div): Add new option.
* doc/invoke.texi (-mlow-precision-div): Describe new option.
---
gcc/config/aarch64/aarch64-protos.h | 28 +++++++++
gcc/config/aarch64/aarch64-simd.md | 14 ++++-
gcc/config/aarch64/aarch64-tuning-flags.def | 1 -
gcc/config/aarch64/aarch64.c | 98 ++++++++++++++++++++++++++---
gcc/config/aarch64/aarch64.md | 19 ++++--
gcc/config/aarch64/aarch64.opt | 5 ++
gcc/doc/invoke.texi | 10 +++
7 files changed, 161 insertions(+), 14 deletions(-)
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58c9d0d..25102d5 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,32 @@ struct cpu_branch_cost
const int unpredictable; /* Unpredictable branch or optimizing for speed. */
};
+/* Control approximate alternatives to certain FP operators. */
+#define AARCH64_APPROX_MODE(MODE) \
+ ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+ ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+ : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+ ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \
+ + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \
+ : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+ | AARCH64_APPROX_MODE (V2SFmode) \
+ | AARCH64_APPROX_MODE (V4SFmode))
+#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \
+ | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \
+ | AARCH64_APPROX_MODE (DFmode) \
+ | AARCH64_APPROX_MODE (V2SFmode))
+#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \
+ | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \
+ | AARCH64_APPROX_MODE (DFmode))
+#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \
+ | AARCH64_APPROX_MODE (V4SFmode) \
+ | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_ALL (-1)
+
struct tune_params
{
const struct cpu_cost_table *insn_extra_cost;
@@ -218,6 +244,7 @@ struct tune_params
} autoprefetcher_model;
unsigned int extra_tuning_flags;
+ unsigned int approx_div_modes;
};
#define AARCH64_FUSION_PAIR(x, name) \
@@ -362,6 +389,7 @@ void aarch64_relayout_simd_types (void);
void aarch64_reset_previous_fndecl (void);
void aarch64_save_restore_target_globals (tree);
void aarch64_emit_approx_rsqrt (rtx, rtx);
+bool aarch64_emit_approx_div (rtx, rtx, rtx);
/* Initialize builtins for SIMD intrinsics. */
void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index bd73bce..99be92e 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1509,7 +1509,19 @@
[(set_attr "type" "neon_fp_mul_<Vetype><q>")]
)
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:VDQF 0 "register_operand")
+ (div:VDQF (match_operand:VDQF 1 "general_operand")
+ (match_operand:VDQF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+ if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+ DONE;
+
+ operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
[(set (match_operand:VDQF 0 "register_operand" "=w")
(div:VDQF (match_operand:VDQF 1 "register_operand" "w")
(match_operand:VDQF 2 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..f25714c 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -30,4 +30,3 @@
AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b7086dd..21af809 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -414,7 +414,8 @@ static const struct tune_params generic_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params cortexa35_tunings =
@@ -439,7 +440,8 @@ static const struct tune_params cortexa35_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params cortexa53_tunings =
@@ -464,7 +466,8 @@ static const struct tune_params cortexa53_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params cortexa57_tunings =
@@ -489,7 +492,8 @@ static const struct tune_params cortexa57_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params cortexa72_tunings =
@@ -514,7 +518,8 @@ static const struct tune_params cortexa72_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params exynosm1_tunings =
@@ -538,7 +543,8 @@ static const struct tune_params exynosm1_tunings =
48, /* max_case_values. */
64, /* cache_line_size. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_APPROX_RSQRT), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params thunderx_tunings =
@@ -562,7 +568,8 @@ static const struct tune_params thunderx_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
static const struct tune_params xgene1_tunings =
@@ -586,7 +593,8 @@ static const struct tune_params xgene1_tunings =
0, /* max_case_values. */
0, /* cache_line_size. */
tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */
- (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags. */
+ (AARCH64_EXTRA_TUNE_APPROX_RSQRT), /* tune_flags. */
+ (AARCH64_APPROX_NONE) /* approx_div_modes. */
};
/* Support for fine-grained override of the tuning structures. */
@@ -7552,6 +7560,80 @@ aarch64_emit_approx_rsqrt (rtx dst, rtx src)
emit_move_insn (dst, x0);
}
+/* Emit the instruction sequence to compute the approximation for a reciprocal. */
+
+bool
+aarch64_emit_approx_div (rtx quo, rtx num, rtx div)
+{
+ machine_mode mode = GET_MODE (quo);
+
+ if (!flag_finite_math_only
+ || flag_trapping_math
+ || !flag_unsafe_math_optimizations
+ || optimize_function_for_size_p (cfun)
+ || !(flag_mlow_precision_div
+ || (aarch64_tune_params.approx_div_modes & AARCH64_APPROX_MODE (mode))))
+ return false;
+
+ /* Estimate the approximate reciprocal. */
+ rtx xrcp = gen_reg_rtx (mode);
+ switch (mode)
+ {
+ case SFmode:
+ emit_insn (gen_aarch64_frecpesf (xrcp, div)); break;
+ case V2SFmode:
+ emit_insn (gen_aarch64_frecpev2sf (xrcp, div)); break;
+ case V4SFmode:
+ emit_insn (gen_aarch64_frecpev4sf (xrcp, div)); break;
+ case DFmode:
+ emit_insn (gen_aarch64_frecpedf (xrcp, div)); break;
+ case V2DFmode:
+ emit_insn (gen_aarch64_frecpev2df (xrcp, div)); break;
+ default:
+ gcc_unreachable ();
+ }
+
+ /* Iterate over the series twice for SF and thrice for DF. */
+ int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2;
+
+ /* Optionally iterate over the series once less for faster performance,
+ while sacrificing the accuracy. */
+ if (flag_mlow_precision_div)
+ iterations--;
+
+ rtx xtmp = gen_reg_rtx (mode);
+ while (iterations--)
+ {
+ switch (mode)
+ {
+ case SFmode:
+ emit_insn (gen_aarch64_frecpssf (xtmp, xrcp, div)); break;
+ case V2SFmode:
+ emit_insn (gen_aarch64_frecpsv2sf (xtmp, xrcp, div)); break;
+ case V4SFmode:
+ emit_insn (gen_aarch64_frecpsv4sf (xtmp, xrcp, div)); break;
+ case DFmode:
+ emit_insn (gen_aarch64_frecpsdf (xtmp, xrcp, div)); break;
+ case V2DFmode:
+ emit_insn (gen_aarch64_frecpsv2df (xtmp, xrcp, div)); break;
+ default:
+ gcc_unreachable ();
+ }
+
+ if (iterations > 0)
+ emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xtmp));
+ }
+
+ if (num != CONST1_RTX (mode))
+ {
+ rtx xnum = force_reg (mode, num);
+ emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xnum));
+ }
+
+ emit_set_insn (quo, gen_rtx_MULT (mode, xrcp, xtmp));
+ return true;
+}
+
/* Return the number of instructions that can be issued per cycle. */
static int
aarch64_sched_issue_rate (void)
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 68676c9..985915e 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4647,11 +4647,22 @@
[(set_attr "type" "fmul<s>")]
)
-(define_insn "div<mode>3"
+(define_expand "div<mode>3"
+ [(set (match_operand:GPF 0 "register_operand")
+ (div:GPF (match_operand:GPF 1 "general_operand")
+ (match_operand:GPF 2 "register_operand")))]
+ "TARGET_SIMD"
+{
+ if (aarch64_emit_approx_div (operands[0], operands[1], operands[2]))
+ DONE;
+
+ operands[1] = force_reg (<MODE>mode, operands[1]);
+})
+
+(define_insn "*div<mode>3"
[(set (match_operand:GPF 0 "register_operand" "=w")
- (div:GPF
- (match_operand:GPF 1 "register_operand" "w")
- (match_operand:GPF 2 "register_operand" "w")))]
+ (div:GPF (match_operand:GPF 1 "register_operand" "w")
+ (match_operand:GPF 2 "register_operand" "w")))]
"TARGET_FLOAT"
"fdiv\\t%<s>0, %<s>1, %<s>2"
[(set_attr "type" "fdiv<s>")]
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index c637ff4..672f08c 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -153,3 +153,8 @@ mlow-precision-recip-sqrt
Common Var(flag_mrecip_low_precision_sqrt) Optimization
When calculating the reciprocal square root approximation,
uses one less step than otherwise, thus reducing latency and precision.
+
+mlow-precision-div
+Common Var(flag_mlow_precision_div) Optimization
+When calculating the approximate division,
+use one less step than otherwise, thus reducing latency and precision.
\ No newline at end of file
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index e9763d4..297f9aa 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -572,6 +572,7 @@ Objective-C and Objective-C++ Dialects}.
-mtls-size=@var{size} @gol
-mfix-cortex-a53-835769 -mno-fix-cortex-a53-835769 @gol
-mfix-cortex-a53-843419 -mno-fix-cortex-a53-843419 @gol
+-mlow-precision-div -mno-low-precision-div @gol
-mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
-march=@var{name} -mcpu=@var{name} -mtune=@var{name}}
@@ -12921,6 +12922,15 @@ uses one less step than otherwise, thus reducing latency and precision.
This is only relevant if @option{-ffast-math} enables the reciprocal square root
approximation, which in turn depends on the target processor.
+@item -mlow-precision-div
+@item -mno-low-precision-div
+@opindex -mlow-precision-div
+@opindex -mno-low-precision-div
+When calculating the division approximation,
+uses one less step than otherwise, thus reducing latency and precision.
+This is only relevant if @option{-ffast-math} enables the division
+approximation.
+
@item -march=@var{name}
@opindex march
Specify the name of the target architecture and, optionally, one or
--
1.9.1