This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [AArch64] Emit square root using the Newton series
- From: Evandro Menezes <e dot menezes at samsung dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>
- Date: Mon, 14 Mar 2016 11:39:20 -0500
- Subject: Re: [AArch64] Emit square root using the Newton series
- Authentication-results: sourceware.org; auth=none
- References: <AM3PR08MB00886499882773F3C8B9F71D83B30 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <011d01d17a26$31b3ade0$951b09a0$ at samsung dot com> <AM3PR08MB0088D558E387C1B736785AA883B40 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <56E1A7AD dot 90408 at samsung dot com> <AM3PR08MB00886A32AE872304290F0B1483B40 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com> <56E1F1FB dot 6070405 at samsung dot com> <AM3PR08MB0088E16F5745DA0E7D463D2183B50 at AM3PR08MB0088 dot eurprd08 dot prod dot outlook dot com>
On 03/10/16 19:06, Wilco Dijkstra wrote:
Evandro Menezes <e.menezes@samsung.com> wrote:
That's what I had in mind too, but around the approximation for x^-1/2
and using masks for vector cases thusly:
fcmne v3.4s, v0.4s, #0.0
frsqrte v1.4s, v0.4s
fmul v2.4s, v1.4s, v1.4s
frsqrts v2.4s, v0.4s, v2.4s
fmul v1.4s, v1.4s, v2.4s
fmul v2.4s, v1.4s, v1.4s
frsqrts v2.4s, v0.4s, v2.4s
fmul v1.4s, v1.4s, v2.4s
and v1.4s, v3.4s
fmul v0.4s, v1.4s, v0.4s
That's possible but the overall latency is higher - according to exynos-1.md the
above takes 44 cycles while my version would be 37.
I'm currently working to get this prototyped without modifying the
reciprocal square root. Once I'm done, I'll merge both functions
together to generate better code.
I got the scalar version going, but I'm stuck with the vector version.
As you can see above, I need to use the complement of the mask produced
by FCMEQ to squelch the offending vector element. However, the way in
which FCMEQ is defined in GCC, it produces an integer vector and the
SIMD AND only takes integer vectors. I'm stuck at how to pass an FP
vector to AND and then its integer vector back to an FP insn.
Here's how the function stands at the moment:
void
aarch64_emit_approx_sqrt (rtx dst, rtx src)
{
machine_mode mode = GET_MODE (src);
gcc_assert (GET_MODE_INNER (mode) == SFmode
|| GET_MODE_INNER (mode) == DFmode);
bool scalar = !VECTOR_MODE_P (mode);
bool narrow = (mode == V2SFmode);
rtx xsrc = gen_reg_rtx (mode);
emit_move_insn (xsrc, src);
rtx xcc, xne, xmsk;
if (scalar)
{
/* fcmp */
xcc = aarch64_gen_compare_reg (NE, xsrc, CONST0_RTX (mode));
xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
}
else
{
machine_mode mcmp = mode_for_vector (int_mode_for_mode
(GET_MODE_INNER (mode)), GET_MODE_NUNITS (mode));
/* fcmne */
xmsk = gen_reg_rtx (mode);
/* Just V4SF for now */
emit_insn (gen_aarch64_cmeqv4sf (xmsk, xsrc, CONST0_RTX (mode)));
/* TODO: must use the complement of the this result. */
}
/* Calculate the approximate reciprocal square root. */
rtx xrsqrt = gen_reg_rtx (mode);
aarch64_emit_approx_rsqrt (xrsqrt, xsrc);
/* Calculate the approximate square root. */
rtx xsqrt = gen_reg_rtx (mode);
emit_set_insn (xsqrt, gen_rtx_MULT (mode, xrsqrt, xsrc));
/* Qualify the result for when the input is zero. */
rtx xdst = gen_reg_rtx (mode);
if (scalar)
/* fcsel */
emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xsqrt,
xsrc));
else
/* and */
emit_set_insn (xdst, gen_rtx_AND (mode, xsqrt, xmsk));
emit_move_insn (dst, xdst);
}
Any help is welcome.
Thank you,
--
Evandro Menezes