[AArch64] Emit square root using the Newton series

Evandro Menezes e.menezes@samsung.com
Mon Mar 14 16:39:00 GMT 2016


On 03/10/16 19:06, Wilco Dijkstra wrote:
> Evandro Menezes <e.menezes@samsung.com> wrote:
>> That's what I had in mind too, but around the approximation for x^-1/2
>> and using masks for vector cases thusly:
>>
>>         fcmne   v3.4s, v0.4s, #0.0
>>          frsqrte v1.4s, v0.4s
>>          fmul    v2.4s, v1.4s, v1.4s
>>          frsqrts v2.4s, v0.4s, v2.4s
>>          fmul    v1.4s, v1.4s, v2.4s
>>          fmul    v2.4s, v1.4s, v1.4s
>>          frsqrts v2.4s, v0.4s, v2.4s
>>          fmul    v1.4s, v1.4s, v2.4s
>>         and     v1.4s, v3.4s
>>          fmul    v0.4s, v1.4s, v0.4s
> That's possible but the overall latency is higher - according to exynos-1.md the
> above takes 44 cycles while my version would be 37.

I'm currently working to get this prototyped without modifying the 
reciprocal square root.  Once I'm done, I'll merge both functions 
together to generate better code.

I got the scalar version going, but I'm stuck with the vector version.  
As you can see above, I need to use the complement of the mask produced 
by FCMEQ to squelch the offending vector element. However, the way in 
which FCMEQ is defined in GCC, it produces an integer vector and the 
SIMD AND only takes integer vectors.  I'm stuck at how to pass an FP 
vector to AND and then its integer vector back to an FP insn.

Here's how the function stands at the moment:

    void
    aarch64_emit_approx_sqrt (rtx dst, rtx src)
    {
       machine_mode mode = GET_MODE (src);
       gcc_assert (GET_MODE_INNER (mode) == SFmode
                   || GET_MODE_INNER (mode) == DFmode);

       bool scalar = !VECTOR_MODE_P (mode);
       bool narrow = (mode == V2SFmode);

       rtx xsrc = gen_reg_rtx (mode);
       emit_move_insn (xsrc, src);

       rtx xcc, xne, xmsk;
       if (scalar)
         {
           /* fcmp */
           xcc = aarch64_gen_compare_reg (NE, xsrc, CONST0_RTX (mode));
           xne = gen_rtx_NE (VOIDmode, xcc, const0_rtx);
         }
       else
         {
           machine_mode mcmp = mode_for_vector (int_mode_for_mode
    (GET_MODE_INNER (mode)), GET_MODE_NUNITS (mode));
           /* fcmne */
           xmsk = gen_reg_rtx (mode);
           /* Just V4SF for now */
           emit_insn (gen_aarch64_cmeqv4sf (xmsk, xsrc, CONST0_RTX (mode)));
           /* TODO: must use the complement of the this result.  */
         }

       /* Calculate the approximate reciprocal square root.  */
       rtx xrsqrt = gen_reg_rtx (mode);
       aarch64_emit_approx_rsqrt (xrsqrt, xsrc);

       /* Calculate the approximate square root.  */
       rtx xsqrt = gen_reg_rtx (mode);
       emit_set_insn (xsqrt, gen_rtx_MULT (mode, xrsqrt, xsrc));

       /* Qualify the result for when the input is zero.  */
       rtx xdst = gen_reg_rtx (mode);
       if (scalar)
         /* fcsel */
         emit_set_insn (xdst, gen_rtx_IF_THEN_ELSE (mode, xne, xsqrt,
    xsrc));
       else
         /* and */
         emit_set_insn (xdst, gen_rtx_AND (mode, xsqrt, xmsk));

       emit_move_insn (dst, xdst);
    }

Any help is welcome.

Thank you,

-- 
Evandro Menezes



More information about the Gcc-patches mailing list