This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

From: Ramana Radhakrishnan <ramana dot gcc at googlemail dot com>
To: Prathamesh Kulkarni <prathamesh dot kulkarni at linaro dot org>
Cc: Ramana Radhakrishnan <ramana dot radhakrishnan at foss dot arm dot com>, gcc Patches <gcc-patches at gcc dot gnu dot org>, Charles Baylis <charles dot baylis at linaro dot org>
Date: Thu, 4 Feb 2016 11:01:09 +0000
Subject: Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
Authentication-results: sourceware.org; auth=none
References: <CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ at mail dot gmail dot com> <55BB4127 dot 5050202 at foss dot arm dot com> <CAAgBjMnteaOO=JFifLD1CG7s-JE5r=65wcQm+HBwSMtpwg3ReA at mail dot gmail dot com>

On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
> On 31 July 2015 at 15:04, Ramana Radhakrishnan
> <ramana.radhakrishnan@foss.arm.com> wrote:
>>
>>
>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>> Hi,
>>> This patch tries to implement division with multiplication by
>>> reciprocal using vrecpe/vrecps
>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>> Tested on arm-none-linux-gnueabihf using qemu.
>>> OK for trunk ?
>>>
>>> Thank you,
>>> Prathamesh
>>>
>>
>> I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree.
>>
>> I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness.
> Hi,
> I got results of SPEC2k6 fp benchmarks:
> a15: +0.64% overall, 481.wrf: +6.46%
> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
> a57: +0.35% overall, 481.wrf: +3.84%
> The other benchmarks had (almost) identical results.

Thanks for the benchmarking results -  Please repost the patch with
the changes that I had requested in my previous review - given it is
now stage4 , I would rather queue changes like this for stage1 now.

Thanks,
Ramana

>
> Thanks,
> Prathamesh
>>
>>
>> moving on to the patches.
>>
>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>> index 654d9d5..28c2e2a 100644
>>> --- a/gcc/config/arm/neon.md
>>> +++ b/gcc/config/arm/neon.md
>>> @@ -548,6 +548,32 @@
>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>  )
>>>
>>
>> Please add a comment here.
>>
>>> +(define_expand "div<mode>3"
>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>
>> I want to double check that this doesn't collide with Alan's patches for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.
>>
>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"
>>> +  {
>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>> +
>>> +    /* Reciprocal estimate */
>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>> +
>>> +    /* Perform 2 iterations of Newton-Raphson method for better accuracy */
>>> +    for (int i = 0; i < 2; i++)
>>> +      {
>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>> +      }
>>> +
>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */
>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>> +    DONE;
>>> +  }
>>> +)
>>> +
>>> +
>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>> new file mode 100644
>>> index 0000000..e562ef3
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>> @@ -0,0 +1,14 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all" } */
>>> +/* { dg-add-options arm_v8_neon } */
>>
>> No this is wrong.
>>
>> What is armv8 specific about this test ? This is just like another test that is for Neon. vrecpe / vrecps are not instructions that were introduced in the v8 version of the architecture. They've existed in the base Neon instruction set. The code generation above in the patterns will be enabled when TARGET_NEON is true which can happen when -mfpu=neon -mfloat-abi={softfp/hard} is true.
>>
>>> +
>>> +void
>>> +foo (int len, float * __restrict p, float *__restrict x)
>>> +{
>>> +  len = len & ~31;
>>> +  for (int i = 0; i < len; i++)
>>> +    p[i] = p[i] / x[i];
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>> new file mode 100644
>>> index 0000000..8e15d0a
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>> @@ -0,0 +1,14 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>
>> And likewise.
>>
>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all" } */
>>> +/* { dg-add-options arm_v8_neon } */
>>> +
>>> +void
>>> +foo (int len, float * __restrict p, float *__restrict x)
>>> +{
>>> +  len = len & ~31;
>>> +  for (int i = 0; i < len; i++)
>>> +    p[i] = p[i] / x[i];
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
>>
>>
>> regards
>> Ramana

Follow-Ups:
- Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
  - From: Prathamesh Kulkarni

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]