31723 – Use reciprocal and reciprocal square root with -ffast-math

Bug 31723 - Use reciprocal and reciprocal square root with -ffast-math

Summary: Use reciprocal and reciprocal square root with -ffast-math

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.3.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Uroš Bizjak

URL:	http://gcc.gnu.org/ml/gcc-patches/200...
Keywords:	missed-optimization, patch

Duplicates (1):	92042 (view as bug list)
Depends on:	32383
Blocks:
	Show dependency tree / graph

Reported:	2007-04-27 09:07 UTC by Janne Blomqvist
Modified:	2019-10-10 05:01 UTC (History)
CC List:	7 users (show)

See Also:	32392
Host:
Target:	i686-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2007-06-14 09:18:11

Attachments
Sample implementation for fast sqrt (for float) (1.33 KB, text/plain) 2007-05-07 19:38 UTC, Thomas Koenig	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Janne Blomqvist 2007-04-27 09:07:24 UTC

I did some analysis of why gfortran does badly at the gas_dyn benchmark of the Polyhedron benchmark suite. See my analysis at

http://gcc.gnu.org/ml/fortran/2007-04/msg00494.html

In short, GCC should use reciprocal and reciprocal square root instructions (available in single precision for SSE and Altivec) when possible. These instructions are very fast, a few cycles vs. dozens or hundreds of cycles for normal division and square root instructions. However, as these instructions are accurate only to 12 bits, they should be enabled only with -ffast-math (or some separate option that gets included with -ffast-math).

The following C program demonstrates the issue, for all the functions it should be possible to use reciprocal and/or reciprocal square root instructions instead of normal div and sqrt:

#include <math.h>

float recip1 (float a)
{
  return 1.0f/a;
}

float recip2 (float a, float b)
{
  return a/b;
}

float rsqrt1 (float a)
{
  return 1.0f/sqrtf(a);
}

float rsqrt2 (float a, float b)
{
  /* Mathematically equivalent to 1/sqrt(b*(1/a))  */
  return sqrtf(a/b);
}

asm output (compiled with -std=c99 -O3 -c -Wall -pedantic -march=k8 -mfpmath=sse -ffast-math -S):

        .file   "recip.c"
        .text
        .p2align 4,,15
.globl recip1
        .type   recip1, @function
recip1:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $4, %esp
        movss   .LC0, %xmm0
        divss   8(%ebp), %xmm0
        movss   %xmm0, -4(%ebp)
        flds    -4(%ebp)
        leave
        ret
        .size   recip1, .-recip1
        .p2align 4,,15
.globl recip2
        .type   recip2, @function
recip2:
        pushl   %ebp
        movl    %esp, %ebp
        movss   8(%ebp), %xmm0
        divss   12(%ebp), %xmm0
        movss   %xmm0, 8(%ebp)
        flds    8(%ebp)
        leave
        ret
        .size   recip2, .-recip2
        .p2align 4,,15
.globl rsqrt2
        .type   rsqrt2, @function
rsqrt2:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $4, %esp
        movss   8(%ebp), %xmm0
        divss   12(%ebp), %xmm0
        sqrtss  %xmm0, %xmm0
        movss   %xmm0, -4(%ebp)
        flds    -4(%ebp)
        leave
        ret
        .size   rsqrt2, .-rsqrt2
        .p2align 4,,15
.globl rsqrt1
        .type   rsqrt1, @function
rsqrt1:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $4, %esp
        movss   .LC0, %xmm0
        sqrtss  8(%ebp), %xmm1
        divss   %xmm1, %xmm0
        movss   %xmm0, -4(%ebp)
        flds    -4(%ebp)
        leave
        ret
        .size   rsqrt1, .-rsqrt1
        .section        .rodata.cst4,"aM",@progbits,4
        .align 4
.LC0:
        .long   1065353216
        .ident  "GCC: (GNU) 4.3.0 20070426 (experimental)"
        .section        .note.GNU-stack,"",@progbits


As can be seen, it uses divss and sqrtss instead of rcpss and rsqrtss. Of course, there are vectorized versions of these functions too, rcpps and rsqrtps, that should be used when appropriate (vectorization is important e.g. for gas_dyn).

Comment 1 Tobias Burnus 2007-04-27 10:16:45 UTC

Comment by Richard Guenther in the same thread:
-----------------
I think that even with -ffast-math 12 bits accuracy is not ok.  There is
the possibility of doing another newton iteration step to improve
accuracy, that would be ok for -ffast-math.  We can, though, add an
extra flag -msserecip or however you'd call it to enable use of the
instructions with less accuracy.

Comment 2 Richard Biener 2007-04-27 10:45:36 UTC

Note that SSE can vectorize only the float precision variant, not the double
precision one.  So one needs to carefuly either disable vectorization for the
double variant to get reciprocal code or the other way around.

Note that the function/pattern vectorizer needs to be quite "adjusted" to support
emitting mutliple instructions if we don't want to create builtin functions for
the result.  But it's certainly possible.

The easier part is to expand differently.

Comment 3 Janne Blomqvist 2007-04-27 11:27:10 UTC

(In reply to comment #2)
> Note that SSE can vectorize only the float precision variant, not the double
> precision one.  So one needs to carefuly either disable vectorization for the
> double variant to get reciprocal code or the other way around.

AFAICS these reciprocal instructions are available only for single precision, both for scalar and packed variants. Altivec is only single precision, the SSE instructions are 

rcpss (single precision scalar reciprocal)
rcpps (single precision packed reciprocal)
rsqrtss (single precision scalar reciprocal square root)
rsqrtps (single precision packed reciprocal square root)

There are no equivalent double precision versions of any of these instructions. Or do you think there would be a speed benefit for double precision to

1. Convert to single precision
2. Calculate rcp(s|p)s or rsqrt(p|s)s
3. Refine with newton iteration

vs. just using div(p|s)d or sqrt(p|s)d?

Comment 4 Janne Blomqvist 2007-04-27 11:29:12 UTC

(In reply to comment #3)
> 1. Convert to single precision
> 2. Calculate rcp(s|p)s or rsqrt(p|s)s
> 3. Refine with newton iteration
> 
> vs. just using div(p|s)d or sqrt(p|s)d?

This should be

1. Convert to single precision
2. Calculate rcp(s|p)s or rsqrt(p|s)s
3. Convert back to double precision
4. Refine with newton iteration

Comment 5 Janne Blomqvist 2007-04-27 12:01:03 UTC

With the benchmarks at http://www.hlnum.org/english/doc/frsqrt/frsqrt.html

I get

~/src/benchmark/rsqrt% g++ -O3 -funroll-loops -ffast-math -funit-at-a-time -march=k8 -mfpmath=sse frsqrt.cc
~/src/benchmark/rsqrt% ./a.out
first example: 1 / sqrt(3)
  exact  = 5.7735026918962584e-01
  float  = 5.7735025882720947e-01, error = 1.7948e-08
  double = 5.7735026918962506e-01, error = 1.3461e-15
second example: 1 / sqrt(5)
  exact  = 4.4721359549995793e-01
  float  = 4.4721359014511108e-01, error = 1.1974e-08
  double = 4.4721359549995704e-01, error = 1.9860e-15

Benchmark

(float)  time for 1.0 / sqrt = 5.96 sec (res = 2.8450581250000000e+05)
(float)  time for      rsqrt = 2.49 sec (res = 2.2360225000000000e+05)
(double)  time for 1.0 / sqrt = 7.35 sec (res = 5.9926234364635509e+05)
(double)  time for      rsqrt = 7.49 sec (res = 5.9926234364355623e+05)

Comment 6 Richard Biener 2007-04-27 12:09:34 UTC

You are right, they are only available for float precision.

Comment 7 Tobias Burnus 2007-04-27 12:41:32 UTC

> (float)  time for 1.0 / sqrt = 5.96 sec (res = 2.8450581250000000e+05)
> (float)  time for      rsqrt = 2.49 sec (res = 2.2360225000000000e+05)
> (double)  time for 1.0 / sqrt = 7.35 sec (res = 5.9926234364635509e+05)
> (double)  time for      rsqrt = 7.49 sec (res = 5.9926234364355623e+05)

On an Athlon 64 2x, the double result is more favourable for rsqrt
(using the system g++ 4.1.2 with g++ -march=opteron -O3 -ftree-vectorize -funroll-loops -funit-at-a-time -msse3 frsqrt.cc; similarly with -ffast-math)

(float)  time for 1.0 / sqrt = 3.76 sec (res = 1.7943843750000000e+05)
(float)  time for      rsqrt = 1.72 sec (res = 1.7943843750000000e+05)
(double)  time for 1.0 / sqrt = 5.15 sec (res = 5.9926234364320245e+05)
(double)  time for      rsqrt = 3.34 sec (res = 5.9926234364320245e+05)

Comment 8 Steven Bosscher 2007-04-27 21:43:38 UTC

I suppose this is something that requires new builtins?

Comment 9 Richard Biener 2007-04-27 22:03:19 UTC

I looked at this at some time and in priciple it doens't require it.  For the
vectorized call we'd need to support target dependent pattern vectorization,
for the scalar case we would need a new optab to handle 1/x expansion specially.
Now, for 1/sqrt a builtin could make sense, but even that can be handled via
another optab at expansion time.

Just to have the time and start experimenting...

Comment 10 Thomas Koenig 2007-05-07 19:38:16 UTC

Created attachment 13524 [details]
Sample implementation for fast sqrt (for float)

Here's an implementation which uses two lookup tables (one for the
exponent, one for the mantissa).  Comments are in the code.

This does beat the system sqrt a bit on my Athlon-XP with:

$ gcc -ffast-math -fdump-tree-optimized -O3 -march=athlon-xp -mtune=athlon-xp sqrt-4.c dummy.c  -lm
$ time ./a.out
-3.05539e-08
-8.6503e-09
nan
0

real    0m3.771s
user    0m3.728s
sys     0m0.000s

vs.

$ gcc -ffast-math -fdump-tree-optimized -O3 -march=athlon-xp -mtune=athlon-xp sqrt-5.c dummy.c  -lm
$ time ./a.out
-3.05539e-08
-6.21117e-08
nan
0

real    0m4.314s
user    0m4.272s
sys     0m0.004s

so it's doing pretty well.  Of course, it also loses a few bits
of accuracy.

Of course, this could be done much faster if the lookup tables
were implemented in silicon, which they aren't on my home system :-)

Heavily dependent on IEEE, of course.

Comment 11 Uroš Bizjak 2007-06-10 08:28:11 UTC

I have experimented a bit with rcpss, trying to measure the effect of additional NR step to the performance. NR step was calculated based on http://en.wikipedia.org/wiki/N-th_root_algorithm, and for N=-1 (1/A) we can simplify to:

x1 = x0 (2.0 - A X0)

To obtain 24bit precision, we have to use a reciprocal, two multiplies and subtraction (+ a constant load).

First, please note that "divss" instruction is quite _fast_, clocking at 23 cycles, where approximation with NR step would sum up to 20 cycles, not counting load of constant.

I have checked the performance of following testcase with various implementetations on x86_64 C2D:

--cut here--
float test(float a)
{
  return 1.0 / a;
}


int main()
{
  float a = 1.12345;
  volatile float t;
  int i;

  for (i = 1; i < 1000000000; i++)
    {
      t += test (a);
      a += 1.0;
    }

  printf("%f\n", t);

  return 0;
}
--cut here--

divss     : 3.132s
rcpss NR  : 3.264s
rcpss only: 3.080s

To enhance the precision of 1/sqrt(A), additional NR step is calculated as

x1 = 0.5 X0 (3.0 - A x0 x0 x0)

and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of clocks ;) ), additional NR step just isn't worth it.

The experimental patch:

Index: i386.md
===================================================================
--- i386.md     (revision 125599)
+++ i386.md     (working copy)
@@ -15399,6 +15399,15 @@
 ;; Gcc is slightly more smart about handling normal two address instructions
 ;; so use special patterns for add and mull.
 
+(define_insn "*rcpsf2_sse"
+  [(set (match_operand:SF 0 "register_operand" "=x")
+       (unspec:SF [(match_operand:SF 1 "nonimmediate_operand" "xm")]
+                  UNSPEC_RCP))]
+  "TARGET_SSE"
+  "rcpss\t{%1, %0|%0, %1}"
+  [(set_attr "type" "sse")
+   (set_attr "mode" "SF")])
+
 (define_insn "*fop_sf_comm_mixed"
   [(set (match_operand:SF 0 "register_operand" "=f,x")
        (match_operator:SF 3 "binary_fp_operator"
@@ -15448,6 +15457,29 @@
           (const_string "fop")))
    (set_attr "mode" "SF")])
 
+(define_insn_and_split "*rcp_sf_1_sse"
+  [(set (match_operand:SF 0 "register_operand" "=x")
+       (div:SF (match_operand:SF 1 "immediate_operand" "F")
+               (match_operand:SF 2 "nonimmediate_operand" "xm")))
+   (clobber (match_scratch:SF 3 "=&x"))
+   (clobber (match_scratch:SF 4 "=&x"))]
+  "TARGET_SSE_MATH
+   && operands[1] == CONST1_RTX (SFmode)
+   && flag_unsafe_math_optimizations"
+   "#"
+   "&& reload_completed"
+   [(set (match_dup 3)(match_dup 2))
+    (set (match_dup 4)(match_dup 5))
+    (set (match_dup 0)(unspec:SF [(match_dup 3)] UNSPEC_RCP))
+    (set (match_dup 3)(mult:SF (match_dup 3)(match_dup 0)))
+    (set (match_dup 4)(minus:SF (match_dup 4)(match_dup 3)))
+    (set (match_dup 0)(mult:SF (match_dup 0)(match_dup 4)))]
+{
+  rtx two = const_double_from_real_value (dconst2, SFmode);
+
+  operands[5] = validize_mem (force_const_mem (SFmode, two));
+})
+
 (define_insn "*fop_sf_1_mixed"
   [(set (match_operand:SF 0 "register_operand" "=f,f,x")
        (match_operator:SF 3 "binary_fp_operator"

Based on these findings, I guess that NR step is just not worth it. If we want to have noticeable speed-up on division and square root, we have to use 12bit implementations, without any refinements - mainly for benchmarketing, I'm afraid.

BTW: on x86_64, patched gcc compiles "test" function to:

test:
        movaps  %xmm0, %xmm1
        rcpss   %xmm0, %xmm0
        movss   .LC1(%rip), %xmm2
        mulss   %xmm0, %xmm1
        subss   %xmm1, %xmm2
        mulss   %xmm2, %xmm0
        ret

Comment 12 Uroš Bizjak 2007-06-10 10:47:41 UTC

Here are the results of mubench insn timings for various x86 processors:
http://mubench.sourceforge.net/results.html (target processor can be benchmarked by downloading mubench from http://mubench.sourceforge.net/index.html).

And finally an interesting read how commercial compilers trade accurracy for speed (please read at least about SPEC2006 benchmark): http://www.hpcwire.com/hpc/1556972.html

Comment 13 Janne Blomqvist 2007-06-10 11:06:36 UTC

(In reply to comment #11)

Thanks for the work.

> First, please note that "divss" instruction is quite _fast_, clocking at 23
> cycles, where approximation with NR step would sum up to 20 cycles, not
> counting load of constant.
> 
> I have checked the performance of following testcase with various
> implementetations on x86_64 C2D:
> 
> --cut here--
> float test(float a)
> {
>   return 1.0 / a;
> }
>
> divss     : 3.132s
> rcpss NR  : 3.264s
> rcpss only: 3.080s

Interesting, on ubuntu/i686/K8 I get (average of 3 runs)

divss: 7.485 s
rcpss NR: 9.915 s

> To enhance the precision of 1/sqrt(A), additional NR step is calculated as
> 
> x1 = 0.5 X0 (3.0 - A x0 x0 x0)
> 
> and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of
> clocks ;) ), additional NR step just isn't worth it.

Well, I suppose it depends on the hardware. IIRC older cpu:s did division with microcode whereas at least core2 and K8 do it in hardware, so I guess the hundreds of cycles doesn't apply to current cpu:s. 

Also, supposedly Penryn will have a much improved divider..

That being said, I think there is still a case for the reciprocal square root, as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn linked to in the first message in this PR (in short, ifort does sqrt(a/b) about twice as fast as gfortran by using reciprocal approximations + NR). If indeed div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it suggests almost all the performance benefit ifort gets is due to the rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

> Based on these findings, I guess that NR step is just not worth it. If we want
> to have noticeable speed-up on division and square root, we have to use 12bit
> implementations, without any refinements - mainly for benchmarketing, I'm
> afraid.

I hear that it's possible to pass spec2k6/gromacs without the NR step. As most MD programs, gromacs spends almost all it's time in the force calculations, where the majority of time is spent calculating 1/sqrt(...). So perhaps one should watch out for compilers that get suspiciously high scores on that benchmark. :)

No, I'm not suggesting gcc should do this.

Comment 14 Richard Biener 2007-06-10 12:07:36 UTC

The interesting difference between sqrtss, divss and rcpss, rsqrtss is that
the former have throughput of 1/16 while the latter are 1/1 (latencies compare
21 vs. 3).  This is on K10.  The optimization guide only mentions calculating
the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss
(sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))

So the optimization would be mainly to improve instruction throughput, not
overall latency.

Comment 15 Richard Biener 2007-06-10 12:09:02 UTC

And of course optimizing division or square root this way violates IEEE 754 which
specifies these as intrinsic operations.  So a separate flag from -funsafe-math-optimization should be used for this optimization.

Comment 16 Uroš Bizjak 2007-06-10 16:24:58 UTC

(In reply to comment #13)

> > x1 = 0.5 X0 (3.0 - A x0 x0 x0)

Whops! One x0 too much above. Correct calcualtion reads:

rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)).

> Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
> microcode whereas at least core2 and K8 do it in hardware, so I guess the
> hundreds of cycles doesn't apply to current cpu:s. 
> 
> Also, supposedly Penryn will have a much improved divider..

Well, mubench says for my Core2Duo that _all_ sqrt and div functions have latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps, divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have latency of 3 clks and rcp throughput of 2 clks. This is just amazing.

> That being said, I think there is still a case for the reciprocal square root,
> as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
> linked to in the first message in this PR (in short, ifort does sqrt(a/b) about
> twice as fast as gfortran by using reciprocal approximations + NR). If indeed
> div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
> suggests almost all the performance benefit ifort gets is due to the
> rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the
> sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

It is true, that only a trivial accumulation function is benchmarked by my "benchmark". I can prepare a bunch of expanders to expand:

a / b <=> a [rcpss(b) (2.0 - b rcpss(b))]

a / sqrtss(b) <=> a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))].

sqrtss (a) <=> a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a))

second and third case indeed look similar...

> I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
> MD programs, gromacs spends almost all it's time in the force calculations,
> where the majority of time is spent calculating 1/sqrt(...). So perhaps one
> should watch out for compilers that get suspiciously high scores on that
> benchmark. :)

Yes, look at hpcwire article in Comment #12

> No, I'm not suggesting gcc should do this.

;))

Comment 17 Uroš Bizjak 2007-06-10 16:49:25 UTC

(In reply to comment #0)

>   /* Mathematically equivalent to 1/sqrt(b*(1/a))  */
>   return sqrtf(a/b);

Whoa, this one is a little gem, but ATM in the opposite direction. At least for -ffast-math we could optimize (a / sqrt (b/c)) into a * sqrt (c/b), thus loosing one division. I'm sure that richi knows by his heart, how to write this kind of folding ;)

Comment 18 Uroš Bizjak 2007-06-10 17:34:23 UTC

(In reply to comment #14)
> The interesting difference between sqrtss, divss and rcpss, rsqrtss is that
> the former have throughput of 1/16 while the latter are 1/1 (latencies compare
> 21 vs. 3).  This is on K10.  The optimization guide only mentions calculating
> the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss
> (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))
> 
> So the optimization would be mainly to improve instruction throughput, not
> overall latency.

If this is the case, then middle-end will need to fold sqrtss in different way for targets that prefer rsqrtss. According to Comment #16, it is better to fold to 1.0/sqrt(c/b) instead of sqrt(b/c) because this way, we will loose one multiplication during NR expansion by rsqrt [due to sqrt(x) <=>  x * (1.0 / sqrt(x))].

IMO we need a new tree code to handle reciprocal sqrt - RSQRT_EXPR, together with proper folding functionality that expands directly to (NR-enhanced) rsqrt optab. If we consider a*sqrt(b/c), then b/c will be expanded as b* NR-rcp(c) [where NR-rcp stands for NR enhanced rcp] and sqrt will be expanded as NR-rsqrt. In this case, I see no RTL pass that would be able to combine everything together in order to swap (b/c) operands to produce NR-enhanced a*rsqrt(c/b) equivalent.

Comment 19 rguenther@suse.de 2007-06-10 21:39:00 UTC

Subject: Re:  Use reciprocal and reciprocal square root
 with -ffast-math

On Sun, 10 Jun 2007, ubizjak at gmail dot com wrote:

> 
> 
> ------- Comment #18 from ubizjak at gmail dot com  2007-06-10 17:34 -------
> (In reply to comment #14)
> > The interesting difference between sqrtss, divss and rcpss, rsqrtss is that
> > the former have throughput of 1/16 while the latter are 1/1 (latencies compare
> > 21 vs. 3).  This is on K10.  The optimization guide only mentions calculating
> > the reciprocal y = a/b via rcpss and the square root (!) via rsqrtss
> > (sqrt a = 0.5 * a * rsqrtss(a) * (3.0 - a * rsqrtss(a) * rsqrtss(a)))
> > 
> > So the optimization would be mainly to improve instruction throughput, not
> > overall latency.
> 
> If this is the case, then middle-end will need to fold sqrtss in different way
> for targets that prefer rsqrtss. According to Comment #16, it is better to fold
> to 1.0/sqrt(c/b) instead of sqrt(b/c) because this way, we will loose one
> multiplication during NR expansion by rsqrt [due to sqrt(x) <=>  x * (1.0 /
> sqrt(x))].
> 
> IMO we need a new tree code to handle reciprocal sqrt - RSQRT_EXPR, together
> with proper folding functionality that expands directly to (NR-enhanced) rsqrt
> optab. If we consider a*sqrt(b/c), then b/c will be expanded as b* NR-rcp(c)
> [where NR-rcp stands for NR enhanced rcp] and sqrt will be expanded as
> NR-rsqrt. In this case, I see no RTL pass that would be able to combine
> everything together in order to swap (b/c) operands to produce NR-enhanced
> a*rsqrt(c/b) equivalent.

We just need a new builtin function, __builtin_rsqrt and at some stage
replace reciprocals of sqrt with the new builtin.  For example in
tree-ssa-math-opts.c which does the existing reciprocal transforms.
For example a target hook could be provided that would for example look
like

   tree target_fn_for_expr (tree expr);

and return a target builtin decl for the given expression.

And we should start splitting this PR ;)  One for a/sqrt(b/c) and one
for the above transformation.

Richard.

Comment 20 Richard Biener 2007-06-10 21:46:58 UTC

PR32279 for 1/sqrt(x/y) to sqrt(y/x)

Comment 21 Richard Biener 2007-06-10 21:48:42 UTC

The other issue is really about this bug, so not splitting.

Comment 22 tbp 2007-06-11 03:32:33 UTC

I'm a bit late to the debate but...

At some point icc did such transformations (for 1/x and sqrt) but, apparently, they're now removed. It didn't bother to plug every holes (ie wrt infinities) but at least got the case of 0 covered even when set lose; it's cheap to do.
I've repeatedly been pointed to the peculiar semantic of -ffast-math in the past, so i know there's little chance for me to succeed, but would it be possible to consider that as an option?

PS: Yes, i do rely on infinities and -ffast-math and deserve to die a slow and painful way.

Comment 23 Uroš Bizjak 2007-06-11 05:51:42 UTC

(In reply to comment #22)

> At some point icc did such transformations (for 1/x and sqrt) but, apparently,
> they're now removed. It didn't bother to plug every holes (ie wrt infinities)
> but at least got the case of 0 covered even when set lose; it's cheap to do.
> I've repeatedly been pointed to the peculiar semantic of -ffast-math in the
> past, so i know there's little chance for me to succeed, but would it be
> possible to consider that as an option?

But both, rcpss and rsqrtss handle infinties correctly (they return zero) and return [-]inf when [-]0.0 is used as an argument.

Comment 24 tbp 2007-06-11 05:58:20 UTC

Yes, but there's some fuss at 0 when you pile up a NR round.

Comment 25 Uroš Bizjak 2007-06-13 20:20:49 UTC

RFC patch at http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00916.html

Comment 26 Uroš Bizjak 2007-06-14 09:18:11 UTC

Patch at http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00944.html

Comment 27 Tobias Burnus 2007-06-15 13:23:44 UTC

Cross-pointer: see also PR 32352 (Polyhedron aermod.f90 crashes due out-of-bounds problems to numerical differences using rsqrt/-mrecip).

Comment 28 uros 2007-06-16 09:53:03 UTC

Subject: Bug 31723

Author: uros
Date: Sat Jun 16 09:52:48 2007
New Revision: 125756

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=125756
Log:
    PR middle-end/31723
    * hooks.c (hook_tree_tree_bool_null): New hook.
    * hooks.h (hook_tree_tree_bool_null): Add prototype.
    * tree-pass.h (pass_convert_to_rsqrt): Declare.
    * passes.c (init_optimization_passes): Add pass_convert_to_rsqrt.
    * tree-ssa-math-opts.c (execute_cse_reciprocals): Scan for a/func(b)
    and convert it to reciprocal a*rfunc(b).
    (execute_convert_to_rsqrt): New function.
    (gate_convert_to_rsqrt): New function.
    (pass_convert_to_rsqrt): New pass definition.
    * target.h (struct gcc_target): Add builtin_reciprocal.
    * target-def.h (TARGET_BUILTIN_RECIPROCAL): New define.
    (TARGET_INITIALIZER): Initialize builtin_reciprocal with
    TARGET_BUILTIN_RECIPROCAL.
    * doc/tm.texi (TARGET_BUILTIN_RECIPROCAL): Document.

    * config/i386/i386.h (TARGET_RECIP): New define.
    * config/i386/i386.md (divsf3): Expand by calling ix86_emit_swdivsf
    for TARGET_SSE_MATH and TARGET_RECIP when
    flag_unsafe_math_optimizations is set and not optimizing for size.
    (*rcpsf2_sse): New insn pattern.
    (*rsqrtsf2_sse): Ditto.
    (rsqrtsf2): New expander.  Expand by calling ix86_emit_swsqrtsf
    for TARGET_SSE_MATH and TARGET_RECIP when
    flag_unsafe_math_optimizations is set and not optimizing for size.
    (sqrt<mode>2): Expand SFmode operands by calling ix86_emit_swsqrtsf
    for TARGET_SSE_MATH and TARGET_RECIP when
    flag_unsafe_math_optimizations is set and not optimizing for size.
    * config/i386/sse.md (divv4sf): Expand by calling ix86_emit_swdivsf
    for TARGET_SSE_MATH and TARGET_RECIP when
    flag_unsafe_math_optimizations is set and not optimizing for size.
    (*sse_rsqrtv4sf2): Do not export.
    (sqrtv4sf2): Ditto.
    (sse_rsqrtv4sf2): New expander.  Expand by calling ix86_emit_swsqrtsf
    for TARGET_SSE_MATH and TARGET_RECIP when
    flag_unsafe_math_optimizations is set and not optimizing for size.
    (sqrtv4sf2): Ditto.
    * config/i386/i386.opt (mrecip): New option.
    * config/i386/i386-protos.h (ix86_emit_swdivsf): Declare.
    (ix86_emit_swsqrtsf): Ditto.
    * config/i386/i386.c (IX86_BUILTIN_RSQRTF): New constant.
    (ix86_init_mmx_sse_builtins): __builtin_ia32_rsqrtf: New
    builtin definition.
    (ix86_expand_builtin): Expand IX86_BUILTIN_RSQRTF using
    ix86_expand_unop1_builtin.
    (ix86_emit_swdivsf): New function.
    (ix86_emit_swsqrtsf): Ditto.
    (ix86_builtin_reciprocal): New function.
    (TARGET_BUILTIN_RECIPROCAL): Use it.
    (ix86_vectorize_builtin_conversion): Rename from
    ix86_builtin_conversion.
    (TARGET_VECTORIZE_BUILTIN_CONVERSION): Use renamed function.
    * doc/invoke.texi (Machine Dependent Options): Add -mrecip to
    "i386 and x86_64 Options" section.
    (Intel 386 and AMD x86_64 Options): Document -mrecip.

testsuite/ChangeLog:

    PR middle-end/31723
    * gcc.target/i386/recip-divf.c: New test.
    * gcc.target/i386/recip-sqrtf.c: Ditto.
    * gcc.target/i386/recip-vec-divf.c: Ditto.
    * gcc.target/i386/recip-vec-sqrtf.c: Ditto.
    * gcc.target/i386/sse-recip.c: Ditto.


Added:
    trunk/gcc/testsuite/gcc.target/i386/recip-divf.c
    trunk/gcc/testsuite/gcc.target/i386/recip-sqrtf.c
    trunk/gcc/testsuite/gcc.target/i386/recip-vec-divf.c
    trunk/gcc/testsuite/gcc.target/i386/recip-vec-sqrtf.c
    trunk/gcc/testsuite/gcc.target/i386/sse-recip.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-protos.h
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/i386.opt
    trunk/gcc/config/i386/sse.md
    trunk/gcc/doc/invoke.texi
    trunk/gcc/doc/tm.texi
    trunk/gcc/hooks.c
    trunk/gcc/hooks.h
    trunk/gcc/passes.c
    trunk/gcc/target-def.h
    trunk/gcc/target.h
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-pass.h
    trunk/gcc/tree-ssa-math-opts.c

Comment 29 Uroš Bizjak 2007-06-18 08:56:14 UTC

Patch was committed to SVN, so closing as fixed.

Comment 30 Hongtao.liu 2019-10-10 05:01:14 UTC

*** Bug 92042 has been marked as a duplicate of this bug. ***