This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
- From: "jb at gcc dot gnu dot org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 10 Jun 2007 11:06:37 -0000
- Subject: [Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math
- References: <bug-31723-11659@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #13 from jb at gcc dot gnu dot org 2007-06-10 11:06 -------
(In reply to comment #11)
Thanks for the work.
> First, please note that "divss" instruction is quite _fast_, clocking at 23
> cycles, where approximation with NR step would sum up to 20 cycles, not
> counting load of constant.
>
> I have checked the performance of following testcase with various
> implementetations on x86_64 C2D:
>
> --cut here--
> float test(float a)
> {
> return 1.0 / a;
> }
>
> divss : 3.132s
> rcpss NR : 3.264s
> rcpss only: 3.080s
Interesting, on ubuntu/i686/K8 I get (average of 3 runs)
divss: 7.485 s
rcpss NR: 9.915 s
> To enhance the precision of 1/sqrt(A), additional NR step is calculated as
>
> x1 = 0.5 X0 (3.0 - A x0 x0 x0)
>
> and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of
> clocks ;) ), additional NR step just isn't worth it.
Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
microcode whereas at least core2 and K8 do it in hardware, so I guess the
hundreds of cycles doesn't apply to current cpu:s.
Also, supposedly Penryn will have a much improved divider..
That being said, I think there is still a case for the reciprocal square root,
as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
linked to in the first message in this PR (in short, ifort does sqrt(a/b) about
twice as fast as gfortran by using reciprocal approximations + NR). If indeed
div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
suggests almost all the performance benefit ifort gets is due to the
rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the
sqrt(a/b) loop fills an array, whereas your benchmark accumulates..
> Based on these findings, I guess that NR step is just not worth it. If we want
> to have noticeable speed-up on division and square root, we have to use 12bit
> implementations, without any refinements - mainly for benchmarketing, I'm
> afraid.
I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
MD programs, gromacs spends almost all it's time in the force calculations,
where the majority of time is spent calculating 1/sqrt(...). So perhaps one
should watch out for compilers that get suspiciously high scores on that
benchmark. :)
No, I'm not suggesting gcc should do this.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723