This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math



------- Comment #13 from jb at gcc dot gnu dot org  2007-06-10 11:06 -------
(In reply to comment #11)

Thanks for the work.

> First, please note that "divss" instruction is quite _fast_, clocking at 23
> cycles, where approximation with NR step would sum up to 20 cycles, not
> counting load of constant.
> 
> I have checked the performance of following testcase with various
> implementetations on x86_64 C2D:
> 
> --cut here--
> float test(float a)
> {
>   return 1.0 / a;
> }
>
> divss     : 3.132s
> rcpss NR  : 3.264s
> rcpss only: 3.080s

Interesting, on ubuntu/i686/K8 I get (average of 3 runs)

divss: 7.485 s
rcpss NR: 9.915 s

> To enhance the precision of 1/sqrt(A), additional NR step is calculated as
> 
> x1 = 0.5 X0 (3.0 - A x0 x0 x0)
> 
> and considering that sqrtss also clocks at 23 clocks (_far_ from hundreds of
> clocks ;) ), additional NR step just isn't worth it.

Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
microcode whereas at least core2 and K8 do it in hardware, so I guess the
hundreds of cycles doesn't apply to current cpu:s. 

Also, supposedly Penryn will have a much improved divider..

That being said, I think there is still a case for the reciprocal square root,
as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
linked to in the first message in this PR (in short, ifort does sqrt(a/b) about
twice as fast as gfortran by using reciprocal approximations + NR). If indeed
div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
suggests almost all the performance benefit ifort gets is due to the
rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn the
sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

> Based on these findings, I guess that NR step is just not worth it. If we want
> to have noticeable speed-up on division and square root, we have to use 12bit
> implementations, without any refinements - mainly for benchmarketing, I'm
> afraid.

I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
MD programs, gromacs spends almost all it's time in the force calculations,
where the majority of time is spent calculating 1/sqrt(...). So perhaps one
should watch out for compilers that get suspiciously high scores on that
benchmark. :)

No, I'm not suggesting gcc should do this.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]