[Bug tree-optimization/88713] Vectorized code slow vs. flang

Wed Jan 23 05:18:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #32 from Chris Elrod <elrodc at gmail dot com> ---
(In reply to Marc Glisse from comment #31)
> (In reply to Chris Elrod from comment #30)
> > gcc caclulates the rsqrt directly
> 
> No, vrsqrt14ps is just the first step in calculating sqrt here (slightly
> different formula than rsqrt). vrcp14ps shows that it is computing an
> inverse later. What we need to understand is why gcc doesn't try to generate
> rsqrt (which would also have vrsqrt14ps, but a slightly different formula
> without the comparison with 0 and masking, and without needing an inversion
> afterwards).

Okay, I think I follow you. You're saying instead of doing this (from
rguenther), which we want (also without the comparison to 0 and masking, as you
note):

 /* rsqrt(a) = -0.5     * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

it is doing this, which also uses the rsqrt instruction:

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

and then calculating an inverse approximation of that?

The approximate sqrt, and then approximate reciprocal approximations were
slower on my computer than just vsqrt followed by div.