[Bug tree-optimization/88713] Vectorized code slow vs. flang

elrodc at gmail dot com gcc-bugzilla@gcc.gnu.org
Wed Jan 23 05:18:00 GMT 2019


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #32 from Chris Elrod <elrodc at gmail dot com> ---
(In reply to Marc Glisse from comment #31)
> (In reply to Chris Elrod from comment #30)
> > gcc caclulates the rsqrt directly
> 
> No, vrsqrt14ps is just the first step in calculating sqrt here (slightly
> different formula than rsqrt). vrcp14ps shows that it is computing an
> inverse later. What we need to understand is why gcc doesn't try to generate
> rsqrt (which would also have vrsqrt14ps, but a slightly different formula
> without the comparison with 0 and masking, and without needing an inversion
> afterwards).

Okay, I think I follow you. You're saying instead of doing this (from
rguenther), which we want (also without the comparison to 0 and masking, as you
note):

 /* rsqrt(a) = -0.5     * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

it is doing this, which also uses the rsqrt instruction:

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

and then calculating an inverse approximation of that?

The approximate sqrt, and then approximate reciprocal approximations were
slower on my computer than just vsqrt followed by div.


More information about the Gcc-bugs mailing list