[Bug tree-optimization/88713] Vectorized code slow vs. flang

Tue Jan 22 08:35:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #21 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 22 Jan 2019, elrodc at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #19 from Chris Elrod <elrodc at gmail dot com> ---
> To add a little more:
> I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
> Julia. Without adding a Newton step, the answers are wrong beyond just a couple
> significant digits.
> With the Newton step, the answers are correct.
> 
> My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
> the Newton step. They get the correct answer.
> 
> That leaves my best guess for the performance difference as owing to the masked
> "vrsqrt14ps" that gcc is using:
> 
>         vcmpps  $4, %zmm0, %zmm5, %k1
>         vrsqrt14ps      %zmm0, %zmm1{%k1}{z}
> 
> Is there any way for me to test that idea?
> Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
> benchmark it?

Usually it's easiest to compile to assembler with GCC (-S) and test
this kind of theories by editing the GCC generated assembly and
then benchmark that.  Just use the assembler as input to the
gfortran compile command instead of the .f for linking the program.