[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

Sun Jan 6 19:38:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #10 from Chris Elrod <elrodc at gmail dot com> ---
(In reply to Thomas Koenig from comment #9)
> Hm.
> 
> It would help if your benchmark was complete, so I could run it.
> 

I don't suppose you happen to have and be familiar with Julia? If you (or
someone else here is), I'll attach the code to generate the fake data (the most
important point is that columns 5:10 of BPP are the upper triangle of a 3x3
symmetric positive definite matrix).

I have also already written a manually unrolled version that gfortran likes..

But I could write Fortran code to create an executable and run benchmarks.
What are best practices? system_clock?

(In reply to Thomas Koenig from comment #9)
> 
> However, what happens if you put int
> 
>         real, dimension(:) ::  Uix
>         real, dimension(:), intent(in)  ::  x
>         real, dimension(:), intent(in)  ::  S
> 
> ?
> 
> gfortran should not pack then.

You're right! I wasn't able to follow this exactly, because it didn't want me
to defer shape on Uix. Probably because it needs to compile a version of
fpdbacksolve that can be called from the shared library?

Interestingly, with that change, Flang failed to vectorize the code, but
gfortran did. Compilers are finicky.

Flang, original:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     655.827 ns (0.00% GC)
  median time:      665.698 ns (0.00% GC)
  mean time:        689.967 ns (0.00% GC)
  maximum time:     1.061 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     162

Flang, not specifying shape: # assembly shows it is using xmm

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.086 μs (0.00% GC)
  median time:      8.315 μs (0.00% GC)
  mean time:        8.591 μs (0.00% GC)
  maximum time:     20.299 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

gfortran, transposed version (not vectorizable): 

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.643 μs (0.00% GC)
  median time:      20.901 μs (0.00% GC)
  mean time:        21.441 μs (0.00% GC)
  maximum time:     54.103 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

gfortran, not specifying shape:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.290 μs (0.00% GC)
  median time:      1.316 μs (0.00% GC)
  mean time:        1.347 μs (0.00% GC)
  maximum time:     4.562 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Assembly confirms it is using zmm registers (but this time is much too fast not
to be vectorized, anyway).

For why gfortran is still slower than the Flang version, here is the loop body:

.L16:
        vmovups (%r10,%rax), %zmm0
        vcmpps  $4, %zmm0, %zmm4, %k1
        vrsqrt14ps      %zmm0, %zmm1{%k1}{z}
        vmulps  %zmm0, %zmm1, %zmm2
        vmulps  %zmm1, %zmm2, %zmm0
        vmulps  %zmm5, %zmm2, %zmm2
        vaddps  %zmm6, %zmm0, %zmm0
        vmulps  %zmm2, %zmm0, %zmm0
        vrcp14ps        %zmm0, %zmm8
        vmulps  %zmm0, %zmm8, %zmm0
        vmulps  %zmm0, %zmm8, %zmm0
        vaddps  %zmm8, %zmm8, %zmm8
        vsubps  %zmm0, %zmm8, %zmm8
        vmulps  (%r8,%rax), %zmm8, %zmm9
        vmulps  (%r9,%rax), %zmm8, %zmm10
        vmulps  (%r12,%rax), %zmm8, %zmm8
        vmovaps %zmm9, %zmm3
        vfnmadd213ps    0(%r13,%rax), %zmm9, %zmm3
        vcmpps  $4, %zmm3, %zmm4, %k1
        vrsqrt14ps      %zmm3, %zmm2{%k1}{z}
        vmulps  %zmm3, %zmm2, %zmm3
        vmulps  %zmm2, %zmm3, %zmm1
        vmulps  %zmm5, %zmm3, %zmm3
        vaddps  %zmm6, %zmm1, %zmm1
        vmulps  %zmm3, %zmm1, %zmm1
        vmovaps %zmm9, %zmm3
        vfnmadd213ps    (%rdx,%rax), %zmm10, %zmm3
        vrcp14ps        %zmm1, %zmm0
        vmulps  %zmm1, %zmm0, %zmm1
        vmulps  %zmm1, %zmm0, %zmm1
        vaddps  %zmm0, %zmm0, %zmm0
        vsubps  %zmm1, %zmm0, %zmm11
        vmulps  %zmm11, %zmm3, %zmm12
        vmovaps %zmm10, %zmm3
        vfnmadd213ps    (%r14,%rax), %zmm10, %zmm3
        vfnmadd231ps    %zmm12, %zmm12, %zmm3
        vcmpps  $4, %zmm3, %zmm4, %k1
        vrsqrt14ps      %zmm3, %zmm1{%k1}{z}
        vmulps  %zmm3, %zmm1, %zmm3
        vmulps  %zmm1, %zmm3, %zmm0
        vmulps  %zmm5, %zmm3, %zmm3
        vmovups (%rcx,%rax), %zmm1
        vaddps  %zmm6, %zmm0, %zmm0
        vmulps  %zmm3, %zmm0, %zmm0
        vrcp14ps        %zmm0, %zmm2
        vmulps  %zmm0, %zmm2, %zmm0
        vmulps  %zmm0, %zmm2, %zmm0
        vaddps  %zmm2, %zmm2, %zmm2
        vsubps  %zmm0, %zmm2, %zmm0
        vmulps  %zmm0, %zmm11, %zmm3
        vmulps  %zmm12, %zmm3, %zmm3
        vxorps  %zmm7, %zmm3, %zmm3
        vmulps  %zmm1, %zmm3, %zmm2
        vmulps  %zmm3, %zmm9, %zmm3
        vfnmadd231ps    %zmm8, %zmm9, %zmm1
        vfmadd231ps     (%r11,%rax), %zmm0, %zmm2
        vfmadd132ps     %zmm10, %zmm3, %zmm0
        vmulps  %zmm11, %zmm1, %zmm1
        vfnmadd231ps    %zmm0, %zmm8, %zmm2
        vmovups %zmm2, (%rdi,%rax)
        vmovups %zmm1, (%rbx,%rax)
        vmovups %zmm8, (%r15,%rax)
        addq    $64, %rax
        cmpq    %rax, %rsi
        jne     .L16

I see far more arithmetic instructions here. Is that because gcc is adding
Newton-Raphson steps for the reciprocal square roots, and Flang is not?

Trying to compare with mpfr, both seem about the same accurate.
Extreme errors in X with gfortran:
-2.676151882353759158425593894760401386764929650751873229109107488336451373232463e-06
1.396013166812755065773272342567265482854011149035436404435394035182092107481168e-05

with Flang:
-3.086256120296619934226727657432734517145988850964563595564605964817907892518832e-05
2.28181026645836083851985181914739792956608114291961973078603672383927518755594e-06

on the data set I benchmarked with.

Anyway, thanks for the prompt responses. 
And my issue was that gfortran didn't vectorize, but your second change fixed
the problem.
It would be nice of course if writing things one way would be optimized well
across all compilers and versions. But compilers are finicky. Simply reordering
operations and adding/removing temporary declarations in fpdbacksolve would
sometimes cause Flang to fail to vectorize!
Maybe I'll use #ifdefs around the declarations and save the files as .F90...