[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
elrodc at gmail dot com
gcc-bugzilla@gcc.gnu.org
Sun Jan 6 19:38:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #10 from Chris Elrod <elrodc at gmail dot com> ---
(In reply to Thomas Koenig from comment #9)
> Hm.
>
> It would help if your benchmark was complete, so I could run it.
>
I don't suppose you happen to have and be familiar with Julia? If you (or
someone else here is), I'll attach the code to generate the fake data (the most
important point is that columns 5:10 of BPP are the upper triangle of a 3x3
symmetric positive definite matrix).
I have also already written a manually unrolled version that gfortran likes..
But I could write Fortran code to create an executable and run benchmarks.
What are best practices? system_clock?
(In reply to Thomas Koenig from comment #9)
>
> However, what happens if you put int
>
> real, dimension(:) :: Uix
> real, dimension(:), intent(in) :: x
> real, dimension(:), intent(in) :: S
>
> ?
>
> gfortran should not pack then.
You're right! I wasn't able to follow this exactly, because it didn't want me
to defer shape on Uix. Probably because it needs to compile a version of
fpdbacksolve that can be called from the shared library?
Interestingly, with that change, Flang failed to vectorize the code, but
gfortran did. Compilers are finicky.
Flang, original:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 655.827 ns (0.00% GC)
median time: 665.698 ns (0.00% GC)
mean time: 689.967 ns (0.00% GC)
maximum time: 1.061 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 162
Flang, not specifying shape: # assembly shows it is using xmm
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.086 μs (0.00% GC)
median time: 8.315 μs (0.00% GC)
mean time: 8.591 μs (0.00% GC)
maximum time: 20.299 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 3
gfortran, transposed version (not vectorizable):
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.643 μs (0.00% GC)
median time: 20.901 μs (0.00% GC)
mean time: 21.441 μs (0.00% GC)
maximum time: 54.103 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
gfortran, not specifying shape:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.290 μs (0.00% GC)
median time: 1.316 μs (0.00% GC)
mean time: 1.347 μs (0.00% GC)
maximum time: 4.562 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
Assembly confirms it is using zmm registers (but this time is much too fast not
to be vectorized, anyway).
For why gfortran is still slower than the Flang version, here is the loop body:
.L16:
vmovups (%r10,%rax), %zmm0
vcmpps $4, %zmm0, %zmm4, %k1
vrsqrt14ps %zmm0, %zmm1{%k1}{z}
vmulps %zmm0, %zmm1, %zmm2
vmulps %zmm1, %zmm2, %zmm0
vmulps %zmm5, %zmm2, %zmm2
vaddps %zmm6, %zmm0, %zmm0
vmulps %zmm2, %zmm0, %zmm0
vrcp14ps %zmm0, %zmm8
vmulps %zmm0, %zmm8, %zmm0
vmulps %zmm0, %zmm8, %zmm0
vaddps %zmm8, %zmm8, %zmm8
vsubps %zmm0, %zmm8, %zmm8
vmulps (%r8,%rax), %zmm8, %zmm9
vmulps (%r9,%rax), %zmm8, %zmm10
vmulps (%r12,%rax), %zmm8, %zmm8
vmovaps %zmm9, %zmm3
vfnmadd213ps 0(%r13,%rax), %zmm9, %zmm3
vcmpps $4, %zmm3, %zmm4, %k1
vrsqrt14ps %zmm3, %zmm2{%k1}{z}
vmulps %zmm3, %zmm2, %zmm3
vmulps %zmm2, %zmm3, %zmm1
vmulps %zmm5, %zmm3, %zmm3
vaddps %zmm6, %zmm1, %zmm1
vmulps %zmm3, %zmm1, %zmm1
vmovaps %zmm9, %zmm3
vfnmadd213ps (%rdx,%rax), %zmm10, %zmm3
vrcp14ps %zmm1, %zmm0
vmulps %zmm1, %zmm0, %zmm1
vmulps %zmm1, %zmm0, %zmm1
vaddps %zmm0, %zmm0, %zmm0
vsubps %zmm1, %zmm0, %zmm11
vmulps %zmm11, %zmm3, %zmm12
vmovaps %zmm10, %zmm3
vfnmadd213ps (%r14,%rax), %zmm10, %zmm3
vfnmadd231ps %zmm12, %zmm12, %zmm3
vcmpps $4, %zmm3, %zmm4, %k1
vrsqrt14ps %zmm3, %zmm1{%k1}{z}
vmulps %zmm3, %zmm1, %zmm3
vmulps %zmm1, %zmm3, %zmm0
vmulps %zmm5, %zmm3, %zmm3
vmovups (%rcx,%rax), %zmm1
vaddps %zmm6, %zmm0, %zmm0
vmulps %zmm3, %zmm0, %zmm0
vrcp14ps %zmm0, %zmm2
vmulps %zmm0, %zmm2, %zmm0
vmulps %zmm0, %zmm2, %zmm0
vaddps %zmm2, %zmm2, %zmm2
vsubps %zmm0, %zmm2, %zmm0
vmulps %zmm0, %zmm11, %zmm3
vmulps %zmm12, %zmm3, %zmm3
vxorps %zmm7, %zmm3, %zmm3
vmulps %zmm1, %zmm3, %zmm2
vmulps %zmm3, %zmm9, %zmm3
vfnmadd231ps %zmm8, %zmm9, %zmm1
vfmadd231ps (%r11,%rax), %zmm0, %zmm2
vfmadd132ps %zmm10, %zmm3, %zmm0
vmulps %zmm11, %zmm1, %zmm1
vfnmadd231ps %zmm0, %zmm8, %zmm2
vmovups %zmm2, (%rdi,%rax)
vmovups %zmm1, (%rbx,%rax)
vmovups %zmm8, (%r15,%rax)
addq $64, %rax
cmpq %rax, %rsi
jne .L16
I see far more arithmetic instructions here. Is that because gcc is adding
Newton-Raphson steps for the reciprocal square roots, and Flang is not?
Trying to compare with mpfr, both seem about the same accurate.
Extreme errors in X with gfortran:
-2.676151882353759158425593894760401386764929650751873229109107488336451373232463e-06
1.396013166812755065773272342567265482854011149035436404435394035182092107481168e-05
with Flang:
-3.086256120296619934226727657432734517145988850964563595564605964817907892518832e-05
2.28181026645836083851985181914739792956608114291961973078603672383927518755594e-06
on the data set I benchmarked with.
Anyway, thanks for the prompt responses.
And my issue was that gfortran didn't vectorize, but your second change fixed
the problem.
It would be nice of course if writing things one way would be optimized well
across all compilers and versions. But compilers are finicky. Simply reordering
operations and adding/removing temporary declarations in fpdbacksolve would
sometimes cause Flang to fail to vectorize!
Maybe I'll use #ifdefs around the declarations and save the files as .F90...
More information about the Gcc-bugs
mailing list