[Bug tree-optimization/88713] Vectorized code slow vs. flang
elrodc at gmail dot com
gcc-bugzilla@gcc.gnu.org
Tue Jan 8 02:06:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #18 from Chris Elrod <elrodc at gmail dot com> ---
I can confirm that the inlined packing does allow gfortran to vectorize the
loop. So allowing packing to inline does seem (to me) like an optimization well
worth making.
However, performance seems to be about the same as before, still close to 2x
slower than Flang.
There is definitely something interesting going on in Flang's SLP
vectorization, though.
I defined the function:
#ifndef VECTORWIDTH
#define VECTORWIDTH 16
#endif
subroutine vpdbacksolve(Uix, x, S)
real, dimension(VECTORWIDTH,3) :: Uix
real, dimension(VECTORWIDTH,3), intent(in) :: x
real, dimension(VECTORWIDTH,6), intent(in) :: S
real, dimension(VECTORWIDTH) :: U11, U12, U22, U13, U23, U33,
&
Ui11, Ui12, Ui22, Ui33
U33 = sqrt(S(:,6))
Ui33 = 1 / U33
U13 = S(:,4) * Ui33
U23 = S(:,5) * Ui33
U22 = sqrt(S(:,3) - U23**2)
Ui22 = 1 / U22
U12 = (S(:,2) - U13*U23) * Ui22
U11 = sqrt(S(:,1) - U12**2 - U13**2)
Ui11 = 1 / U11 ! u11
Ui12 = - U12 * Ui11 * Ui22 ! u12
Uix(:,3) = Ui33*x(:,3)
Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) *
Uix(:,3)
Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3)
end subroutine vpdbacksolve
in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling.
I wanted to modify the Fortran file to benchmark these, but I'm pretty sure
Flang cheated in the benchmarks. So compiling into a shared library, and
benchmarking from Julia:
julia> @benchmark flangvtest($Uix, $x, $S)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.104 ns (0.00% GC)
median time: 15.563 ns (0.00% GC)
mean time: 16.017 ns (0.00% GC)
maximum time: 49.524 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark gfortvtest($Uix, $x, $S)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 24.394 ns (0.00% GC)
median time: 24.562 ns (0.00% GC)
mean time: 25.600 ns (0.00% GC)
maximum time: 58.652 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
That is over 60% faster for Flang, which would account for much, but not all,
of the runtime difference in the actual for loops.
For comparison, the vectorized loop in processbpp covers 16 samples per
iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations.
For the three gfortran benchmarks (that averaged 100,000 runs of the loop),
that means each loop iteration averaged at about
1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64)
21.230246197916664
For flang, that was:
1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64)
9.991520833333334
so we have about 21 vs 10 ns for the loop body in gfortran vs Flang,
respectively.
Comparing the asm between:
1. Flang processbpp loop body
2. Flang vpdbacksolve
3. gfortran processbpp loop body
4. gfortran vpdbacksolve
Here are a few things I notice.
1. gfortran always uses masked reciprocal square root operations, to make sure
it only takes the square root of non-negative (positive?) numbers:
vxorps %xmm5, %xmm5, %xmm5
...
vmovups (%rsi,%rax), %zmm0
vmovups 0(%r13,%rax), %zmm9
vcmpps $4, %zmm0, %zmm5, %k1
vrsqrt14ps %zmm0, %zmm1{%k1}{z}
This might be avx512f specific?
Either way, Flang does not use masks:
vmovups (%rcx,%r14), %zmm4
vrsqrt14ps %zmm4, %zmm5
I'm having a hard time finding any information on what the performance impact
of this may be.
Agner Fog's instruction tables, for example, don't mention mask arguments for
vrsqrt14ps.
2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total
plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop
iteration (fpdbacksolve's arguments are a vector of length 3 and another of
length 6; it returns a vector of length 3).
gfortran's loop body has 3 unnecessary vmovaps, copying register contents.
gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register
contents.
Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple
unnecessary memory accesses. Ouch!
They also moved on/off (the stack?)
vmovaps %zmm2, .BSS4+192(%rip)
...
vmovaps %zmm5, .BSS4+320(%rip)
...
vmovaps .BSS4+192(%rip), %zmm5
... #zmm5 is overwritten in here, I just mean to show the sort of stuff that
goes on
vmulps .BSS4+320(%rip), %zmm5, %zmm0
Some of those moves also don't get used again, and some other things are just
plain weird:
vxorps %xmm3, %xmm3, %xmm3
vfnmsub231ps %zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3
vmovaps %zmm3, .BSS4+576(%rip)
Like, why zero out the 128 bit portion of zmm3 ?
I verified that the answers are still correct.
I don't know that much about how compiler's and loop vectorizer's work, but I'm
guessing in the loop, Flang managed to verify lots of things that helped out
the register allocator. And that without it, it struggled.
gfortran's vpdbacksolve also did some stuff I don't understand:
vmulps %zmm1, %zmm2, %zmm2
vxorps .LC3(%rip), %zmm2, %zmm2
vmulps %zmm6, %zmm2, %zmm4
This happens in gfortran's loop too, except the move from .LC3(%rip) was
hoisted out of the loop.
It definitely handled register allocation much better in just the function,
although not as well in the loop.
Given that Flang's vpdbacksolve did the worst here, but was still >60% faster
than gfortran's vpdbacksolve, I don't think we can attribute worse performance
here.
3. Arithmetic instructions:
vaddps:
flang-loop body: 0
flang-vpdbacksolve: 0
gfortran-loop-body: 6
gfortran-vpdbacksolve: 6
vsubps
flang-loop body: 1
flang-vpdbacksolve: 1
gfortran-loop-body: 3
gfortran-vpdbacksolve: 4
vmulps
flang-loop body: 20
flang-vpdbacksolve: 18
gfortran-loop-body: 27
gfortran-vpdbacksolve: 29
Total unfused operations:
flang-loop body: 21
flang-vpdbacksolve: 19
gfortran-loop-body: 30
gfortran-vpdbacksolve: 33
vfmadd
flang-loop body: 5
flang-vpdbacksolve: 6
gfortran-loop-body: 2
gfortran-vpdbacksolve: 3
vfnmadd
flang-loop body: 2
flang-vpdbacksolve: 4
gfortran-loop-body: 6
gfortran-vpdbacksolve: 2
vfmsub
flang-loop body: 3
flang-vpdbacksolve: 0
gfortran-loop-body: 0
gfortran-vpdbacksolve: 2
vfnmsub
flang-loop body: 0
flang-vpdbacksolve: 1
gfortran-loop-body: 0
gfortran-vpdbacksolve: 0
Total fused operations:
flang-loop body: 10
flang-vpdbacksolve: 11
gfortran-loop-body: 8
gfortran-vpdbacksolve: 7
Total arithmetic operations:
flang-loop body: 31
flang-vpdbacksolve: 30
gfortran-loop-body: 38
gfortran-vpdbacksolve: 40
So gfortran's version had more overall arithmetic instructions (but less fused
operations), but definitely not by a factor approaching the degree to which it
was slower.
More information about the Gcc-bugs
mailing list