[Bug tree-optimization/88713] Vectorized code slow vs. flang

Tue Jan 8 02:06:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #18 from Chris Elrod <elrodc at gmail dot com> ---
I can confirm that the inlined packing does allow gfortran to vectorize the
loop. So allowing packing to inline does seem (to me) like an optimization well
worth making.

However, performance seems to be about the same as before, still close to 2x
slower than Flang.

There is definitely something interesting going on in Flang's SLP
vectorization, though.

I defined the function:

#ifndef VECTORWIDTH
#define VECTORWIDTH 16
#endif

    subroutine vpdbacksolve(Uix, x, S)

        real, dimension(VECTORWIDTH,3)              ::  Uix
        real, dimension(VECTORWIDTH,3), intent(in)  ::  x
        real, dimension(VECTORWIDTH,6), intent(in)  ::  S

        real, dimension(VECTORWIDTH)    ::  U11,  U12,  U22,  U13,  U23,  U33,
&
                                            Ui11, Ui12, Ui22, Ui33

        U33 = sqrt(S(:,6))

        Ui33 = 1 / U33
        U13 = S(:,4) * Ui33
        U23 = S(:,5) * Ui33
        U22 = sqrt(S(:,3) - U23**2)
        Ui22 = 1 / U22
        U12 = (S(:,2) - U13*U23) * Ui22
        U11 = sqrt(S(:,1) - U12**2 - U13**2)

        Ui11 = 1 / U11 ! u11
        Ui12 = - U12 * Ui11 * Ui22 ! u12
        Uix(:,3) = Ui33*x(:,3)
        Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) *
Uix(:,3)
        Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3)

    end subroutine vpdbacksolve

in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling.

I wanted to modify the Fortran file to benchmark these, but I'm pretty sure
Flang cheated in the benchmarks. So compiling into a shared library, and
benchmarking from Julia:

julia> @benchmark flangvtest($Uix, $x, $S)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.104 ns (0.00% GC)
  median time:      15.563 ns (0.00% GC)
  mean time:        16.017 ns (0.00% GC)
  maximum time:     49.524 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark gfortvtest($Uix, $x, $S)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.394 ns (0.00% GC)
  median time:      24.562 ns (0.00% GC)
  mean time:        25.600 ns (0.00% GC)
  maximum time:     58.652 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

That is over 60% faster for Flang, which would account for much, but not all,
of the runtime difference in the actual for loops.

For comparison, the vectorized loop in processbpp covers 16 samples per
iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations.

For the three gfortran benchmarks (that averaged 100,000 runs of the loop),
that means each loop iteration averaged at about
1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64)
21.230246197916664

For flang, that was:
1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64)
9.991520833333334

so we have about 21 vs 10 ns for the loop body in gfortran vs Flang,
respectively.

Comparing the asm between:
1. Flang processbpp loop body
2. Flang vpdbacksolve
3. gfortran processbpp loop body
4. gfortran vpdbacksolve

Here are a few things I notice.
1. gfortran always uses masked reciprocal square root operations, to make sure
it only takes the square root of non-negative (positive?) numbers:
        vxorps  %xmm5, %xmm5, %xmm5
...
        vmovups (%rsi,%rax), %zmm0
        vmovups 0(%r13,%rax), %zmm9
        vcmpps  $4, %zmm0, %zmm5, %k1
        vrsqrt14ps      %zmm0, %zmm1{%k1}{z}

This might be avx512f specific? 
Either way, Flang does not use masks:

        vmovups (%rcx,%r14), %zmm4
        vrsqrt14ps      %zmm4, %zmm5

I'm having a hard time finding any information on what the performance impact
of this may be.
Agner Fog's instruction tables, for example, don't mention mask arguments for
vrsqrt14ps.

2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total
plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop
iteration (fpdbacksolve's arguments are a vector of length 3 and another of
length 6; it returns a vector of length 3).

gfortran's loop body has 3 unnecessary vmovaps, copying register contents.

gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register
contents.

Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple
unnecessary memory accesses. Ouch!
They also moved on/off (the stack?)

vmovaps %zmm2, .BSS4+192(%rip)
...
vmovaps %zmm5, .BSS4+320(%rip)
...
vmovaps .BSS4+192(%rip), %zmm5
... #zmm5 is overwritten in here, I just mean to show the sort of stuff that
goes on
vmulps  .BSS4+320(%rip), %zmm5, %zmm0

Some of those moves also don't get used again, and some other things are just
plain weird:
vxorps  %xmm3, %xmm3, %xmm3
vfnmsub231ps    %zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3
vmovaps %zmm3, .BSS4+576(%rip)

Like, why zero out the 128 bit portion of zmm3 ?
I verified that the answers are still correct.

I don't know that much about how compiler's and loop vectorizer's work, but I'm
guessing in the loop, Flang managed to verify lots of things that helped out
the register allocator. And that without it, it struggled.

gfortran's vpdbacksolve also did some stuff I don't understand:

vmulps  %zmm1, %zmm2, %zmm2
vxorps  .LC3(%rip), %zmm2, %zmm2
vmulps  %zmm6, %zmm2, %zmm4

This happens in gfortran's loop too, except the move from .LC3(%rip) was
hoisted out of the loop.

It definitely handled register allocation much better in just the function,
although not as well in the loop.

Given that Flang's vpdbacksolve did the worst here, but was still >60% faster
than gfortran's vpdbacksolve, I don't think we can attribute worse performance
here.

3. Arithmetic instructions:
vaddps:
flang-loop body: 0
flang-vpdbacksolve: 0
gfortran-loop-body: 6
gfortran-vpdbacksolve: 6

vsubps
flang-loop body: 1
flang-vpdbacksolve: 1
gfortran-loop-body: 3
gfortran-vpdbacksolve: 4

vmulps
flang-loop body: 20
flang-vpdbacksolve: 18
gfortran-loop-body: 27
gfortran-vpdbacksolve: 29

Total unfused operations:
flang-loop body: 21
flang-vpdbacksolve: 19
gfortran-loop-body: 30
gfortran-vpdbacksolve: 33

vfmadd
flang-loop body: 5
flang-vpdbacksolve: 6
gfortran-loop-body: 2
gfortran-vpdbacksolve: 3

vfnmadd
flang-loop body: 2
flang-vpdbacksolve: 4
gfortran-loop-body: 6
gfortran-vpdbacksolve: 2

vfmsub
flang-loop body: 3
flang-vpdbacksolve: 0
gfortran-loop-body: 0
gfortran-vpdbacksolve: 2

vfnmsub
flang-loop body: 0
flang-vpdbacksolve: 1
gfortran-loop-body: 0
gfortran-vpdbacksolve: 0

Total fused operations:
flang-loop body: 10
flang-vpdbacksolve: 11
gfortran-loop-body: 8
gfortran-vpdbacksolve: 7

Total arithmetic operations:
flang-loop body: 31
flang-vpdbacksolve: 30
gfortran-loop-body: 38
gfortran-vpdbacksolve: 40

So gfortran's version had more overall arithmetic instructions (but less fused
operations), but definitely not by a factor approaching the degree to which it
was slower.