I wasn't sure where to put this.
I posted in the Fortran gcc mailing list initially, but was redirected to bugzilla.
I specified RTL-optimization as the component, because the manually unrolled version avoids register spills yet has 13 (unnecessary?) vmovapd instructions between registers, and the loop version is a behemoth of moving data in, out, and between registers.
The failure of the loop might also fall under tree optimization?
For that reason, completely unrolling the loop actually results in over 3x less assembly than the loop. Unfortunately, funroll-loops did not complete unroll, making the manual unrolling necessary.
Assembly is identical whether or not funroll-loops is used.
Adding the directive:
!GCC$ unroll 31
does lead to complete unrolling, but also use of xmm registers instead of zmm, and thus massive amounts of spilling (and probably extremely slow code -- did not benchmark).
Here is the code (a 16x32 * 32x14 matrix multiplication kernel for avx-512 [the 32 is arbitrary]), sans directive:
I compiled with:
gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
resulting in this assembly (without the directive):
The manually unrolled version has 13 vmovapd instructions that look unnecessary (like a vfmadd should've been able to place the answer in the correct location?). 8 of them move from one register to another, and 5 look something like:
vmovapd %zmm20, 136(%rsp)
I suspect there should ideally be 0 of these?
If not, I'd be interested in learning more about why.
This at least seems like an RTL optimization bug/question.
The rest of the generated code looks great to me. Repeated blocks of only:
In the looped code, however, the `vfmadd231pd` instructions are a rare sight between all the register management. The loop code begins at line 1475 in the assembly file.
While the manually unrolled code benchmarked at 135ns, the looped version took 1.4 microseconds on my computer.
Trying to understand more about what it's doing:
- While the manually unrolled code has the expected 868 = (16/8)*(32-1)*14 vfmadds for the fully unrolled code, the looped version has two blocks of 224 = (16/8)*X*14, where X = 8, indicating it is partially unrolling the loop.
One of them is using xmm registers instead of zmm, so it looks like the compiler mistakenly things smaller vectors may be needed to clean up something?
(Maybe it is trying to vectorize across loop iterations, rather than within, in some weird way? I don't know why it'd be using all those vpermt2pd, otherwise.)
Please supply testcase(s) as Bugzilla attachments, not external links.
At -O3/-Ofast the main issue is early unrolling ('cunrolli') splatting all simple 16-iteration inner loops. After that imho all hope is lost, and yeah, looks like we try to vectorize across the other dimension.
With -O3 -fdisable-tree-cunrolli, or with -O2 -ftree-vectorize we do get the correct vectorization pattern, but a couple of problems remain: after vect, tree optimizations cannot hoist/sink memory references out of the outer loop, leaving 2 loads, 1 load-broadcast and 1 store per each fma. Later, RTL PRE cleans up redundant vector loads, but load-broadcasts and stores remain.
Created attachment 44418 [details]
Code to reproduce slow vectorization pattern and unnecessary loads & stores
(Sorry if this goes to the bottom instead of top, trying to attach a file in place of a link, but I can't edit the old comment.)
Attached is sample code to reproduce the problem in gcc 8.1.1
As observed by amonakov, compiling with -O3/-Ofast reproduces the full problem, eg:
gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S kernels.f90 -o kernels.s
Compiling with -O3 -fdisable-tree-cunrolli or -O2 -ftree-vectorize fixes the incorrect vectorization pattern, but leave a lot of unnecessary broadcast loads and stores.
If you see spilling on the manually unrolled loop register pressure is somehow an issue.
Created attachment 44423 [details]
8x16 * 16x6 kernel for avx2.
Here is a scaled down version to reproduce most of the the problem for avx2-capable architectures.
I just used march=haswell, but I think most recent architectures fall under this.
For some, like zenv1, you may need to add -mprefer-vector-width=256.
To get the inefficiently vectorized loop:
gfortran -march=haswell -Ofast -shared -fPIC -S kernelsavx2.f90 -o kernelsavx2bad.s
To get only the unnecessary loads/stores, use:
gfortran -march=haswell -O2 -ftree-vectorize -shared -fPIC -S kernelsavx2.f90 -o kernelsavx2.s
This file compiles instantly, while with `O3` the other one can take a couple seconds.
However while it does `vmovapd` between registers, it no longer spills into the stack in the manually unrolled version, like the avx512 kernel does.
Created attachment 44424 [details]
Smaller avx512 kernel that still spills into the stack
This generated 18 total `vmovapd` (I think there'd ideally be 0) when compiled with:
gfortran -march=skylake-avx512 -mprefer-vector-width=512 -O2 -ftree-vectorize -shared -fPIC -S kernels16x32x13.f90 -o kernels16x32x13.s
4 of which moved onto the stack, and one moved from the stack back into a register.
(The others were transfered from the stack within vfmadd instructions:
`vfmadd213pd 72(%rsp), %zmm11, %zmm15`
Similar to the larger kernel, using `-O3` instead of `-O2 -ftree-vectorize` eliminated two of the `vmovapd`instructions between registers, but none of the spills.
(In reply to Richard Biener from comment #3)
> If you see spilling on the manually unrolled loop register pressure is
> somehow an issue.
In the matmul kernel:
D = A * X
where D is 16x14, A is 16xN, and X is Nx14 (N arbitrarily set to 32)
The code holds all of D in registers.
16x14 doubles, and 8 doubles per register mean 28 of the 32 registers.
Then, it loads 1 column of A at a time (2 more registers), and broadcasts elements from the corresponding row in each column of X, updating the corresponding column of D with fma instructions.
By broadcasting 2 at a time, it should be using exactly 32 registers.
For the most part, that is precisely what the manually unrolled code is doing for each column of A.
However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the 32 columns of A, it suddenly spills (all the stack accesses happen for the same column, and none of the others), even though the process is identical for each column.
Switching to a smaller 16x13 output, freeing up 2 registers to allow 4 broadcast loads at a time, still resulted in 4 spills (down from 5) for only column #23 or #25.
I couldn't reproduce the spills in the avx2 kernel.
The smaller kernel has an 8x6 output, taking up 12 registers. Again leaving 4 total registers, 2 for a column of A, and 2 broadcasts from X at a time. So it's the same pattern.
The smaller kernel does reproduce the problems with the loops. Both -O3 without `-fdisable-tree-cunrolli` leading to a slow vectorization scheme, and with it or `-O2 -ftree-vectorize` producing repetitive loads and stores within the loop.
(In reply to Chris Elrod from comment #6)
> However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the
> 32 columns of A
Correction: it was the 16x13 version that used stack data after loading column 25 instead of 23 of A.