This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/68928] New: AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 16 Dec 2015 00:19:16 +0000
- Subject: [Bug target/68928] New: AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928
Bug ID: 68928
Summary: AVX loops on unaligned arrays could generate more
efficient startup/cleanup code when peeling
Product: gcc
Version: 5.3.0
Status: UNCONFIRMED
Keywords: missed-optimization, ssemmx
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86-64-*-*
I have some suggestions for better code that gcc could use for the
prologue/epilogue when vectorizing loops over unaligned buffers. I haven't
looked at gcc's code, just the output, so IDK how one might get gcc to
implement these.
---------
Consider the following code:
#include <immintrin.h>
typedef float float_align32 __attribute__ ((aligned (32)));
void floatmul_aligned(float_align32 *a) {
for (int i=0; i<1024 ; i++)
a[i] *= 2;
}
void floatmul(float *a) {
for (int i=0; i<1024 ; i++)
a[i] *= 2;
}
g++ 5.3.0 -O3 -march=sandybridge emits what you'd expect for the aligned
version:
floatmul_aligned(float*):
leaq 4096(%rdi), %rax
.L2:
vmovaps (%rdi), %ymm0
addq $32, %rdi
vaddps %ymm0, %ymm0, %ymm0
vmovaps %ymm0, -32(%rdi)
cmpq %rdi, %rax
jne .L2
vzeroupper
ret
*** off-topic ***
It unfortunately uses 5 uops in the loop, meaning it can only issue one
iteration per 2 clocks. Other than unrolling, it would prob. be more efficient
to get 2.0f broadcast into %ymm1 and use vmulps (%rdi), %ymm1, %ymm0, avoiding
the separate load.
Doing the loop in reverse order, with an indexed addressing mode counting an
index down to zero, would also keep the loop overhead down to one
decrement-and-branch uop. I know compilers are allowed to re-order memory
accesses, so I assume this would be allowed. However, this wouldn't actually
help on Sandybridge since it seems that two-register addressing modes might not
micro-fuse on SnB-family CPUs:
(http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes.
Agner Fog says he tested and found 2-reg addressing modes did micro-fuse.
Agner Fog is probably right, but IDK what's wrong with my experiment using perf
counters.) That would make the store 2 uops.
*** back on topic ***
Anyway, that wasn't even what I meant to report. The unaligned case peels off
the potentially-unaligned start/end iterations, and unrolls them into a giant
amount of code. This is unlikely to be optimal outside of microbenchmarks,
since CPUs with a uop-cache suffer from excessive unrolling.
floatmul(float*):
movq %rdi, %rax
andl $31, %eax
shrq $2, %rax
negq %rax
andl $7, %eax
je .L12
vmovss (%rdi), %xmm0
vaddss %xmm0, %xmm0, %xmm0
vmovss %xmm0, (%rdi)
cmpl $1, %eax
je .L13
vmovss 4(%rdi), %xmm0
vaddss %xmm0, %xmm0, %xmm0
vmovss %xmm0, 4(%rdi)
cmpl $2, %eax
je .L14
vmovss 8(%rdi), %xmm0
...
repeated up to cmpl $6, %eax
...
some loop setup
.L9:
vmovaps (%rcx,%rax), %ymm0
addl $1, %edx
vaddps %ymm0, %ymm0, %ymm0
vmovaps %ymm0, (%rcx,%rax)
addq $32, %rax
cmpl %esi, %edx
jb .L9
...
another fully-unrolled up-to-7 iteration cleanup loop
Notice that the vectorized part of the loop now has 6 uops. (Or 7, if the
store can't micro-fuse.) So gcc is even farther from getting this loop to run
at one cycle per iteration. (Which should be possible on Haswell. On SnB/IvB
(and AMD Bulldozer-family), a 256b store takes two cycles anyway.)
Is there any experimental evidence that fully unrolling to make this much code
is beneficial?
The most obvious way to improve on this would be to use a 128b xmm vector for
the first 4 iterations of the prologue/epilogue loops.
Even simply not unrolling the 7-iteration alignment loops might be a win.
Every unrolled iteration still has a compare-and-branch. By counting down to
zero, the loop could have the same overhead. All that changes is branch
prediction (one taken branch and many not-taken, vs. a single loop branch taken
n times.)
AVX introduces a completely different way to handle this, though: VMASKMOVPS is
usable now, since it doesn't have the non-temporal hint that makes the SSE
version of it nearly useless. According to Agner Fog's insn tables, vpmaskmov
%ymm, %ymm, m256 is only 4 uops, and has a throughput of one per 2 cycles
(SnB/IvB/Haswell). It's quite slow (as a store) on AMD bulldozer-family CPUs,
though, so this might only be appropriate with -tune=something other than AMD.
The trouble is turning a misalignment count into a mask. Most of the useful
instructions (like PSRLDQ to use on a vector of all-ones) are only available
with immediate counts. Keeping an array of 7 256b masks seems like a big
waste, and having a switch() to run the byte-shift instruction with one of 7
different immediate operands also sucks. (And doesn't work because it work
in-lane, not across both 256b lanes). PSLLQ can take a shift count in the low
qword of an xmm register, but I'm not sure it helps.
My best idea for generating a mask for VMASKMOVPS requires AVX2: broadcast the
misalignment count to all bytes of a ymm register (VPBROADCASTB). Use VPCMPGTB
with another 256b constant (LSB first): { 7,7,7,7, 6,6,6,6, 5,5,5,5, 4,4,4,4,
3,3,3,3, 2,2,2,2, 1,1,1,1, 0,0,0,0 }. The least significant 4B will always be
0, since the count is in the 0-7 range, and 7>7 is false. (Or put the count
into the 1-8 range so we do 32B of useful work in the aligned case, instead of
zero, without branching.)
The constant could shrink to 64b {7,6,5,4,3,2,1,0}, with an extra instruction:
vmovd %eax, %xmm0 # %eax = misalignment count
vpbroadcastb %xmm0, %xmm0 # broadcast the low byte to 128b
vpcmpgtb .constant, %xmm0, %xmm1 # make sure the 64b .constant is at
least 128b from the end of a cache line
# Our mask is in the low 8 bytes of %xmm1. The upper 64b is garbage
vpmovsxbd %xmm1, %ymm1 # all-ones sign-extends to all-ones
Or if reading "garbage" is too hacky, expand to 256b sooner:
vmovd %eax, %xmm0 # %eax = misalignment count
vpbroadcastd %xmm0, %ymm0 # broadcast the low 32b
vpmovzxbd .constant, %ymm1 # 2 uops, unlike the other forms.
vpcmpgtd %ymm1, %ymm0, %ymm2
Haswell: first version is movd: 1uop(p5). 1/1. vpbroadcastb x,x: 1uop(p5). 1/1
vpcmpgtb x,mem: 1uop(p15). 1/0.5 vpmovzxbd y,x: 1uop(p5), 3/1.
total: 4 uops, 6c latency, 1 per 2c throughput (bottleneck on p5).
second version: movd: 1uop. 1/1. vpbroadcastb y,x: 1uop. 3/1.
vpcmpgtd y,y: 1uop. 1/0.5 vpmovzxbd ymm, m64: 2uops(p5+p23), ?/1.
total: 5 uops, 5c latency, 1 per 2c throughput (saturating p5). The movzx is
off the critical path, but can't micro-fuse.
Without AVX2:
#### untested / possibly buggy
imul $0x1010101, %edx, %eax # broadcast the misalignment count to
the 4 bytes of %eax
vmovd %eax, %xmm0
vpshufd $0, %xmm0, %xmm0 # broadcast to all bytes of xmm0
vpcmpgtb .constant2, %xmm0, %xmm1 # constant2 = .byte 0,0, 1,1, ...
vpmovsxwd %xmm1, %xmm2 # [32b:count>0, 32b:count>1, ...]
vpunpckhwd %xmm1, %xmm1, %xmm3 # [32b:count>4, 32b:count>5, ...]
vinsertf128 $1, %xmm2, %ymm3, %ymm3 # combine the masks
# do a masked load, too, to avoid possible NaN slowdowns if there's garbage
before the array
vmaskmov (%rdi), %ymm2, %ymm3 # rdi = p & ~0x1F: rounded down to
previous 32B boundary
vaddps %ymm3, %ymm3, %ymm3
vmaskmov %ymm3, %ymm2, (%rdi)
mainloop:
...
# epilogue: invert the mask
Or skip the vinsertf and do two separate vaddps xmm / vmaskmov, but that's
probably worse. The addresses used with vmaskmov will be aligned: its the
masking that takes care of unaligned accesses.
Also, we could broadcast the count just to words, and use vpcmpgtw. We're
using words rather than bytes because there's no single instruction to pmovsxbd
from the 2nd 32b chunk of a source register (to unpack the high half of the
mask). punpckh same,same can read from the high half of a reg and double the
size of the elements, though. An extra pmovsxbw or punpcklbw would let us use
a 64b constant, though.
Anyway, IDK if this idea is generally useful for gcc to handle arrays that
aren't guaranteed to be aligned.
Probably things should be arranged so that in the aligned case, either the mask
generation and vmaskmovps are skipped altogether, or that the vmaskmovps does a
full 256b load/store, rather than a fully-masked 0-byte load/store. Besides
the obvious reason of avoiding wasted work, AMD Jaguar's VMASKMOVPS takes ~300
clocks for a load with mask=0, vs. 2 clocks (15 cycle latency) in the normal
case. VMASKMOVPS 256b store on Jaguar has one per 22c throughput, though, and
takes 36m-ops. So it's not worth using if targeting jaguar, but avoiding
Jaguar's catastrophic case is a good idea even when tuning for something else,
since it's probably a good idea anyway.