This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 02 Feb 2016 12:16:00 +0000
- Subject: [Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622
Bug ID: 69622
Summary: compiler reordering of non-temporal (write-combining)
stores produces significant performance hit
Product: gcc
Version: 5.3.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-linux-gnu, x86_64-linux-gnu
IDK whether to mark this as "target" or something else. Other architectures
might have similar write-combining stores that are sensitive to writing whole
cache-lines at once.
For background, see this SO question:
http://stackoverflow.com/questions/25778302/wrong-gcc-generated-assembly-ordering-results-in-performance-hit
In an unrolled copy loop, gcc decides to emit vmovntdq stores in a different
order than they appear in the source. There's no correctness issue, but the
amount of fill-buffers is very limited (maybe each core has 10 or so?). So
it's *much* better to write all of one cacheline, then all of the next
cacheline. See my answer on that SO question for lots of discussion and links.
The poster of that question got a 33% speedup (from ~10.2M packets per second
to ~13.3M packets per second by putting the loads and stores in source order in
the binary. (Unknown hardware and surrounding code, but presumably this loop
is *the* bottleneck in his app). Anyway, real numbers show that this isn't
just a theoretical argument that some code would be better.
Compilable test-case that demonstrates the issue:
#include <stdint.h>
#include <immintrin.h>
//#define compiler_writebarrier() __asm__ __volatile__ ("")
#define compiler_writebarrier() // empty.
void copy_mcve(void *const destination, const void *const source, const size_t
bytes)
{
__m256i *dst = destination;
const __m256i *src = source;
const __m256i *dst_endp = (destination + bytes);
while (dst < dst_endp) {
__m256i m0 = _mm256_load_si256( src + 0 );
__m256i m1 = _mm256_load_si256( src + 1 );
__m256i m2 = _mm256_load_si256( src + 2 );
__m256i m3 = _mm256_load_si256( src + 3 );
_mm256_stream_si256( dst+0, m0 );
compiler_writebarrier(); // even one anywhere in the loop is enough for
current gcc
_mm256_stream_si256( dst+1, m1 );
compiler_writebarrier();
_mm256_stream_si256( dst+2, m2 );
compiler_writebarrier();
_mm256_stream_si256( dst+3, m3 );
compiler_writebarrier();
src += 4;
dst += 4;
}
}
compiles (with the barriers defined as a no-op) to (gcc 5.3.0 -O3
-march=haswell: http://goo.gl/CwtpS7):
copy_mcve:
addq %rdi, %rdx
cmpq %rdx, %rdi
jnb .L7
.L5:
vmovdqa 32(%rsi), %ymm2
subq $-128, %rdi
subq $-128, %rsi
vmovdqa -64(%rsi), %ymm1
vmovdqa -32(%rsi), %ymm0
vmovdqa -128(%rsi), %ymm3
# If dst is aligned, the four halves of two cache lines are {A B} {C D}:
vmovntdq %ymm2, -96(%rdi) # B
vmovntdq %ymm1, -64(%rdi) # C
vmovntdq %ymm0, -32(%rdi) # D
vmovntdq %ymm3, -128(%rdi) # A
cmpq %rdi, %rdx
ja .L5
vzeroupper
.L7: ret
If the output buffer is aligned, that B C D A store ordering maximally
separates the two halves of the first cache line, giving the most opportunity
for partially-full fill buffers to get flushed.
Doing the +32 load first makes no sense with that placement of the
pointer-increment instructions. Doing the +0 load first could save a byte of
code-size by not needing a displacement byte. I'm guessing that's what one
optimizer function was going for when it put the subs there, but then something
else came along and re-ordered the loads.
Is there something that tries to touch both cache-lines as early as possible,
to trigger the loads? Assuming the buffer is 64B-aligned?
Doing the subs after the last store would save another insn byte, because one
of the stores could use an empty displacement as well. That's where clang puts
the pointer increments (and it keeps the loads and stores in source order).
clang also uses vmovaps / vmovntps. It's probably a holdover from saving an
insn byte in the non-VEX encoding of the 128b insn, but does make the output
work with AVX1 instead of requiring AVX2.
Using a 2-register addressing mode for the loads could save a sub instruction
inside the loop. Increment dst normally, but reference src with a 2-register
addressing mode with dst and a register initialized with src-dst. (In the
godbolt link, uncomment the #define ADDRESSING_MODE_HACK. With ugly enough
source, gcc can be bludgeoned into making code like that. It wastes insns in
the intro, though, apparently to avoid 3-component (base+index+disp) addresses.
I've been meaning to check on whether that's a factor).
2-register addressing modes don't seem to micro-fuse in the pipeline in
SnB-family CPUs
(http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes),
but Agner Fog's testing methods found that they do micro-fuse. There's a
theory (discussed on a thread on his blog) that they micro-fuse in the uop
cache only. Anyway, way off topic, sorry.