[Bug rtl-optimization/67072] New: Slow code generated for getting each byte of a 64bit register as a LUT index.
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Thu Jul 30 22:19:00 GMT 2015
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072
Bug ID: 67072
Summary: Slow code generated for getting each byte of a 64bit
register as a LUT index.
Product: gcc
Version: 4.9.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
gcc produces much slower code than what I wrote by hand, for a function that
does LUT lookups for each byte of source data.
intrinsics version (bad code):
https://github.com/pcordes/par2-asm-experiments/blob/master/intrin-pinsrw.c
hand-written asm version (good code):
https://github.com/pcordes/par2-asm-experiments/blob/master/asm-pinsrw.s
Both versions maintain two dependency chains to avoid latency bottlenecks.
My hand-written asm uses movzx from dl and dh, then shifts by 16 to bring the
next two bytes into dl/dh.
movzx %dl, %eax
movzx %dh, %ebx
shr $16, %rdx ; set up for next block of this dep
chain
pinsrw $1, 0x0000(%rsi, %rax, 4), %xmm0
pinsrw $1, 0x0000(%rdi, %rbx, 4), %xmm1
repeat for $2, etc.
Here's a cut-down version that doesn't need any extra headers:
#include <emmintrin.h>
#include <stdint.h>
union wordbytes { uint16_t w; uint8_t b[2]; };
void rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid, size_t size,
const uint32_t* LH)
{
const uint64_t *src = srcvoid;
__m128i d, l, h;
__m128i *dst = dstvoid;
for (size_t i = 0; i < size/sizeof(d) ; i+=1) {
#define LO(x) ((uint8_t)x)
//#define HI(x) ((uint8_t)((x >> 8) & 0xff))
//#define HI(x) ( (uint8_t) (((uint16_t)x) >> 8) ) // This gets gcc to emit
movz %bh, %eax, unlike the & 0xff version
// u8(((u16) s) >> 8)
#define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1]) // fastest code, but
still horrible; WAY too much useless mov
uint64_t s0 = src[i*2 + 0];
uint64_t s1 = src[i*2 + 1];
l = _mm_cvtsi32_si128( LH[ LO(s0)] ); // movd
the lookup for the l/h byte of the first word
h = _mm_cvtsi32_si128( LH[256 + HI(s0)] );
s0 >>= 16;
l = _mm_insert_epi16(l, LH[ LO(s1)], 4);
h = _mm_insert_epi16(h, LH[256 + HI(s1)], 4);
s1 >>= 16;
l = _mm_insert_epi16(l, LH[ LO(s0)], 1);
h = _mm_insert_epi16(h, LH[256 + HI(s0)], 1);
s0 >>= 16;
l = _mm_insert_epi16(l, LH[ LO(s1)], 5);
h = _mm_insert_epi16(h, LH[256 + HI(s1)], 5);
s1 >>= 16;
l = _mm_insert_epi16(l, LH[ LO(s0)], 2);
h = _mm_insert_epi16(h, LH[256 + HI(s0)], 2);
s0 >>= 16;
l = _mm_insert_epi16(l, LH[ LO(s1)], 6);
h = _mm_insert_epi16(h, LH[256 + HI(s1)], 6);
s1 >>= 16;
l = _mm_insert_epi16(l, LH[ LO(s0)], 3);
h = _mm_insert_epi16(h, LH[256 + HI(s0)], 3);
l = _mm_insert_epi16(l, LH[ LO(s1)], 7);
h = _mm_insert_epi16(h, LH[256 + HI(s1)], 7);
#undef LO
#undef HI
d = _mm_xor_si128(l, h); // gcc 4.9 uses a 3rd xmm reg
for no reason, even with l instead of d
d = _mm_xor_si128(d, dst[i]);
_mm_store_si128(&dst[i], d);
}
}
alias m=less
gcc -std=gnu11 -O3 -o- -S intrin-pinsrw-simple.c 2>&1 | m
gcc 4.9.2's output for this includes a lot of extra mov instructions. Instead
of shifting s0 and s1 in the same register, it copies them to another register
and then shifts that by 16, 32, or 48. Maybe gcc would have done better if it
had decided to put s0 and s1 into registers that have a high 16 (%ah, %bh, %ch,
%dh), instead of r9 and r14.
It runs 35% slower than my hand-tuned asm version, on Intel Sandybridge. uop
throughput is the bottleneck here, because there's a mix of load and ALU uops,
and the dep chains are short enough to not be a problem. (4 fused-domain uops
per cycle). My asm version does more PINSRW xmm dep chains in parallel, and
merges at the end with punpcklqdq, but that change only gained a couple %,
IIRC.
More information about the Gcc-bugs
mailing list