[Bug rtl-optimization/67072] New: Slow code generated for getting each byte of a 64bit register as a LUT index.

Thu Jul 30 22:19:00 GMT 2015

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072

            Bug ID: 67072
           Summary: Slow code generated for getting each byte of a 64bit
                    register as a LUT index.
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

gcc produces much slower code than what I wrote by hand, for a function that
does LUT lookups for each byte of source data.

intrinsics version (bad code):
https://github.com/pcordes/par2-asm-experiments/blob/master/intrin-pinsrw.c

hand-written asm version (good code):
https://github.com/pcordes/par2-asm-experiments/blob/master/asm-pinsrw.s

Both versions maintain two dependency chains to avoid latency bottlenecks.

  My hand-written asm uses movzx from dl and dh, then shifts by 16 to bring the
next two bytes into dl/dh.

        movzx           %dl, %eax
        movzx           %dh, %ebx
        shr                     $16, %rdx  ; set up for next block of this dep
chain
        pinsrw          $1, 0x0000(%rsi, %rax, 4), %xmm0
        pinsrw          $1, 0x0000(%rdi, %rbx, 4), %xmm1
          repeat for $2, etc.

Here's a cut-down version that doesn't need any extra headers:

#include <emmintrin.h>
#include <stdint.h>
union wordbytes { uint16_t w; uint8_t b[2]; };
void rs_process_pinsrw_intrin(void* dstvoid, const void* srcvoid, size_t size,
const uint32_t* LH)
{
        const uint64_t *src = srcvoid;
        __m128i d, l, h;
        __m128i *dst = dstvoid;

        for (size_t i = 0; i < size/sizeof(d) ; i+=1) {

#define LO(x) ((uint8_t)x)
//#define HI(x) ((uint8_t)((x >> 8) & 0xff))
//#define HI(x) ( (uint8_t) (((uint16_t)x) >> 8) )  // This gets gcc to emit 
movz %bh, %eax, unlike the & 0xff version
// u8(((u16) s) >>  8)
#define HI(x) (( (union wordbytes) ((uint16_t)x) ).b[1])  // fastest code, but
still horrible; WAY too much useless mov

                uint64_t s0 = src[i*2 + 0];
                uint64_t s1 = src[i*2 + 1];

                l = _mm_cvtsi32_si128( LH[      LO(s0)] );              // movd
the lookup for the l/h byte of the first word
                h = _mm_cvtsi32_si128( LH[256 + HI(s0)] );
                s0 >>= 16;
                l = _mm_insert_epi16(l, LH[      LO(s1)], 4);
                h = _mm_insert_epi16(h, LH[256 + HI(s1)], 4);
                s1 >>= 16;

                l = _mm_insert_epi16(l, LH[      LO(s0)], 1);
                h = _mm_insert_epi16(h, LH[256 + HI(s0)], 1);
                s0 >>= 16;
                l = _mm_insert_epi16(l, LH[      LO(s1)], 5);
                h = _mm_insert_epi16(h, LH[256 + HI(s1)], 5);
                s1 >>= 16;

                l = _mm_insert_epi16(l, LH[      LO(s0)], 2);
                h = _mm_insert_epi16(h, LH[256 + HI(s0)], 2);
                s0 >>= 16;
                l = _mm_insert_epi16(l, LH[      LO(s1)], 6);
                h = _mm_insert_epi16(h, LH[256 + HI(s1)], 6);
                s1 >>= 16;

                l = _mm_insert_epi16(l, LH[      LO(s0)], 3);
                h = _mm_insert_epi16(h, LH[256 + HI(s0)], 3);
                l = _mm_insert_epi16(l, LH[      LO(s1)], 7);
                h = _mm_insert_epi16(h, LH[256 + HI(s1)], 7);
#undef LO
#undef HI
                d = _mm_xor_si128(l, h);        // gcc 4.9 uses a 3rd xmm reg
for no reason, even with l instead of d
                d = _mm_xor_si128(d, dst[i]);
                _mm_store_si128(&dst[i], d);
        }
}

alias m=less
gcc -std=gnu11 -O3  -o- -S intrin-pinsrw-simple.c 2>&1 | m

gcc 4.9.2's output for this includes a lot of extra mov instructions.  Instead
of shifting s0 and s1 in the same register, it copies them to another register
and then shifts that by 16, 32, or 48.  Maybe gcc would have done better if it
had decided to put s0 and s1 into registers that have a high 16 (%ah, %bh, %ch,
%dh), instead of r9 and r14.

It runs 35% slower than my hand-tuned asm version, on Intel Sandybridge.  uop
throughput is the bottleneck here, because there's a mix of load and ALU uops,
and the dep chains are short enough to not be a problem.  (4 fused-domain uops
per cycle).  My asm version does more PINSRW xmm dep chains in parallel, and
merges at the end with punpcklqdq, but that change only gained a couple %,
IIRC.