82369 – "optimizes" indexed addressing back into two pointer increments

Bug 82369 - "optimizes" indexed addressing back into two pointer increments

Summary: "optimizes" indexed addressing back into two pointer increments

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	8.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	bin cheng

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2017-09-29 22:53 UTC by Peter Cordes
Modified:	2017-10-04 17:12 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Cordes 2017-09-29 22:53:49 UTC

gcc defeats this attempt to get it to reduce the front-end bottleneck in this loop (simplified from a version of the loop in pr82356).

Indexing src by  (dst-src) + src  is easy to do in C, and works well.  But when one pointer advances faster than the other it's very clunky to express in C.

#include <immintrin.h>
#include <stdint.h>
#include <stddef.h>

// index src relative to dst, but use a pointer-increment for dst
// so the store still has a simple addressing mode (and can run on port7)
// gcc and clang "optimize" back to two separate pointers, but ICC13 leaves it alone
// Saves one ADD instruction in the loop.
void pack_high8_indexed_src(uint8_t *restrict dst, const uint16_t *restrict src, size_t bytes) {
  uintptr_t end_dst = (uintptr_t)(dst + bytes);
  uintptr_t srcu = (uintptr_t)src, dstu = (uintptr_t)dst;

  ptrdiff_t src_dst_offset = srcu - 2*dstu;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset));
     __m128i v1 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset)+1);
     __m128i res = _mm_packus_epi16(v1,v0);

     _mm_storeu_si128((__m128i*)dstu, res);
     dstu += 16;
     //src += 16;  // 32 bytes
  } while(dstu < end_dst);
}

https://godbolt.org/g/pycLQC
gcc -O3 -mtune=skylake  de-optimizes it to this:

pack_high8_indexed_src:       # gcc and clang do this:
        addq    %rdi, %rdx
.L2:
        movdqu  16(%rsi), %xmm0
        movdqu  (%rsi), %xmm1
        addq    $16, %rdi
        addq    $32, %rsi                # 2 separate pointer increments
        packuswb        %xmm1, %xmm0
        movups  %xmm0, -16(%rdi)
        cmpq    %rdi, %rdx
        ja      .L2
        ret

Intel SnB-family: 7 fused-domain uops.  (The store micro-fuses, and the cmp/ja macro-fuses).  In theory, this bottlenecks on front-end throughput (4 uops per clock), running at 1 iter per 1.75 cycles.  The store uses a simple addressing mode, so its store-address uop can run on port7.  If not for the front-end bottleneck, the back-end could run this at nearly 1 per clock.

ICC13/16/17 compiles it the way I was hoping to hand-hold gcc into doing, to 6 fused-domain uops, and should run 1 iter per 1.5 clocks on SnB/HSW/SKL.  This might also be good on Silvermont, since it's fewer instructions.

Possibly a similar benefit on K10 / BD (although AMD would benefit from using simple array indexing, because indexed addressing modes for stores aren't worse AFAIK.  But -mtune=bdver2 doesn't do that.)

pack_high8_indexed_src:               # ICC17
        lea       (%rdi,%rdi), %rax
        negq      %rax
        addq      %rdi, %rdx
        addq      %rax, %rsi
..B1.2:
        movdqu    16(%rsi,%rdi,2), %xmm1           # src indexed via dst*2
        movdqu    (%rsi,%rdi,2), %xmm0
        packuswb  %xmm0, %xmm1
        movdqu    %xmm1, (%rdi)                    # dst with a simple addressing mode.
        addq      $16, %rdi                        # 16B of dst, 32B of src
        cmpq      %rdx, %rdi
        jb        ..B1.2
        ret

A mov-load with a complex addressing mode is a single uop on all CPUs.  It might have 1c higher latency than a simple addressing mode, but that doesn't matter when the address math is off the critical path.

With unrolling, the actual work is only 4 fused-domain uops for 2x load + pack + store, so the front-end can just barely keep the back-end fed with infinite unrolling.  For any sane unroll factor, saving 1 uop of loop overhead is a slight win.

A store with an indexed addressing-mode can't run on port7 on Haswell/Skylake.  With any unrolling, that would become a bottleneck.  On SnB/IvB, indexed stores are un-laminated into 2 fused-domain uops, so simple array-indexing gets worse with unrolling.


BTW, with an indexed store, we could count a negative index up towards zero.  That would avoid the CMP, since the loop overhead could be just a single macro-fused uop: add $16, %rdx / jnc.  (But only SnB-family macro-fuses add/jcc.  AMD and Core2/Nehalem only macro-fuse test/cmp.)  But on a CPU that doesn't macro-fuse at all, it's good.  (e.g. Silvermont / KNL).

---

BTW, with AVX, micro-fused loads are un-laminated on Haswell/Skylake.  e.g.

        vmovdqu   16(%rsi,%rdi,2), %xmm0
        vpackuswb (%rsi,%rdi,2), %xmm0, %xmm1
        vmovdqu   %xmm1, (%rdi)

is 3 fused-domain uops in the decoders/uop cache, but its 4 fused-domain uops for the issue/rename stage and in the ROB.  The vpackuswb un-laminates.
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes#comment76198723_31027695

So if unrolling with AVX, it's better to do what gcc does and increment 2 separate pointers.  Then we can actually keep the back-end fed and bottleneck on load throughput, store throughput, and shuffle throughput.  gcc can unroll this loop (but clang can't, maybe confused by using integers as pointers.)

packuswb (%rsi,%rdi,2), %xmm0  could stay micro-fused, because it's a 2-operand instruction with a read-modify destination (not write-only like pabsb).  But we can't use it because it requires alignment.  (Of course, with load instead of loadu, this indexing trick would still be a win.)

Comment 1 Andrew Pinski 2017-09-29 23:03:54 UTC

IV-opts is doing this.

But the cost must be for a reason.
Someone would need to understand why the x86 backend cost to handle this case.

It might be the case the cost is taking account the vector modes.

Comment 2 Richard Biener 2017-10-02 08:31:24 UTC

Maybe sth for Bin to look at.

Comment 3 bin cheng 2017-10-04 16:20:22 UTC

Given IR dump before IVOPTs:
  <bb 2> [15.00%] [count: INV]:
  _1 = dst_12(D) + bytes_13(D);
  end_dst_14 = (uintptr_t) _1;
  srcu_16 = (uintptr_t) src_15(D);
  dstu_17 = (uintptr_t) dst_12(D);
  _2 = dstu_17 * 2;
  _3 = srcu_16 - _2;

  <bb 3> [100.00%] [count: INV]:
  # dstu_10 = PHI <dstu_17(2), dstu_22(4)>
  _4 = dstu_10 * 2;
  _5 = _3 + _4;
  _6 = (const __m128i_u * {ref-all}) _5;
  _25 = *_6;

When ivopts tries to find address type IV for "*_6", the base it has is like:
  (const __m128i_u * {ref-all}) ((uintptr_t) dst_12(D) * 2 + _3)
If we do more aggressive expansion for "_3", we could have:
  (const __m128i_u * {ref-all}) (srcu_16)

So without the expansion, we can't find the base object for the address in alloc_iv, that's why we failed to classify _6 as an address type IV.  As a result, addressing mode is not considered in choosing candidate, thus wrong candidate chosen in the end.

OTOH, we surely don't want to do aggressive expansion because that introduces more code into loop.  One possible fix is to do aggressive expansion for analysis purpose, rather than unconditionally (or for code generation purpose).  For example, we can try aggressive expansion when alloc_iv fails to find base object, see if it can do better.

Thanks,
bin

Comment 4 bin cheng 2017-10-04 17:12:31 UTC

Hmm, with expansion, IVOPTs can find address type uses as:
Group 0:
  Type:	ADDRESS
  Use 0.0:
    At stmt:	_25 = *_6;
    At pos:	*_6
    IV struct:
      Type:	const __m128i_u * {ref-all}
      Base:	(const __m128i_u * {ref-all}) src_15(D)
      Step:	32
      Object:	(void *) src_15(D)
      Biv:	N
      Overflowness wrto loop niter:	Overflow
Group 1:
  Type:	ADDRESS
  Use 1.0:
    At stmt:	*dstu.2_9 = _23;
    At pos:	*dstu.2_9
    IV struct:
      Type:	__m128i_u * {ref-all}
      Base:	(__m128i_u * {ref-all}) dst_12(D)
      Step:	16
      Object:	(void *) dst_12(D)
      Biv:	N
      Overflowness wrto loop niter:	Overflow
Group 2:
  Type:	COMPARE
  Use 2.0:
    At stmt:	if (end_dst_14 > dstu_22)
    At pos:	dstu_22
    IV struct:
      Type:	uintptr_t
      Base:	(uintptr_t) dst_12(D) + 16
      Step:	16
      Biv:	Y
      Overflowness wrto loop niter:	Overflow
Group 3:
  Type:	ADDRESS
  Use 3.0:
    At stmt:	_24 = *_8;
    At pos:	*_8
    IV struct:
      Type:	const __m128i_u * {ref-all}
      Base:	(const __m128i_u * {ref-all}) (src_15(D) + 16)
      Step:	32
      Object:	(void *) src_15(D)
      Biv:	N
      Overflowness wrto loop niter:	Overflow

But it's less likely to express all address type uses with dstu because they have different base object.  In general, we don't allow expressing reference to one base object using pointer pointing to different base object.  This case is a bit tricky because the addresses are computed and casted from uintptr_t, which means we can assume result pointers are valid to point to any address?

Even it's valid to rewrite load like MEM[src_dst_offset + dstu << 1], it's hard to do so in current IVOPTs because it's implemented on the basis of base_object.

Richi, any comments?

Thanks,
bin