Bug 82137 - auto-vectorizing shuffles way to much to avoid duplicate work
Summary: auto-vectorizing shuffles way to much to avoid duplicate work
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2017-09-08 02:07 UTC by Peter Cordes
Modified: 2017-12-14 14:54 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-09-08 02:07:04 UTC
(same code as bug 82136, but with aligned pointers, and discussing the overall vectorization strategy)

static const int aligned = 1;

void pairs_double(double blocks[]) {
    if(aligned) blocks = __builtin_assume_aligned(blocks, 64);
    for (int i = 0 ; i<10240 ; i+=2) {
        double x = blocks[i];
        double y = blocks[i+1];
        blocks[i] = x*y;
        blocks[i+1] = x+y;
    }
}

https://godbolt.org/g/pTY5iU has this + a uint64_t version (with ^ and &)

i.e. load a pair of 64-bit elements and replace them with 2 different combinations of the pair.  See https://stackoverflow.com/questions/46038401/unpack-m128i-m256i-to-m64-mmx-sse2-avx2.

gcc auto-vectorizes this *way* too much shuffling when AVX / AVX2 are available, same for the integer version.  It's especially bad with AVX1 only, but even AVX2 is a mess and will totally bottleneck on shuffle throughput on Intel and AMD CPUs (Ryzen has good shuffle throughput, but lane-crossing shuffles are much more expensive than in-lane).

gcc 8.0.8 -O3 -march=haswell

pairs_double:
        leaq    81920(%rdi), %rax
.L2:
        vmovapd (%rdi), %ymm1
        vmovapd 32(%rdi), %ymm0
        addq    $64, %rdi
        vunpcklpd       %ymm0, %ymm1, %ymm2
        vunpckhpd       %ymm0, %ymm1, %ymm0
        vpermpd $216, %ymm2, %ymm2
        vpermpd $216, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vpermpd $68, %ymm0, %ymm3
        vpermpd $238, %ymm0, %ymm0
        vpermpd $68, %ymm1, %ymm2
        vpermpd $238, %ymm1, %ymm1
        vshufpd $12, %ymm3, %ymm2, %ymm2
        vshufpd $12, %ymm0, %ymm1, %ymm0
        vmovapd %ymm2, -64(%rdi)
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L2
        vzeroupper
        ret

I haven't worked through the constants to see how its shuffling, but the same function with uint64_t (and operators ^ &) uses 4 shuffles per 2 vectors after the actual work:

        # vpand / vpxor in ymm0 / ymm1
        vpunpcklqdq     %ymm0, %ymm1, %ymm2
        vpunpckhqdq     %ymm0, %ymm1, %ymm0
        vperm2i128      $32, %ymm0, %ymm2, %ymm1
        vperm2i128      $49, %ymm0, %ymm2, %ymm0
        vmovdqa %ymm1, -64(%rdi)
        vmovdqa %ymm0, -32(%rdi)
 
It uses equivalent shuffles to prepare for vpand/vpxor, so vunpckl/hpd + 2x vperm2f128 should do the same shuffle.  Using split stores would help a lot on Intel and AMD: 2x vmovdqa xmm, then 2x vextractf128 xmm.  SnB/HSW/SKL vextractf128 to memory is 2 fused-domain uops (Agner Fog's table) but there's no ALU uop.  It's just a non-micro-fused store.  Since gcc's code horribly bottlenecks on shuffles, split stores would be a win here.

It's a huge win on Ryzen, where vperm2f128 is 8 uops with 3c throughput.  (-march=ryzen enables -mprefer-avx128, but -march=haswell running on Ryzen is probably a relevant use-case.  When all else is equal for Intel, do something that doesn't suck on Ryzen.)

----

However, a different strategy entirely is *MUCH* better for most CPUs, and as a bonus it only needs AVX1 and no lane-crossing shuffles:

Do far less shuffling in exchange for doing more mul/add (or whatever), by allowing duplicates.  Doing the same operation twice is allowed, even when it can raise FP exceptions (in C and C++).

    b0     b1       |    b2       b3       # load 256b
    b1     b0       |    b3       b2       # vshufpd or vpermilpd for in-lane swap

    b0*b1  b1*b0    |    b2*b3    b3*b2    # vmulpd
    b0+b1  b1+b0    |    b2+b3    b3+b2    # vaddpd

    b0*b1  b0+b1    |    b2*b3    b2+b3    # vblendpd
                                           # store 256b

Per two 128b pairs, that's 1 load, 1 shuffle, 2x FP math, 1 immediate-blend, 1 store.  No lane-crossing shuffles is a huge plus for Ryzen (vperm2f128 is very slow).

That's 6 fused-domain uops on Intel, and should easily run at 1 iter per 1.5 cycles (+ loop overhead), without bottlenecking on any ports.  The bottleneck is the front-end, so unrolling helps.

That's 0.75c per 128b pair.  (Slower on SnB/IvB, bottlenecking on load/store throughput.)

Immediate-blend can run on any port on HSW+, or p0/p5 on SnB (and good throughput on BD/Ryzen) so it's much more throughput-friendly than using vunpcklpd to mix the mul and add result vectors.  If the operation hadn't been commutative (e.g. sub instead of add), you might have to use vpunpck.

This same strategy is also optimal for integer booleans with AVX2.  (Or possibly even with AVX1, but using VXORPS / VANDPS will bottleneck on port5 on SnB-Broadwell, since it wasn't until Skylake that FP booleans run on p015).
Comment 1 Richard Biener 2017-09-12 08:59:55 UTC
Interesting idea.  It's probably a bit hard to make the vectorizer do this
though given it's current structure and the fact that it would have to
cost the extra ops against the saved suffling (the extra cost of the ops
would depent on availability of spare execution resources for example).

In general the interleaving strategy looks like a bit of a dead end with
AVX (and possibly worse on AVX512).  This is how extract even gets expanded:

;; vect_perm_even_21 = VEC_PERM_EXPR <vect_x_13.2_26, vect_x_13.3_22, { 0, 2, 4, 6 }>;

(insn 18 16 17 (set (reg:V4DF 103)
        (vec_select:V4DF (vec_concat:V8DF (reg:V4DF 95 [ vect_x_13.2 ])
                (reg:V4DF 93 [ vect_x_13.3 ]))
            (parallel [
                    (const_int 0 [0])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2])
                    (const_int 6 [0x6])
                ]))) "t2.c":6 {*avx_unpcklpd256}
     (nil))

(insn 17 18 0 (set (reg:V4DF 92 [ vect_perm_even_21 ])
        (vec_select:V4DF (reg:V4DF 103)
            (parallel [
                    (const_int 0 [0])
                    (const_int 2 [0x2])
                    (const_int 1 [0x1])
                    (const_int 3 [0x3])
                ]))) "t2.c":6 4183 {avx2_permv4df_1}
     (nil))
Comment 2 Peter Cordes 2017-09-12 18:23:23 UTC
(In reply to Richard Biener from comment #1)
> Interesting idea.  It's probably a bit hard to make the vectorizer do this
> though given it's current structure and the fact that it would have to
> cost the extra ops against the saved suffling (the extra cost of the ops
> would depent on availability of spare execution resources for example).

 Shuffles take execution resources just like anything else (except for load+broadcast (vbroadcastss or movddup, or even movshdup) which is done as part of a load uop in Ryzen and Intel CPUs).

On Haswell and later Intel, there's only one shuffle port, but multiple ports for almost everything else (especially on Skylake).  Shuffles are very often the bottleneck if you're not careful, or shuffle + something else that also only runs on port 5 (Intel).

Shuffle instructions also of course take front-end throughput resources, and that's another common bottleneck in code with a mix of ops for different ports (e.g. loads, stores, ALU, integer loop overhead.)  Without loop unrolling, front-end or memory bandwidth are the most likely bottlenecks for load+ALU+store loops.  (Latency is often a bottleneck for reduction loops, because gcc is bad at unrolling with multiple accumulators last I looked.)

A cost model that includes the front-end, and the limited throughput of of shuffles, would be a *very* good thing to have.  Especially when using shuffles that take multiple instructions because of limited flexibility in the instruction set.

 
> In general the interleaving strategy looks like a bit of a dead end with
> AVX.

And in bug 82136 you said:
> (I guess AVX256 doesn't have full width extract even/odd
> and interleave high/low ...)

That's correct.  AVX1 / AVX2 have very few 2-input lane-crossing shuffles.  (Most of the 2-input shuffles work in two separate 128b lanes, like 2 separate 128b instructions glued together).

I think the only ones that exist are vperm2i/f128 which shuffle with 128b granularity.


> (and possibly worse on AVX512).

Actually it's much better on AVX512.  It mostly drops the in-lane limitations of AVX1/AVX2.  It has powerful 2-input lane-crossing shuffles like vpermt2d / ps or vperm2tq / pd that can do the shuffles gcc wants with one instruction each, using a shuffle control constant in another vector.

Clang and gcc both use them with `-march=skylake-avx512`: https://godbolt.org/g/SzwxNA.  (Actually, gcc uses vpermi2pd instead of vpermt2pd, costing it an extra 2 vmovdapd instead of clobbering the load result or add results with the 2nd shuffle.  Clang gets it right; see the clang tab on that godbolt link.)

It's back down to 4 shuffles per 2 loads / 2 stores / add+mul, same as in the 128b case with unpckl/hpd.  I.e. 2 shuffles per vector of results, but it saves a mul+add vs. the shuffle/blend AVX1 strategy.  It also "tricks" gcc into unrolling and doing 2 vectors per loop iteration, reducing loop overhead for the non-PGO default of not unrolling.

KNL has lower throughput for vpermt2pd (one per 2c vs. 1c for vpermilps), but it's still only 1 uop.  Since Knight's Landing bottlenecks almost completely on front-end decode (2 instructions per clock), it takes about 8 clocks to issue the loop, which matches the time for 4 two-input shuffles.  If out-of-order execution + hyperthreading can hide the latency, it should be fine.

Skylake-avx512 (aka SKX) doesn't care about 2 or 3 input shuffles vs. 1 input (same throughput for vpermilpd vs. vpermt2pd https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0050654_SkylakeX_InstLatX64.txt).  Lane-crossing shuffles have higher latency, but the shuffle isn't part of a loop-carried dependency.

SKX shuts down port 1 shut down while 512b uops are in flight, so the shuffle+blend strategy with duplicated work would probably bottleneck on ALU throughput and still only manage one vector per 2 clocks.  clang's output (what gcc should emit for its current strategy) is 15 front-end uops, or 3.75 cycles for the front-end, so gcc will bottleneck just barely on shuffle throughput, assuming vmulpd and vaddpd never get scheduled onto port5 and steal a cycle from a shuffle.  (Should be rare, because there's much less pressure on port0).
Comment 3 rguenther@suse.de 2017-09-13 07:44:28 UTC
On Tue, 12 Sep 2017, peter at cordes dot ca wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137
> 
> --- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
> (In reply to Richard Biener from comment #1)
> > Interesting idea.  It's probably a bit hard to make the vectorizer do this
> > though given it's current structure and the fact that it would have to
> > cost the extra ops against the saved suffling (the extra cost of the ops
> > would depent on availability of spare execution resources for example).
> 
>  Shuffles take execution resources just like anything else (except for
> load+broadcast (vbroadcastss or movddup, or even movshdup) which is done as
> part of a load uop in Ryzen and Intel CPUs).

I was concerned about cases where we don't just have one operation but
a computation sequence where you go from SHUF + n * op + SHUF to
n * 2 * op + BLEND.  GCC already knows how to handle a mix of two
operators & blend for addsubpd support -- it just does the blending
in a possibly unwanted position and nothing later optimizes shuffles
through transparent operations.  Extending this support to handle
plus and mult yields

pairs_double:
.LFB0:
        .cfi_startproc
        leaq    81920(%rdi), %rax
        .p2align 4,,10
        .p2align 3
.L2:
        vmovapd (%rdi), %ymm0
        addq    $32, %rdi
        vpermpd $160, %ymm0, %ymm2
        vpermpd $245, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vshufpd $10, %ymm0, %ymm1, %ymm0
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rax, %rdi
        jne     .L2
        vzeroupper

the vectorizer generates

  <bb 3> [50.00%] [count: INV]:
  # ivtmp.17_24 = PHI <ivtmp.17_2(2), ivtmp.17_1(3)>
  _3 = (void *) ivtmp.17_24;
  vect_x_14.2_23 = MEM[(double *)_3];
  vect_x_14.3_22 = VEC_PERM_EXPR <vect_x_14.2_23, vect_x_14.2_23, { 0, 0, 
2, 2 }>;
  vect_y_15.7_10 = VEC_PERM_EXPR <vect_x_14.2_23, vect_x_14.2_23, { 1, 1, 
3, 3 }>;
  vect__7.8_9 = vect_y_15.7_10 * vect_x_14.3_22;
  vect__7.9_40 = vect_y_15.7_10 + vect_x_14.3_22;
  _35 = VEC_PERM_EXPR <vect__7.8_9, vect__7.9_40, { 0, 5, 2, 7 }>;
  MEM[(double *)_3] = _35;
  ivtmp.17_1 = ivtmp.17_24 + 32;
  if (ivtmp.17_1 != _5)
    goto <bb 3>; [98.00%] [count: INV]

in this case.  Not sure if this is optimal.  Now the question is
whether it is profitable to enable this for more than the plus/minus
combination in general.  As I noted GCC does a poor job optimizing
for example

static const int aligned = 1;

void pairs_double(double blocks[]) {
    if(aligned) blocks = __builtin_assume_aligned(blocks, 64);
    for (int i = 0 ; i<10240 ; i+=2) {
        double x = blocks[i];
        double y = blocks[i+1];
        double tem = x * y;
        double tem2 = x + y;
        tem = tem + 3;
        tem2 = tem2 * 2;
        blocks[i] = tem;
        blocks[i+1] = tem2;
    }
}

which ends up as

  <bb 3> [50.00%] [count: INV]:
  # ivtmp.19_26 = PHI <ivtmp.19_2(2), ivtmp.19_1(3)>
  _3 = (void *) ivtmp.19_26;
  vect_x_12.2_31 = MEM[(double *)_3];
  vect_x_12.3_30 = VEC_PERM_EXPR <vect_x_12.2_31, vect_x_12.2_31, { 0, 0, 
2, 2 }>;
  vect_y_13.7_24 = VEC_PERM_EXPR <vect_x_12.2_31, vect_x_12.2_31, { 1, 1, 
3, 3 }>;
  vect_tem_14.8_23 = vect_y_13.7_24 * vect_x_12.3_30;
  vect_tem_14.9_22 = vect_y_13.7_24 + vect_x_12.3_30;
  _21 = VEC_PERM_EXPR <vect_tem_14.8_23, vect_tem_14.9_22, { 0, 5, 2, 7 
}>;
  vect_tem_16.10_7 = _21 + { 3.0e+0, 2.0e+0, 3.0e+0, 2.0e+0 };
  vect_tem_16.11_41 = _21 * { 3.0e+0, 2.0e+0, 3.0e+0, 2.0e+0 };
  _42 = VEC_PERM_EXPR <vect_tem_16.10_7, vect_tem_16.11_41, { 0, 5, 2, 7 
}>;
  MEM[(double *)_3] = _42;
  ivtmp.19_1 = ivtmp.19_26 + 32;
  if (ivtmp.19_1 != _5)
    goto <bb 3>; [98.00%] [count: INV]

and thus

.L2:
        vmovapd (%rdi), %ymm0
        addq    $32, %rdi
        vpermpd $160, %ymm0, %ymm2
        vpermpd $245, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vshufpd $10, %ymm0, %ymm1, %ymm0
        vaddpd  %ymm3, %ymm0, %ymm1
        vmulpd  %ymm3, %ymm0, %ymm0
        vshufpd $10, %ymm0, %ymm1, %ymm0
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rax, %rdi
        jne     .L2

we miss some pass that would move and combine VEC_PERM_EXPRs
across operations.

Thanks for the detailed write-up on how AVX512 works compared to AVX2,
it does indeed look like an improvement!  (until AMD cripples
it with splitting it into two AVX256 halves... ;))