Bug 80820 - _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.
Summary: _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid st...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on: 81616
  Show dependency treegraph
Reported: 2017-05-18 20:25 UTC by Peter Cordes
Modified: 2019-04-18 14:07 UTC (History)
4 users (show)

See Also:
Target: x86_64-*-*, i?86-*-*
Known to work:
Known to fail:
Last reconfirmed: 2017-05-19 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-05-18 20:25:53 UTC
gcc with -mtune=generic likes to bounce through memory when moving data from integer registers to xmm for things like _mm_set_epi32.

There are 3 related tuning issues here:

* -mtune=haswell -mno-sse4 still uses one store/reload for _mm_set_epi64x.

* -mtune=znver1 should definitely favour movd/movq instead of store/reload.
  (Ryzen has 1 m-op movd/movq between vector and integer with 3c latency, shorter than store-forwarding.  All the reasons to favour store/reload on other AMD uarches are gone.)

* -mtune=generic should probably favour movd/movq.  I think it's better for a weighted-average of CPUs we care about for -mtune=generic.  Most of the text below is an attempt to back up this claim, but I don't have hardware to test with so all I can do is look at Agner Fog's tables and microarch pdf.

 movd is about break-even on Bulldozer, better on SnB-family, much better on Core2/Nehalem, and significantly worse only on AMD K8/K10.  Or maybe use a hybrid strategy that does half with movd and half with store/reload, which can actually be better than either strategy alone on Bulldozer and SnB-family.


The tune=haswell issue is maybe separate from the others, since gcc already knows that bouncing through memory isn't the optimal strategy.

#include <immintrin.h>
__m128i combine64(long long a, long long b) {
  return _mm_set_epi64x(b,a);

gcc8 -O3 -mtune=haswell emits:

        movq    %rsi, -16(%rsp)
        movq    %rdi, %xmm0
        movhps  -16(%rsp), %xmm0

(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819 for the wasted store with -msse4 -mno-avx).

I think what clang and ICC do is optimal for the SSE2-only case, for Intel CPUs and Ryzen:

        movq    %rsi, %xmm1
        movq    %rdi, %xmm0
        punpcklqdq      %xmm1, %xmm0

_mm_set_epi32(d,c,b,a) with -mtune=haswell gives us the expected movd/punpck (without SSE4), no store/reload.


Using movd or movq instead of a store/reload is a code-size win: movd %eax, %xmm0 is 4 bytes (or 5 with a REX prefix for movq or high registers).  Store/reload to -0x10(%rsp) is 10, 11, or 12 bytes, depending on operand size and high register(s).

movd int->xmm is lower latency than store/reload on most CPUs, especially Intel SnB-family where it's 1c latency, and also AMD Ryzen.   On SnB family, store/reload's only advantage is rare cases where port5 is a throughput bottleneck and latency isn't important.

It replaces a store and a load uop with 1 ALU uop on Intel Core2 and later, and Atom/Silvermont/KNL.  Also 1 uop on VIA Nano.

movd int->xmm is 2 ALU uops on AMD K10/Bulldozer-family and Jaguar, and P4, and 3 on K8/Bobcat.  It never costs any more total uops for the front-end (since a movd load is 2 uops on K8/Bobcat), but decoding a multi-uop instruction can sometimes be a bottleneck (especially on K8 where a 3 m-op instruction is a "vectorpath" (microcode)).

Store/reload has one per clock throughput on every CPU, AFAIK.  On most CPUs that have much weight in -mtune=generic, movd's throughput is one-per-clock or better.  (According to Agner Fog's tables, only Bobcat, K8/K10, and P4 have throughput of one per 2 or 3 clocks for movd/movq int->xmm).  The biggest problem is K10, with something like one per 2.8c throughput (according to a couple reports from http://users.atw.hu/instlatx64/, e.g.  http://users.atw.hu/instlatx64/AuthenticAMD0100FA0_K10_Thuban_InstLatX64.txt).  Agner Fog says 3, but none of these are measuring with other instructions mixed in.

Some CPUs have better than one-per-clock throughput for movd/movq: Core2 is 0.5, and Nehalem is 0.33.  So do we hurt them a lot to help PhenomII?  I'd guess that Core2+Nehalem has somewhat more weight in tune=generic than K10.  Some AMD PhenomII CPUs are still around, though.  (But we could exclude them for code built with -mssse3)


Probably the deciding factor for tune=generic is whether it hurts AMD Bulldozer-family significantly or at all.  It looks there's not much difference either way: similar throughput and latency.

However, store/reload may have an advantage when two cores in a cluster are competing for their shared vector unit.  Probably both of movd's macro-ops need to run on the shared vector unit, but for store/reload maybe only the load needs the shared resource.  IDK if this is correct or relevant, though.  Probably -mtune=bdver* should keep using store/reload, but this might not be enough of a reason to stop -mtune=generic from using movd.

Agner Fog's microarch pdf (Bulldozer section 18.11) says:

  > Nevertheless, I cannot confirm that it is faster to move data from a general purpose register
  > to a vector register through a memory intermediate, as recommended in AMD's optimization guide.

That AMD optimization guide advice may have been left over from K8/K10, where movd/movq from integer->vector has bad throughput.

As far as latency goes, scalar store -> vector reload is 10c on Bulldozer according to Agner Fog's numbers, while movd/movq is 10c on Bulldozer/Piledriver, and 5c on Steamroller.  (Steamroller also appears to have reduced the store-forwarding latency to 6c.  Agner's tables are supposed to have the store+load latencies add up to the store-forwarding latency.)

Store/reload is 2 instructions / 2 m-ops, but movd or movq is 1 instruction / 2 m-ops.  This is mostly ok for the decoders, but bdver1 can't decode in a 2-2 pattern (ver2/ver3 can).

Scheduling instructions to avoid consecutive multi-uop instructions may help decode throughput on bdver1.  But pairs of 2 m-op instructions are good on bdver2 and later.

With SSE4, pinsrd/q is probably good, because it's still only 2 m-ops on Bulldozer-family.  Indeed, -mtune=bdver1 uses 2x store/reload and 2x pinsrd for

        movl    %edx, -12(%rsp)
        movd    -12(%rsp), %xmm1
        movl    %edi, -12(%rsp)
        movd    -12(%rsp), %xmm0
        pinsrd  $1, %ecx, %xmm1
        pinsrd  $1, %esi, %xmm0
        punpcklqdq      %xmm1, %xmm0

Even better would probably be

        movd    %edx, %xmm1
        movl    %edi, -12(%rsp)
        pinsrd  $1, %ecx, %xmm1    # for bdver2, schedule so it can decode in a 2-2 pattern with the other pinsrd
        movd    -12(%rsp), %xmm0
        pinsrd  $1, %esi, %xmm0
        punpcklqdq      %xmm1, %xmm0

The store/reload can happen in parallel with the direct movd int->xmm1.  This would be pretty reasonable for tune=generic, and should run well on Intel SnB-family CPUs.


For -msse4 -mtune=core2, -mtune=nehalem, probably this is optimal:

        movd    %edi, %xmm0
        pinsrd  $1, %esi, %xmm0
        pinsrd  $2, %edx, %xmm0
        pinsrd  $3, %ecx, %xmm0

movd can run on any port and pinsrd is only 1 uop.  So this has a total latency of 2 + 3*1 = 5c on Core2 Wolfdale.  (First-gen core2 doesn't have SSE4.1).  Front-end bottlenecks are more common on Core2/Nehalem since they don't have a uop-cache, so fewer instructions is probably a good bet even at the expense of latency.

It might not be worth the effort to get gcc to emit this for Core2/Nehalem, since they're old and getting less relevant all the time.

It may also be good for -mtune=silvermont or KNL, though, since they also have 1 uop pinsrd/q.  But with 3c latency for pinsrd, the lack ILP may be a big problem.  Also, decode on Silvermont without VEX will stall if the pinsrd needs a REX (too many prefixes).  KNL should always use VEX or EVEX to avoid that.
Comment 1 Richard Biener 2017-05-19 08:45:51 UTC
Confirmed.  We need to revisit a lot of the little details for generic tuning for recent GCC.
Comment 2 Peter Cordes 2017-05-19 21:47:42 UTC
See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833.

gcc -m32 does an even worse job of getting int64_t into an xmm reg, e.g. as part of a 64-bit atomic store.

We get a store-forwarding failure from code like this, even with -march=haswell

        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        movq    (%esp), %xmm0
Comment 3 Peter Cordes 2017-05-19 21:59:58 UTC
Also, going the other direction is not symmetric.  On some CPUs, a store/reload strategy for xmm->int might be better even if an ALU strategy for int->xmm is best.

Also, the choice can depend on chunk size, since loads are cheap (2 per clock for AMD since K8 and Intel since SnB).  And store-forwarding works.

Doing the first one with movd and the next with store/reload might be good, too, on some CPUs. especially if there's some independent work that can happen for the movd result.

I also discussed some of this at the bottom of the first post in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833.
Comment 4 Venkataramanan 2017-08-22 06:42:50 UTC
(In reply to Peter Cordes from comment #0)
> gcc with -mtune=generic likes to bounce through memory when moving data from
> integer registers to xmm for things like _mm_set_epi32.
> There are 3 related tuning issues here:
> * -mtune=haswell -mno-sse4 still uses one store/reload for _mm_set_epi64x.
> * -mtune=znver1 should definitely favour movd/movq instead of store/reload.
>   (Ryzen has 1 m-op movd/movq between vector and integer with 3c latency,
> shorter than store-forwarding.  All the reasons to favour store/reload on
> other AMD uarches are gone.)

Yes for Ryzen, using direct move instructions should be better than using store-forwarding.
Comment 5 Peter Cordes 2018-06-10 02:01:29 UTC
AVX512F with marge-masking for integer->vector broadcasts give us a single-uop replacement for vpinsrq/d, which is 2 uops on Intel/AMD.

See my answer on https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs.  I don't have access to real hardware, but according to reported uop counts, this should be very good: 1 uop per instruction on Skylake-avx512 or KNL

vmovq         xmm0, rax                        1 uop p5   2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b0010    1 uop p5   3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b0100    1 uop p5   3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b1000    1 uop p5   3c latency

xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64

(For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine 256-bit halves.  Also means only 3 k registers instead of 7)

vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b10000     3c latency
... filling up the ZMM reg

Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

    mov      eax, 0b0010 = 2
    kmovw    k1, eax
    KSHIFTLW k2, k1, 1
    KSHIFTLW k3, k1, 2

  #  KSHIFTLW k4, k1, 3

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.  We could start with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2 / kmovw k1, eax being p0156 + p5.

Loading k registers from memory is not helpful: according to IACA, it costs 3 uops.  (But that includes p237, and a store-AGU uop makes no sense, so it might be wrong.)
Comment 6 Andrew Pinski 2018-11-12 01:58:35 UTC
*** Bug 87976 has been marked as a duplicate of this bug. ***