Bug 80833 - 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
Summary: 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-05-19 21:41 UTC by Peter Cordes
Modified: 2018-06-10 01:02 UTC (History)
0 users

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-05-22 00:00:00


Attachments
Prototype patch (1.62 KB, patch)
2017-05-24 13:33 UTC, Uroš Bizjak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-05-19 21:41:42 UTC
This affects 64-bit atomic loads/stores, as well as _mm_set_epi64x intrinsics.

gcc -m32 copies int64_t data into xmm registers with the worst possible strategy: two 32-bit stores and a 64-bit vector load, causing a store-forwarding stall.

Similarly, for getting 64-bit integers back out of xmm registers, gcc's store/reload strategy may not be optimal on Intel CPUs.  (But doesn't cause a store-forwarding stall).


#include <immintrin.h>
__m128i combine64(int64_t a, int64_t b) {
  return _mm_set_epi64x(b, a|1);
}

b is loaded directly from memory, but the low half of `a` is modified so it can't be, letting us observe gcc's int64->xmm strategy.

gcc8-snapshot -m32 -march=haswell -mno-avx -O3 emits:

        subl    $28, %esp
        movl    32(%esp), %eax
        movl    36(%esp), %edx   # the untouched upper half of `a`
        orl     $1, %eax

        # `a` is in integer regs.  The next three lines are gcc's typical pattern for int64 -> xmm
        movl    %edx, 12(%esp)
        movl    %eax, 8(%esp)
        movq   8(%esp), %xmm0    # guaranteed store-forwarding stall, except with -mtune=atom

        movhps 40(%esp), %xmm0   # store-forwarding stall if the caller used scalar stores.
        addl    $28, %esp
        ret

A slight improvement would be to  orl $1, 4(%esp)  to do a|1 in-place, instead of copying it.  But that still has an extra load-store round-trip, so it's not good even on Atom where it wouldn't cause a store-forwarding stall.

-----

For data coming from integer registers, clearly we should use whatever strategy is optimal for _mm_set_epi32, as discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820).  movd / pinsrd should be good for -mtune=intel, and Ryzen.

(But in a case like this, with one simple integer operation on data starting in memory, avoiding integer registers entirely by doing that operation with a vector-integer instruction is worth considering, see below).

----

For integers that are coming from memory, it matters a lot whether we expect them to have been written recently with narrow stores.  Except with -mtune=atom, which doesn't stall for narrow stores -> wide load.

e.g. -mtune=atom could use

        mov        $1, %eax
        movd     %eax, %xmm0   # or load the vector constant from memory
        por   4(%esp), %xmm0
        ret

If the two integer aren't adjacent, movq / movhps is good.  Other CPUs can use this, too, if reading memory that we expect wasn't recently written with narrow stores.

Doing 64-bit integer ops in vector regs is an even bigger win for ops with carry from low to high, like shift or add.  For example, if it was `a++` instead of `a|1`:

        movq    4(%esp), %xmm0   # load `a`
        pcmpeqd %xmm1,%xmm1
        psubq   %xmm1,%xmm0      # a -= -1.  1 uop vs. 3 for add/adc on Haswell
        movhps  12(%esp), %xmm0  # merge in `b`
        ret

If we were worried about store-forwarding stalls, then get `a` into xmm0 in two halves before the psubq.

Scalar 64-bit integer ops in vector regs may be useful in general in 32-bit code in some cases, especially if it helps with register pressure.

----

For function args, especially near the start of a function, we should assume that anything other than separate 32-bit loads will cause a store-forwarding stall, incurring an extra 10 to 12 cycles of latency beyond the usual store-forwarding latency, depending on uarch.  i.e. very expensive, and worth spending a lot of instructions to avoid if it's at all likely, maybe even if not part of a long dep chain (since out-of-order execution may have trouble hiding that much latency).

clang4.0 -O3 -march=sandybridge -mno-avx emits this, which is very good if we have to assume that the function args were recently stored in 32-bit chunks:

        movl    4(%esp), %eax
        orl     $1, %eax
        movd    %eax, %xmm0
        pinsrd  $1, 8(%esp), %xmm0   # taking the unmodified half of `a` directly from the original memory location is a good optimization
        pinsrd  $2, 12(%esp), %xmm0
        pinsrd  $3, 16(%esp), %xmm0
        retl

pinsrd with a memory source is 2 fused-domain uops, but only one of them is an ALU uop (for the shuffle port).  The other is the load.  It never micro-fuses.

gcc's usual _mm_set_epi32 strategy of doing two 64-bit halves and merging with a shuffle would also work well with memory source data.



For CPUs where int->xmm is not fast, doing the OR (or whatever other ALU operation) with a vector instruction is even more attractive than on Intel, even if we still have to load 32 bits at a time.

It's also good on Intel CPUs if we can hoist the vector constant out of a loop, since int->xmm needs a port5 uop which competes with shuffles, especially on Haswell and later that only have 1 shuffle unit.


        movd    4(%esp), %xmm0       # a (low half)

        movd    12(%esp), %xmm2      # b
        pinsrd  $1, 16(%esp), %xmm2

        pcmpeqd %xmm1,%xmm1
        psrld   $31, %xmm1           # or load a constant from memory
        por     %xmm1, %xmm0         # a |= 1

        pinsrd  $1,  8(%esp), %xmm0  # then merge the high half of `a`, replacing the garbage in element 1

        punpcklqdq %xmm2,%xmm0
        retl

This only has 3 port5 uops, and has a latency on haswell of 3 cycles from the first 2 loads being ready.  Since out-of-order CPUs typically run uops in oldest-ready order (with adjustments for higher-latency uops to avoid writeback conflicts), I scheduled this so the two movd loads are first, allowing the three port5 uops to run in three consecutive cycles.  (First the pinsrd of the high half of b, since the two load uops from the pinsrd instructions should be ready the cycle after the two movd uops.)  I have no idea if this really would help avoid extra resource-conflict latency for the critical path, but it can't hurt.



----------


This also affects 64-bit atomic stores and loads.

#include <atomic>
#include <stdint.h>

int64_t load64(std::atomic<int64_t> *p) {
  return p->load(std::memory_order_acquire) + 1;
}

gcc8 -m32 -mno-avx -march=haswell -O3

        subl    $12, %esp
        movl    16(%esp), %eax   # function arg
        movq    (%eax), %xmm0    # 64-bit atomic load

        movq    %xmm0, (%esp)    # gcc's store/reload strategy
        movl    (%esp), %eax
        movl    4(%esp), %edx

        addl    $1, %eax         # a++ in integer regs
        adcl    $0, %edx
        addl    $12, %esp
        ret

It would be cheaper to do a++ with paddq or psubq while we still have the value in xmm0, even counting the cost of generating the constant on the fly.  That takes 1 ALU uop for pcmpeqd, and is off the critical path.  ADC is 2 uops on Intel CPUs before Broadwell.

A lower-latency xmm->int strategy would be:

        movd    %xmm0, %eax
        pextrd  $1, %xmm0, %edx

Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other CPUs where an FP shuffle has bypass delay between integer ops)

        movd     %xmm0, %eax
        movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
        movd     %xmm0, %edx

Or without SSE3,

        movd     %xmm0, %eax
        psrldq   $4,  %xmm0    # 1 m-op cheaper than pshufd on K8
        movd     %xmm0, %edx


movd xmm->r32 is efficient on K10 (1 m-op with 3c latency), unlike movd r32->xmm.

On Bulldozer-family, it's only 1 m-op, and has 8c latency (or 4 on Steamroller).  Store-forwarding latency is high on Bulldozer, so movd %xmm0, %eax / pextrd is probably a win.

So avoiding a store/reload is probably a good strategy for -mtune=generic, and -mtune=bdver*.

-mtune=k8 should probably store/reload for this direction, too, because movd %xmm0,%eax is 3 m-ops with 2c latency.  And it can do 2 loads per clock, and I think store-forwarding both halves of an 8-byte load works.


------

Atomic stores are more of a problem, since two 32b stores can't store-forward to a 64b load (except on Atom).


int64_t store64(std::atomic<int64_t> *p, int64_t a) {
  p->store(a, std::memory_order_release);
  return a;  // tempt gcc into loading into integer regs instead of movq
}

gcc -m32 -march=sandybridge -O3  emits the following, for gcc4.7 through gcc8-snapshot.  (Other than a regression to fildq/fistpq with gcc4.8):

        subl    $12, %esp
        movl    20(%esp), %eax
        movl    24(%esp), %edx
        movl    16(%esp), %ecx

        # at this point, the int64 and the pointer are in integer regs, like would be typical as part of a real function.
        # The next three lines are the issue
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        vmovq   (%esp), %xmm0   # store-forwarding failure

        vmovq   %xmm0, (%ecx)
        addl    $12, %esp
        ret

If the function is void, then gcc uses a movq load of the function arg (causing a store-forwarding stall if the caller used narrow stores).

If it wasn't for that, probably it would be optimal to load it twice.

        movq    8(%esp), %xmm0
        movl    4(%esp), %ecx

        movl    8(%esp), %eax   # return value
        movl    12(%esp), %edx

        vmovq   %xmm0, (%ecx)
        ret

Loads are cheap: AMD since K8 and Intel since SnB can execute two loads per clock.  Unless we bottleneck on load uops (not the same thing as memory bandwidth), other CPUs like Silvermont and Nehalem will probably do well with this, too.  (Again, except for the store-forwarding issue from the caller writing the args).
Comment 1 Peter Cordes 2017-05-19 21:45:25 UTC
See https://godbolt.org/g/krXH9M for the functions I was looking at.
Comment 2 Peter Cordes 2017-05-20 01:38:51 UTC
On most CPUs, psrldq / movd is optimal for xmm[1] -> int without SSE4.  On SnB-family, movd runs on port0, and psrldq can run on port5, so they can execute in parallel.  (And the second movd can run the next cycle).

I'd suggest using movd/psrldq/movd for -mtune=generic.  (Or pshuflw to copy+shuffle if it's useful to not destroy the value in the xmm reg while extracting to integer.  pshuflw is faster than pshufd on old CPUs, and the same on current CPUs).

But for some CPUs, this is better:

    movd    %xmm0, %eax
    psrlq   $32, %xmm0
    movd    %xmm0, %edx

A 64-bit shift by 32 is much better than PSRLDQ on some CPUs, especially SlowShuffle CPUs (where xmm pshufd is slower than 64-bit granularity shuffles).

* P4: 2c latency instead of 4, and twice the throughput
* Pentium M: 2 uops instead of 4.
* Core2 merom/conroe:  1 uop instead of 2
* K8/K10: same as PSRLDQ
Comment 3 Peter Cordes 2017-05-20 01:39:14 UTC
Atom's movd xmm->int is slower (lat=4, rtput=2) than its movd int->xmm (lat=3, rtput=1), which is opposite of every other CPU (except Silvermont where they're the same throughput but xmm->int is 1c slower).  So very likely store/reload is the way to go for -mtune=atom, since store-forwarding is so amazingly fast (1c latency).  But maybe with SSE4 pextrd, the code-size saving is worth it.
Comment 4 Peter Cordes 2017-05-20 01:41:25 UTC
I don't think it's worth anyone's time to implement this in 2017, but using MMX regs for 64-bit store/load would be faster on really old CPUs that split 128b vectors insns into two halves, like K8 and Pentium M.  Especially with -mno-sse2 (e.g. Pentium3 compat) where movlps has a false dependency on the old value of the xmm reg, but movq mm0 doesn't.  (No SSE2 means we can't MOVQ or MOVSD to an XMM reg).

MMX is also a saving in code-size: one fewer prefix byte vs. SSE2 integer instructions.  It's also another set of 8 registers for 32-bit mode.

But Skylake has lower throughput for the MMX versions of some instructions than for the XMM version.  And SSE4 instructions like PEXTRD don't have MMX versions, unlike SSSE3 and earlier (e.g. pshufb mm0, mm1 is available, and on Conroe it's faster than the xmm version).
Comment 5 Richard Biener 2017-05-22 07:46:19 UTC
There's some related bugs.  I think there is no part of the compiler that specifically tries to avoid store forwarding issues.
Comment 6 Peter Cordes 2017-05-22 20:27:27 UTC
(In reply to Richard Biener from comment #5)
> There's some related bugs.  I think there is no part of the compiler that
> specifically tries to avoid store forwarding issues.

Ideally the compiler would keep track of how stores were done (or likely done for code that's not visible), but without that:

For data coming from integer registers, a pure ALU strategy (with movd/q and punpck or pinsrd/q) should be a win on all CPUs over narrow-store -> wide-load. 
 Except maybe in setup for long loops where small code / fewer uops is a win, or if there's other work that hides the latency of a store-forwarding stall.

The other exception is maybe -mtune=atom without SSE4.  (But ALU isn't bad there, so adding a special-case just for old in-order Atom might not make sense.)

---

For data starting in memory, a simple heuristic might be: vector loads wider than a single integer reg are good for arrays, or for anything other than scalar locals / args on the stack that happen to be contiguous.

We need to make sure such a heuristic never results in auto-vectorizing with movd/pinsrd loads from arrays, instead of movdqu.  However, it might be appropriate to use a movd/pinsrd strategy for _mm_set_epi32, even if the data happens to be contiguous in memory.  In that case, the programmer can use a _mm_loadu_si128 (and a struct or array to ensure adjacency).

It's less clear what to do about int64_t in 32-bit mode, though, without a good mechanism to track how it was recently written.  Always using movd/pinsrd for locals / args is not horrible, but would suck for structs in memory if the programmer is assuming that they'll get an efficient MOVQ/MOVHPS.


A function that takes a read-write int64_t *arg might often get called right after the pointed-to data is written.  In 32-bit code, we need it in integer registers to do anything but copy it.  If we're just copying it somewhere else, hopefully a store-forwarding stall isn't part of the critical path.  I'm not sure how long it takes for a store to complete, and no longer need to be forwarded.  The store buffer can't commit stores to L1 until they retire (and then it has to go in-order to preserve x86 memory ordering), so even passing a pointer on the stack (store/reload with successful forwarding) probably isn't nearly enough latency for a pair of stores in the caller to be actually committed to L1.

A store-forwarding "stall" doesn't actually stall the whole pipeline, or even unrelated memory ops, AFAIK.  My understanding is that it just adds latency to the affected load while out-of-order execution continues as normal.  There may be some throughput limitations on how many failed-store-forwarding loads can be in flight at once: I think it works by scanning the store buffer for all overlapping stores, if the last store that wrote any of the bytes isn't able to use the forwarding fast-case (either because of sub-alignment restrictions or only partial overlap).  It doesn't have to drain the store buffer, though.

Obviously every uarch can have its own quirks, but this seems the most likely explanation for a latency penalty that's a constant number of cycles.

AFAIK, the store-forwarding stall penalty can't start until the load-address is ready, since AFAIK no major x86 CPUs do address-prediction for loads.  So the 6 + 10c latency for an SSE load on SnB with failed store-forwarding would be from when the address becomes ready to when the value becomes ready.  I might be mistaken, though.  Maybe it helps if the store executed several cycles before the load-address was ready, so 32-bit code using a MOVQ xmm load on an int64_t* won't suffer as badly if it got the address from a stack arg, and did some other work before the load.
Comment 7 Uroš Bizjak 2017-05-24 13:33:19 UTC
Created attachment 41412 [details]
Prototype patch

Patch that emits mov/pinsr or mov/pextr pairs for DImode (x86_32) and TImode (x86_64) moves.
Comment 8 Uroš Bizjak 2017-05-24 13:37:25 UTC
The patch from comment #7 generates:

a) DImode move for 32 bit targets:

--cut here--
long long test (long long a)
{
  asm ("" : "+x" (a));
  return a;
}
--cut here--

gcc -O2 -msse4.1 -mtune=intel -mregparm=2:

        movd    %eax, %xmm0
        pinsrd  $1, %edx, %xmm0
        movq    %xmm0, (%esp)       <<-- unneeded store due to RA problem
        movd    %xmm0, %eax
        pextrd  $1, %xmm0, %edx
        leal    12(%esp), %esp

b) TImode move for 64 bit targets:

--cut here--
__int128 test (__int128 a)
{
  asm ("" : "+x" (a));
  return a;
}
--cut here--

gcc -O2 -msse4.1 -mtune=intel

        movq    %rdi, %xmm0
        pinsrq  $1, %rsi, %xmm0
        pextrq  $1, %xmm0, %rdx
        movq    %xmm0, %rax
Comment 9 Uroš Bizjak 2017-05-24 13:44:18 UTC
(In reply to Uroš Bizjak from comment #8)
>         movq    %xmm0, (%esp)       <<-- unneeded store due to RA problem

For some reason, reload "fixes" direct DImode register moves, and passes value via memory. Later passes partially merge these moves, but leave the above insn.
Comment 10 Uroš Bizjak 2017-05-24 13:51:39 UTC
(In reply to Peter Cordes from comment #0)

> Scalar 64-bit integer ops in vector regs may be useful in general in 32-bit
> code in some cases, especially if it helps with register pressure.

We have scalar-to-vector pass (-mstv) that does the above, but chooses not to convert the above code due to costs.
Comment 11 Uroš Bizjak 2017-05-24 13:55:21 UTC
(In reply to Peter Cordes from comment #0)
> A lower-latency xmm->int strategy would be:
> 
>         movd    %xmm0, %eax
>         pextrd  $1, %xmm0, %edx

Proposed patch implements the above for generic moves.

> Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> CPUs where an FP shuffle has bypass delay between integer ops)
> 
>         movd     %xmm0, %eax
>         movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
>         movd     %xmm0, %edx
> 
> Or without SSE3,
> 
>         movd     %xmm0, %eax
>         psrldq   $4,  %xmm0    # 1 m-op cheaper than pshufd on K8
>         movd     %xmm0, %edx

The above two proposals are not suitable for generic moves. We should not clobber input value, and we are not allowed to use temporary.
Comment 12 Uroš Bizjak 2017-05-24 13:57:22 UTC
(In reply to Peter Cordes from comment #4)
> MMX is also a saving in code-size: one fewer prefix byte vs. SSE2 integer
> instructions.  It's also another set of 8 registers for 32-bit mode.

After touching a MMX register, the compiler needs to emit emms insn, so MMX moves are practically unusable as generic moves.
Comment 13 uros 2017-05-30 17:18:56 UTC
Author: uros
Date: Tue May 30 17:18:25 2017
New Revision: 248691

URL: https://gcc.gnu.org/viewcvs?rev=248691&root=gcc&view=rev
Log:
	PR target/80833
	* config/i386/constraints.md (Yd): New constraint.
	(Ye): Ditto.
	* config/i386/i386.md (*movti_internal): Add (?r, Ye)
	and (?Yd, r) alternatives.  Update insn attributes.
	* config/i386/i386.md (*movti_internal): Add (?r, *Ye)
	and (?*Yd, r) alternatives.  Update insn attributes.
	(double-mode inter-unit splitters): Add new GR<->XMM splitters.

testsuite/ChangeLog:

	PR target/80833
	* gcc.target/i386/pr80833-1.c: New test.
	* gcc.target/i386/pr80833-2.c: Ditto.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr80833-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr80833-2.c
Modified:
    trunk/gcc/config/i386/constraints.md
    trunk/gcc/config/i386/i386.md
Comment 14 Peter Cordes 2018-06-10 01:02:49 UTC
I happened to look at this old bug again recently.

re: extracting high the low two 32-bit elements:

(In reply to Uroš Bizjak from comment #11)
> > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> > CPUs where an FP shuffle has bypass delay between integer ops)
> > 
> >         movd     %xmm0, %eax
> >         movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
> >         movd     %xmm0, %edx
> > 
> > Or without SSE3,
> > 
> >         movd     %xmm0, %eax
> >         psrldq   $4,  %xmm0    # 1 m-op cheaper than pshufd on K8
> >         movd     %xmm0, %edx
> 
> The above two proposals are not suitable for generic moves. We should not
> clobber input value, and we are not allowed to use temporary.

SSE3 movshdup broadcasts the high element within each pair of 32-bit elements so 

   movshdup  %xmm0, %xmm1
   movd      %xmm1, %eax

saves a byte of code vs  pshufd / movd, and saves a uop on Merom and avoids a flt->int.  (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it wants input in the float domain.  While movshdup ironically is only an integer shuffle.)

Probably not worth looking for that optimization, though, because it's not worth using universally (Nehalem has worse latency for float shuffles between int instructions).


With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8 (slowshuffle CPUs where PSHUFD is multiple uops).  It's not slower on any current CPUs.  I could imagine some future CPU having better throughput for 32-bit element size shuffles than 16-bit, though.  That's already the case for wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512).  This would be a definite win for tune=core2 or k8, and Pentium M, but those are so old it's probably not worth adding extra code to look for it.

I think it's pretty future-proof, though, unless Intel or AMD add an extra shuffle unit for element sizes of 32-bit or wider on another port.