Bug 67461 - Multiple atomic stores generate a StoreLoad barrier between each one, not just at the end
Summary: Multiple atomic stores generate a StoreLoad barrier between each one, not jus...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 5.2.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-09-05 12:44 UTC by Peter Cordes
Modified: 2021-10-02 18:43 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2015-09-05 12:44:24 UTC
Multiple atomic stores in a row generate multiple barriers.

I noticed this while playing around with the same code that led to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458.  This is a separate issue, but if I left out any context, look at that bug.

I suspect this is a case of correctness trumps performance, since atomics are still new.  These cases are probably just missing optimizations for weird use-cases, and fixing them is not very likely to benefit real code (unless it over-uses atomics).



#include <atomic>
std::atomic<int> a, c;
void simple_set(void){ a=1; a=1; a=3; c=2; a=3; }

compiles to (x86, g++ 5.2.0 -O3, on godbolt.org)

	movl	$1, %eax
	movl	%eax, a(%rip)  # a=1
	mfence
	movl	%eax, a(%rip)  # a=1
	movl	$3, %eax
	mfence
	movl	%eax, a(%rip)  # a=3
	mfence
	movl	$2, c(%rip)    # c=2
	mfence
	movl	%eax, a(%rip)  # a=3
	mfence
	ret

First, does the C++ standard actually require multiple stores to the same variable in a row, if there are no intervening loads or stores in the source?  I would have thought that at least a=1; a=1; would collapse to a single store.  

Consider
  initially: a=0
  thread1: a=1; a=1;
  thread2: tmp=a.exchange(2); tmp2=a;

These operations have to happen in some order, but isn't the compiler allowed to make decisions at compile-time that eliminate some possible orders?  e.g. collapsing both a=1 operations into a single store would make this impossible:

  a=1; tmp=a.exchange(2); a=1; tmp2=a; 

But the remaining two orderings are valid, and I think it would be an error for software to depend on that interleaved ordering being possible.  Does the standard require generation of machine code that can end up with tmp=1, tmp2=1?  If it does, then this isn't a bug.  >.<

More generally, collapsing a=1; a=3;  into a single store should be ok for the same reason.

 A producer thread doing stores separated by StoreStore barriers to feed a consumer thread doing loads separated by LoadLoad barriers gives no guarantee that the consumer doesn't miss some events.

-----------

There are no loads between the stores, so I don't understand having multiple StoreLoad barriers (mfence), unless that's just a missing optimization, too.

Are the mfence instructions between each store supposed to protect a signal handler from something?  An interrupt could come in after the first store, but before the first mfence.  (clang uses (lock) xchg for each atomic store with sequential consistency, which would prevent the possibility of an interrupt between the store and the mfence).

 I guess if the signal handler sees a=3, it knows that the mfence between a=1 and a=3 has already happened, but not necessarily the mfence after a=3.

 If these extra mfences in a sequence of stores are just for the potential benefit of a signal handler, doesn't that already as part of the context switch to/from the kernel?  It seems very inefficient that stores to multiple atomic variables produces multiple mfences.

 It's worse for ARM, where there's a full memory barrier before/after every atomic store, so two stores in a row produces two memory barriers in a row.

	dmb	sy
	movs	r2, #1
	str	r2, [r3]   # a=1
	dmb	sy
	dmb	sy
	str	r2, [r3]   # a=1
	dmb	sy
Comment 1 Andrew Pinski 2015-12-28 05:04:01 UTC
Hmm, I think there needs to be a barrier between each store as each store needs to be observed by the other threads.
Comment 2 Peter Cordes 2016-02-05 03:25:10 UTC
(In reply to Andrew Pinski from comment #1)
> Hmm, I think there needs to be a barrier between each store as each store
> needs to be observed by the other threads.

On x86, stores are already ordered wrt. other stores.  A full-barrier (including a StoreLoad barrier) after the last store will prevent it from passing (appearing after) any subsequent loads.

StoreStore, LoadLoad, and LoadStore barriers are implicit between every memory operation.  (except non-temporal ones).  http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

I *think* that's enough for sequential consistency.  If *I'm* misunderstanding this (which is possible), then please clue me in.

There's definitely a problem on ARM, though.  There's no way two consecutive
dmb	sy   instructions are useful.
Comment 3 Andrew Pinski 2016-08-20 17:37:15 UTC
I should correct myself, there should be only if we have
__atomic_store(...,__ATOMIC_SEQ_CST)
__atomic_store(...,__ATOMIC_SEQ_CST)
__atomic_store(...,__ATOMIC_SEQ_CST)

This sequence can be converted into:
__atomic_store(...,__ATOMIC_RELEASE)
__atomic_store(...,__ATOMIC_RELEASE)
__atomic_store(...,__ATOMIC_SEQ_CST)
Comment 4 Peter Cordes 2016-11-15 23:26:26 UTC
(In reply to Andrew Pinski from comment #1)
> Hmm, I think there needs to be a barrier between each store as each store
> needs to be observed by the other threads.

As we agreed earlier, a full barrier is only needed after the last store, and any weaker barriers happen for free on x86.

But it turns out that the standard doesn't even require other threads to even have a possibility of observing each store separately.  It's possible for observers to see the  a=1; a=1; a=3;  all happen together, so we can decide at compile time that that's what every observer will always see (by collapsing them to one store).

This is a fairly aggressive optimization that violates the principle of least surprise for some users, and there is discussion about clarifying this / giving programmers control over this.


See http://wg21.link/p0062  When should compilers optimize atomics? 

and also http://wg21.link/n4455.

Also, there was some discussion about this on http://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num.  See the last section of my answer there (the accepted one) for the results of discussion about this with Richard Hodges in comments on his answer.