Bug 65697 - __atomic memory barriers not strong enough for __sync builtins
Summary: __atomic memory barriers not strong enough for __sync builtins
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 5.0
: P3 normal
Target Milestone: 5.3
Assignee: mwahab
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2015-04-08 11:25 UTC by Matthew Wahab
Modified: 2015-10-05 08:30 UTC (History)
8 users (show)

See Also:
Host:
Target: aarch64 arm
Build:
Known to work:
Known to fail: 4.9.4
Last reconfirmed: 2015-06-11 00:00:00


Attachments
potential patch to add MEMMODEL_SYNC (3.47 KB, patch)
2015-04-29 16:04 UTC, Andrew Macleod
Details | Diff
implement SYNC flag for memory model (9.01 KB, patch)
2015-05-06 14:25 UTC, Andrew Macleod
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Wahab 2015-04-08 11:25:23 UTC
The __sync builtins are implemented by expanding them to the equivalent __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers. This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow some data references to move across the operation (see https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) while a __sync full barrier should block all movement (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins).

This is problem for Aarch64 where data references after the barrier could be speculated ahead of it.

For example, compiling with aarch64-none-linux-gnu at -O2,
-----
int foo = 0;
int bar = 0;

int T1(void)
{
  int x = __sync_fetch_and_add(&foo, 1);
  return bar;
}
----
produces
----
T1:
	adrp	x0, .LANCHOR0
	add	x0, x0, :lo12:.LANCHOR0
.L2:
	ldaxr	w1, [x0]       ; load-acquire (__sync_fetch_and_add)
	add	w1, w1, 1
	stlxr	w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
	cbnz	w2, .L2
	ldr	w0, [x0, 4]    ; load (return bar)
	ret
----
With this code, the load can be executed ahead of the store-release which ends the __sync_fetch_and_add. A correct implementation should emit a dmb instruction after the cbnz.

GCC info:
gcc version 5.0.0 20150407 (experimental) 
Target: aarch64-none-linux-gnu
Comment 1 Matthew Wahab 2015-04-08 11:33:52 UTC
I'm working on this but it isn't obvious how to fix it. The strong barriers aren't needed for the __atomics and will have an effect on performance so simply strengthening the MEMMODEL_SEQ_CST implementation in the backend isn't a good solution.

It looks like the __sync builtins are always expanded out to the __atomics.  This is done in the middle-end and there doesn't seem to be any way for a backend to intervene. 

I'm currently preparing a patch that introduces a new MEMMODEL tag to coretypes.h/memmodel to specify the __sync style full barriers. This can then be used in the __sync to __atomic expansions and, because the MEMMODEL argument to the __atomics is eventually passed to the backends, this is enough to enable backends to know when the stronger barrier is needed.

The drawback with this is that a number of backends need to be updated to recognize the new MEMMODEL tag. For each, the minimum change is to make the new MEMMODEL tag a synonym for MEMMODEL_SEQ_CST. The backends that would need updating are arm, aarch64, i386, ia64, pa, rs6000, s390, sparc. (A subsequent patch would make the aarch64 backend generate the stronger barrier for the new MEMMODEL tag.)

If there's a better way of doing it, let me know.
Comment 2 Jakub Jelinek 2015-04-08 16:16:36 UTC
I doubt the __sync_* builtins are meant to be any stronger than __ATOMIC_SEQ_CST is.
Comment 3 Andrew Macleod 2015-04-08 17:34:35 UTC
Note that atomic MEMMODEL_SEQ_CST is the original implementation of __sync.  It was expanded to allow other memory models and the name was changed... but under the covers it still implemented the same as it was before.
Comment 4 Matthew Wahab 2015-04-09 09:53:23 UTC
There is a difference between __syncs and __atomics, assuming the the __atomics are meant to match the C11/C++11 memory model. With C11 atomics, a barrier for an operation on an object only affects memory references related to that object. So __atomic_add_fetch(&foo, 1, __ATOMIC_SEQ_CST) is only a full barrier for references that can affect the value of foo, other references can be moved around it. The documentation for the __sync builtins is clear that the __sync_fetch_and_add is a barrier that affects all memory references.

There are C11 fences which make __ATOMIC_SEQ_CST act as a barrier for all memory references but these are distinct from the operations on a memory location.

Descriptions of C11 atomics are a little opaque, the relevant sections in the C11 standard seem to be 5.1.2.4-9 Note 4 ("There is a separate order for each atomic object") and 7.17.3-7 (limiting memory_order_seq_cst to affected locations). Torvald Riegels email, linked above, explains it better. It's from a similar discussion about the Aarch64 implementation of MEMMODEL_SEQ_CST for the __atomic builtins.
Comment 5 torvald 2015-04-09 10:08:12 UTC
(In reply to Matthew Wahab from comment #0)
> The __sync builtins are implemented by expanding them to the equivalent
> __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers.
> This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow
> some data references to move across the operation (see
> https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html)

I don't think that's a precise characterization.  The difference discussed in that thread is that seq-cst fences have a stronger effect than seq-cst stores.  This is not really a question of MEMMODEL_SEQ_CST but what you apply it to.

> while a __sync full
> barrier should block all movement
> (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.
> html#_005f_005fsync-Builtins).

But in your code example below, you didn't use __synch_synchronize.  The other __sync operations do not document that they are a full barrier, AFAICS.  (I'm not sure whether there is a wide assumption that they do.)

> This is problem for Aarch64 where data references after the barrier could be
> speculated ahead of it.
> 
> For example, compiling with aarch64-none-linux-gnu at -O2,
> -----
> int foo = 0;
> int bar = 0;
> 
> int T1(void)
> {
>   int x = __sync_fetch_and_add(&foo, 1);
>   return bar;
> }

This test case is problematic in that it has a data race on bar unless bar is never modified by another thread, in which case it would be fine to load bar before the fetch_add.  It also loads bar irrespective of which value the fetch_add actually returns.  I'm aware that the __sync builtins offer no MEMMODEL_RELAXED, and so lots of code simply has a plain memory access.  

Nonetheless, I think you should also look at a test case that is properly synchronized, data-race-free C11 code with the builtins, and see whether that is compiled differently (ie, either use a relaxed MO load for bar or make the load of bar conditional on the outcome of the fetch_add).

To be on the safe side, you can also use the cppmem tool to verify what the C11 model requires the compiled code to guarantee.  Here's what you probably want to achieve:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  {{{ {
        r1=cas_strong(foo, 0, 1);
        r2=bar.load(memory_order_relaxed);
      }
  ||| {
        bar.store(42, memory_order_relaxed);
        foo.store(1, memory_order_release);
      }
  }}};
  return 0; }

This tries to test whether "ordered" stores to bar and foo are observed in this order in the first thread (it uses a CAS instead of fetch_add).  This gives me 3 consistent executions, of which one is the one in which the CAS succeeds (ie, like fetch_add would do), and this one returns a value of 42 for the load of bar.

> ----
> produces
> ----
> T1:
> 	adrp	x0, .LANCHOR0
> 	add	x0, x0, :lo12:.LANCHOR0
> .L2:
> 	ldaxr	w1, [x0]       ; load-acquire (__sync_fetch_and_add)
> 	add	w1, w1, 1
> 	stlxr	w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
> 	cbnz	w2, .L2
> 	ldr	w0, [x0, 4]    ; load (return bar)
> 	ret
> ----
> With this code, the load can be executed ahead of the store-release which
> ends the __sync_fetch_and_add. A correct implementation should emit a dmb
> instruction after the cbnz.

I'm not sufficiently familiar with ARM to assess whether that's correct.  Is this ARMv8?  It seems to be consistent with the mapping to machine code that GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), which does not include a DMB.  However, it also shows no difference between an acq_rel CAS and a seq_cst CAS.  Could it be that the CAS sequence itself (ie, the machine code for it) prevents the reordering you are concerned about at the HW level?
Comment 6 torvald 2015-04-09 10:11:50 UTC
(In reply to torvald from comment #5)
> (In reply to Matthew Wahab from comment #0)
> > while a __sync full
> > barrier should block all movement
> > (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.
> > html#_005f_005fsync-Builtins).
> 
> But in your code example below, you didn't use __synch_synchronize.  The
> other __sync operations do not document that they are a full barrier,
> AFAICS.  (I'm not sure whether there is a wide assumption that they do.)

No, you're right.  They are "in most cases" considered to be a full barrier.
Comment 7 Matthew Wahab 2015-04-09 10:47:14 UTC
(In reply to torvald from comment #5)
> (In reply to Matthew Wahab from comment #0)
> > The __sync builtins are implemented by expanding them to the equivalent
> > __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers.
> > This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow
> > some data references to move across the operation (see
> > https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html)
> 
> I don't think that's a precise characterization.  The difference discussed
> in that thread is that seq-cst fences have a stronger effect than seq-cst
> stores.  This is not really a question of MEMMODEL_SEQ_CST but what you
> apply it to.

Ok, my point was just that an __sync operation has a stronger barrier that an __atomic operation.

> This test case is problematic in that it has a data race on bar unless bar
> is never modified by another thread, in which case it would be fine to load
> bar before the fetch_add.  It also loads bar irrespective of which value the
> fetch_add actually returns.  I'm aware that the __sync builtins offer no
> MEMMODEL_RELAXED, and so lots of code simply has a plain memory access.  
> 
> Nonetheless, I think you should also look at a test case that is properly
> synchronized, data-race-free C11 code with the builtins, and see whether
> that is compiled differently (ie, either use a relaxed MO load for bar or
> make the load of bar conditional on the outcome of the fetch_add).
> 
> To be on the safe side, you can also use the cppmem tool to verify what the
> C11 model requires the compiled code to guarantee.  Here's what you probably
> want to achieve:

I agree that this wouldn't affect valid C11 code (because of data-races) but my understanding is that __sync builtins don't require a C11 model. The problem is that the __syncs are being implemented using atomic operations that do assume a C11 model and its notion of program validity. This looks like it would lead to differences in behaviour when code, using only the __sync builtins and knowing nothing of C11, is moved between targets with different memory models.

 
> > ----
> > produces
> > ----
> > T1:
> > 	adrp	x0, .LANCHOR0
> > 	add	x0, x0, :lo12:.LANCHOR0
> > .L2:
> > 	ldaxr	w1, [x0]       ; load-acquire (__sync_fetch_and_add)
> > 	add	w1, w1, 1
> > 	stlxr	w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
> > 	cbnz	w2, .L2
> > 	ldr	w0, [x0, 4]    ; load (return bar)
> > 	ret
> > ----
> > With this code, the load can be executed ahead of the store-release which
> > ends the __sync_fetch_and_add. A correct implementation should emit a dmb
> > instruction after the cbnz.
> 
> I'm not sufficiently familiar with ARM to assess whether that's correct.  Is
> this ARMv8?  It seems to be consistent with the mapping to machine code that
> GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html),
> which does not include a DMB.  However, it also shows no difference between
> an acq_rel CAS and a seq_cst CAS.  Could it be that the CAS sequence itself
> (ie, the machine code for it) prevents the reordering you are concerned
> about at the HW level?

Yes, its ARMv8. It's consistent with the code expected for the atomics used to implement the __sync_fetch_and_add. I believe that acquire-release will normally be treated as seq-cst for the aarch64 back-end (I'd have to double-check though).
Comment 8 Jonathan Wakely 2015-04-09 11:10:07 UTC
(In reply to Matthew Wahab from comment #7)
> I agree that this wouldn't affect valid C11 code (because of data-races) but
> my understanding is that __sync builtins don't require a C11 model. The

You say that like it's a good thing :-)

They don't require a memory model only because there wasn't a cross-platform one that existed at the time.

> problem is that the __syncs are being implemented using atomic operations
> that do assume a C11 model and its notion of program validity. This looks
> like it would lead to differences in behaviour when code, using only the
> __sync builtins and knowing nothing of C11, is moved between targets with
> different memory models.

It seems unsurprising to me that you'll get different behaviour when trying to use a program written with no formal memory model on platforms with different memory models.
Comment 9 Matthew Wahab 2015-04-09 11:25:50 UTC
(In reply to Jonathan Wakely from comment #8)
> (In reply to Matthew Wahab from comment #7)
> > I agree that this wouldn't affect valid C11 code (because of data-races) but
> > my understanding is that __sync builtins don't require a C11 model. The
> 
> You say that like it's a good thing :-)
> 
> They don't require a memory model only because there wasn't a cross-platform
> one that existed at the time.

That's certainly changed, I'm not sure its made things any simpler though.

> It seems unsurprising to me that you'll get different behaviour when trying
> to use a program written with no formal memory model on platforms with
> different memory models.

I don't think it's the programs that are at fault here. I'd expect programs to only rely on what the documentation tells them to expect, so, if the manual says that something is a full barrier, it seems reasonable for code to assume a full barrier will be provided.
Comment 10 Andrew Macleod 2015-04-09 11:51:44 UTC
(In reply to Matthew Wahab from comment #7)
> (In reply to torvald from comment #5)
> > (In reply to Matthew Wahab from comment #0)

> > 
> > I don't think that's a precise characterization.  The difference discussed
> > in that thread is that seq-cst fences have a stronger effect than seq-cst
> > stores.  This is not really a question of MEMMODEL_SEQ_CST but what you
> > apply it to.
> 
> Ok, my point was just that an __sync operation has a stronger barrier that
> an __atomic operation.
> 

I think that is just an interpretation problem or flaw in my documentation. It was never meant to indicate any difference between __sync and MEMMODEL_SEQ_CST. 

Have you tried it on a gcc before __atomic and __sync were merged? probably GCC 4.7 and earlier.  The code should be identical.  Of course, changes to the machine description patterns could result in differences I suppose...  but the intent is that SEQ_CST and __sync are the same... hence the code base was merged. 

The intention was also to deprecate __sync so I wouldn't put too much thought into it.
Comment 11 Matthew Wahab 2015-04-09 14:03:29 UTC
(In reply to Andrew Macleod from comment #10)
> (In reply to Matthew Wahab from comment #7)
>
> > Ok, my point was just that an __sync operation has a stronger barrier that
> > an __atomic operation.
> > 
> 
> I think that is just an interpretation problem or flaw in my documentation.
> It was never meant to indicate any difference between __sync and
> MEMMODEL_SEQ_CST. 

The gcc manual page for the __atomics does suggest that ATOMIC_SEQ_CST is a full barrier so should be equivalent to the __sync full barrier. There is a difference though and that does mean that the semantics of __sync builtins was  changed by the merge with __atomics.

> Have you tried it on a gcc before __atomic and __sync were merged? probably
> GCC 4.7 and earlier.  The code should be identical.  Of course, changes to
> the machine description patterns could result in differences I suppose... 
> but the intent is that SEQ_CST and __sync are the same... hence the code
> base was merged. 

Aarch64 support first went into GCC 4.8 and the __atomic/__sync merge went into GCC 4.7. 

I took a look the __sync implementation in GCC-4.6: the builtins would expand to the sync_* instructions if a backend provided them and fall back to a compare-swap loop otherwise. 

For backends that provided sync_* instructions, it looks like the operations were treated as a full barrier (applying to all memory references). This is still the case for the __atomics on some targets; e.g. i386 uses lock <inst> which apparently acts as a full barrier. 

The compare-swap loop didn't emit any kind of barrier.

> The intention was also to deprecate __sync so I wouldn't put too much
> thought into it.

New code should certainly be pointed at the __atomics but there's likely to be a lot existing code using __syncs. It should be possible to move it to a new architecture without having to do potentially painful rewrites.
Comment 12 James Greenhalgh 2015-04-10 15:08:54 UTC
There are two problems here, one of which concerns me more in the real world, and both of which rely on races if you are in the C/C++11 model - there isn't much I can do about that as the __sync primitives are legacy and give stronger guarantees about "visible memory references" (which includes ALL global data).

Consider:

int x = 0;
int y = 0;
int z = 0;

void foo (void)
{
  x = 1;
  __sync_fetch_and_add (&y, 1);
  z = 1;
}

My reading of what a full barrier implies would suggest that we should never have a modification order which is any of:

  z = 1, x = 1, y = 0+1
  z = 1, y = 0+1, x = 1
  x = 1, z = 1, y = 0+1

GCC 5.0-ish (recent trunk) will emit:

foo:
	adrp	x3, .LANCHOR0
	mov	w1, 1
	add	x0, x3, :lo12:.LANCHOR0
	add	x2, x0, 4
	str	w1, [x3, #:lo12:.LANCHOR0]
.L2:
	ldaxr	w3, [x2]
	add	w3, w3, w1
	stlxr	w4, w3, [x2]
	cbnz	w4, .L2
	str	w1, [x0, 8]
	ret

Dropping some of the context and switching to pseudo-asm we have:

	str	1, [x]
.L2:
	ldaxr	tmp, [y]
	add	tmp, tmp, 1
	stlxr	flag, tmp, [y]
	cbnz	flag, .L2
	str	1, [z]

As far as I understand it, the memory ordering guarantees of the half-barrier LDAXR/STLXR instructions do not prevent this being reordered to an execution which looks like:

	ldaxr	tmp, [y]
	str	1, [z]
	str	1, [x]
	add	tmp, tmp, 1
	stlxr	flag, tmp, [y]
	cbnz	flag, .L2
   
which gives one of the modification orders we wanted to forbid above. Similar reordering can give the other undesirable modification orders.

As mentioned, this reordering is permitted under C11, as the stores to x and z are racy - this permits the CPPMEM/Cambridge documentation of what an SEQ_CST must do. However, my feeling is that it is at the very least *surprising* for the __sync primitives to allow this ordering, and more likely points to a break in the AArch64 implementation.

Fixing this requires a hard memory barrier - but I really don't want to see us penalise C11 SEQ_CST to get the right behaviour out of __sync_fetch_and_add...

The second problem relies on a similar argument to show that in a __sync_lock_test_and_set operation we can reorder a later access before the "set" half of the operation. i.e. that

  __sync_lock_test_and_set (&y, 1)
  x = 1;

Can be seen as

  tmp = load_acquire (y)
  store (x, 1)
  store (y, 1)

Giving an incorrect modification order.

The AArch64 parts are something I can argue through with smarter people than me, but I think my comments on the __sync primitives are correct?
Comment 13 Andrew Haley 2015-04-15 10:16:18 UTC
There's surely a documentation problem here.

GCC defines this:

`__ATOMIC_SEQ_CST'
     Full barrier in both directions and synchronizes with acquire
     loads and release stores in all threads.

But LDAXR/STLXR doesn't do that, and there's no write barrier at all when the compare fails.  If the intention really is to map onto c++11, this specification is wrong.

But if this specification is correct, then

bool gcc_cas() {
  int expected = 1;
  int desired = 2;
  return __atomic_compare_exchange(&barf, &expected, &desired, false,
                                   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}

gcc_cas():
	sub	sp, sp, #16
	mov	w1, 1
	adrp	x0, .LANCHOR0
	str	w1, [sp,12]
	add	x0, x0, :lo12:.LANCHOR0
	mov	w2, 2
.L10:
	ldaxr	w3, [x0]
	cmp	w3, w1
	bne	.L11
	stlxr	w4, w2, [x0]
	cbnz	w4, .L10
.L11:
	cset	w0, eq
	add	sp, sp, 16
	ret

is wrong.
Comment 14 mwahab 2015-04-15 11:11:30 UTC
(In reply to Andrew Haley from comment #13)
> There's surely a documentation problem here.
> 
> GCC defines this:
> 
> `__ATOMIC_SEQ_CST'
>      Full barrier in both directions and synchronizes with acquire
>      loads and release stores in all threads.

The documentation is a little misleading. Barriers in __atomic operations are slightly weaker then barriers in __atomic fences; the differences are rather subtle though. For the __sync implementation an fence would be ok but an operation is too weak.

> But LDAXR/STLXR doesn't do that, and there's no write barrier at all when
> the compare fails.  If the intention really is to map onto c++11, this
> specification is wrong.

The LDAXR/STLXR sequences rely on the C11/C++11 prohibition of data races. That the __atomic builtins assume this restriction is implied by the references to C11/C++11 in the documentation but should probably be made clearer. I'll try to write a patch, if nobody else gets there first.

I'll have to look at the compare-exchange code, it does looks like it goes wrong.
Comment 15 Jakub Jelinek 2015-04-15 11:54:19 UTC
(In reply to mwahab from comment #14)
> The LDAXR/STLXR sequences rely on the C11/C++11 prohibition of data races.
> That the __atomic builtins assume this restriction is implied by the
> references to C11/C++11 in the documentation but should probably be made
> clearer. I'll try to write a patch, if nobody else gets there first.

The __atomic builtins are used far more widely than just as the underlying implementation for C11/C++11 atomics.  And, even on very weakly ordered architectures like PowerPC, __atomic_ CAS __ATOMIC_SEQ_CST emits a sync (full barrier) before the LL/SC loop and isync after the loop, so it behaves as documented currently.
Comment 16 Andrew Haley 2015-04-15 12:43:23 UTC
(In reply to mwahab from comment #14)
> (In reply to Andrew Haley from comment #13)

> > But LDAXR/STLXR doesn't do that, and there's no write barrier at all when
> > the compare fails.  If the intention really is to map onto c++11, this
> > specification is wrong.
> 
> The LDAXR/STLXR sequences rely on the C11/C++11 prohibition of data races.
> That the __atomic builtins assume this restriction is implied by the
> references to C11/C++11 in the documentation but should probably be made
> clearer. I'll try to write a patch, if nobody else gets there first.

Right.

> I'll have to look at the compare-exchange code, it does looks like it goes
> wrong.

Well... goes wrong how?  If it's intended to be a C++11 sequentially-consistent CAS, it's just fine.
Comment 17 mwahab 2015-04-15 13:01:01 UTC
According to the GCC documentation, __atomic_compare_exchange(ptr, exp, des, ..) is: if (*ptr == *exp) *ptr = *exp; else *exp = *ptr;

On Aarch64 the else (*ptr != *exp) branch is a store rather than a store-release.

This doesn't appear in your example but with
----
int cas(int* barf, int* expected, int* desired)
{
  return __atomic_compare_exchange_n(barf, expected, desired, 0,
				     __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
}
----
cas:
	ldr	w3, [x1]
.L3:
	ldaxr	w4, [x0]
	cmp	w4, w3
	bne	.L4
	stlxr	w5, w2, [x0]  ; store-release
	cbnz	w5, .L3
.L4:
	cset	w0, eq
	cbz	w0, .L6
	ret
	.p2align 3
.L6:
	str	w4, [x1]  ; store, no barrier.
	ret
----

This looks odd to me but I'd need look into it more.
Comment 18 Andrew Haley 2015-04-15 14:10:50 UTC
(In reply to mwahab from comment #17)

> ----
> int cas(int* barf, int* expected, int* desired)
> {
>   return __atomic_compare_exchange_n(barf, expected, desired, 0,
> 				     __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> }
> ----
> cas:
> 	ldr	w3, [x1]
> .L3:
> 	ldaxr	w4, [x0]
> 	cmp	w4, w3
> 	bne	.L4
> 	stlxr	w5, w2, [x0]  ; store-release
> 	cbnz	w5, .L3
> .L4:
> 	cset	w0, eq
> 	cbz	w0, .L6
> 	ret
> 	.p2align 3
> .L6:
> 	str	w4, [x1]  ; store, no barrier.
> 	ret
> ----
> 
> This looks odd to me but I'd need look into it more.

That looks fine to me: if the CAS fails, the prev value -> *expected.
Comment 19 mwahab 2015-04-15 14:33:30 UTC
(In reply to Andrew Haley from comment #18)
> (In reply to mwahab from comment #17)
> 
> > ----
> > int cas(int* barf, int* expected, int* desired)
> > {
> >   return __atomic_compare_exchange_n(barf, expected, desired, 0,
> > 				     __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> > }
> > ----
> > cas:
> > 	ldr	w3, [x1]
> > .L3:
> > 	ldaxr	w4, [x0]
> > 	cmp	w4, w3
> > 	bne	.L4
> > 	stlxr	w5, w2, [x0]  ; store-release
> > 	cbnz	w5, .L3
> > .L4:
> > 	cset	w0, eq
> > 	cbz	w0, .L6
> > 	ret
> > 	.p2align 3
> > .L6:
> > 	str	w4, [x1]  ; store, no barrier.
> > 	ret
> > ----
> > 
> > This looks odd to me but I'd need look into it more.
> 
> That looks fine to me: if the CAS fails, the prev value -> *expected.

It looks inconsistent with C11 S7.17.7.4-2 (C++11 S29.6.4-21) "Further, if the comparison is true, memory is affected according to the value of success, and if the comparison is false, memory is affected according to the value of failure." (where success and failure are the memory model arguments.) In this case, the write to *exp should be memory_order_seq_cst.
Comment 20 Andrew Haley 2015-04-15 14:49:42 UTC
(In reply to mwahab from comment #19)
> (In reply to Andrew Haley from comment #18)
> 
> It looks inconsistent with C11 S7.17.7.4-2 (C++11 S29.6.4-21) "Further, if
> the comparison is true, memory is affected according to the value of
> success, and if the comparison is false, memory is affected according to the
> value of failure." (where success and failure are the memory model
> arguments.) In this case, the write to *exp should be memory_order_seq_cst.

But no store actually takes place, so the only effect is that of the read. You can't have a sequentially consistent store without a store.
Comment 21 torvald 2015-04-15 15:49:43 UTC
(In reply to Andrew Haley from comment #20)
> (In reply to mwahab from comment #19)
> > (In reply to Andrew Haley from comment #18)
> > 
> > It looks inconsistent with C11 S7.17.7.4-2 (C++11 S29.6.4-21) "Further, if
> > the comparison is true, memory is affected according to the value of
> > success, and if the comparison is false, memory is affected according to the
> > value of failure." (where success and failure are the memory model
> > arguments.) In this case, the write to *exp should be memory_order_seq_cst.
> 
> But no store actually takes place, so the only effect is that of the read.
> You can't have a sequentially consistent store without a store.

I agree. If you continue reading in the C++11 paragraph that you cited, you'll see that if just one MO is provided and the CAS fails, an acq_rel MO is downgraded to acquire and a release MO to relaxed.  This is consistent with no update of the atomic variable (note that expected is not atomic, so applying an MO to accesses to it is not meaningful in any case).  However, if the provided MO is seq_cst, I think a failed CAS still needs to be equivalent to a seq_cst load.
Comment 22 torvald 2015-04-15 20:05:07 UTC
(In reply to James Greenhalgh from comment #12)
> There are two problems here, one of which concerns me more in the real
> world, and both of which rely on races if you are in the C/C++11 model -
> there isn't much I can do about that as the __sync primitives are legacy and
> give stronger guarantees about "visible memory references" (which includes
> ALL global data).
> 
> Consider:
> 
> int x = 0;
> int y = 0;
> int z = 0;
> 
> void foo (void)
> {
>   x = 1;
>   __sync_fetch_and_add (&y, 1);
>   z = 1;
> }
> 
> My reading of what a full barrier implies would suggest that we should never
> have a modification order which is any of:
> 
>   z = 1, x = 1, y = 0+1
>   z = 1, y = 0+1, x = 1
>   x = 1, z = 1, y = 0+1

At least in C11/C++11, modification orders are per memory location.  If we want to have something that orders accesses to x, y, and z, we need to model the observer side too (which the C11 model does, for example).  I'm mentioning that because at the very least, we have compiler reordering happening at the observer side; if a programmer would need to use, for example, __sync builtins to order the observers, then this means we "just" have to consider combinations of those on the modification and the observation side.

(And because I'm not sure what you think "modification order" is, I can't comment further.)

> GCC 5.0-ish (recent trunk) will emit:
> 
> foo:
> 	adrp	x3, .LANCHOR0
> 	mov	w1, 1
> 	add	x0, x3, :lo12:.LANCHOR0
> 	add	x2, x0, 4
> 	str	w1, [x3, #:lo12:.LANCHOR0]
> .L2:
> 	ldaxr	w3, [x2]
> 	add	w3, w3, w1
> 	stlxr	w4, w3, [x2]
> 	cbnz	w4, .L2
> 	str	w1, [x0, 8]
> 	ret
> 
> Dropping some of the context and switching to pseudo-asm we have:
> 
> 	str	1, [x]
> .L2:
> 	ldaxr	tmp, [y]
> 	add	tmp, tmp, 1
> 	stlxr	flag, tmp, [y]
> 	cbnz	flag, .L2
> 	str	1, [z]
> 
> As far as I understand it, the memory ordering guarantees of the
> half-barrier LDAXR/STLXR instructions do not prevent this being reordered to
> an execution which looks like:
> 
> 	ldaxr	tmp, [y]
> 	str	1, [z]
> 	str	1, [x]
> 	add	tmp, tmp, 1
> 	stlxr	flag, tmp, [y]
> 	cbnz	flag, .L2
>    
> which gives one of the modification orders we wanted to forbid above.
> Similar reordering can give the other undesirable modification orders.

(I'm not familiar with the ARM model, so please bear with me.)

This is reordering the HW can do?  Or are you concerned about the compiler backend?
Would the reordered str always become visible atomically with the stlxr?
Would it become visible even if the LLSC fails, thus potentially storing more than once to z?  This would be surprising in that the ldaxr would then have to be reloaded too potentially after the store to z, I believe (at least for a strong CAS) -- which would break the acquire MO on the load.

> As mentioned, this reordering is permitted under C11, as the stores to x and
> z are racy - this permits the CPPMEM/Cambridge documentation of what an
> SEQ_CST must do.

That's true in this example, but at the instruction level (assuming HW reordering is what you're concerned about), a atomic relaxed-MO load isn't distinguishable from a normal memory access, right?  So, it's not DRF what this is strictly about, but the difference between C11 seq_cst fences and seq_cst RMW ops.

> However, my feeling is that it is at the very least
> *surprising* for the __sync primitives to allow this ordering, and more
> likely points to a break in the AArch64 implementation.

With the caveat that given that __sync isn't documented in great detail, a lot of interpretations might happen in practice, so there might be a few surprises to some people :)

> Fixing this requires a hard memory barrier - but I really don't want to see
> us penalise C11 SEQ_CST to get the right behaviour out of
> __sync_fetch_and_add...

I agree.
Comment 23 James Greenhalgh 2015-04-15 20:49:06 UTC
(In reply to torvald from comment #22)
> (In reply to James Greenhalgh from comment #12)
> > There are two problems here, one of which concerns me more in the real
> > world, and both of which rely on races if you are in the C/C++11 model -
> > there isn't much I can do about that as the __sync primitives are legacy and
> > give stronger guarantees about "visible memory references" (which includes
> > ALL global data).
> > 
> > Consider:
> > 
> > int x = 0;
> > int y = 0;
> > int z = 0;
> > 
> > void foo (void)
> > {
> >   x = 1;
> >   __sync_fetch_and_add (&y, 1);
> >   z = 1;
> > }
> > 
> > My reading of what a full barrier implies would suggest that we should never
> > have a modification order which is any of:
> > 
> >   z = 1, x = 1, y = 0+1
> >   z = 1, y = 0+1, x = 1
> >   x = 1, z = 1, y = 0+1
> 
> At least in C11/C++11, modification orders are per memory location.  If we
> want to have something that orders accesses to x, y, and z, we need to model
> the observer side too (which the C11 model does, for example).  I'm
> mentioning that because at the very least, we have compiler reordering
> happening at the observer side; if a programmer would need to use, for
> example, __sync builtins to order the observers, then this means we "just"
> have to consider combinations of those on the modification and the
> observation side.
> 
> (And because I'm not sure what you think "modification order" is, I can't
> comment further.)

Yes. I'm sorry,  I was imprecise in my language as I couldn't quite find the right phrase in specifications - what I am trying to get at is: the order in which observers in the system are able to observe the writes to x, y, z. Later in this post I'll call this "observation order", until I can find a better terminology.

I believe the __sync builtins should guarantee that an observer could trust:

  assert (z == 1 && y == 1 && x ==1);

to hold.
  
That is, the write to z is observed if and only if the writes to y and x have also been observed.

> > GCC 5.0-ish (recent trunk) will emit:
> > 
> > foo:
> > 	adrp	x3, .LANCHOR0
> > 	mov	w1, 1
> > 	add	x0, x3, :lo12:.LANCHOR0
> > 	add	x2, x0, 4
> > 	str	w1, [x3, #:lo12:.LANCHOR0]
> > .L2:
> > 	ldaxr	w3, [x2]
> > 	add	w3, w3, w1
> > 	stlxr	w4, w3, [x2]
> > 	cbnz	w4, .L2
> > 	str	w1, [x0, 8]
> > 	ret
> > 
> > Dropping some of the context and switching to pseudo-asm we have:
> > 
> > 	str	1, [x]
> > .L2:
> > 	ldaxr	tmp, [y]
> > 	add	tmp, tmp, 1
> > 	stlxr	flag, tmp, [y]
> > 	cbnz	flag, .L2
> > 	str	1, [z]
> > 
> > As far as I understand it, the memory ordering guarantees of the
> > half-barrier LDAXR/STLXR instructions do not prevent this being reordered to
> > an execution which looks like:
> > 
> > 	ldaxr	tmp, [y]
> > 	str	1, [z]
> > 	str	1, [x]
> > 	add	tmp, tmp, 1
> > 	stlxr	flag, tmp, [y]
> > 	cbnz	flag, .L2
> >    
> > which gives one of the modification orders we wanted to forbid above.
> > Similar reordering can give the other undesirable modification orders.
> 
> (I'm not familiar with the ARM model, so please bear with me.)
> 
> This is reordering the HW can do?  Or are you concerned about the compiler
> backend?

The architectural description of the memory model - this is reordering the HW is permitted to do.

> Would the reordered str always become visible atomically with the stlxr?
> Would it become visible even if the LLSC fails, thus potentially storing
> more than once to z?  This would be surprising in that the ldaxr would then
> have to be reloaded too potentially after the store to z, I believe (at
> least for a strong CAS) -- which would break the acquire MO on the load.

The STR would not become visible until after the stlxr had resolved as successful, however, it can take a position of the observation order ahead of the stlxr, such the the ordering of writes seen by observers would not be program order.

> > As mentioned, this reordering is permitted under C11, as the stores to x and
> > z are racy - this permits the CPPMEM/Cambridge documentation of what an
> > SEQ_CST must do.
> 
> That's true in this example, but at the instruction level (assuming HW
> reordering is what you're concerned about), a atomic relaxed-MO load isn't
> distinguishable from a normal memory access, right?  So, it's not DRF what
> this is strictly about, but the difference between C11 seq_cst fences and
> seq_cst RMW ops.

Absolutely! I was trying to preempt "That program is racy" responses :).

> > However, my feeling is that it is at the very least
> > *surprising* for the __sync primitives to allow this ordering, and more
> > likely points to a break in the AArch64 implementation.
> 
> With the caveat that given that __sync isn't documented in great detail, a
> lot of interpretations might happen in practice, so there might be a few
> surprises to some people :)

:) Agreed.

> > Fixing this requires a hard memory barrier - but I really don't want to see
> > us penalise C11 SEQ_CST to get the right behaviour out of
> > __sync_fetch_and_add...
> 
> I agree.
Comment 24 torvald 2015-04-15 22:12:53 UTC
I think we need to at least clarify the documentation of __atomic, probably also of __sync; we might also have to change the implementation of __sync builtins on some archs.

First, I think the __atomic builtins should strictly follow the C11 model -- one major reason being that it's not worthwhile for us to support a GCC-specific memory model, which is complex.  Therefore, I think we should clarify in the __atomic docs that C++11 is the reference semantics, and that any additional text we provide (e.g., to describe what MEMMODEL_SEQ_CST does) is just some hand-wavy attempt of makings things look simpler.  I'd choose C++11 over C11 because, although the models are supposed to be equal, C++11 has more detailed docs in the area of atomics and synchronization in general, IMO.  There are also more people involved in C++ standardization than C standardization.

The only exception that I see is that we might want to give more guarantees wrt data-race-freedom (DRF) as far as compiler reordering is concerned.  I'm mentioning just the compiler because I'm not aware of GCC supporting C11/C++11 on any archs whose ISA distinguishes between atomic and nonatomic accesses (ie, where DRF is visible at the ISA level); however, I've heard of upcoming archs that may do that.  Trying to promise some support for non-DRF programs that synchronize might make transition of legacy code to __atomic builtins easier, but it's a slippery slope; I don't think it's worthwhile for us to try to give really precise and hard guarantees, just because of the complexity involved.  I'm not quite sure what's best here.

Second, in C11 there is a difference between a memory access or read-modify-write op (RMW) that is seq-cst, and a seq-cst fence.  For reference, try this example in cppmem:

int main() {
  atomic_int x = 0; 
  atomic_int y = 0; 
  atomic_int d1 = 0; 
  atomic_int d2 = 0; 
  {{{ {
        x.store(1, memory_order_relaxed);
        // atomic_thread_fence(memory_order_seq_cst);
        r3=cas_strong(d1, 0, 1);
        r1=y.load(memory_order_relaxed).readsvalue(0);
      }
  ||| {
        y.store(1, memory_order_relaxed);
        // atomic_thread_fence(memory_order_seq_cst);
        r4=cas_strong(d2, 0, 1);
        r2=x.load(memory_order_relaxed).readsvalue(0);
      }
  }}};
  return 0; }

This is Dekker synchronization: It doesn't work (ie, both threads can read value 0) with seq-cst RMWs to separate locations (d1 and d2) even though those RMWs will be totally ordered; however, it does work when using fences (see the comments).  Therefore, for __atomic, it's not only the MEMMODEL_* value that counts, but also which operation it is applied to.  Perhaps we need to point this out in the docs.


However, our current docs for __sync indicate that most of the __sync RMWs are "full barriers". The latter has a very TSO-like description, and my guess would be that most people might understand it this way too (ie, that it would at least be as strong as a seq-cst C11 fence).  Also, Andrew mentioned offline that prior documentation seemed to have described the __sync RMWs as
  __sync_synchronize();
  operation...
  __sync_synchronize();

For x86 and similar, a C11-seq-cst RMW will be a full barrier; Jakub says the same holds for PowerPC currently.  AFAIU, on ARM it doesn't (but it kind of surprises me that HW would reorder into an LLSC region; my guess would that this complicates matters, and I can't see a benefit).

So, at least on ARM, there seems to be a difference between a __sync RMW that is documented to be at least as strong as a C11 seq-cst fence, and how we implement a C11 seq-cst fence.  This difference is observable by a program, but doing so requires a non-DRF program (but there are no relaxed-MO __sync variants, so programs that used __sync might just relied on non-DRF accesses).

I would guess that programs that actually contain synchronization code affected by this difference are rather rare.  Normal locks and such are not affected, as are many common synchronization algorithms such as for linked lists etc (eg, which work fine with acq_rel MO ops).  I'd say that the full fences are really necessary mostly if one wants to synchronize using separate locations and without a global common order for those accesses (eg, using several mem locations and asymmetric access patterns to tune contention favorably for common cases (eg, Dekker-like stuff if made asymmetric; libitm uses that)).  I guess it's much more common to "route" the synchronization through centralized or hierarchical locations so that eventually there is a tie-breaker.  For example, AFAIR there's no use of a seq-cst fence in current glibc.

Thus, we need to at least improve the __sync docs.  If we want to really make them C11-like, we need to update the docs, and there might be legacy code assuming different semantics that breaks.  If we don't, we need to update the implementation of __sync RMWs.  I don't know what would be the easiest way to do that:
1) We could put __synch_synchronize around them on all archs, and just don't care about the performance overhead (thinking that we want to deprecate them anyway).  But unless the backends optimize unnecessary sync ops (eg, on x86, remove mfence adjacent to a lock'd RMW), there could be performance degradation for __sync-using code.
2) We could make __atomic RMWs with MEMMODEL_SEQ_CST stronger than required by C11.  This could result in other C11-conforming compilers to produce more efficient synchronization code, and is thus not a good option IMO.
3) We could do something just on ARM (and scan other arcs for similar issues).  That's perhaps the cleanest option.

Thoughts?
Comment 25 Andrew Macleod 2015-04-16 03:44:50 UTC
My opinion:

1) is undesirable... even though it could possibly accelerate the conversion of legacy sync to atomic calls... I fear it would instead just cause frustration, annoyance and worse.  I don't think we need to actually penalize sync for no reason better than a 1 in a somethillion shot in the dark.

2) Similar reasoning, I don't think we need to penalize SEQ_CST everywhere for a legacy sync call that probably doesn't need stronger code either.

which leaves option 3)

There are 2 primary options I think

a) introduce an additional memory model... MEMMODEL_SYNC or something which is even stronger than SEQ_CST.  This would involve fixing any places that assume SEQ_CST is the highest.  And then we use this from the places that expand sync calls as the memory model.  

or
b) When we are expanding sync calls, we first look for target __sync patterns instead of atomic patterns.  If they exist, we use those. Otherwise we simply expand like we do today.  


(b) may be easier to implement, but puts more onus on the target.. probably involves more complex patterns since you need both atomic and sync patterns. My guess is some duplication of code will occur here.  But the impact is only to specific targets.

(a) Is probably cleaner... just touches a lot more code.  Since we're in stage 1, maybe (a) is a better long term solution...?   



Documentation needs updating for sure... The rules have changed under us since originally SEQ_CST and sync were intended to be the same thing... Guess I shouldn't have tied our horse to the C++11 cart :-)
Comment 26 mwahab 2015-04-16 08:11:38 UTC
(In reply to torvald from comment #21)
> (In reply to Andrew Haley from comment #20)
> > (In reply to mwahab from comment #19)
> > > (In reply to Andrew Haley from comment #18)
> > > 
> > > It looks inconsistent with C11 S7.17.7.4-2 (C++11 S29.6.4-21) "Further, if
> > > the comparison is true, memory is affected according to the value of
> > > success, and if the comparison is false, memory is affected according to the
> > > value of failure." (where success and failure are the memory model
> > > arguments.) In this case, the write to *exp should be memory_order_seq_cst.
> > 
> > But no store actually takes place, so the only effect is that of the read.
> > You can't have a sequentially consistent store without a store.
> 
> I agree. If you continue reading in the C++11 paragraph that you cited,
> you'll see that if just one MO is provided and the CAS fails, an acq_rel MO
> is downgraded to acquire and a release MO to relaxed.  This is consistent
> with no update of the atomic variable (note that expected is not atomic, so
> applying an MO to accesses to it is not meaningful in any case).  However,
> if the provided MO is seq_cst, I think a failed CAS still needs to be
> equivalent to a seq_cst load.

Yes, the last two sentences in the C++11 paragraph make it clear: "If the operation returns true, these operations are atomic read-modify-write operations (1.10). Otherwise, these operations are atomic load operations."  In that case, the Aarch64 code looks ok.
Comment 27 mwahab 2015-04-16 08:50:27 UTC
(In reply to Andrew Macleod from comment #25)
> My opinion:
> 
> 1) is undesirable... even though it could possibly accelerate the conversion
> of legacy sync to atomic calls... I fear it would instead just cause
> frustration, annoyance and worse.  I don't think we need to actually
> penalize sync for no reason better than a 1 in a somethillion shot in the
> dark.
> 
> 2) Similar reasoning, I don't think we need to penalize SEQ_CST everywhere
> for a legacy sync call that probably doesn't need stronger code either.
> 
> which leaves option 3)
> 
> There are 2 primary options I think
> 
> a) introduce an additional memory model... MEMMODEL_SYNC or something which
> is even stronger than SEQ_CST.  This would involve fixing any places that
> assume SEQ_CST is the highest.  And then we use this from the places that
> expand sync calls as the memory model.  
>
> or
> b) When we are expanding sync calls, we first look for target __sync
> patterns instead of atomic patterns.  If they exist, we use those. Otherwise
> we simply expand like we do today.  
> 
> 
> (b) may be easier to implement, but puts more onus on the target.. probably
> involves more complex patterns since you need both atomic and sync patterns.
> My guess is some duplication of code will occur here.  But the impact is
> only to specific targets.
> 
> (a) Is probably cleaner... just touches a lot more code.  Since we're in
> stage 1, maybe (a) is a better long term solution...?   
> 

Adding a new barrier specifer to enum memmodel seems the simplest approach. It would mean that all barriers used in gcc are represented in the same way and that would make them more visible and easier to understand than splitting them between the enum memmodel and the atomic/sync patterns.

I'm testing a patch-set for option (a). It touches several backends but all the changes are very simple since the new barrier is the same as MEMMODEL_SEQ_CST for most targets.

One thing though is that the aarch64 code for __sync_lock_test_and_set and there may have a similar problem with a too-weak barrier. It's not confirmed that there is a problem but, since it may mean more work in this area, I thought I'd mention it.
Comment 28 James Greenhalgh 2015-04-16 09:00:52 UTC
(In reply to torvald from comment #24)
> I think we need to at least clarify the documentation of __atomic, probably
> also of __sync; we might also have to change the implementation of __sync
> builtins on some archs.
> 
> First, I think the __atomic builtins should strictly follow the C11 model --
> one major reason being that it's not worthwhile for us to support a
> GCC-specific memory model, which is complex.  Therefore, I think we should
> clarify in the __atomic docs that C++11 is the reference semantics, and that
> any additional text we provide (e.g., to describe what MEMMODEL_SEQ_CST
> does) is just some hand-wavy attempt of makings things look simpler.  I'd
> choose C++11 over C11 because, although the models are supposed to be equal,
> C++11 has more detailed docs in the area of atomics and synchronization in
> general, IMO.  There are also more people involved in C++ standardization
> than C standardization.

Agreed.

> The only exception that I see is that we might want to give more guarantees
> wrt data-race-freedom (DRF) as far as compiler reordering is concerned.  I'm
> mentioning just the compiler because I'm not aware of GCC supporting
> C11/C++11 on any archs whose ISA distinguishes between atomic and nonatomic
> accesses (ie, where DRF is visible at the ISA level); however, I've heard of
> upcoming archs that may do that.  Trying to promise some support for non-DRF
> programs that synchronize might make transition of legacy code to __atomic
> builtins easier, but it's a slippery slope; I don't think it's worthwhile
> for us to try to give really precise and hard guarantees, just because of
> the complexity involved.  I'm not quite sure what's best here.

I use enough whiteboard space trying to work through the well researched, formalized, and discussed C++11 semantics; I would not like to see a subtly different GNU++11 semantics!! Not least because I worry for the programmer who thinks they've understood the guarantees of the specification and reverse engineers GCC's output for confirmation, only to find GCC is giving stronger guarantees than is strictly neccessary.
 
> <snip>
>
> Thus, we need to at least improve the __sync docs.  If we want to really
> make them C11-like, we need to update the docs, and there might be legacy
> code assuming different semantics that breaks.If we don't, we need to
> update the implementation of __sync RMWs.  I don't know what would be the
> easiest way to do that:
>
> 1) We could put __synch_synchronize around them on all archs, and just don't
> care about the performance overhead (thinking that we want to deprecate them
> anyway).  But unless the backends optimize unnecessary sync ops (eg, on x86,
> remove mfence adjacent to a lock'd RMW), there could be performance
> degradation for __sync-using code.

I'm with Andrew, I don't see the need to penalise everyone to ensure AArch64/ARM memory ordering (Though I agree it would be nice as a stick to encourage people to move forwards).

> 2) We could make __atomic RMWs with MEMMODEL_SEQ_CST stronger than required
> by C11.  This could result in other C11-conforming compilers to produce more
> efficient synchronization code, and is thus not a good option IMO.

I would not like to see this implemented.

> 3) We could do something just on ARM (and scan other arcs for similar
> issues).  That's perhaps the cleanest option.

Which leaves 3). From Andrew's two proposed solutions:

> a) introduce an additional memory model... MEMMODEL_SYNC or something
> which is even stronger than SEQ_CST.  This would involve fixing any places
> that assume SEQ_CST is the highest.  And then we use this from the places
> that expand sync calls as the memory model.

This seems a sensible approach and leads to a nicer design, but I worry that it might be overkill for primitives which we ultimately want to reduce support for and eventually deprecate.

> b) When we are expanding sync calls, we first look for target __sync
> patterns instead of atomic patterns.  If they exist, we use those.
> Otherwise we simply expand like we do today.  
>
> (b) may be easier to implement, but puts more onus on the target..
> probably involves more complex patterns since you need both atomic and
> sync patterns. My guess is some duplication of code will occur here.  But
> the impact is only to specific targets.

When I looked at this problem internally a few weeks back, this is exactly how I expected things to work (we can add the documentation for the pattern names to the list of things which need fixing, as it took me a while to figure out why my new patterns were never expanded!).

I don't think this is particularly onerous for a target. The tough part in all of this is figuring out the minimal cost instructions at the ISA level to use for the various __atomic primitives. Extending support to a stronger model, should be straightforward given explicit documentation of the stronger ordering requirements.

Certainly, a target would have to do the same legwork for a) regardless, and would have to pollute their atomic patterns with checks and code paths for MEMMODEL_SYNC.

This also gives us an easier route to fixing any issues with the acquire/release __sync primitives (__sync_lock_test_and_set and __sync_lock_release) if we decide that these also need to be stronger than their C++11 equivalents.
Comment 29 mwahab 2015-04-16 09:18:08 UTC
(In reply to mwahab from comment #27)
> (In reply to Andrew Macleod from comment #25)
> > My opinion:
> > 
> > 1) is undesirable... even though it could possibly accelerate the conversion
> > of legacy sync to atomic calls... I fear it would instead just cause
> > frustration, annoyance and worse.  I don't think we need to actually
> > penalize sync for no reason better than a 1 in a somethillion shot in the
> > dark.
> > 
> > 2) Similar reasoning, I don't think we need to penalize SEQ_CST everywhere
> > for a legacy sync call that probably doesn't need stronger code either.
> > 
> > which leaves option 3)
> > 
> > There are 2 primary options I think


There may be a third option, which is to set-up a target hook to allow the sync expansions in the middle end to be overridden. This limits the changes to the backends that need the different semantics without having to continue with the sync_ patterns (which aren't in aarch64 for example). There's space in the memmodel values to support target specific barriers so, for aarch64, this would allow the atomics code to be reused. I don't know much about the target hook infrastructure though so I don't know how disruptive a new hook would be.
Comment 30 mwahab 2015-04-16 10:10:11 UTC
(In reply to James Greenhalgh from comment #28)

> Which leaves 3). From Andrew's two proposed solutions:
> 
> > a) introduce an additional memory model... MEMMODEL_SYNC or something
> > which is even stronger than SEQ_CST.  This would involve fixing any places
> > that assume SEQ_CST is the highest.  And then we use this from the places
> > that expand sync calls as the memory model.
> 
> This seems a sensible approach and leads to a nicer design, but I worry that
> it might be overkill for primitives which we ultimately want to reduce
> support for and eventually deprecate.

I don't understand why it would be overkill. Its already expected that not all barriers will be meaningful for all targets and target code to handle redundant barriers usually comes down to a few clauses in a conditional statement. 

> > (b) may be easier to implement, but puts more onus on the target..
> > probably involves more complex patterns since you need both atomic and
> > sync patterns. My guess is some duplication of code will occur here.  But
> > the impact is only to specific targets.
> 
> When I looked at this problem internally a few weeks back, this is exactly
> how I expected things to work (we can add the documentation for the pattern
> names to the list of things which need fixing, as it took me a while to
> figure out why my new patterns were never expanded!).
> 
> I don't think this is particularly onerous for a target. The tough part in
> all of this is figuring out the minimal cost instructions at the ISA level
> to use for the various __atomic primitives. Extending support to a stronger
> model, should be straightforward given explicit documentation of the
> stronger ordering requirements.
>

My objection to using sync patterns is that it does take more work, both for the initial implementation and for continuing maintenance. It also means adding sync patterns to targets that would not otherwise need them and preserving a part of the gcc infrastructure that is only needed for a legacy feature and could otherwise be targeted for removal.

> Certainly, a target would have to do the same legwork for a) regardless, and
> would have to pollute their atomic patterns with checks and code paths for
> MEMMODEL_SYNC.

Actually, the changes needed for (a) are very simple. The checks and code-paths for handling barriers are already there, reusing them for MEMMODEL_SYNC is trivial. The resulting code is much easier to understand, and safely fix, because it is all in the same place, than if it was spread out across patterns and functions. 

> This also gives us an easier route to fixing any issues with the
> acquire/release __sync primitives (__sync_lock_test_and_set and
> __sync_lock_release) if we decide that these also need to be stronger than
> their C++11 equivalents.

This seems like a chunky workaround to avoid having to add a barrier representation to gcc.  It's a, not necessarily straightforward, reworking of the middle-end to use patterns that would need to be added to architectures that don't currently have them and at least checked in the architectures that do.

This seems to come down to what enum memmodel is supposed to be for. If it is intended to represent the barriers provided by C11/C+11 atomics and nothing else than a workaround seems unavoidable. If it is meant to represent barriers needed by gcc to compile user code than, it seems to me, that it would be better to just add the barrier to the enum and update code where necessary. 

Since memmodel is relatively recent, was merged from what looks like the C++ memory model branch (cxx-mem-model), and doesn't seem to have changed since then, it's maybe not surprising that it doesn't already include every thing needed by gcc. I don't see that adding to it should be prohibited, provided the additions can be show to be strictly required.
Comment 31 Andrew Macleod 2015-04-16 11:37:32 UTC
(In reply to mwahab from comment #30)
> (In reply to James Greenhalgh from comment #28)
> 
> > I don't think this is particularly onerous for a target. The tough part in
> > all of this is figuring out the minimal cost instructions at the ISA level
> > to use for the various __atomic primitives. Extending support to a stronger
> > model, should be straightforward given explicit documentation of the
> > stronger ordering requirements.
> >
> 
> My objection to using sync patterns is that it does take more work, both for
> the initial implementation and for continuing maintenance. It also means
> adding sync patterns to targets that would not otherwise need them and
> preserving a part of the gcc infrastructure that is only needed for a legacy
> feature and could otherwise be targeted for removal.
> 

Targets that don't need special sync patterns (ie most of them) simply don't provide them.   The expanders see no sync pattern and use SEQ_CST, exactly like they do today.

sync patterns would only be provided by targets which do need to do something different.   This means there is no potential bug introduction on unaffected targets.. 

> > This also gives us an easier route to fixing any issues with the
> > acquire/release __sync primitives (__sync_lock_test_and_set and
> > __sync_lock_release) if we decide that these also need to be stronger than
> > their C++11 equivalents.
> 
> This seems like a chunky workaround to avoid having to add a barrier
> representation to gcc.  It's a, not necessarily straightforward, reworking
> of the middle-end to use patterns that would need to be added to
> architectures that don't currently have them and at least checked in the
> architectures that do.

Well, it actually returns back to the situation before they were merged. We use to look for sync patterns too... until I thought they were redundant.

> 
> This seems to come down to what enum memmodel is supposed to be for. If it
> is intended to represent the barriers provided by C11/C+11 atomics and
> nothing else than a workaround seems unavoidable. If it is meant to
> represent barriers needed by gcc to compile user code than, it seems to me,
> that it would be better to just add the barrier to the enum and update code
> where necessary. 
> 
> Since memmodel is relatively recent, was merged from what looks like the C++
> memory model branch (cxx-mem-model), and doesn't seem to have changed since
> then, it's maybe not surprising that it doesn't already include every thing
> needed by gcc. I don't see that adding to it should be prohibited, provided
> the additions can be show to be strictly required.

The intention was to deprecate __sync and support just the c++ memory model.  SEQ_CST == SYNC was the original intent and understanding, thus the code merge.  If we decide we want/need to provide a stronger form of SEQ_CST and call it SYNC, then we can do that as required... but I'm not envisioning a lot of future additional memory models. Although never say never I suppose.
Comment 32 torvald 2015-04-16 11:54:44 UTC
(In reply to James Greenhalgh from comment #28)
> (In reply to torvald from comment #24)
> > 3) We could do something just on ARM (and scan other arcs for similar
> > issues).  That's perhaps the cleanest option.
> 
> Which leaves 3). From Andrew's two proposed solutions:

3) Also seems best to me.  2) is worst, 1) is too much of a stick.

> This also gives us an easier route to fixing any issues with the
> acquire/release __sync primitives (__sync_lock_test_and_set and
> __sync_lock_release) if we decide that these also need to be stronger than
> their C++11 equivalents.

I don't think we have another case of different __sync vs. __atomics semantics in case of __sync_lock_test_and_set.  The current specification makes it clear that this is an acquire barrier, and how it describes the semantics (ie, loads and stores that are program-order before the acquire op can move to after it) , this seems to be consistent with the effects C11 specifies for acquire MO (with perhaps the distinction that C11 is clear that acquire needs to be paired with some release op to create an ordering constraint).

I'd say this basically also applies to __sync_lock_release, with the exception that the current documentation does not mention that stores can be speculated to before the barrier.  That seems to be an artefact of a TSO model.
However, I don't think this matters much because what the release barrier allows one to do is reasoning that if one sees the barrier to have taken place (eg, observe that the lock has been released), then also all ops before the barrier will be visible.
It does not guarantee that if one observes an effect that is after the barrier in program order, that the barrier itself will necessarily have taken effect.  To be able to make this observation, one would have to ensure using __sync ops that the other effect after the barrier is indeed after the barrier, which would mean using an release op for the other effect -- which would take care of things.

If everyone agrees with this reasoning, we probably should add documentation explaining this.

However, I guess some people relying on data races in their programs could (mis?)understand the __sync_lock_release semantics to mean that it is a means to get the equivalent of a C11 release *fence* -- which it is not because the fence would apply to the (erroneously non-atomic) store after the barrier, which could one lead to believe that if one observes the store after the barrier, the fence must also be in effect.  Thoughts?
Comment 33 mwahab 2015-04-16 12:13:04 UTC
(In reply to torvald from comment #32)
> (In reply to James Greenhalgh from comment #28)
> > (In reply to torvald from comment #24)
> > > 3) We could do something just on ARM (and scan other arcs for similar
> > > issues).  That's perhaps the cleanest option.
> > 
> > Which leaves 3). From Andrew's two proposed solutions:
> 
> 3) Also seems best to me.  2) is worst, 1) is too much of a stick.
> 
> > This also gives us an easier route to fixing any issues with the
> > acquire/release __sync primitives (__sync_lock_test_and_set and
> > __sync_lock_release) if we decide that these also need to be stronger than
> > their C++11 equivalents.
> 
> I don't think we have another case of different __sync vs. __atomics
> semantics in case of __sync_lock_test_and_set.  The current specification
> makes it clear that this is an acquire barrier, and how it describes the
> semantics (ie, loads and stores that are program-order before the acquire op
> can move to after it) , this seems to be consistent with the effects C11
> specifies for acquire MO (with perhaps the distinction that C11 is clear
> that acquire needs to be paired with some release op to create an ordering
> constraint).

Thanks, I suspect that the acquire barrier may not be much as much of an issue as I had remembered. (The issue came up while I was trying to understand the C11 semantics.)

The test case (aarch64) I have is:
----
int foo = 0;
int bar = 0;
int T5(void)
{
  int x = __sync_lock_test_and_set(&foo, 1);
  return bar;
}
----
.L11:
	ldaxr	w2, [x0]     ; load-acquire
	stxr	w3, w1, [x0] ; store
	cbnz	w3, .L11
	ldr	w0, [x0, 4]  ; load
	ret
----
My concern was that the load could be speculated ahead of the store. Since the store marks the end of the barrier, that could make it appear as if the load had completed before the acquire-barrier.

In retrospect, I don't think that there will be a problem because any load that could be moved would have to end up with the same value as if it had not moved.
Comment 34 Andrew Macleod 2015-04-16 12:25:25 UTC
> However, I guess some people relying on data races in their programs could
> (mis?)understand the __sync_lock_release semantics to mean that it is a
> means to get the equivalent of a C11 release *fence* -- which it is not
> because the fence would apply to the (erroneously non-atomic) store after
> the barrier, which could one lead to believe that if one observes the store
> after the barrier, the fence must also be in effect.  Thoughts?

before we get too carried away, maybe we should return to what we *think* __sync are suppose to do. It represents a specific definition by intel.. From the original documentation for __sync "back in the day", and all legacy uses of sync should expect this behaviour:


"The following builtins are intended to be compatible with those described
in the "Intel Itanium Processor-specific Application Binary Interface",
section 7.4.  As such, they depart from the normal GCC practice of using
the ``__builtin_'' prefix, and further that they are overloaded such that
they work on multiple types."

The definition of "barrier" from that documentation is :

acquire barrier : Disallows the movement of memory references to visible data from before the intrinsic (in program order) to after the intrinsic (this behavior is desirable at lock-release operations, hence the name).

release barrier: Disallows the movement of memory references to visible data from after the intrinsic (in program order) to before the intrinsic (this behavior is desirable at lock-acquire operations, hence the name).

full barrier: disallows the movement of memory references to visible data past the intrinsic (in either direction), and is thus both an acquire and a release barrier. A barrier only restricts the movement of memory references to visible data across the intrinsic operation: between synchronization operations (or in their absence), memory references to visible data may be freely reordered subject to the usual data-dependence constraints.

Caution: Conditional execution of a synchronization intrinsic (such as within an if or a while statement) does not prevent the movement of memory references to visible data past the overall if or while construct.
Comment 35 James Greenhalgh 2015-04-16 13:26:54 UTC
(In reply to torvald from comment #32)
> (In reply to James Greenhalgh from comment #28)
> > This also gives us an easier route to fixing any issues with the
> > acquire/release __sync primitives (__sync_lock_test_and_set and
> > __sync_lock_release) if we decide that these also need to be stronger than
> > their C++11 equivalents.
> 
> I don't think we have another case of different __sync vs. __atomics
> semantics in case of __sync_lock_test_and_set.  The current specification
> makes it clear that this is an acquire barrier, and how it describes the
> semantics (ie, loads and stores that are program-order before the acquire op
> can move to after it) , this seems to be consistent with the effects C11
> specifies for acquire MO (with perhaps the distinction that C11 is clear
> that acquire needs to be paired with some release op to create an ordering
> constraint).

I think that the question is which parts of a RMW operation with MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11 MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an observer to:

  atomic_flag_test_and_set_explicit(foo, memory_order_acquire)
  atomic_store_exlicit (bar, 1, memory_model_relaxed)

Is permitted to observe a write to bar before a write to foo (but not before the read from foo).

My reading of the Itanium ABI is that the acquire barrier applies to the entire operation (Andrew, I think you copied these over exactly backwards in comment 34 ;) ):

  "Disallows the movement of memory references to visible data from
   after the intrinsic (in program order) to before the intrinsic (this
   behavior is desirable at lock-acquire operations, hence the name)."

The definition of __sync_lock_test_and_set is:

  "Behavior:
   • Atomically store the supplied value in *ptr and return the old value
     of *ptr. (i.e.)
       { tmp = *ptr; *ptr = value; return tmp; }
   • Acquire barrier."

So by the strict letter of the specification, no memory references to visible data should be allowed to move from after the entire body of the intrinsic to before it. That is to say in:

  __sync_lock_test_and_set (foo, 1)
  bar = 1

an observer should not be able to observe the write to bar before the write to foo. This is a difference from the C++11 semantics.

I'm not worried about __sync_lock_release, I think the documentation is strong enough and unambiguous.
Comment 36 mwahab 2015-04-16 13:29:41 UTC
(In reply to Andrew Macleod from comment #31)
> 
> 
> Targets that don't need special sync patterns (ie most of them) simply don't
> provide them.   The expanders see no sync pattern and use SEQ_CST, exactly
> like they do today.

For current targets that may be true but assumes that no new target will need the special sync patterns.

> sync patterns would only be provided by targets which do need to do
> something different.   This means there is no potential bug introduction on
> unaffected targets.. 

I was thinking of existing sync patterns in current backends which may have been made redundant by the atomics builtins but are still there. There's a danger that they'll get used even if they're not preferred for that target. There's also the question of whether they're getting tested, assuming that they were only ever generated for the __sync builtins.

> Well, it actually returns back to the situation before they were merged. We
> use to look for sync patterns too... until I thought they were redundant.

I believe that they are redundant (for __sync builtins anyway) since it looks like everything could be done through the atomics + new barrier.

> The intention was to deprecate __sync and support just the c++ memory model.
> SEQ_CST == SYNC was the original intent and understanding, thus the code
> merge.  If we decide we want/need to provide a stronger form of SEQ_CST and
> call it SYNC, then we can do that as required... but I'm not envisioning a
> lot of future additional memory models. Although never say never I suppose.

I don't expect that many models will be needed, the set should certainly be kept as small as possible. I think that extending it in this case is justified because of the gap between what is needed and what is now available. 

The choice seems to be 
- between continuing the move away from the syncs to the atomics. This makes the __sync and the __atomic builtins rely on one infrastructure.
- keeping both the atomics and the syncs infrastructures with individual targets choosing between them.

The first option seems better, not least because it reduces the number of things that need to be understood when dealing with synchronization primitives.
Comment 37 torvald 2015-04-17 15:38:38 UTC
(In reply to James Greenhalgh from comment #35)
> (In reply to torvald from comment #32)
> > (In reply to James Greenhalgh from comment #28)
> > > This also gives us an easier route to fixing any issues with the
> > > acquire/release __sync primitives (__sync_lock_test_and_set and
> > > __sync_lock_release) if we decide that these also need to be stronger than
> > > their C++11 equivalents.
> > 
> > I don't think we have another case of different __sync vs. __atomics
> > semantics in case of __sync_lock_test_and_set.  The current specification
> > makes it clear that this is an acquire barrier, and how it describes the
> > semantics (ie, loads and stores that are program-order before the acquire op
> > can move to after it) , this seems to be consistent with the effects C11
> > specifies for acquire MO (with perhaps the distinction that C11 is clear
> > that acquire needs to be paired with some release op to create an ordering
> > constraint).
> 
> I think that the question is which parts of a RMW operation with
> MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11
> MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an
> observer to:
> 
>   atomic_flag_test_and_set_explicit(foo, memory_order_acquire)
>   atomic_store_exlicit (bar, 1, memory_model_relaxed)
 
That depends on your observer.  When reasoning about C11, please let's use complete examples (and ideally, run them through cppmem).  For example, if the observation is done by a pair of relaxed-MO loads, bar==1 && foo==0 can be observed:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  {{{ {
    r1 = cas_strong(foo, 0, 1);
    bar.store(1, memory_order_relaxed);
  } ||| {
    r2 = bar.load(memory_order_relaxed).readsvalue(1); 
    r3 = foo.load(memory_order_relaxed).readsvalue(0); 
  } }}};
  return 0; }

> Is permitted to observe a write to bar before a write to foo (but not before
> the read from foo).

How does one observe a load in C11 (ie, so that "before the read from foo" is defined)?  You can constrain the reads-from of a load, but the load itself is not observable as an effect.

I think I get what you're trying to say regarding the "load half" but I would phrase it differently: If there could be another release store to foo that the CAS/TAS reads-from, then moving the store to bar before the load would risk creating a cycle between inter-thread happens-before and sequenced-before-created intra-thread happens-before.  (Note that the later isn't visible to other threads though, except through additional inter-thread synchronization.)

Perhaps one could also say that moving the store to bar before the store to foo could result in the CAS/TAS not being atomic -- but this is just reliably observable if the foo observation is a CAS/TAS by itself (so it's totally ordered with the other CAS), or if there is a inter-thread happens-before between the observation of bar and the store to bar (which means using a release store for bar).

I'm discussing these aspects because I think it matters which observations are actually possible in a reliable way.  It matters for C11, and it may matter for the __sync semantics as well because while the __sync functions promise TSO, we don't really do so for all (nonatomic) loads or stores anywhere else in the program...

IOW, can we actually come up with reliable and correct counter-examples (either using the C11 or __sync interfaces) for the guarantees we think ARM might not provide?

> My reading of the Itanium ABI is that the acquire barrier applies to the
> entire operation (Andrew, I think you copied these over exactly backwards in
> comment 34 ;) ):
> 
>   "Disallows the movement of memory references to visible data from
>    after the intrinsic (in program order) to before the intrinsic (this
>    behavior is desirable at lock-acquire operations, hence the name)."
> 
> The definition of __sync_lock_test_and_set is:
> 
>   "Behavior:
>    • Atomically store the supplied value in *ptr and return the old value
>      of *ptr. (i.e.)
>        { tmp = *ptr; *ptr = value; return tmp; }
>    • Acquire barrier."

Hmm.  I don't think this list is meant to be a sequence; __sync_lock_release has two items, first the assignment to the memory location, and *second* a release barrier.  If this were supposed to be a sequence, it wouldn't be correct for lock releases.  It rather seems that, somehow, the whole intrinsic is supposed to be a barrier, and be atomic.  Similarly, the __sync RMW ops just have a single full barrier listed under behavior.

> So by the strict letter of the specification, no memory references to
> visible data should be allowed to move from after the entire body of the
> intrinsic to before it. That is to say in:
> 
>   __sync_lock_test_and_set (foo, 1)
>   bar = 1
> 
> an observer should not be able to observe the write to bar before the write
> to foo. This is a difference from the C++11 semantics.

Can you clarify how this observer would look like?  I think we should look at both code examples that just use __sync for concurrent accesses, and examples that also use normal memory accesses as if those would be guaranteed to be atomic.  None of the __sync docs nor the psABI guarantee any of the latter AFAIK.  I don't think it would be unreasonable to argue that __sync doesn't make promises about non-DRF normal accesses, and so, strictly speaking, maybe programs can't in fact rely on any of the behaviors that are complicated to implement for ARM.  That's why I'd like to distinguish those two cases.

Currently, I couldn't say with confidence which semantics __sync really should have, and which semantics we can actually promise in practice.  It feels as if we need to have a more thorough model / understanding and more detailed info about the trade-offs before we can make a decision.  It feels kind of risky to just make a quick decision here that seems alright but which we don't truly understand -- OTOH, we want to deprecate __sync eventually, so maybe just going the super-conservative route is most practical.

> I'm not worried about __sync_lock_release, I think the documentation is
> strong enough and unambiguous.

Are you aware that the GCC's __sync disallow store-store reordering across __sync_lock_release, whereas the psABI docs don't?
Comment 38 torvald 2015-04-17 15:48:30 UTC
(In reply to Andrew Macleod from comment #34)
> > However, I guess some people relying on data races in their programs could
> > (mis?)understand the __sync_lock_release semantics to mean that it is a
> > means to get the equivalent of a C11 release *fence* -- which it is not
> > because the fence would apply to the (erroneously non-atomic) store after
> > the barrier, which could one lead to believe that if one observes the store
> > after the barrier, the fence must also be in effect.  Thoughts?
> 
> before we get too carried away, maybe we should return to what we *think*
> __sync are suppose to do. It represents a specific definition by intel..
> From the original documentation for __sync "back in the day", and all legacy
> uses of sync should expect this behaviour:

The problem I see with that is that I don't think that just looking at the psABI gives you enough information to really reason about what you are allowed to do or not.  Strictly speaking, the psABI doesn't give you guarantees about normal memory accesses that are not data-race-free (through use of the __sync builtins).  Nonetheless, legacy code does use them in a combination with the __sync builtins.

Also, if you look at the IA-64 __sync_lock_release vs. GCC docs' __sync_lock_release, the latter is like x86/TSO.  Do you have any info on which other semantics __sync was supposed to adhere to?

One potential way to solve it would be to just require code that uses __sync to more or less implement an IA-64 or x86 memory model, modulo allowing compiler-reordering and optimization between adjacent non-__sync memory accesses.  This could be inefficient on ARM (see James' examples) and perhaps Power too (or not -- see Jakub's comments).
Comment 39 Andrew Macleod 2015-04-17 18:00:48 UTC
(In reply to torvald from comment #38)
> (In reply to Andrew Macleod from comment #34)
> > > However, I guess some people relying on data races in their programs could
> > > (mis?)understand the __sync_lock_release semantics to mean that it is a
> > > means to get the equivalent of a C11 release *fence* -- which it is not
> > > because the fence would apply to the (erroneously non-atomic) store after
> > > the barrier, which could one lead to believe that if one observes the store
> > > after the barrier, the fence must also be in effect.  Thoughts?
> > 
> > before we get too carried away, maybe we should return to what we *think*
> > __sync are suppose to do. It represents a specific definition by intel..
> > From the original documentation for __sync "back in the day", and all legacy
> > uses of sync should expect this behaviour:
> 
> The problem I see with that is that I don't think that just looking at the
> psABI gives you enough information to really reason about what you are
> allowed to do or not.  Strictly speaking, the psABI doesn't give you
> guarantees about normal memory accesses that are not data-race-free (through
> use of the __sync builtins).  Nonetheless, legacy code does use them in a
> combination with the __sync builtins.
> 
> Also, if you look at the IA-64 __sync_lock_release vs. GCC docs'
> __sync_lock_release, the latter is like x86/TSO.  Do you have any info on
> which other semantics __sync was supposed to adhere to?
>
 
no, __sync was simply an implementation of psABI back when it was new... I'm not aware of any additions, enhancements or guarantees that were added when it was ported to other arch's.  

Terminology was much looser 14 years ago :-)  That's one of the reasons we want to migrate to __atomic... it is supposedly more precisely defined, whereas __sync had some hand-waving.  We're now experiencing some different interpretations of that.    Regardless of the documentation, we didn't think we'd be supporting something stronger than SEQ_CST since they were suppose to be equivalent... 

when rth gets back he may have some opinion as well.



> One potential way to solve it would be to just require code that uses __sync
> to more or less implement an IA-64 or x86 memory model, modulo allowing
> compiler-reordering and optimization between adjacent non-__sync memory
> accesses.  This could be inefficient on ARM (see James' examples) and
> perhaps Power too (or not -- see Jakub's comments).
Comment 40 mwahab 2015-04-20 13:42:32 UTC
(In reply to Andrew Macleod from comment #25)
 
> Documentation needs updating for sure... The rules have changed under us
> since originally SEQ_CST and sync were intended to be the same thing...
> Guess I shouldn't have tied our horse to the C++11 cart :-)

I've submitted a patch to try to update the __atomics documentation: https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01025.html.
Comment 41 mwahab 2015-04-20 15:17:09 UTC
(In reply to torvald from comment #38)
> (In reply to Andrew Macleod from comment #34)
> 
> Also, if you look at the IA-64 __sync_lock_release vs. GCC docs'
> __sync_lock_release, the latter is like x86/TSO.  Do you have any info on
> which other semantics __sync was supposed to adhere to?
> 
> One potential way to solve it would be to just require code that uses __sync
> to more or less implement an IA-64 or x86 memory model, modulo allowing
> compiler-reordering and optimization between adjacent non-__sync memory
> accesses.  This could be inefficient on ARM (see James' examples) and
> perhaps Power too (or not -- see Jakub's comments).

If the __sync barriers are as described in the GCC manual, that a barrier is atomic and its restrictions apply to all data references, then the Aarch64 backend doesn't currently emit strong enough barriers.

For MEMMODEL_SEQ_CST, the problem was visible enough and the solution I suggested (extending the set of available memmodel types) was simple enough that the changes it would need could be justified. I don't think that's true for the MEMMODEL_ACQUIRE case which seems to be much less likely to be seen and would be rather more disruptive.

I believe that Aarch64 is the only current target where the code needs to be strengthened. Since extending the set of memmodels is difficult to justify and (IMO) so is resurrecting the __sync patterns, I suggest just adding a target hook to allow the expansion of __sync calls to be overridden. That will allow Aarch64 to set a target-specific memmodel value, as is currently allowed, which can then be passed through the existing __atomics mechanisms in the middle through to the Aarch64 backend. No other backend will need to be touched.

If it happens that future architectures have a similar problem then we can reconsider whether any changes need to be made in the target-independent expansions.

Does that sounds like a reasonable approach?
Comment 42 Richard Henderson 2015-04-21 18:51:20 UTC
(In reply to Andrew Macleod from comment #39)
> no, __sync was simply an implementation of psABI back when it was new... I'm
> not aware of any additions, enhancements or guarantees that were added when
> it was ported to other arch's.  
> 
> Terminology was much looser 14 years ago :-)  That's one of the reasons we
> want to migrate to __atomic... it is supposedly more precisely defined,
> whereas __sync had some hand-waving.  We're now experiencing some different
> interpretations of that.    Regardless of the documentation, we didn't think
> we'd be supporting something stronger than SEQ_CST since they were suppose
> to be equivalent... 

Exactly right.  I don't believe there's anywhere we can look for 
more definitive semantics than the psABI.  And as already explored
here, that's not entirely helpful.
Comment 43 James Greenhalgh 2015-04-28 15:32:01 UTC
(In reply to torvald from comment #37)
> (In reply to James Greenhalgh from comment #35)
> > (In reply to torvald from comment #32)
> > > (In reply to James Greenhalgh from comment #28)
> > > > This also gives us an easier route to fixing any issues with the
> > > > acquire/release __sync primitives (__sync_lock_test_and_set and
> > > > __sync_lock_release) if we decide that these also need to be stronger than
> > > > their C++11 equivalents.
> > > 
> > > I don't think we have another case of different __sync vs. __atomics
> > > semantics in case of __sync_lock_test_and_set.  The current specification
> > > makes it clear that this is an acquire barrier, and how it describes the
> > > semantics (ie, loads and stores that are program-order before the acquire op
> > > can move to after it) , this seems to be consistent with the effects C11
> > > specifies for acquire MO (with perhaps the distinction that C11 is clear
> > > that acquire needs to be paired with some release op to create an ordering
> > > constraint).
> > 
> > I think that the question is which parts of a RMW operation with
> > MEMMODEL_ACQUIRE semantics is ordered. My understanding is that in C++11
> > MEMMODEL_ACQUIRE only applies to the "load" half of the operation. So an
> > observer to:
> > 
> >   atomic_flag_test_and_set_explicit(foo, memory_order_acquire)
> >   atomic_store_exlicit (bar, 1, memory_model_relaxed)
>  
> That depends on your observer.  When reasoning about C11, please let's use
> complete examples (and ideally, run them through cppmem).  For example, if
> the observation is done by a pair of relaxed-MO loads, bar==1 && foo==0 can
> be observed:
> 
> int main() {
>   atomic_int foo = 0; 
>   atomic_int bar = 0; 
>   {{{ {
>     r1 = cas_strong(foo, 0, 1);
>     bar.store(1, memory_order_relaxed);
>   } ||| {
>     r2 = bar.load(memory_order_relaxed).readsvalue(1); 
>     r3 = foo.load(memory_order_relaxed).readsvalue(0); 
>   } }}};
>   return 0; }

Thanks for that, and sorry again for my sloppy terminology. I did try to write a
cppmem example, but I couldn't find the syntax for a CAS. If I could have, this is
what I would have written, and gets the results I had expected and was trying to express.

> 
> > Is permitted to observe a write to bar before a write to foo (but not before
> > the read from foo).
> 
> How does one observe a load in C11 (ie, so that "before the read from foo"
> is defined)?  You can constrain the reads-from of a load, but the load
> itself is not observable as an effect.

Sorry, this is ARM memory model terminology leaking across to my C11 examples, but
even then what I was saying doesn't make sense. To observe a load in the ARM model:

"A read of a location in memory is said to be observed by an observer when a
 subsequent write to the location by the same observer has no effect on the value
 returned by the read."
 
But you are right, this is not interesting to model.


> I think I get what you're trying to say regarding the "load half" but I
> would phrase it differently: If there could be another release store to foo
> that the CAS/TAS reads-from, then moving the store to bar before the load
> would risk creating a cycle between inter-thread happens-before and
> sequenced-before-created intra-thread happens-before.  (Note that the later
> isn't visible to other threads though, except through additional
> inter-thread synchronization.)
> 
> Perhaps one could also say that moving the store to bar before the store to
> foo could result in the CAS/TAS not being atomic -- but this is just
> reliably observable if the foo observation is a CAS/TAS by itself (so it's
> totally ordered with the other CAS), or if there is a inter-thread
> happens-before between the observation of bar and the store to bar (which
> means using a release store for bar).
> 
> I'm discussing these aspects because I think it matters which observations
> are actually possible in a reliable way.  It matters for C11, and it may
> matter for the __sync semantics as well because while the __sync functions
> promise TSO, we don't really do so for all (nonatomic) loads or stores
> anywhere else in the program...
> 
> IOW, can we actually come up with reliable and correct counter-examples
> (either using the C11 or __sync interfaces) for the guarantees we think ARM
> might not provide?
> 
> > My reading of the Itanium ABI is that the acquire barrier applies to the
> > entire operation (Andrew, I think you copied these over exactly backwards in
> > comment 34 ;) ):
> > 
> >   "Disallows the movement of memory references to visible data from
> >    after the intrinsic (in program order) to before the intrinsic (this
> >    behavior is desirable at lock-acquire operations, hence the name)."
> > 
> > The definition of __sync_lock_test_and_set is:
> > 
> >   "Behavior:
> >    • Atomically store the supplied value in *ptr and return the old value
> >      of *ptr. (i.e.)
> >        { tmp = *ptr; *ptr = value; return tmp; }
> >    • Acquire barrier."
> 
> Hmm.  I don't think this list is meant to be a sequence; __sync_lock_release
> has two items, first the assignment to the memory location, and *second* a
> release barrier.  If this were supposed to be a sequence, it wouldn't be
> correct for lock releases.  It rather seems that, somehow, the whole
> intrinsic is supposed to be a barrier, and be atomic.  Similarly, the __sync
> RMW ops just have a single full barrier listed under behavior.

Yes, I agree and that was the interpretation I was getting at. I believe that
this is a consequence of the Itanium "cmpxchg" instruction, which implements the
intrinsic in one instruction, and thus does not need to care about its decomposition.

> > So by the strict letter of the specification, no memory references to
> > visible data should be allowed to move from after the entire body of the
> > intrinsic to before it. That is to say in:
> > 
> >   __sync_lock_test_and_set (foo, 1)
> >   bar = 1
> > 
> > an observer should not be able to observe the write to bar before the write
> > to foo. This is a difference from the C++11 semantics.
> 
> Can you clarify how this observer would look like?  I think we should look
> at both code examples that just use __sync for concurrent accesses, and
> examples that also use normal memory accesses as if those would be
> guaranteed to be atomic.  None of the __sync docs nor the psABI guarantee
> any of the latter AFAIK.  I don't think it would be unreasonable to argue
> that __sync doesn't make promises about non-DRF normal accesses, and so,
> strictly speaking, maybe programs can't in fact rely on any of the behaviors
> that are complicated to implement for ARM.  That's why I'd like to
> distinguish those two cases.

Sure, it would look much like your cppmem example above, though you can add some
barriers to the observer if you want to make a stronger example

bar = 0, foo = 0;

thread_a {
  __sync_lock_test_and_set (foo, 1)
  bar = 1
}

thread_b {
  /* If we can see the write to bar, the write
     to foo must also have happened.  */
  if (bar) /* Reads 1.  */
   assert (foo) /* Should never read 0.  */
}

> Currently, I couldn't say with confidence which semantics __sync really
> should have, and which semantics we can actually promise in practice.  It
> feels as if we need to have a more thorough model / understanding and more
> detailed info about the trade-offs before we can make a decision.  It feels
> kind of risky to just make a quick decision here that seems alright but
> which we don't truly understand -- OTOH, we want to deprecate __sync
> eventually, so maybe just going the super-conservative route is most
> practical.

Yes agreed.

> > I'm not worried about __sync_lock_release, I think the documentation is
> > strong enough and unambiguous.
> 
> Are you aware that the GCC's __sync disallow store-store reordering across
> __sync_lock_release, whereas the psABI docs don't?

No I was not, and even looking exactly for the text you were referring to, it took
me three attempts to spot it. Yes, I understand now why you are concerned about
the GCC wording. Perhaps this is just an artefact of a mistake transcribing
the psABI?

AArch64 is certainly not providing that guarantee just now.
Comment 44 mwahab 2015-04-29 08:47:17 UTC
(In reply to James Greenhalgh from comment #43)
> (In reply to torvald from comment #37)
> 
> > > I'm not worried about __sync_lock_release, I think the documentation is
> > > strong enough and unambiguous.
> > 
> > Are you aware that the GCC's __sync disallow store-store reordering across
> > __sync_lock_release, whereas the psABI docs don't?
> 
> No I was not, and even looking exactly for the text you were referring to,
> it took
> me three attempts to spot it. Yes, I understand now why you are concerned
> about
> the GCC wording. Perhaps this is just an artefact of a mistake transcribing
> the psABI?
> 
> AArch64 is certainly not providing that guarantee just now.

I wasn't able to find the text, could you copy it in a reply. 
In the GCC manual, the only description of __sync_lock_release behaviour is in the last paragraph. That descriptions seems consistent with the function being a release barrier and with the current Aarch64 code generated for it.
Comment 45 James Greenhalgh 2015-04-29 12:20:28 UTC
(In reply to mwahab from comment #44)
> (In reply to James Greenhalgh from comment #43)
> > (In reply to torvald from comment #37)
> > 
> > > > I'm not worried about __sync_lock_release, I think the documentation is
> > > > strong enough and unambiguous.
> > > 
> > > Are you aware that the GCC's __sync disallow store-store reordering across
> > > __sync_lock_release, whereas the psABI docs don't?
> > 
> > No I was not, and even looking exactly for the text you were referring to,
> > it took
> > me three attempts to spot it. Yes, I understand now why you are concerned
> > about
> > the GCC wording. Perhaps this is just an artefact of a mistake transcribing
> > the psABI?
> > 
> > AArch64 is certainly not providing that guarantee just now.
> 
> I wasn't able to find the text, could you copy it in a reply. 
> In the GCC manual, the only description of __sync_lock_release behaviour is
> in the last paragraph. That descriptions seems consistent with the function
> being a release barrier and with the current Aarch64 code generated for it.

Yes, so this part is fine...

  void __sync_lock_release (type *ptr, ...)

  This built-in function releases the lock acquired by
  __sync_lock_test_and_set. Normally this means writing the
  constant 0 to *ptr.

  This built-in function is not a full barrier, but rather a release barrier.
  This means that all previous memory stores are globally visible, and all
  previous memory loads have been satisfied,

And this final sentence is buggy by omission of a mention of memory writes:

  but following memory reads are not prevented from being speculated to
  before the barrier.

Which can be read as forbidding store-store reordering across the barrier.

As I say though, I think this is an artefact of transcribing the documentation, rather than an attempt to provide stronger ordering semantics in the GCC implementation.
Comment 46 mwahab 2015-04-29 13:26:39 UTC
(In reply to James Greenhalgh from comment #45)
> (In reply to mwahab from comment #44)
>
> And this final sentence is buggy by omission of a mention of memory writes:
> 
>   but following memory reads are not prevented from being speculated to
>   before the barrier.
> 
> Which can be read as forbidding store-store reordering across the barrier.

It doesn't say anything about store-store reordering so it's trumped by the stated intention to be compatible with the psABI.

> As I say though, I think this is an artefact of transcribing the
> documentation, rather than an attempt to provide stronger ordering semantics
> in the GCC implementation.

Agreed.
Comment 47 Andrew Macleod 2015-04-29 16:04:38 UTC
Created attachment 35425 [details]
potential patch to add MEMMODEL_SYNC

I don't know where we've finally settled on this, but I did prototype out what involved in adding a new MEMMODEL_SYNC which can be identified and made stronger than MEMMODEL_SEQ_CST if so desired. Normally it will behave exactly like MEMMODEL_SEQ_CST.   This bootstraps on x86-64 with no testsuite regressions.

Other than using the new memmodel in the expanders for __sync routines, there were 3 common idioms that needed changing:
- a few places assumed SEQ_CST was the end of the enumeration, changed those to check for MEMMODEL_LAST instead. 
- if (model == MEMMODEL_SEQ_CST) needed changing to if (model >= MEMMODEL_SEQ_CST)
- case MEMMODEL_SEQ_CST:  needed a CASE MEMMODEL_SYNC: added.

Its actually not that invasive, and should have no impact on any existing target.  If a target feels they need to strengthen the __sync cases, then they can look for MEMMODEL_SYNC in their rtl atomic patterns and do something different.

I did a quick and dirty test on x86_64 to make the atomic_sub pattern issue an extra lock if it sees MEMMODEL_SYNC.. and that worked fine.

MEMMODEL_SYNC is not exposed in any way to the user, it is only generated from the __sync expanders, and the compare_swap loop generator.

If someone wants to fool around with it and see if this resolves their issues, go for it.  

If we decide we don't need to support a stronger __sync model and leave things as originally intended, then we simply don't use the patch.

If we do decide to go ahead with it, I will also do a comparison run on the __sync tests in the testsuite to verify that the code generated is in fact identical and I didn't miss anything.
Comment 48 mwahab 2015-04-30 08:20:03 UTC
(In reply to Andrew Macleod from comment #47)
> Created attachment 35425 [details]
> potential patch to add MEMMODEL_SYNC
> 
> I don't know where we've finally settled on this, but I did prototype out
> what involved in adding a new MEMMODEL_SYNC which can be identified and made
> stronger than MEMMODEL_SEQ_CST if so desired. Normally it will behave
> exactly like MEMMODEL_SEQ_CST.   This bootstraps on x86-64 with no testsuite
> regressions.
> 

I've been working on patches for this. Adding a new MEMMODEL type was my first approach because it means that gcc has one representation for all the memory models it supports. I decided not to submit the patch because Aarch64 also needs an tag for the __sync acquire used in __sync_lock_test_and_set. Adding that touches more backends than the MEMMODEL_SYNC and some of those changes are not obvious (for the mips backend in particular). The reasons why Aarch64 needs a new acquire barrier are rather tenuous (they're in the earlier comments) and don't seem to justify the possibly invasive changes that would be needed. 

The approach I'm working on now is to add a target hook to allow a backend to supply its own expansion of calls to the __sync builtins. That makes for smaller changes in the target-independent code and lets the Aarch64 backend add the new barriers as target-specific memory models. The actual code generation goes through the infrastructure for the atomics.

Adding the __sync barriers to coretypes.h is the better approach if more than one backend has this problem. Since it seems that only Aarch64 is affected, I think its better to go with the target hook. If a new architecture gets added with the same problem, it will be easy enough to switch to the coretypes.h approach.
Comment 49 torvald 2015-05-01 12:52:53 UTC
(In reply to James Greenhalgh from comment #43)
> (In reply to torvald from comment #37)
> > (In reply to James Greenhalgh from comment #35)
> > > So by the strict letter of the specification, no memory references to
> > > visible data should be allowed to move from after the entire body of the
> > > intrinsic to before it. That is to say in:
> > > 
> > >   __sync_lock_test_and_set (foo, 1)
> > >   bar = 1
> > > 
> > > an observer should not be able to observe the write to bar before the write
> > > to foo. This is a difference from the C++11 semantics.
> > 
> > Can you clarify how this observer would look like?  I think we should look
> > at both code examples that just use __sync for concurrent accesses, and
> > examples that also use normal memory accesses as if those would be
> > guaranteed to be atomic.  None of the __sync docs nor the psABI guarantee
> > any of the latter AFAIK.  I don't think it would be unreasonable to argue
> > that __sync doesn't make promises about non-DRF normal accesses, and so,
> > strictly speaking, maybe programs can't in fact rely on any of the behaviors
> > that are complicated to implement for ARM.  That's why I'd like to
> > distinguish those two cases.
> 
> Sure, it would look much like your cppmem example above, though you can add
> some
> barriers to the observer if you want to make a stronger example
> 
> bar = 0, foo = 0;
> 
> thread_a {
>   __sync_lock_test_and_set (foo, 1)
>   bar = 1
> }
> 
> thread_b {
>   /* If we can see the write to bar, the write
>      to foo must also have happened.  */
>   if (bar) /* Reads 1.  */
>    assert (foo) /* Should never read 0.  */
> }

This is the case of allowing non-DRF normal accesses.  The *other* case I was thinking about is how the test would have to look like when *not* allowing them.  One way to do it would be:

thread_a {
  __sync_lock_test_and_set (foo, 1)
  __sync_lock_test_and_set (bar, 1) // or __sync_lock_release, or __sync RMW
}

thread_b {
  if (__sync_fetch_and_add (bar, 0))
    assert (foo)  // DRF if thread_a's write is the final one
}

In this case, would the current ARM implementation still produce insufficient code?  If not, at least in this test case, we could argue that there's nothing wrong with what ARM does.  (The question whether we wan't to require DRF strictly for __sync usage is of course still open.)

> > > I'm not worried about __sync_lock_release, I think the documentation is
> > > strong enough and unambiguous.
> > 
> > Are you aware that the GCC's __sync disallow store-store reordering across
> > __sync_lock_release, whereas the psABI docs don't?
> 
> No I was not, and even looking exactly for the text you were referring to,
> it took
> me three attempts to spot it. Yes, I understand now why you are concerned
> about
> the GCC wording. Perhaps this is just an artefact of a mistake transcribing
> the psABI?

I suppose so, but can't say for sure.  Somebody might have had x86 TSO in mind when writing that, and perhaps not just accidentally.

However, we say psABI is the reference spec, so I agree we should change the GCC __sync_lock_release docs accordingly.
Comment 50 Andrew Macleod 2015-05-06 14:25:18 UTC
Created attachment 35478 [details]
implement SYNC flag for memory model

> I've been working on patches for this. Adding a new MEMMODEL type was my
> first approach because it means that gcc has one representation for all the
> memory models it supports. I decided not to submit the patch because Aarch64
> also needs an tag for the __sync acquire used in __sync_lock_test_and_set.
> Adding that touches more backends than the MEMMODEL_SYNC and some of those
> changes are not obvious (for the mips backend in particular). The reasons
> why Aarch64 needs a new acquire barrier are rather tenuous (they're in the
> earlier comments) and don't seem to justify the possibly invasive changes
> that would be needed. 
> 
> The approach I'm working on now is to add a target hook to allow a backend
> to supply its own expansion of calls to the __sync builtins. That makes for
> smaller changes in the target-independent code and lets the Aarch64 backend
> add the new barriers as target-specific memory models. The actual code
> generation goes through the infrastructure for the atomics.
> 
> Adding the __sync barriers to coretypes.h is the better approach if more
> than one backend has this problem. Since it seems that only Aarch64 is
> affected, I think its better to go with the target hook. If a new
> architecture gets added with the same problem, it will be easy enough to
> switch to the coretypes.h approach.

Well, if it affects some target now, it likely to affect some target in the future as well. Didn't someone also mention there is a potential issue with PPC somewhere too?

In any case, I'm not a fan of adding target hooks for this... We ought to just bite the bullet and integrate it cleanly into the memory model. 

I have a patch which adds a SYNC flag to the upper bit of the memory model. It modifies all the generic code and target patterns to use access routines (declared in tree.h) instead of hard coding masks and flags (which probably should have be done in the first place). This makes carrying around the SYNC flag transparent until you want to look for it.

There are new sync enum variants for SYNC_ACQUIRE, SYNC_SEQ_CST, and SYNC_RELEASE. 

The new access routines ignore the sync flag, so "is_mm_acquire (model)" will be true for MEMMODEL_ACQUIRE and MEMMODEL_SYNC_ACQUIRE. If a target doesn't care about differentiating sync (which is most targets) it will be transparent.

It is also simple to check for the presence of sync "is_mm_sync (model)", or a particular MEMMODEL_SYNC variant "model == MEMMODEL_SYNC_SEQ_CST". A quick hack to a couple of x86 patterns indicate this works and I can issue extra/different barriers for sync variants.

This compiles on all targets, but is only runtime tested on x86_64-unknown-linux-gnu with no regressions.

With this you should be able to easily issue whatever different insn sequence you need to within an atomic pattern.   Give it a try.  If it works as required, then we'll see about executing on all targets for testing...
Comment 51 mwahab 2015-05-06 15:58:12 UTC
(In reply to Andrew Macleod from comment #50)
> Created attachment 35478 [details]
> implement SYNC flag for memory model
> 
> > Adding the __sync barriers to coretypes.h is the better approach if more
> > than one backend has this problem. Since it seems that only Aarch64 is
> > affected, I think its better to go with the target hook. If a new
> > architecture gets added with the same problem, it will be easy enough to
> > switch to the coretypes.h approach.
> 
> Well, if it affects some target now, it likely to affect some target in the
> future as well. Didn't someone also mention there is a potential issue with
> PPC somewhere too?

The earlier comment (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24) suggests that PPC generates full barriers so won't be affected by this. It looks like Aarch64 is the only current target with the problem.

> In any case, I'm not a fan of adding target hooks for this... We ought to
> just bite the bullet and integrate it cleanly into the memory model. 

OK.

> I have a patch which adds a SYNC flag to the upper bit of the memory model.
> It modifies all the generic code and target patterns to use access routines
> (declared in tree.h) instead of hard coding masks and flags (which probably
> should have be done in the first place). This makes carrying around the SYNC
> flag transparent until you want to look for it.

From a quick look, this seems to do what's needed to allow the Aarch64 backend to distinguish the sync and atomic code.
 
> This compiles on all targets, but is only runtime tested on
> x86_64-unknown-linux-gnu with no regressions.

The mips backend was the only one that stood out as needing some care, because the way it uses the memory models (e.g. in function mips_process_sync_loop) is a little different from the backends.

> With this you should be able to easily issue whatever different insn
> sequence you need to within an atomic pattern.   Give it a try.  If it works
> as required, then we'll see about executing on all targets for testing...

I'll give it a go.
Comment 52 Andrew Macleod 2015-05-06 16:26:18 UTC
(In reply to mwahab from comment #51)

> The mips backend was the only one that stood out as needing some care,
> because the way it uses the memory models (e.g. in function
> mips_process_sync_loop) is a little different from the backends.
> 

Yeah, it looks a bit wonky, but doesn't need anything special when I looked closer. That 11 and 10 are special overrides on a  couple of patterns, the rest are just normal memory model values from a specified operand.

Although I did miss a conversion there that has no real effect, but should be put in place...    I'll add it to my patches.


*** mips.c	2015-05-06 12:15:04.145423200 -0400
--- BAK/mips.c	2015-05-06 12:12:57.265671466 -0400
*************** mips_process_sync_loop (rtx_insn *insn,
*** 13111,13117 ****
        model = MEMMODEL_ACQUIRE;
        break;
      default:
!       model = memmodel_from_int (INTVAL (operands[memmodel_attr]));
      }
  
    mips_multi_start ();
--- 13111,13117 ----
        model = MEMMODEL_ACQUIRE;
        break;
      default:
!       model = (enum memmodel) INTVAL (operands[memmodel_attr]);
      }
  
    mips_multi_start ();
Comment 53 mwahab 2015-05-07 13:25:16 UTC
(In reply to Andrew Macleod from comment #50)
> Created attachment 35478 [details]
> implement SYNC flag for memory model
> 
> This compiles on all targets, but is only runtime tested on
> x86_64-unknown-linux-gnu with no regressions.

The patch passes check-gcc for aarch64-none-linux-gnu with no new failures. I'll also run the tests on arm.

> With this you should be able to easily issue whatever different insn
> sequence you need to within an atomic pattern.   Give it a try.  If it works
> as required, then we'll see about executing on all targets for testing...

I modified the Aarch64 backend to use the new is_mm_sync test to generate stronger barriers when needed and it all works as required.
Comment 54 mwahab 2015-05-08 08:01:41 UTC
(In reply to mwahab from comment #53)
> (In reply to Andrew Macleod from comment #50)
> > Created attachment 35478 [details]
> > implement SYNC flag for memory model
> > 
> > This compiles on all targets, but is only runtime tested on
> > x86_64-unknown-linux-gnu with no regressions.
> 
> The patch passes check-gcc for aarch64-none-linux-gnu with no new failures.
> I'll also run the tests on arm.
> 

Also passes check-gcc for arm-none-eabi.
Comment 55 James Greenhalgh 2015-05-11 13:44:13 UTC
(In reply to torvald from comment #49)
> > bar = 0, foo = 0;
> > 
> > thread_a {
> >   __sync_lock_test_and_set (foo, 1)
> >   bar = 1
> > }
> > 
> > thread_b {
> >   /* If we can see the write to bar, the write
> >      to foo must also have happened.  */
> >   if (bar) /* Reads 1.  */
> >    assert (foo) /* Should never read 0.  */
> > }
> 
> This is the case of allowing non-DRF normal accesses.  The *other* case I
> was thinking about is how the test would have to look like when *not*
> allowing them.  One way to do it would be:
> 
> thread_a {
>   __sync_lock_test_and_set (foo, 1)
>   __sync_lock_test_and_set (bar, 1) // or __sync_lock_release, or __sync RMW
> }
> 
> thread_b {
>   if (__sync_fetch_and_add (bar, 0))
>     assert (foo)  // DRF if thread_a's write is the final one
> }
> 
> In this case, would the current ARM implementation still produce
> insufficient code?  If not, at least in this test case, we could argue that
> there's nothing wrong with what ARM does.  (The question whether we wan't to
> require DRF strictly for __sync usage is of course still open.)

In this case, the current implementation would be fine. Thread A looks like this:

thread_a:
	adrp	x0, foo
	mov	w1, 1
	ldr	x0, [x0, #:lo12:foo]
.L2:
	ldaxr	w2, [x0] /* Load acquire foo.  */
	stxr	w3, w1, [x0] /* Store release foo.  */
	cbnz	w3, .L2 /* Branch if not exclusive access.  */
	adrp	x0, bar
	ldr	x0, [x0, #:lo12:bar]
.L3:
	ldaxr	w2, [x0] /* Load acquire bar.  */
	stxr	w3, w1, [x0] /* Store release bar.  */
	cbnz	w3, .L3
	ret

And the architecture gives a specific requirement on the ordering of store-release and load-acquire:

A Store-Release followed by a Load-Acquire is observed in program order by any observers that are in both:
  — The shareability domain of the address accessed by the Store-Release.
  — The shareability domain of the address accessed by the Load-Acquire.

So yes, I think in this case we could argue that there is nothing wrong with what ARM does, however I would expect the non-DRF code to be much more common in the wild, so I think we still need to deal with this issue. (it is a shame that the DRF code you provided will suffer from an extra barrier if Matthew/Andrew's work is applied, but I think this is a corner case which we probably don't want to put too much thought in to working around).
Comment 56 mwahab 2015-05-11 15:58:33 UTC
(In reply to James Greenhalgh from comment #55)
> (In reply to torvald from comment #49)
>
> > This is the case of allowing non-DRF normal accesses.  The *other* case I
> > was thinking about is how the test would have to look like when *not*
> > allowing them.  One way to do it would be:
> > 
> > thread_a {
> >   __sync_lock_test_and_set (foo, 1)
> >   __sync_lock_test_and_set (bar, 1) // or __sync_lock_release, or __sync RMW
> > }
>
> [..] (it is
> a shame that the DRF code you provided will suffer from an extra barrier if
> Matthew/Andrew's work is applied, but I think this is a corner case which we
> probably don't want to put too much thought in to working around).

I'm not familiar with the code but I would have thought that it would be  straightforward to optimize away the first dmb. It seems like it would be a simple peephole to spot a sequence of two atomic operations, both with __sync barriers, and replace the first with the equivalent __atomic barrier.
Comment 57 Andrew Macleod 2015-05-12 20:02:19 UTC
Author: amacleod
Date: Tue May 12 20:01:47 2015
New Revision: 223096

URL: https://gcc.gnu.org/viewcvs?rev=223096&root=gcc&view=rev
Log:
2015-05-12  Andrew MacLeod  <amacleod@redhat.com>

	PR target/65697
	* coretypes.h (MEMMODEL_SYNC, MEMMODEL_BASE_MASK): New macros.
	(enum memmodel): Add SYNC_{ACQUIRE,RELEASE,SEQ_CST}.
	* tree.h (memmodel_from_int, memmodel_base, is_mm_relaxed,
	is_mm_consume,is_mm_acquire, is_mm_release, is_mm_acq_rel,
	is_mm_seq_cst, is_mm_sync): New accessor functions.
	* builtins.c (expand_builtin_sync_operation,
	expand_builtin_compare_and_swap): Use MEMMODEL_SYNC_SEQ_CST.
	(expand_builtin_sync_lock_release): Use MEMMODEL_SYNC_RELEASE.
	(get_memmodel,  expand_builtin_atomic_compare_exchange,
	expand_builtin_atomic_load, expand_builtin_atomic_store,
	expand_builtin_atomic_clear): Use new accessor routines.
	(expand_builtin_sync_synchronize): Use MEMMODEL_SYNC_SEQ_CST.
	* optabs.c (expand_compare_and_swap_loop): Use MEMMODEL_SYNC_SEQ_CST.
	(maybe_emit_sync_lock_test_and_set): Use new accessors and
	MEMMODEL_SYNC_ACQUIRE.
	(expand_sync_lock_test_and_set): Use MEMMODEL_SYNC_ACQUIRE.
	(expand_mem_thread_fence, expand_mem_signal_fence, expand_atomic_load,
	expand_atomic_store): Use new accessors.
	* emit-rtl.c (need_atomic_barrier_p): Add additional enum cases.
	* tsan.c (instrument_builtin_call): Update check for memory model beyond
	final enum to use MEMMODEL_LAST.
	* c-family/c-common.c: Use new accessor for memmodel_base.
	* config/aarch64/aarch64.c (aarch64_expand_compare_and_swap): Use new
	accessors.
	* config/aarch64/atomics.md (atomic_load<mode>,atomic_store<mode>,
	arch64_load_exclusive<mode>, aarch64_store_exclusive<mode>,
	mem_thread_fence, *dmb): Likewise.
	* config/alpha/alpha.c (alpha_split_compare_and_swap,
	alpha_split_compare_and_swap_12): Likewise.
	* config/arm/arm.c (arm_expand_compare_and_swap,
	arm_split_compare_and_swap, arm_split_atomic_op): Likewise.
	* config/arm/sync.md (atomic_load<mode>, atomic_store<mode>,
	atomic_loaddi): Likewise.
	* config/i386/i386.c (ix86_destroy_cost_data, ix86_memmodel_check):
	Likewise.
	* config/i386/sync.md (mem_thread_fence, atomic_store<mode>): Likewise.
	* config/ia64/ia64.c (ia64_expand_atomic_op): Add new memmodel cases and
	use new accessors.
	* config/ia64/sync.md (mem_thread_fence, atomic_load<mode>,
	atomic_store<mode>, atomic_compare_and_swap<mode>,
	atomic_exchange<mode>): Use new accessors.
	* config/mips/mips.c (mips_process_sync_loop): Likewise.
	* config/pa/pa.md (atomic_loaddi, atomic_storedi): Likewise.
	* config/rs6000/rs6000.c (rs6000_pre_atomic_barrier,
	rs6000_post_atomic_barrier): Add new cases.
	(rs6000_expand_atomic_compare_and_swap): Use new accessors.
	* config/rs6000/sync.md (mem_thread_fence): Add new cases.
	(atomic_load<mode>): Add new cases and use new accessors.
	(store_quadpti): Add new cases.
	* config/s390/s390.md (mem_thread_fence, atomic_store<mode>): Use new
	accessors.
	* config/sparc/sparc.c (sparc_emit_membar_for_model): Use new accessors.
	* doc/extend.texi: Update docs to indicate 16 bits are used for memory
	model, not 8.



Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/builtins.c
    trunk/gcc/c-family/c-common.c
    trunk/gcc/config/aarch64/aarch64.c
    trunk/gcc/config/aarch64/atomics.md
    trunk/gcc/config/alpha/alpha.c
    trunk/gcc/config/arm/arm.c
    trunk/gcc/config/arm/sync.md
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sync.md
    trunk/gcc/config/ia64/ia64.c
    trunk/gcc/config/ia64/sync.md
    trunk/gcc/config/mips/mips.c
    trunk/gcc/config/pa/pa.md
    trunk/gcc/config/rs6000/rs6000.c
    trunk/gcc/config/rs6000/sync.md
    trunk/gcc/config/s390/s390.md
    trunk/gcc/config/sparc/sparc.c
    trunk/gcc/coretypes.h
    trunk/gcc/doc/extend.texi
    trunk/gcc/emit-rtl.c
    trunk/gcc/optabs.c
    trunk/gcc/tree.h
    trunk/gcc/tsan.c
Comment 58 mwahab 2015-06-01 15:18:52 UTC
Author: mwahab
Date: Mon Jun  1 15:18:19 2015
New Revision: 223983

URL: https://gcc.gnu.org/viewcvs?rev=223983&root=gcc&view=rev
Log:
        PR target/65697
	* config/aarch64/aarch64.c (aarch64_emit_post_barrier):New.
	(aarch64_split_atomic_op): Check for __sync memory models, emit
	appropriate initial loads and final barriers.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.c
Comment 59 mwahab 2015-06-01 15:21:34 UTC
Author: mwahab
Date: Mon Jun  1 15:21:02 2015
New Revision: 223984

URL: https://gcc.gnu.org/viewcvs?rev=223984&root=gcc&view=rev
Log:
	PR target/65697
	* config/aarch64/aarch64.c (aarch64_split_compare_and_swap): Check
	for __sync memory models, emit initial loads and final barriers as
	appropriate.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.c
Comment 60 mwahab 2015-06-01 15:25:09 UTC
Author: mwahab
Date: Mon Jun  1 15:24:37 2015
New Revision: 223986

URL: https://gcc.gnu.org/viewcvs?rev=223986&root=gcc&view=rev
Log:
	PR target/65697
	* gcc.target/aarch64/sync-comp-swap.c: New.
	* gcc.target/aarch64/sync-comp-swap.x: New.
	* gcc.target/aarch64/sync-op-acquire.c: New.
	* gcc.target/aarch64/sync-op-acquire.x: New.
	* gcc.target/aarch64/sync-op-full.c: New.
	* gcc.target/aarch64/sync-op-full.x: New.
	* gcc.target/aarch64/sync-op-release.c: New.
	* gcc.target/aarch64/sync-op-release.x: New.


Added:
    trunk/gcc/testsuite/gcc.target/aarch64/sync-comp-swap.c
    trunk/gcc/testsuite/gcc.target/aarch64/sync-comp-swap.x
    trunk/gcc/testsuite/gcc.target/aarch64/sync-op-acquire.c
    trunk/gcc/testsuite/gcc.target/aarch64/sync-op-acquire.x
    trunk/gcc/testsuite/gcc.target/aarch64/sync-op-full.c
    trunk/gcc/testsuite/gcc.target/aarch64/sync-op-full.x
    trunk/gcc/testsuite/gcc.target/aarch64/sync-op-release.c
    trunk/gcc/testsuite/gcc.target/aarch64/sync-op-release.x
Modified:
    trunk/gcc/testsuite/ChangeLog
Comment 61 Ramana Radhakrishnan 2015-06-11 07:35:01 UTC
Well, confirmed at least. And at the minute fixed on trunk - not sure if we are asking for backports for this ?
Comment 62 mwahab 2015-06-11 08:03:04 UTC
(In reply to Ramana Radhakrishnan from comment #61)
> Well, confirmed at least. And at the minute fixed on trunk - not sure if we
> are asking for backports for this ?

Marcus has asked for this to be backported to 5.1 and maybe 4.9. I'm looking at it.
Comment 63 mwahab 2015-06-29 16:04:05 UTC
Author: mwahab
Date: Mon Jun 29 16:03:34 2015
New Revision: 225132

URL: https://gcc.gnu.org/viewcvs?rev=225132&root=gcc&view=rev
Log:
2015-06-29  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* config/armc/arm.c (arm_split_atomic_op): For ARMv8, replace an
	initial acquire barrier with final barrier.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
Comment 64 mwahab 2015-06-29 16:09:42 UTC
Author: mwahab
Date: Mon Jun 29 16:09:10 2015
New Revision: 225133

URL: https://gcc.gnu.org/viewcvs?rev=225133&root=gcc&view=rev
Log:
2015-06-29  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* config/armc/arm.c (arm_split_compare_and_swap): For ARMv8, replace an
	initial acquire barrier with final barrier.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
Comment 65 mwahab 2015-06-29 16:12:43 UTC
Author: mwahab
Date: Mon Jun 29 16:12:12 2015
New Revision: 225134

URL: https://gcc.gnu.org/viewcvs?rev=225134&root=gcc&view=rev
Log:
2015-06-29  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* gcc.target/arm/armv-sync-comp-swap.c: New.
	* gcc.target/arm/armv-sync-op-acquire.c: New.
	* gcc.target/arm/armv-sync-op-full.c: New.
	* gcc.target/arm/armv-sync-op-release.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/arm/armv8-sync-comp-swap.c   (with props)
    trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-acquire.c   (with props)
    trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-full.c   (with props)
    trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-release.c   (with props)
Modified:
    trunk/gcc/ChangeLog

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-comp-swap.c
            ('svn:eol-style' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-comp-swap.c
            ('svn:keywords' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-acquire.c
            ('svn:eol-style' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-acquire.c
            ('svn:keywords' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-full.c
            ('svn:eol-style' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-full.c
            ('svn:keywords' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-release.c
            ('svn:eol-style' added)

Propchange: trunk/gcc/testsuite/gcc.target/arm/armv8-sync-op-release.c
            ('svn:keywords' added)
Comment 66 mwahab 2015-08-05 11:21:32 UTC
Author: mwahab
Date: Wed Aug  5 11:20:59 2015
New Revision: 226618

URL: https://gcc.gnu.org/viewcvs?rev=226618&root=gcc&view=rev
Log:
	Backport from trunk
	2015-05-12  Andrew MacLeod  <amacleod@redhat.com>

	PR target/65697
	* coretypes.h (MEMMODEL_SYNC, MEMMODEL_BASE_MASK): New macros.
	(enum memmodel): Add SYNC_{ACQUIRE,RELEASE,SEQ_CST}.
	* tree.h (memmodel_from_int, memmodel_base, is_mm_relaxed)
	(is_mm_consume,is_mm_acquire, is_mm_release, is_mm_acq_rel)
	(is_mm_seq_cst, is_mm_sync): New accessor functions.
	* builtins.c (expand_builtin_sync_operation)
	(expand_builtin_compare_and_swap): Use MEMMODEL_SYNC_SEQ_CST.
	(expand_builtin_sync_lock_release): Use MEMMODEL_SYNC_RELEASE.
	(get_memmodel,  expand_builtin_atomic_compare_exchange)
	(expand_builtin_atomic_load, expand_builtin_atomic_store)
	(expand_builtin_atomic_clear): Use new accessor routines.
	(expand_builtin_sync_synchronize): Use MEMMODEL_SYNC_SEQ_CST.
	* optabs.c (expand_compare_and_swap_loop): Use MEMMODEL_SYNC_SEQ_CST.
	(maybe_emit_sync_lock_test_and_set): Use new accessors and
	MEMMODEL_SYNC_ACQUIRE.
	(expand_sync_lock_test_and_set): Use MEMMODEL_SYNC_ACQUIRE.
	(expand_mem_thread_fence, expand_mem_signal_fence, expand_atomic_load)
	(expand_atomic_store): Use new accessors.
	* emit-rtl.c (need_atomic_barrier_p): Add additional enum cases.
	* tsan.c (instrument_builtin_call): Update check for memory model beyond
	final enum to use MEMMODEL_LAST.
	* c-family/c-common.c: Use new accessor for memmodel_base.
	* config/aarch64/aarch64.c (aarch64_expand_compare_and_swap): Use new
	accessors.
	* config/aarch64/atomics.md (atomic_load<mode>,atomic_store<mode>)
	(arch64_load_exclusive<mode>, aarch64_store_exclusive<mode>)
	(mem_thread_fence, *dmb): Likewise.
	* config/alpha/alpha.c (alpha_split_compare_and_swap)
	(alpha_split_compare_and_swap_12): Likewise.
	* config/arm/arm.c (arm_expand_compare_and_swap)
	(arm_split_compare_and_swap, arm_split_atomic_op): Likewise.
	* config/arm/sync.md (atomic_load<mode>, atomic_store<mode>)
	(atomic_loaddi): Likewise.
	* config/i386/i386.c (ix86_destroy_cost_data, ix86_memmodel_check):
	Likewise.
	* config/i386/sync.md (mem_thread_fence, atomic_store<mode>): Likewise.
	* config/ia64/ia64.c (ia64_expand_atomic_op): Add new memmodel cases and
	use new accessors.
	* config/ia64/sync.md (mem_thread_fence, atomic_load<mode>)
	(atomic_store<mode>, atomic_compare_and_swap<mode>)
	(atomic_exchange<mode>): Use new accessors.
	* config/mips/mips.c (mips_process_sync_loop): Likewise.
	* config/pa/pa.md (atomic_loaddi, atomic_storedi): Likewise.
	* config/rs6000/rs6000.c (rs6000_pre_atomic_barrier)
	(rs6000_post_atomic_barrier): Add new cases.
	(rs6000_expand_atomic_compare_and_swap): Use new accessors.
	* config/rs6000/sync.md (mem_thread_fence): Add new cases.
	(atomic_load<mode>): Add new cases and use new accessors.
	(store_quadpti): Add new cases.
	* config/s390/s390.md (mem_thread_fence, atomic_store<mode>): Use new
	accessors.
	* config/sparc/sparc.c (sparc_emit_membar_for_model): Use new accessors.
	* doc/extend.texi: Update docs to indicate 16 bits are used for memory
	model, not 8.


Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/builtins.c
    branches/gcc-5-branch/gcc/c-family/c-common.c
    branches/gcc-5-branch/gcc/config/aarch64/aarch64.c
    branches/gcc-5-branch/gcc/config/aarch64/atomics.md
    branches/gcc-5-branch/gcc/config/alpha/alpha.c
    branches/gcc-5-branch/gcc/config/arm/arm.c
    branches/gcc-5-branch/gcc/config/arm/sync.md
    branches/gcc-5-branch/gcc/config/i386/i386.c
    branches/gcc-5-branch/gcc/config/i386/sync.md
    branches/gcc-5-branch/gcc/config/ia64/ia64.c
    branches/gcc-5-branch/gcc/config/ia64/sync.md
    branches/gcc-5-branch/gcc/config/mips/mips.c
    branches/gcc-5-branch/gcc/config/pa/pa.md
    branches/gcc-5-branch/gcc/config/rs6000/rs6000.c
    branches/gcc-5-branch/gcc/config/rs6000/sync.md
    branches/gcc-5-branch/gcc/config/s390/s390.md
    branches/gcc-5-branch/gcc/config/sparc/sparc.c
    branches/gcc-5-branch/gcc/coretypes.h
    branches/gcc-5-branch/gcc/doc/extend.texi
    branches/gcc-5-branch/gcc/emit-rtl.c
    branches/gcc-5-branch/gcc/optabs.c
    branches/gcc-5-branch/gcc/tree.h
    branches/gcc-5-branch/gcc/tsan.c
Comment 67 mwahab 2015-08-05 11:29:59 UTC
Author: mwahab
Date: Wed Aug  5 11:29:28 2015
New Revision: 226619

URL: https://gcc.gnu.org/viewcvs?rev=226619&root=gcc&view=rev
Log:
	Backport from trunk.
	2015-06-01  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* config/aarch64/aarch64.c (aarch64_emit_post_barrier): New.
	(aarch64_split_atomic_op): Check for __sync memory models, emit
	appropriate initial loads and final barriers.


Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/aarch64/aarch64.c
Comment 68 mwahab 2015-08-05 11:40:57 UTC
Author: mwahab
Date: Wed Aug  5 11:40:25 2015
New Revision: 226620

URL: https://gcc.gnu.org/viewcvs?rev=226620&root=gcc&view=rev
Log:
	Backport from trunk.
	2015-06-01  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* config/aarch64/aarch64.c (aarch64_split_compare_and_swap): Check
	for __sync memory models, emit initial loads and final barriers as
	appropriate.


Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/aarch64/aarch64.c
Comment 69 mwahab 2015-08-05 11:49:15 UTC
Author: mwahab
Date: Wed Aug  5 11:48:43 2015
New Revision: 226621

URL: https://gcc.gnu.org/viewcvs?rev=226621&root=gcc&view=rev
Log:
	Backport from trunk
	2015-06-01  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* gcc.target/aarch64/sync-comp-swap.c: New.
	* gcc.target/aarch64/sync-comp-swap.x: New.
	* gcc.target/aarch64/sync-op-acquire.c: New.
	* gcc.target/aarch64/sync-op-acquire.x: New.
	* gcc.target/aarch64/sync-op-full.c: New.
	* gcc.target/aarch64/sync-op-full.x: New.
	* gcc.target/aarch64/sync-op-release.c: New.
	* gcc.target/aarch64/sync-op-release.x: New.


Added:
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-comp-swap.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-comp-swap.x
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-op-acquire.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-op-acquire.x
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-op-full.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-op-full.x
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-op-release.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/sync-op-release.x
Modified:
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
Comment 70 mwahab 2015-08-05 13:28:13 UTC
Author: mwahab
Date: Wed Aug  5 13:27:41 2015
New Revision: 226625

URL: https://gcc.gnu.org/viewcvs?rev=226625&root=gcc&view=rev
Log:
	Backport from trunk:
	2015-06-29  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* config/armc/arm.c (arm_split_atomic_op): For ARMv8, replace an
	initial acquire barrier with final barrier.


Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/arm/arm.c
Comment 71 mwahab 2015-08-05 13:40:46 UTC
Author: mwahab
Date: Wed Aug  5 13:40:14 2015
New Revision: 226627

URL: https://gcc.gnu.org/viewcvs?rev=226627&root=gcc&view=rev
Log:
	Backport from trunk:
	2015-06-29  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* config/arm/arm.c (arm_split_compare_and_swap): For ARMv8,
	replace an initial acquire barrier with final barrier.


Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/arm/arm.c
Comment 72 mwahab 2015-08-05 13:43:36 UTC
Author: mwahab
Date: Wed Aug  5 13:43:04 2015
New Revision: 226628

URL: https://gcc.gnu.org/viewcvs?rev=226628&root=gcc&view=rev
Log:
	Backport from trunk:
	2015-06-29  Matthew Wahab  <matthew.wahab@arm.com>

	PR target/65697
	* gcc.target/arm/armv-sync-comp-swap.c: New.
	* gcc.target/arm/armv-sync-op-acquire.c: New.
	* gcc.target/arm/armv-sync-op-full.c: New.
	* gcc.target/arm/armv-sync-op-release.c: New.


Added:
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/armv8-sync-comp-swap.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/armv8-sync-op-acquire.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/armv8-sync-op-full.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/armv8-sync-op-release.c
Modified:
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
Comment 73 Ramana Radhakrishnan 2015-09-30 03:07:26 UTC
Now fixed ,Matthew ?

Ramana
Comment 74 mwahab 2015-10-05 08:08:41 UTC
(In reply to Ramana Radhakrishnan from comment #73)
> Now fixed ,Matthew ?

Its fixed in trunk and gcc-5 but not in gcc-4.9.
Comment 75 Ramana Radhakrishnan 2015-10-05 08:30:45 UTC
Fixed on 5 branch and trunk.