[PATCH, AArch64 v2 05/11] aarch64: Emit LSE st<op> instructions

Wed Oct 31 19:42:00 GMT 2018

On Wed, Oct 31, 2018 at 04:38:53PM +0000, Richard Henderson wrote:
> On 10/31/18 3:04 PM, Will Deacon wrote:
> > The example test above uses relaxed atomics in conjunction with an acquire
> > fence, so I don't think we can actually use ST<op> at all without a change
> > to the language specification. I previouslyyallocated P0861 for this purpose
> > but never got a chance to write it up...
> > 
> > Perhaps the issue is a bit clearer with an additional thread (not often I
> > say that!):
> > 
> > 
> > P0 (atomic_int* y,atomic_int* x) {
> >   atomic_store_explicit(x,1,memory_order_relaxed);
> >   atomic_thread_fence(memory_order_release);
> >   atomic_store_explicit(y,1,memory_order_relaxed);
> > }
> > 
> > P1 (atomic_int* y,atomic_int* x) {
> >   atomic_fetch_add_explicit(y,1,memory_order_relaxed);	// STADD
> >   atomic_thread_fence(memory_order_acquire);
> >   int r0 = atomic_load_explicit(x,memory_order_relaxed);
> > }
> > 
> > P2 (atomic_int* y) {
> >   int r1 = atomic_load_explicit(y,memory_order_relaxed);
> > }
> > 
> > 
> > My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
> > this test has executed. However, if the relaxed add in P1 compiles to
> > STADD and the subsequent acquire fence is compiled as DMB LD, then we
> > don't have any ordering guarantees in P1 and the forbidden result could
> > be observed.
> 
> I suppose I don't understand exactly what you're saying.

Apologies, I'm probably not explaining things very well. I'm trying to
avoid getting into the C11 memory model relations if I can help it, hence
the example.

> I can see that, yes, if you split the fetch-add from the acquire in P1 you get
> the incorrect results you describe.  But isn't that a bug in the test itself?

Per the C11 memory model, the test above is well-defined and if r1 == 2
then it is required that r0 == 1. With your proposal, this is not guaranteed
for AArch64, and it would be possible to end up with r1 == 2 and r0 == 0.

> Why would not the only correct version have
> 
> P1 (atomic_int* y, atomic_int* x) {
>   atomic_fetch_add_explicit(y, 1, memory_order_acquire);
>   int r0 = atomic_load_explicit(x, memory_order_relaxed);
> }
> 
> at which point we won't use STADD for the fetch-add, but LDADDA.

That would indeed work correctly, but the problem is that the C11 memory
model doesn't rule out the previous test as something which isn't portable.

> If the problem is more fundamental than this, would you have another go at
> explaining?  In particular, I don't see the difference between
> 
> 	ldadd	val, scratch, [base]
>   vs
> 	stadd	val, [base]
> 
> and
> 
> 	ldaddl	val, scratch, [base]
>   vs
> 	staddl	val, [base]
> 
> where both pairs of instructions have the same memory ordering semantics.
> Currently we are always producing the ld version of each pair.

Aha, maybe this is the problem. An acquire fence on AArch64 is implemented
using a DMB LD instruction, which orders prior reads against subsequent
reads and writes. However, the architecture says:

  | The ST<OP> instructions, and LD<OP> instructions where the destination
  | register is WZR or XZR, are not regarded as doing a read for the purpose
  | of a DMB LD barrier.

and so therefore an ST atomic is not affected by a subsequent acquire fence,
whereas an LD atomic is.

Does that help at all?

Will