[PATCH, AArch64 v2 05/11] aarch64: Emit LSE st<op> instructions
Will Deacon
will.deacon@arm.com
Wed Oct 31 19:42:00 GMT 2018
On Wed, Oct 31, 2018 at 04:38:53PM +0000, Richard Henderson wrote:
> On 10/31/18 3:04 PM, Will Deacon wrote:
> > The example test above uses relaxed atomics in conjunction with an acquire
> > fence, so I don't think we can actually use ST<op> at all without a change
> > to the language specification. I previouslyyallocated P0861 for this purpose
> > but never got a chance to write it up...
> >
> > Perhaps the issue is a bit clearer with an additional thread (not often I
> > say that!):
> >
> >
> > P0 (atomic_int* y,atomic_int* x) {
> > atomic_store_explicit(x,1,memory_order_relaxed);
> > atomic_thread_fence(memory_order_release);
> > atomic_store_explicit(y,1,memory_order_relaxed);
> > }
> >
> > P1 (atomic_int* y,atomic_int* x) {
> > atomic_fetch_add_explicit(y,1,memory_order_relaxed); // STADD
> > atomic_thread_fence(memory_order_acquire);
> > int r0 = atomic_load_explicit(x,memory_order_relaxed);
> > }
> >
> > P2 (atomic_int* y) {
> > int r1 = atomic_load_explicit(y,memory_order_relaxed);
> > }
> >
> >
> > My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
> > this test has executed. However, if the relaxed add in P1 compiles to
> > STADD and the subsequent acquire fence is compiled as DMB LD, then we
> > don't have any ordering guarantees in P1 and the forbidden result could
> > be observed.
>
> I suppose I don't understand exactly what you're saying.
Apologies, I'm probably not explaining things very well. I'm trying to
avoid getting into the C11 memory model relations if I can help it, hence
the example.
> I can see that, yes, if you split the fetch-add from the acquire in P1 you get
> the incorrect results you describe. But isn't that a bug in the test itself?
Per the C11 memory model, the test above is well-defined and if r1 == 2
then it is required that r0 == 1. With your proposal, this is not guaranteed
for AArch64, and it would be possible to end up with r1 == 2 and r0 == 0.
> Why would not the only correct version have
>
> P1 (atomic_int* y, atomic_int* x) {
> atomic_fetch_add_explicit(y, 1, memory_order_acquire);
> int r0 = atomic_load_explicit(x, memory_order_relaxed);
> }
>
> at which point we won't use STADD for the fetch-add, but LDADDA.
That would indeed work correctly, but the problem is that the C11 memory
model doesn't rule out the previous test as something which isn't portable.
> If the problem is more fundamental than this, would you have another go at
> explaining? In particular, I don't see the difference between
>
> ldadd val, scratch, [base]
> vs
> stadd val, [base]
>
> and
>
> ldaddl val, scratch, [base]
> vs
> staddl val, [base]
>
> where both pairs of instructions have the same memory ordering semantics.
> Currently we are always producing the ld version of each pair.
Aha, maybe this is the problem. An acquire fence on AArch64 is implemented
using a DMB LD instruction, which orders prior reads against subsequent
reads and writes. However, the architecture says:
| The ST<OP> instructions, and LD<OP> instructions where the destination
| register is WZR or XZR, are not regarded as doing a read for the purpose
| of a DMB LD barrier.
and so therefore an ST atomic is not affected by a subsequent acquire fence,
whereas an LD atomic is.
Does that help at all?
Will
More information about the Gcc-patches
mailing list