Bug 59448 - Code generation doesn't respect C11 address-dependency
Summary: Code generation doesn't respect C11 address-dependency
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: unknown
: P3 normal
Target Milestone: 5.0
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2013-12-10 12:07 UTC by algrant
Modified: 2022-01-15 01:53 UTC (History)
11 users (show)

See Also:
Host:
Target: arm aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-10-28 00:00:00


Attachments
promote memory order consume to acquire (334 bytes, patch)
2014-10-28 12:44 UTC, Andrew Macleod
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description algrant 2013-12-10 12:07:52 UTC
int f2(int const *p, int const *q)
  {
    int flag = *p;
    return flag ? *(q + flag - flag) : 0;
  }

The evaluation *(q + flag - flag) is ordered after *p by 5.1.2.4#14;
writing it this way is a recognized way of forcing the ordering.
AArch64 GCC 4.8 -O2 --std=c11 generates

f2:
        ldr     w2, [x0]
        mov     w0, 0
        cbz     w2, .L2
        ldr     w0, [x1]
.L2

which doesn't preserve the ordering in the target memory model.
The compiler needs to introduce an address dependency, e.g.

        and     w2, w2, #0
        ldr     w0, [x1, w2]

or insert a memory barrier instruction.  According to the target
architecture's own rules, this is sufficient to order the loads.

In general, the C source-level address dependency (defined by the 
syntactic rules of 5.1.2.4#14) might not involve arithmetic.
For example:

  inline int makedep(int flag, ...) { return flag; }  // variadic

  int f3(int const *p, int const *q) {
    int flag = *p;
    return flag ? makedep(*q, flag) : 0;
  }

An access might even be formally dependent on multiple expressions.
The compiler just has to keep track.
Comment 1 Richard Earnshaw 2013-12-10 14:19:17 UTC
I suspect this will affect any target with relaxed HW ordering of loads.
Comment 2 algrant 2013-12-10 16:43:10 UTC
Just realised that that last example is bogus, it should read:

  inline int const *makedep(int const *a, ...) { return a; }  // variadic

  int f3(int const *p, int const *q) {
    int flag = *p;
    return flag ? *makedep(q, flag) : 0;
  }

In C++, makedep could be a template - the counterpart of kill_dependency.
Comment 3 jsm-csl@polyomino.org.uk 2013-12-10 17:47:35 UTC
The code generated appears fully in accordance with the semantics of C11.  
You refer to 5.1.2.4#14, the definition of "carries a dependency".  The 
term "carries a dependency" is used outside its own definition only in the 
definition of "dependency-ordered before".  This testcase contains no 
instances of "dependency-ordered before", because any such instance must 
directly or indirectly involve a case of "A performs a release operation 
on an atomic object M, and, in another thread, B performs a consume 
operation on M and reads a value written by any side effect in the release 
sequence headed by A", and the test involves no atomic objects.

"dependency-ordered before", in turn, is only used in the definition of 
"inter-thread happens before", which is only used in the definition of 
"happens before".

Please provide a complete testcase, using _Atomic or <stdatomic.h> (and so 
using GCC 4.9, of course) that demonstrates any bug: that is, that does 
not contain a data race according to the C11 definition, but where GCC has 
reordered code in a way that introduces one (and so one can be 
demonstrated by enough iterations of the threads in the testcase) - or 
where the code generated fails some other semantics associated with 
"happens before", such as those for "visible side effect" and "visible 
sequence of side effects".  I expect any such bug, and fix, would probably 
not be a front-end issue but an issue with the atomic built-in functions 
failing to ensure appropriate ordering in the presence of dependencies 
involving atomic operations (dependencies not involving such operations 
having no effects on C11 semantics).
Comment 4 algrant 2013-12-12 18:07:01 UTC
So using g++,

  #include <atomic>
  int f1(std::atomic<int> const *p, std::atomic<int> const *q)
  {
    int flag = p->load(std::memory_order_consume);
    return flag ? (q + flag - flag)->load(std::memory_order_relaxed) : 0;
  }

demonstrates the same lack of ordering.  You suggest that this might
be a problem with the atomic built-ins - and yes, if this had been a
load-acquire, it would be a problem with the built-in not introducing a
barrier or using a load-acquire instruction.  But for a load-consume on
this architecture, no barrier is necessary to separate the load-consume
from a load that is address-dependent on it.  The programmer wrote a
dependency but the compiler lost track of it.

It's not necessary to demonstrate failure - there's an architectural 
race condition here.  Even if it doesn't fail now there's no guarantee
it will never fail on future more aggressively reordering cores.
Comment 5 jsm-csl@polyomino.org.uk 2013-12-16 14:38:35 UTC
On Thu, 12 Dec 2013, algrant at acm dot org wrote:

> demonstrates the same lack of ordering.  You suggest that this might
> be a problem with the atomic built-ins - and yes, if this had been a
> load-acquire, it would be a problem with the built-in not introducing a
> barrier or using a load-acquire instruction.  But for a load-consume on
> this architecture, no barrier is necessary to separate the load-consume
> from a load that is address-dependent on it.  The programmer wrote a
> dependency but the compiler lost track of it.

"address-dependent" is not a C standard concept.  As far as I can tell, at 
least as regards C there are no such ordering constraints between 
non-atomic operations, only between operations at least one of which is 
atomic - thus, it is the responsibility of the atomic built-in functions 
to ensure whatever ordering may be required.  (Whereas the parts of the 
memory model defining what counts as a "memory location" *do* have 
implications in the absence of atomics, restricting the code sequences 
that can be used for struct modifications and preventing speculative 
stores.)

> It's not necessary to demonstrate failure - there's an architectural 
> race condition here.  Even if it doesn't fail now there's no guarantee
> it will never fail on future more aggressively reordering cores.

You still need to provide a testcase (a complete program that can be 
compiled and linked with current GCC) that (a) does not show undefined 
behavior, (b) that, you justify by reference to the standard definitions 
and how they apply to source code constructs, must exhibit specific 
observable behavior (values printed, assertions passed, etc. - *not* just 
ordering of loads at the architectural level), and (c) that, you justify 
by reference to the architecture definition if not to actual observed 
failure, could fail to meet the requirements for observable behavior, 
given the code generated by GCC and the behavior permitted by the 
architecture for that code.

I am unable to tell what code you envisage running in another thread or 
what observable failure you think could result from "lack of ordering", 
because you have not provided a complete testcase.  Thus, I am unable to 
tell if there is a genuine bug here at all.  The standard definitions 
associated with atomicity are extremely complicated; you need to be very 
careful about identifying exactly how particular definitions apply to 
particular source code constructs and so how you deduce the requirements 
on behavior of a particular program, for a bug report to be of any use.
Comment 6 algrant 2014-01-20 08:52:36 UTC
Here is a complete C++11 test case.  The code generated for the two loads
in thread B doesn't maintain the required ordering:

        ...
        ldrb    w1, [x0]
        uxtb    w1, w1
        adrp    x0, .LANCHOR0
        ldr     w2, [x3]
        ...

According to the architecture specification, to achieve the ordering it's
sufficient to use the result of the first load in the calculation of the
address of the second, even if it's done in such a way as to have no
dependence on the value.

/* Reproducer for 59448: compile with -std=c++11 -lpthread */

#include <atomic>
#include <thread>
#include <stdio.h>
#include <assert.h>

static std::atomic<bool> flag(0);
static int data;

/*
Thread A tries to release a data value 1 to thread B via a
release/consume sequence.

In the demonstration, this is tried repeatedly.  The iterations of
the two threads are synchronized via a release/acquire sequence
from B to A.  This is not relevant to the ordering problem.
*/

void threadA(void)
{
  for (;;) {
    data = 1;                                           // A.A
    flag.store(1, std::memory_order_release);           // A.B
    /* By 1.9#14, A.A is sequenced before A.B */
    /* Now wait for thread B to accept the data. */
    while (flag.load(std::memory_order_acquire) == 1);
    assert(data == 0);
  }
}

void threadB(void)
{
  for (;;) {
    int f, d;
    do {
      f = flag.load(std::memory_order_consume);             // B.A
      d = *(&data + f - f);                                 // B.B
      /* By 1.10#9, B.A carries a dependency to B.B */
      /* By 1.10#10, A.B is dependency-ordered before B.B */
      /* By 1.10#11, A.A intra-thread happens before B.B */
      /* By 1.10#12, A.A happens before B.B */
      /* By 1.10#13, A.A is a visible side-effect with respect to B.B
                     and the value determined by B.B shall be the
                     value (1) stored by A.A. */
    } while (!f);
    assert(d == 1);
    /* Release thread A to start another iteration. */
    data = 0;
    flag.store(0, std::memory_order_release);
  }
}

int main(void)
{
  std::thread a(&threadA);
  std::thread b(&threadB);
  a.join();
  b.join();
  return 0;
}
Comment 7 Andrew Pinski 2014-01-20 09:02:10 UTC
>*(&data + f - f);  

This could be undefined behavior if f is not 0 or 1.
Comment 8 algrant 2014-01-20 10:13:13 UTC
I don't see how f can not be 0 or 1 here, but to make this even more clear
that there is no UB, define flag this way:

  static std::atomic<unsigned int> flag(0);

and calculate the address this way:

      d = *(&data + (f - f));                                 // B.B

The ordering requirements are the same, and there is no possibility of UB.
Comment 9 jsm-csl@polyomino.org.uk 2014-01-20 14:21:17 UTC
Thanks for the testcase.  At this point I think we need Andrew MacLeod or 
Torvald Riegel to review it and assess what should happen where to fix 
this (both for direct uses of the atomic built-in functions, and for 
_Atomic / std::atomic).

Since the original bug report it has come to my attention that there are 
known ambiguities in how the C11/C++11 memory model is mapped to processor 
operations on some architectures, which will need to be addressed in 
platform ABIs for interoperation between implementations (when one thread 
is built with one implementation and the other thread with another 
implementation, for example).

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

(It is not clear to me if research in this area is sufficiently advanced 
to provide a good way for ABIs to specify this that is unambiguous and 
does not inhibit optimizations within a single implementation.)

It will also of course be necessary for ABIs to specify such things as 
alignment for C11 _Atomic for that to interoperate between implementations 
(for GCC at present, if the size of the type is 8 / 16 / 32 / 64/ 128 
bits, the alignment gets increased to match that of the corresponding core 
type).  But first it may take a while for things to work reliably within 
single implementations.
Comment 10 torvald 2014-01-23 22:28:30 UTC
(In reply to algrant from comment #6)
> Here is a complete C++11 test case.  The code generated for the two loads
> in thread B doesn't maintain the required ordering:

The C++ test has a race condition because the nonatomic load from data is not ordered wrt. the nonatomic store to data in the other thread.  Please fix that and report back whether this changes anything.  (I believe the use of relaxed-MO atomic accesses should be sufficient.)

Nonetheless, if there is a fixed (data-race-free) similar test, then I believe this fixed test case to be valid in the sense that AFAIU the standards' requirements (both C++11's and C11's; they are equal in this aspect), code as in this test case is supposed to be carrying the dependencies forward.

However, I cannot see a *practical* implementation of this requirement.  AFAIU, dependencies are carried forward through loads/stores too (5.1.2.4#14):
"An evaluation A carries a dependency to an evaluation B if: [...]A writes a scalar object or bit-field M, B reads from M the value written by A, and A is sequenced before B".
Because a compiler cannot, in general, analyze which piece of code stored last to M, it would have to conservatively assume that a dependency is carried through M.  In turn, it could never optimize expressions such as "(x-x)", at least if those affect atomic operations on M in some way, irrespective of how remotely that might have happened.  (I'm restricting my statement to atomic accesses because I'm not quite sure whether there's some other reason why nonatomic accesses couldn't be affected by this.)  OTOH, maybe there are factors that I'm not aware of and that effectively constrain carries-a-dependency to something that doesn't widely prohibit optimizations.

I don't have significant background information on the design of this part of the memory model (nor the implementability of this part), but I'll ask around at the C++ meeting in two weeks.

> According to the architecture specification, to achieve the ordering it's
> sufficient to use the result of the first load in the calculation of the
> address of the second, even if it's done in such a way as to have no
> dependence on the value.

And that's certainly a feasible *architecture* specification.  ISTM that the language standard tried to model this, but then didn't give the compiler enough leeway to still be able to optimize code.  The carries-a-depency rules seem like an unusual shortcut from the language level to requirements on the generated code, without the usual indirection using the observable behavior of the abstract machine and as-if.  For example -- and this isn't a precise definition of course -- if the standard would require that "f" (in your test case) is needed to process the evaluation, then things like "f-f" could be optimized out, and the compiler might just do the right thing with standard optimizations.

As a short-term fix, I believe we can promote all memory_order_consume atomics to memory_order_acquire, when generating code with consume-MO atomics.  GCC is at fault here, not the code in the builtins or in libatomic.  Thus, this would prevent errors when GCC compiled the code and a correct atomics implementation is used (e.g., third-party libatomic).

This won't help when another (potentially correct) compiler generated the code with the consume-MO atomics, and GCC-compiled code fails to carry the dependencies forward.  We could try to change libatomic's consume-MO implementation to prevent that, but this would any catch cases in which libatomic is actually used and not the other compiler's builtins or similar.  Thus, changing libatomic does not seem to be necessary.

Richard, Andrew: Thoughts?
Comment 11 algrant 2014-02-17 10:26:47 UTC
Where do you get that this is racy if the access to data is not atomic?  By design, release/acquire and release/consume sequences don't require wholesale changes to the way the data payload (in the general case, multiple fields within a structure) is first constructed and then used.  1.10#13 makes clear that as a result of the intra-thread sequencing between atomic and non-atomic operations (1.9#14), and the inter-thread ordering between atomic operations (1.10 various), there is a resulting ordering on operations to "ordinary" (sic) objects.  Please see the references to the C++ standard in the source example, for the chain of reasoning here.
Comment 12 torvald 2014-02-17 21:02:58 UTC
(In reply to algrant from comment #11)
> Where do you get that this is racy if the access to data is not atomic?

In threadB(), you do:

      f = flag.load(std::memory_order_consume);             // B.A
      d = *(&data + f - f);                                 // B.B

That reads from data irrespective of the value of f, so will read when data is actually not "owned" by threadB.  You can either make the accesses to data atomic, or move the load from data to after checking that flag equals 1.

> By
> design, release/acquire and release/consume sequences don't require
> wholesale changes to the way the data payload (in the general case, multiple
> fields within a structure) is first constructed and then used.  1.10#13
> makes clear that as a result of the intra-thread sequencing between atomic
> and non-atomic operations (1.9#14), and the inter-thread ordering between
> atomic operations (1.10 various), there is a resulting ordering on
> operations to "ordinary" (sic) objects.  Please see the references to the
> C++ standard in the source example, for the chain of reasoning here.

I'm very much aware of the rules, I believe.  But as far as you can see, there's a data race in your test program, which results in undefined behavior.
Comment 13 algrant 2014-02-17 22:22:03 UTC
I see what you mean - there is a race if threadB reads the data when the
flag is not set, i.e. in the case when the read value is never used.
Moving the read of data after the test (as in the fragment in comment 4)
would fix the race.
Comment 14 Ramana Radhakrishnan 2014-10-28 10:53:05 UTC
This appears to have now morphed into a standards issue and possibly affects all targets that have a weak memory model. Anyway confirming this.
Comment 15 Andrew Macleod 2014-10-28 12:44:05 UTC
Created attachment 33831 [details]
promote memory order consume to acquire

So have we concluded that we should promote memory_order_consume to memory_order_acquire for now?  

That change should be fairly trivial. builtins.c::get_memmodel would simply change any consumes to acquire.  All consumes would then be promoted to acquire.

I don't know if libatomic does much with consume, but once compiled with this patch in the compiler, it will also generate acquire code for consumes.

Here's an untested patch provided if someone wants to experiment.
Comment 16 jsm-csl@polyomino.org.uk 2014-10-28 13:32:40 UTC
My view is promote consume to acquire until the standards committees and 
formal model people have sorted out what to do with it.
Comment 17 torvald 2014-10-28 17:01:40 UTC
(In reply to Andrew Macleod from comment #15)
> So have we concluded that we should promote memory_order_consume to
> memory_order_acquire for now?  

I also think that this is the best way forward.  I believe everyone in ISO C++ SG1 agreed that this is basically a defect in the standard.

What I haven't thought through is how to deal with with carries_dependency (7.6.4 in C++11): For GCC code generated after we promote consume to acquire, it can safely be ignored; but should GCC code be linked to code generated by another compiler that does not promote and expects the code to preserve dependencies, this won't work.

I am not aware of any shipping compiler that would actually try to preserve dependencies, and nobody else mentioned any during the discussion of this topic in ISO C++ SG1.  Thus, we could assume that there are no such other compilers, and make it part of the ABI (assumptions) that consume is promoted to acquire in a correct compiler.

Alternatively, we could try to be conservative and add an acquire barrier before the function body if any parameter of the function has carries_dependency; and, likewise, add an acquire barrier after every call to a function which has carries_dependency.

I don't have more input from the ISO C side, but I would guess that the situation there is similar.

Thoughts?
Comment 18 Tim Northover 2014-10-28 17:10:55 UTC
> I am not aware of any shipping compiler that would actually try to preserve dependencies, and nobody else mentioned any during the discussion of this topic in ISO C++ SG1.

In case the data point is useful, Clang promotes consume to acquire at the moment (though I wouldn't rule out bugs even in that choice).
Comment 19 jsm-csl@polyomino.org.uk 2014-10-29 00:07:36 UTC
On Tue, 28 Oct 2014, torvald at gcc dot gnu.org wrote:

> Alternatively, we could try to be conservative and add an acquire barrier
> before the function body if any parameter of the function has
> carries_dependency; and, likewise, add an acquire barrier after every call to a
> function which has carries_dependency.
> 
> I don't have more input from the ISO C side, but I would guess that the
> situation there is similar.

As previously noted:

* ISO C does not have carries_dependency (or attributes at all).

* carries_dependency is about increasing optimization, not decreasing it.  
If it affects when barriers are added, it does so by allowing some 
barriers to be omitted that would otherwise be required.
Comment 20 torvald 2014-10-29 09:04:44 UTC
(In reply to joseph@codesourcery.com from comment #19)
> * carries_dependency is about increasing optimization, not decreasing it.  
> If it affects when barriers are added, it does so by allowing some 
> barriers to be omitted that would otherwise be required.

That's not quite true, unfortunately, AFAIU.  I agree that generally, attributes are supposed to be "ignorable" -- but in the case of carries_dependency, C++11 explicitly requires that all declarations of a function either have the attribute or none has (see 7.6.4p2).  This is because you need that to actually exploit the attribute; it's a contract between callers and callees.  If a compiler does try to preserve dependencies (or not across function boundaries), then ignoring the attribute is fine.  But if there should be a compiler out there that doesn't, and GCC-generated code is supposed to link to that other compiler's code, then we need to do something to make this work.
Comment 21 torvald 2014-10-30 20:41:18 UTC
(In reply to torvald from comment #17)
> (In reply to Andrew Macleod from comment #15)
> > So have we concluded that we should promote memory_order_consume to
> > memory_order_acquire for now?  
> 
> I also think that this is the best way forward.  I believe everyone in ISO
> C++ SG1 agreed that this is basically a defect in the standard.

To clarify, my impression from the discussion was that there was general consensus that memory_order_consume as specified now wouldn't achieve what it was intended to in practice, due to the implementation issues.  IOW, it's not useful even though it's not a defect in the sense of being inconsistent or plain wrong.
Comment 22 preshing 2014-10-30 22:13:38 UTC
(In reply to algrant from comment #13)
> I see what you mean - there is a race if threadB reads the data when the
> flag is not set, i.e. in the case when the read value is never used.
> Moving the read of data after the test (as in the fragment in comment 4)
> would fix the race.

algrant, could you post the complete assembly listing for threadB(), using your version of AArch64 g++, after all the corrections to the example have been taken into account, with -O2? For convenience, I've posted the corrected example below.

When I compile the example using PowerPC g++ 4.9.1, it seems that memory_order_consume is already promoted to memory_order_acquire, as evidenced by the cmpw;bne-;isync sequence following the load from flag, which acts as an acquire barrier. I've posted the PowerPC assembly listing below as well.

Is the bug specific to AArch64?

-------------

/* Reproducer for 59448: compile with -std=c++11 -lpthread -O2 */

#include <atomic>
#include <thread>
#include <stdio.h>
#include <assert.h>

static std::atomic<unsigned int> flag(0);
static int data;

/*
Thread A tries to release a data value 1 to thread B via a
release/consume sequence.

In the demonstration, this is tried repeatedly.  The iterations of
the two threads are synchronized via a release/acquire sequence
from B to A.  This is not relevant to the ordering problem.
*/

void threadA(void)
{
  for (;;) {
    data = 1;                                           // A.A
    flag.store(1, std::memory_order_release);           // A.B
    /* By 1.9#14, A.A is sequenced before A.B */
    /* Now wait for thread B to accept the data. */
    while (flag.load(std::memory_order_acquire) == 1);
    assert(data == 0);
  }
}

void threadB(void)
{
  for (;;) {
    int f, d;
    do {
      f = flag.load(std::memory_order_consume);             // B.A
    } while (!f);
    /* By 1.10#9, B.A carries a dependency to B.B */
    /* By 1.10#10, A.B is dependency-ordered before B.B */
    /* By 1.10#11, A.A intra-thread happens before B.B */
    /* By 1.10#12, A.A happens before B.B */
    /* By 1.10#13, A.A is a visible side-effect with respect to B.B
                   and the value determined by B.B shall be the
                   value (1) stored by A.A. */
    d = *(&data + (f - f));                                 // B.B
    assert(d == 1);
    /* Release thread A to start another iteration. */
    data = 0;
    flag.store(0, std::memory_order_release);
  }
}

int main(void)
{
  std::thread a(&threadA);
  std::thread b(&threadB);
  a.join();
  b.join();
  return 0;
}

------------- PowerPC listing of threadB():

_Z7threadBv:
.LFB2302:
    .cfi_startproc
    lis 10,.LANCHOR0@ha
    li 8,0
    la 10,.LANCHOR0@l(10)
.L16:
    lwz 9,4(10)
    cmpw 7,9,9      <------- acquire barrier
    bne- 7,$+4      <-------
    isync           <-------
    cmpwi 7,9,0
    beq+ 7,.L16
    lwz 9,0(10)
    cmpwi 7,9,1
    bne- 7,.L22
    stw 8,0(10)
    lwsync
    stw 8,4(10)
    b .L16
.L22:
    mflr 0
    lis 6,.LANCHOR1@ha
    stwu 1,-16(1)
    .cfi_def_cfa_offset 16
    .cfi_register 65, 0
    lis 3,.LC2@ha
    lis 4,.LC1@ha
    la 6,.LANCHOR1@l(6)
    la 3,.LC2@l(3)
    la 4,.LC1@l(4)
    li 5,47
    addi 6,6,16
    stw 0,20(1)
    .cfi_offset 65, 4
    bl __assert_fail
    .cfi_endproc
Comment 23 preshing 2014-11-24 12:13:45 UTC
Hi,

I went ahead and verified this bug using a cross-compiler built from GCC 4.9.2 sources. The bug indeed exists and happens when compiling for AArch64, but not PowerPC. Andrew's patch fixes it (changing the first ldr instruction to an ldar in this case). Full AArch64 assembly listings below.

I've also written a blog post on this subject in the hope of clarifying the issue for anyone determined enough to make sense of it: http://preshing.com/20141124/fixing-gccs-implementation-of-memory_order_consume

Andrew's patch, if it works the way I understand it, seems like the correct thing for GCC to do until somebody figures out how to safely implement the "efficient" compiler strategy for consume semantics.

I guess the next step is to run the test suite on a few platforms to make sure there are no regressions, then submit?

Cheers,
Jeff

------------------ AArch64 listing of threadB() without Andrew's patch:

_Z7threadBv:
.LFB2304:
        .cfi_startproc
        adrp    x1, .LANCHOR0
        stp     x29, x30, [sp, -16]!
        .cfi_def_cfa_offset 16
        .cfi_offset 29, -16
        .cfi_offset 30, -8
        add     x1, x1, :lo12:.LANCHOR0
        add     x29, sp, 0
        .cfi_def_cfa_register 29
.L10:
        add     x0, x1, 8
        ldr     w0, [x0]
        cbz     w0, .L10
        ldr     w0, [x1]
        cmp     w0, 1
        bne     .L15
        str     wzr, [x1]
        add     x0, x1, 8
        stlr    wzr, [x0]
        b       .L10
.L15:
        adrp    x3, .LANCHOR1
        adrp    x0, .LC2
        adrp    x1, .LC1
        add     x3, x3, :lo12:.LANCHOR1
        add     x0, x0, :lo12:.LC2
        add     x1, x1, :lo12:.LC1
        mov     w2, 47
        add     x3, x3, 16
        bl      __assert_fail
        .cfi_endproc

------------------ AArch64 listing of threadB() with Andrew's patch:

_Z7threadBv:
.LFB2304:
        .cfi_startproc
        adrp    x1, .LANCHOR0
        stp     x29, x30, [sp, -16]!
        .cfi_def_cfa_offset 16
        .cfi_offset 29, -16
        .cfi_offset 30, -8
        add     x1, x1, :lo12:.LANCHOR0
        add     x29, sp, 0
        .cfi_def_cfa_register 29
.L10:
        add     x0, x1, 8
        ldar    w0, [x0]
        cbz     w0, .L10
        ldr     w0, [x1]
        cmp     w0, 1
        bne     .L15
        str     wzr, [x1]
        add     x0, x1, 8
        stlr    wzr, [x0]
        b       .L10
.L15:
        adrp    x3, .LANCHOR1
        adrp    x0, .LC2
        adrp    x1, .LC1
        add     x3, x3, :lo12:.LANCHOR1
        add     x0, x0, :lo12:.LC2
        add     x1, x1, :lo12:.LC1
        mov     w2, 47
        add     x3, x3, 16
        bl      __assert_fail
        .cfi_endproc
Comment 24 Andrew Macleod 2015-01-14 13:59:08 UTC
Author: amacleod
Date: Wed Jan 14 13:58:35 2015
New Revision: 219601

URL: https://gcc.gnu.org/viewcvs?rev=219601&root=gcc&view=rev
Log:

2015-01-14  Andrew MacLeod  <amacleod@redhat.com>

	PR middle-end/59448
	* builtins.c (get_memmodel): Promote consume to acquire always.
	* testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal
	consume in an atomic_exchange.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/builtins.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/atomic-invalid.c
Comment 25 torvald 2016-01-21 18:37:01 UTC
(In reply to Andrew Macleod from comment #24)
> Author: amacleod
> Date: Wed Jan 14 13:58:35 2015
> New Revision: 219601
> 
> URL: https://gcc.gnu.org/viewcvs?rev=219601&root=gcc&view=rev
> Log:
> 
> 2015-01-14  Andrew MacLeod  <amacleod@redhat.com>
> 
> 	PR middle-end/59448
> 	* builtins.c (get_memmodel): Promote consume to acquire always.
> 	* testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal
> 	consume in an atomic_exchange.

To summarize, Jeff Preshing confirmed that Andrew's patch fixes the problem (see Comment 23).  Everyone in ISO C++ seems to agree that promoting consume to acquire is the right approach until we have a different language-level facility that is actually implementable.  Therefore, I'm closing this as fixed.