int f2(int const *p, int const *q) { int flag = *p; return flag ? *(q + flag - flag) : 0; } The evaluation *(q + flag - flag) is ordered after *p by 5.1.2.4#14; writing it this way is a recognized way of forcing the ordering. AArch64 GCC 4.8 -O2 --std=c11 generates f2: ldr w2, [x0] mov w0, 0 cbz w2, .L2 ldr w0, [x1] .L2 which doesn't preserve the ordering in the target memory model. The compiler needs to introduce an address dependency, e.g. and w2, w2, #0 ldr w0, [x1, w2] or insert a memory barrier instruction. According to the target architecture's own rules, this is sufficient to order the loads. In general, the C source-level address dependency (defined by the syntactic rules of 5.1.2.4#14) might not involve arithmetic. For example: inline int makedep(int flag, ...) { return flag; } // variadic int f3(int const *p, int const *q) { int flag = *p; return flag ? makedep(*q, flag) : 0; } An access might even be formally dependent on multiple expressions. The compiler just has to keep track.
I suspect this will affect any target with relaxed HW ordering of loads.
Just realised that that last example is bogus, it should read: inline int const *makedep(int const *a, ...) { return a; } // variadic int f3(int const *p, int const *q) { int flag = *p; return flag ? *makedep(q, flag) : 0; } In C++, makedep could be a template - the counterpart of kill_dependency.
The code generated appears fully in accordance with the semantics of C11. You refer to 5.1.2.4#14, the definition of "carries a dependency". The term "carries a dependency" is used outside its own definition only in the definition of "dependency-ordered before". This testcase contains no instances of "dependency-ordered before", because any such instance must directly or indirectly involve a case of "A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A", and the test involves no atomic objects. "dependency-ordered before", in turn, is only used in the definition of "inter-thread happens before", which is only used in the definition of "happens before". Please provide a complete testcase, using _Atomic or <stdatomic.h> (and so using GCC 4.9, of course) that demonstrates any bug: that is, that does not contain a data race according to the C11 definition, but where GCC has reordered code in a way that introduces one (and so one can be demonstrated by enough iterations of the threads in the testcase) - or where the code generated fails some other semantics associated with "happens before", such as those for "visible side effect" and "visible sequence of side effects". I expect any such bug, and fix, would probably not be a front-end issue but an issue with the atomic built-in functions failing to ensure appropriate ordering in the presence of dependencies involving atomic operations (dependencies not involving such operations having no effects on C11 semantics).
So using g++, #include <atomic> int f1(std::atomic<int> const *p, std::atomic<int> const *q) { int flag = p->load(std::memory_order_consume); return flag ? (q + flag - flag)->load(std::memory_order_relaxed) : 0; } demonstrates the same lack of ordering. You suggest that this might be a problem with the atomic built-ins - and yes, if this had been a load-acquire, it would be a problem with the built-in not introducing a barrier or using a load-acquire instruction. But for a load-consume on this architecture, no barrier is necessary to separate the load-consume from a load that is address-dependent on it. The programmer wrote a dependency but the compiler lost track of it. It's not necessary to demonstrate failure - there's an architectural race condition here. Even if it doesn't fail now there's no guarantee it will never fail on future more aggressively reordering cores.
On Thu, 12 Dec 2013, algrant at acm dot org wrote: > demonstrates the same lack of ordering. You suggest that this might > be a problem with the atomic built-ins - and yes, if this had been a > load-acquire, it would be a problem with the built-in not introducing a > barrier or using a load-acquire instruction. But for a load-consume on > this architecture, no barrier is necessary to separate the load-consume > from a load that is address-dependent on it. The programmer wrote a > dependency but the compiler lost track of it. "address-dependent" is not a C standard concept. As far as I can tell, at least as regards C there are no such ordering constraints between non-atomic operations, only between operations at least one of which is atomic - thus, it is the responsibility of the atomic built-in functions to ensure whatever ordering may be required. (Whereas the parts of the memory model defining what counts as a "memory location" *do* have implications in the absence of atomics, restricting the code sequences that can be used for struct modifications and preventing speculative stores.) > It's not necessary to demonstrate failure - there's an architectural > race condition here. Even if it doesn't fail now there's no guarantee > it will never fail on future more aggressively reordering cores. You still need to provide a testcase (a complete program that can be compiled and linked with current GCC) that (a) does not show undefined behavior, (b) that, you justify by reference to the standard definitions and how they apply to source code constructs, must exhibit specific observable behavior (values printed, assertions passed, etc. - *not* just ordering of loads at the architectural level), and (c) that, you justify by reference to the architecture definition if not to actual observed failure, could fail to meet the requirements for observable behavior, given the code generated by GCC and the behavior permitted by the architecture for that code. I am unable to tell what code you envisage running in another thread or what observable failure you think could result from "lack of ordering", because you have not provided a complete testcase. Thus, I am unable to tell if there is a genuine bug here at all. The standard definitions associated with atomicity are extremely complicated; you need to be very careful about identifying exactly how particular definitions apply to particular source code constructs and so how you deduce the requirements on behavior of a particular program, for a bug report to be of any use.
Here is a complete C++11 test case. The code generated for the two loads in thread B doesn't maintain the required ordering: ... ldrb w1, [x0] uxtb w1, w1 adrp x0, .LANCHOR0 ldr w2, [x3] ... According to the architecture specification, to achieve the ordering it's sufficient to use the result of the first load in the calculation of the address of the second, even if it's done in such a way as to have no dependence on the value. /* Reproducer for 59448: compile with -std=c++11 -lpthread */ #include <atomic> #include <thread> #include <stdio.h> #include <assert.h> static std::atomic<bool> flag(0); static int data; /* Thread A tries to release a data value 1 to thread B via a release/consume sequence. In the demonstration, this is tried repeatedly. The iterations of the two threads are synchronized via a release/acquire sequence from B to A. This is not relevant to the ordering problem. */ void threadA(void) { for (;;) { data = 1; // A.A flag.store(1, std::memory_order_release); // A.B /* By 1.9#14, A.A is sequenced before A.B */ /* Now wait for thread B to accept the data. */ while (flag.load(std::memory_order_acquire) == 1); assert(data == 0); } } void threadB(void) { for (;;) { int f, d; do { f = flag.load(std::memory_order_consume); // B.A d = *(&data + f - f); // B.B /* By 1.10#9, B.A carries a dependency to B.B */ /* By 1.10#10, A.B is dependency-ordered before B.B */ /* By 1.10#11, A.A intra-thread happens before B.B */ /* By 1.10#12, A.A happens before B.B */ /* By 1.10#13, A.A is a visible side-effect with respect to B.B and the value determined by B.B shall be the value (1) stored by A.A. */ } while (!f); assert(d == 1); /* Release thread A to start another iteration. */ data = 0; flag.store(0, std::memory_order_release); } } int main(void) { std::thread a(&threadA); std::thread b(&threadB); a.join(); b.join(); return 0; }
>*(&data + f - f); This could be undefined behavior if f is not 0 or 1.
I don't see how f can not be 0 or 1 here, but to make this even more clear that there is no UB, define flag this way: static std::atomic<unsigned int> flag(0); and calculate the address this way: d = *(&data + (f - f)); // B.B The ordering requirements are the same, and there is no possibility of UB.
Thanks for the testcase. At this point I think we need Andrew MacLeod or Torvald Riegel to review it and assess what should happen where to fix this (both for direct uses of the atomic built-in functions, and for _Atomic / std::atomic). Since the original bug report it has come to my attention that there are known ambiguities in how the C11/C++11 memory model is mapped to processor operations on some architectures, which will need to be addressed in platform ABIs for interoperation between implementations (when one thread is built with one implementation and the other thread with another implementation, for example). http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html (It is not clear to me if research in this area is sufficiently advanced to provide a good way for ABIs to specify this that is unambiguous and does not inhibit optimizations within a single implementation.) It will also of course be necessary for ABIs to specify such things as alignment for C11 _Atomic for that to interoperate between implementations (for GCC at present, if the size of the type is 8 / 16 / 32 / 64/ 128 bits, the alignment gets increased to match that of the corresponding core type). But first it may take a while for things to work reliably within single implementations.
(In reply to algrant from comment #6) > Here is a complete C++11 test case. The code generated for the two loads > in thread B doesn't maintain the required ordering: The C++ test has a race condition because the nonatomic load from data is not ordered wrt. the nonatomic store to data in the other thread. Please fix that and report back whether this changes anything. (I believe the use of relaxed-MO atomic accesses should be sufficient.) Nonetheless, if there is a fixed (data-race-free) similar test, then I believe this fixed test case to be valid in the sense that AFAIU the standards' requirements (both C++11's and C11's; they are equal in this aspect), code as in this test case is supposed to be carrying the dependencies forward. However, I cannot see a *practical* implementation of this requirement. AFAIU, dependencies are carried forward through loads/stores too (5.1.2.4#14): "An evaluation A carries a dependency to an evaluation B if: [...]A writes a scalar object or bit-field M, B reads from M the value written by A, and A is sequenced before B". Because a compiler cannot, in general, analyze which piece of code stored last to M, it would have to conservatively assume that a dependency is carried through M. In turn, it could never optimize expressions such as "(x-x)", at least if those affect atomic operations on M in some way, irrespective of how remotely that might have happened. (I'm restricting my statement to atomic accesses because I'm not quite sure whether there's some other reason why nonatomic accesses couldn't be affected by this.) OTOH, maybe there are factors that I'm not aware of and that effectively constrain carries-a-dependency to something that doesn't widely prohibit optimizations. I don't have significant background information on the design of this part of the memory model (nor the implementability of this part), but I'll ask around at the C++ meeting in two weeks. > According to the architecture specification, to achieve the ordering it's > sufficient to use the result of the first load in the calculation of the > address of the second, even if it's done in such a way as to have no > dependence on the value. And that's certainly a feasible *architecture* specification. ISTM that the language standard tried to model this, but then didn't give the compiler enough leeway to still be able to optimize code. The carries-a-depency rules seem like an unusual shortcut from the language level to requirements on the generated code, without the usual indirection using the observable behavior of the abstract machine and as-if. For example -- and this isn't a precise definition of course -- if the standard would require that "f" (in your test case) is needed to process the evaluation, then things like "f-f" could be optimized out, and the compiler might just do the right thing with standard optimizations. As a short-term fix, I believe we can promote all memory_order_consume atomics to memory_order_acquire, when generating code with consume-MO atomics. GCC is at fault here, not the code in the builtins or in libatomic. Thus, this would prevent errors when GCC compiled the code and a correct atomics implementation is used (e.g., third-party libatomic). This won't help when another (potentially correct) compiler generated the code with the consume-MO atomics, and GCC-compiled code fails to carry the dependencies forward. We could try to change libatomic's consume-MO implementation to prevent that, but this would any catch cases in which libatomic is actually used and not the other compiler's builtins or similar. Thus, changing libatomic does not seem to be necessary. Richard, Andrew: Thoughts?
Where do you get that this is racy if the access to data is not atomic? By design, release/acquire and release/consume sequences don't require wholesale changes to the way the data payload (in the general case, multiple fields within a structure) is first constructed and then used. 1.10#13 makes clear that as a result of the intra-thread sequencing between atomic and non-atomic operations (1.9#14), and the inter-thread ordering between atomic operations (1.10 various), there is a resulting ordering on operations to "ordinary" (sic) objects. Please see the references to the C++ standard in the source example, for the chain of reasoning here.
(In reply to algrant from comment #11) > Where do you get that this is racy if the access to data is not atomic? In threadB(), you do: f = flag.load(std::memory_order_consume); // B.A d = *(&data + f - f); // B.B That reads from data irrespective of the value of f, so will read when data is actually not "owned" by threadB. You can either make the accesses to data atomic, or move the load from data to after checking that flag equals 1. > By > design, release/acquire and release/consume sequences don't require > wholesale changes to the way the data payload (in the general case, multiple > fields within a structure) is first constructed and then used. 1.10#13 > makes clear that as a result of the intra-thread sequencing between atomic > and non-atomic operations (1.9#14), and the inter-thread ordering between > atomic operations (1.10 various), there is a resulting ordering on > operations to "ordinary" (sic) objects. Please see the references to the > C++ standard in the source example, for the chain of reasoning here. I'm very much aware of the rules, I believe. But as far as you can see, there's a data race in your test program, which results in undefined behavior.
I see what you mean - there is a race if threadB reads the data when the flag is not set, i.e. in the case when the read value is never used. Moving the read of data after the test (as in the fragment in comment 4) would fix the race.
This appears to have now morphed into a standards issue and possibly affects all targets that have a weak memory model. Anyway confirming this.
Created attachment 33831 [details] promote memory order consume to acquire So have we concluded that we should promote memory_order_consume to memory_order_acquire for now? That change should be fairly trivial. builtins.c::get_memmodel would simply change any consumes to acquire. All consumes would then be promoted to acquire. I don't know if libatomic does much with consume, but once compiled with this patch in the compiler, it will also generate acquire code for consumes. Here's an untested patch provided if someone wants to experiment.
My view is promote consume to acquire until the standards committees and formal model people have sorted out what to do with it.
(In reply to Andrew Macleod from comment #15) > So have we concluded that we should promote memory_order_consume to > memory_order_acquire for now? I also think that this is the best way forward. I believe everyone in ISO C++ SG1 agreed that this is basically a defect in the standard. What I haven't thought through is how to deal with with carries_dependency (7.6.4 in C++11): For GCC code generated after we promote consume to acquire, it can safely be ignored; but should GCC code be linked to code generated by another compiler that does not promote and expects the code to preserve dependencies, this won't work. I am not aware of any shipping compiler that would actually try to preserve dependencies, and nobody else mentioned any during the discussion of this topic in ISO C++ SG1. Thus, we could assume that there are no such other compilers, and make it part of the ABI (assumptions) that consume is promoted to acquire in a correct compiler. Alternatively, we could try to be conservative and add an acquire barrier before the function body if any parameter of the function has carries_dependency; and, likewise, add an acquire barrier after every call to a function which has carries_dependency. I don't have more input from the ISO C side, but I would guess that the situation there is similar. Thoughts?
> I am not aware of any shipping compiler that would actually try to preserve dependencies, and nobody else mentioned any during the discussion of this topic in ISO C++ SG1. In case the data point is useful, Clang promotes consume to acquire at the moment (though I wouldn't rule out bugs even in that choice).
On Tue, 28 Oct 2014, torvald at gcc dot gnu.org wrote: > Alternatively, we could try to be conservative and add an acquire barrier > before the function body if any parameter of the function has > carries_dependency; and, likewise, add an acquire barrier after every call to a > function which has carries_dependency. > > I don't have more input from the ISO C side, but I would guess that the > situation there is similar. As previously noted: * ISO C does not have carries_dependency (or attributes at all). * carries_dependency is about increasing optimization, not decreasing it. If it affects when barriers are added, it does so by allowing some barriers to be omitted that would otherwise be required.
(In reply to joseph@codesourcery.com from comment #19) > * carries_dependency is about increasing optimization, not decreasing it. > If it affects when barriers are added, it does so by allowing some > barriers to be omitted that would otherwise be required. That's not quite true, unfortunately, AFAIU. I agree that generally, attributes are supposed to be "ignorable" -- but in the case of carries_dependency, C++11 explicitly requires that all declarations of a function either have the attribute or none has (see 7.6.4p2). This is because you need that to actually exploit the attribute; it's a contract between callers and callees. If a compiler does try to preserve dependencies (or not across function boundaries), then ignoring the attribute is fine. But if there should be a compiler out there that doesn't, and GCC-generated code is supposed to link to that other compiler's code, then we need to do something to make this work.
(In reply to torvald from comment #17) > (In reply to Andrew Macleod from comment #15) > > So have we concluded that we should promote memory_order_consume to > > memory_order_acquire for now? > > I also think that this is the best way forward. I believe everyone in ISO > C++ SG1 agreed that this is basically a defect in the standard. To clarify, my impression from the discussion was that there was general consensus that memory_order_consume as specified now wouldn't achieve what it was intended to in practice, due to the implementation issues. IOW, it's not useful even though it's not a defect in the sense of being inconsistent or plain wrong.
(In reply to algrant from comment #13) > I see what you mean - there is a race if threadB reads the data when the > flag is not set, i.e. in the case when the read value is never used. > Moving the read of data after the test (as in the fragment in comment 4) > would fix the race. algrant, could you post the complete assembly listing for threadB(), using your version of AArch64 g++, after all the corrections to the example have been taken into account, with -O2? For convenience, I've posted the corrected example below. When I compile the example using PowerPC g++ 4.9.1, it seems that memory_order_consume is already promoted to memory_order_acquire, as evidenced by the cmpw;bne-;isync sequence following the load from flag, which acts as an acquire barrier. I've posted the PowerPC assembly listing below as well. Is the bug specific to AArch64? ------------- /* Reproducer for 59448: compile with -std=c++11 -lpthread -O2 */ #include <atomic> #include <thread> #include <stdio.h> #include <assert.h> static std::atomic<unsigned int> flag(0); static int data; /* Thread A tries to release a data value 1 to thread B via a release/consume sequence. In the demonstration, this is tried repeatedly. The iterations of the two threads are synchronized via a release/acquire sequence from B to A. This is not relevant to the ordering problem. */ void threadA(void) { for (;;) { data = 1; // A.A flag.store(1, std::memory_order_release); // A.B /* By 1.9#14, A.A is sequenced before A.B */ /* Now wait for thread B to accept the data. */ while (flag.load(std::memory_order_acquire) == 1); assert(data == 0); } } void threadB(void) { for (;;) { int f, d; do { f = flag.load(std::memory_order_consume); // B.A } while (!f); /* By 1.10#9, B.A carries a dependency to B.B */ /* By 1.10#10, A.B is dependency-ordered before B.B */ /* By 1.10#11, A.A intra-thread happens before B.B */ /* By 1.10#12, A.A happens before B.B */ /* By 1.10#13, A.A is a visible side-effect with respect to B.B and the value determined by B.B shall be the value (1) stored by A.A. */ d = *(&data + (f - f)); // B.B assert(d == 1); /* Release thread A to start another iteration. */ data = 0; flag.store(0, std::memory_order_release); } } int main(void) { std::thread a(&threadA); std::thread b(&threadB); a.join(); b.join(); return 0; } ------------- PowerPC listing of threadB(): _Z7threadBv: .LFB2302: .cfi_startproc lis 10,.LANCHOR0@ha li 8,0 la 10,.LANCHOR0@l(10) .L16: lwz 9,4(10) cmpw 7,9,9 <------- acquire barrier bne- 7,$+4 <------- isync <------- cmpwi 7,9,0 beq+ 7,.L16 lwz 9,0(10) cmpwi 7,9,1 bne- 7,.L22 stw 8,0(10) lwsync stw 8,4(10) b .L16 .L22: mflr 0 lis 6,.LANCHOR1@ha stwu 1,-16(1) .cfi_def_cfa_offset 16 .cfi_register 65, 0 lis 3,.LC2@ha lis 4,.LC1@ha la 6,.LANCHOR1@l(6) la 3,.LC2@l(3) la 4,.LC1@l(4) li 5,47 addi 6,6,16 stw 0,20(1) .cfi_offset 65, 4 bl __assert_fail .cfi_endproc
Hi, I went ahead and verified this bug using a cross-compiler built from GCC 4.9.2 sources. The bug indeed exists and happens when compiling for AArch64, but not PowerPC. Andrew's patch fixes it (changing the first ldr instruction to an ldar in this case). Full AArch64 assembly listings below. I've also written a blog post on this subject in the hope of clarifying the issue for anyone determined enough to make sense of it: http://preshing.com/20141124/fixing-gccs-implementation-of-memory_order_consume Andrew's patch, if it works the way I understand it, seems like the correct thing for GCC to do until somebody figures out how to safely implement the "efficient" compiler strategy for consume semantics. I guess the next step is to run the test suite on a few platforms to make sure there are no regressions, then submit? Cheers, Jeff ------------------ AArch64 listing of threadB() without Andrew's patch: _Z7threadBv: .LFB2304: .cfi_startproc adrp x1, .LANCHOR0 stp x29, x30, [sp, -16]! .cfi_def_cfa_offset 16 .cfi_offset 29, -16 .cfi_offset 30, -8 add x1, x1, :lo12:.LANCHOR0 add x29, sp, 0 .cfi_def_cfa_register 29 .L10: add x0, x1, 8 ldr w0, [x0] cbz w0, .L10 ldr w0, [x1] cmp w0, 1 bne .L15 str wzr, [x1] add x0, x1, 8 stlr wzr, [x0] b .L10 .L15: adrp x3, .LANCHOR1 adrp x0, .LC2 adrp x1, .LC1 add x3, x3, :lo12:.LANCHOR1 add x0, x0, :lo12:.LC2 add x1, x1, :lo12:.LC1 mov w2, 47 add x3, x3, 16 bl __assert_fail .cfi_endproc ------------------ AArch64 listing of threadB() with Andrew's patch: _Z7threadBv: .LFB2304: .cfi_startproc adrp x1, .LANCHOR0 stp x29, x30, [sp, -16]! .cfi_def_cfa_offset 16 .cfi_offset 29, -16 .cfi_offset 30, -8 add x1, x1, :lo12:.LANCHOR0 add x29, sp, 0 .cfi_def_cfa_register 29 .L10: add x0, x1, 8 ldar w0, [x0] cbz w0, .L10 ldr w0, [x1] cmp w0, 1 bne .L15 str wzr, [x1] add x0, x1, 8 stlr wzr, [x0] b .L10 .L15: adrp x3, .LANCHOR1 adrp x0, .LC2 adrp x1, .LC1 add x3, x3, :lo12:.LANCHOR1 add x0, x0, :lo12:.LC2 add x1, x1, :lo12:.LC1 mov w2, 47 add x3, x3, 16 bl __assert_fail .cfi_endproc
Author: amacleod Date: Wed Jan 14 13:58:35 2015 New Revision: 219601 URL: https://gcc.gnu.org/viewcvs?rev=219601&root=gcc&view=rev Log: 2015-01-14 Andrew MacLeod <amacleod@redhat.com> PR middle-end/59448 * builtins.c (get_memmodel): Promote consume to acquire always. * testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal consume in an atomic_exchange. Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/atomic-invalid.c
(In reply to Andrew Macleod from comment #24) > Author: amacleod > Date: Wed Jan 14 13:58:35 2015 > New Revision: 219601 > > URL: https://gcc.gnu.org/viewcvs?rev=219601&root=gcc&view=rev > Log: > > 2015-01-14 Andrew MacLeod <amacleod@redhat.com> > > PR middle-end/59448 > * builtins.c (get_memmodel): Promote consume to acquire always. > * testsuite/gcc.dg/atomic-invalid.c: Remove obselete test for illegal > consume in an atomic_exchange. To summarize, Jeff Preshing confirmed that Andrew's patch fixes the problem (see Comment 23). Everyone in ISO C++ seems to agree that promoting consume to acquire is the right approach until we have a different language-level facility that is actually implementable. Therefore, I'm closing this as fixed.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html