Created attachment 23656 [details] Patch alters DMB placement and adds CLREXNE The ARMv6+ implementation of __sync_*_compare_and_swap branches on a failed compare. There are two (theoretical, as I understand it) flaws in this branch path. One, it skips past the memory barrier at the end of the critical region, which could cause memory accesses to get speculated in. Two, it doesn't perform a clrex (or, for older armv6, a dummy strex) to clear the local monitor. This may not be a practical problem in most userland code, but it's at least not technically correct according to ARM docs, and it interferes with auditing locking code using valgrind/qemu/etc. The attached patch fixes these two issues for ARMv7-a / Thumb2 targets. However, the "clrexne" part of it is not correct for older ARMv6 variants which lack clrex or the ability to add a conditional on it (or for assemblers which can't handle the Thumb2 "it" opcode when assembling for ARM).
The code sequence in question looks like this: dmb sy 1: ldrex r0, ... cmp r0, r1 bne 2f strex r4, r2, .. teq r4, #0 bne 1b dmb sy 2: The ARM Architecture reference manual (section 3.4.5) requires that there are no explicit memory references between the LDREX and the STREX, this requirement does not extended to speculative loads in branch shadows. An LDREX without a following STREX is explicilty permitted by the reference manual (section 3.4.5). The CLREX instruction is provided for use by context switch code in order to prevent a false positive STREX following a context switch (section 3.4.4). The inclusion of a CLREX instruction in the above fragment is not required by the ARM architecture reference manual.
Please insert the text of your citations, since the ARM ARM is not a public document. I think I'm persuaded that the CLREX isn't necessary -- although I'm going to keep it in my compiler so that I can monitor ldrex/strex pairing accurately in valgrind/qemu. How about the branch past the DMB? What's actually required in order to get "ordered" semantics? If it's safe to skip the trailing DMB in the "compare failed" case, maybe it can be disposed of entirely?
The primitive is required to have lock semantics therefore the load of the old value must be followed by a dmb in the case that the old value comparison succeeds and the swap goes ahead. In the branch out case the final dmb serves no purpose, the swap did not occur and no lock was taken. Therefore the branch over dmb is ok.
OK, that's a clear explanation of why the DMB is necessary in the case where both the compare and the store succeed (neither branch is taken; at a higher semantic level, a lock is acquired, if that's what the atomic is being used for). For future reference, I would appreciate having those ARM ARM quotations, along with some indication of how load scheduling interacts with a branch past a memory barrier. Suppose that the next instruction after label "2" is a load. On some ARMv7 implementations, I presume that this load can get issued speculatively as early as label "1", due to the "bne 2f" branch shadow, which skips the trailing dmb. I gather that the intention is that, if this branch is not taken (and thus we execute through the trailing dmb), the fetch results from the branch shadow should be discarded, and the load re-issued (with, in a multi-core device, the appropriate ordering guarantee with respect to the strex). If this interpretation is more or less right, and the shipping silicon behaves as intended, then the branch past the dmb may be harmless -- although I might argue that it wastes memory bandwidth in what is usually the common case (compare-and-swap succeeds) in exchange for a slight reduction in latency in the other case. Yet that's still not quite the documented semantics of the GCC compare-and-swap primitive, which is supposed to be totally ordered whether or not the swap succeeds. When I write a lock-free algorithm, I may well rely on the guarantee that the value read in the next line of code was actually fetched after the value fetched by the ldrex. In fact, I have code that does rely on this guarantee; if thread A sees that thread B has altered the atomic location, then it expects to be able to see all data that thread B wrote before issuing its compare-and-swap. Here's the problem case: thread A: thread B: dmb <speculated load of Y> store Y dmb ldrex X cmp bne (doesn't branch) strex X ldrex X cmp bne (branches) load Y (speculated above) Is there something I'm not seeing that prevents this?
Michael: I think I agree with you on the need for the barrier in the branch out case; gcc's info page (section 6.49 'Built-in functions for atomic memory access') state: ----- In most cases, these builtins are considered a "full barrier". That is, no memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation. ------ so it does look like that last barrier would be needed to stop a subsequent load floating backwards before the ldrex. If I understand correctly however most cases wouldn't need it - I think most cases are use the compare&swap to take some form of lock, and then once you know you have the lock go and do your accesses - and in that case the ordering is guaranteed, where as if you couldn't take the lock you wouldn't use the subsequent access anyway. Dave
(In reply to comment #5) > If I understand correctly however most cases wouldn't need it - I think most > cases are use the compare&swap to take some form of lock, and then once you > know you have the lock go and do your accesses - and in that case the ordering > is guaranteed, where as if you couldn't take the lock you wouldn't use the > subsequent access anyway. Yes, that fits my understanding. It's only when you actually use the compare-and-swap as a compare-and-swap that you can get bit. I expect that it is quite hard to hit this in the 32-bit case, but with your 64-bit atomics and a dual-core system it should be a little easier to expose. I have an implementation of Michael-Scott lock-free queues (which rely on applying DCAS to a counter+pointer), in which I currently use the assembly cmpxchg64 equivalent we discussed; I'll adapt it to use the GCC intrinsic and re-test.
(In reply to comment #5) > Michael: > I think I agree with you on the need for the barrier in the branch out case; > gcc's info page (section 6.49 'Built-in functions for atomic memory access') > state: I agree. The DMB needs to be included in the branch out case in order to comply with the "full barrier" requirement. /Marcus
So I think we agree that the CLREX is needless, but the DMB should move after the branch target. Does that make this bug "confirmed"? (I don't feel the need for patch credit. :-)
It is confirmed - though it's valid only for 4.6.0 onwards in the FSF tools. 4.5.x FSF doesn't have inline sync primitives. cheers Ramana
Author: rsandifo Date: Fri Oct 14 14:38:42 2011 New Revision: 179980 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=179980 Log: gcc/ 2011-10-14 David Alan Gilbert <david.gilbert@linaro.org> PR target/48126 * config/arm/arm.c (arm_output_sync_loop): Move label before barrier. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c
Dave's patch was applied to trunk. This isn't a regression, so it probably isn't suitable for backporting.
Author: jye2 Date: Fri Jun 8 06:58:25 2012 New Revision: 188327 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188327 Log: Backport mainline r179607, r179979, r179980, r181416, r182014 2012-06-08 Joey Ye <joey.ye@arm.com> Backport r182014 from mainline. 2011-12-05 Kazu Hirata <kazu@codesourcery.com> PR target/51408 * config/arm/arm.md (*minmax_arithsi): Always require the else clause in the MINUS case. Backport r181416 from mainline. 2011-11-16 Richard Earnshaw <rearnsha@arm.com> Bernd Schmidt <bernds@coudesourcery.com> Sebastian Huber <sebastian.huber@embedded-brains.de> PR target/49641 * config/arm/arm.c (store_multiple_sequence): Avoid cases where the base reg is stored iff compiling for Thumb1. Backport r179980 from mainline. 2011-10-14 David Alan Gilbert <david.gilbert@linaro.org> PR target/48126 * config/arm/arm.c (arm_output_sync_loop): Move label before barrier. Backport r179979 from mainline. 2011-10-14 David Alan Gilbert <david.gilbert@linaro.org> * config/arm/arm.h (TARGET_HAVE_DMB_MCR): MCR Not available in Thumb1. Backport r179607 from mainline. 2011-10-06 Bernd Schmidt <bernds@codesourcery.com> PR target/49049 * config/arm/arm.md (arm_subsi3_insn): Lose the last alternative. Testsuites: Backport r182014 from mainline 2011-12-05 Kazu Hirata <kazu@codesourcery.com> PR target/51408 * gcc.dg/pr51408.c: New. Backport r181416 from mainline 2011-11-16 Richard Earnshaw <rearnsha@arm.com> Bernd Schmidt <bernds@coudesourcery.com> Sebastian Huber <sebastian.huber@embedded-brains.de> PR target/49641 * gcc.target/arm/pr49641.c: New test. Backport r179607 from mainline 2011-10-06 Bernd Schmidt <bernds@codesourcery.com> PR target/49049 * gcc.c-torture/compile/pr49049.c: New test. Added: branches/ARM/embedded-4_6-branch/gcc/testsuite/gcc.c-torture/compile/pr49049.c branches/ARM/embedded-4_6-branch/gcc/testsuite/gcc.dg/pr51408.c branches/ARM/embedded-4_6-branch/gcc/testsuite/gcc.target/arm/pr49641.c Modified: branches/ARM/embedded-4_6-branch/gcc/ChangeLog.arm branches/ARM/embedded-4_6-branch/gcc/config/arm/arm.c branches/ARM/embedded-4_6-branch/gcc/config/arm/arm.h branches/ARM/embedded-4_6-branch/gcc/config/arm/arm.md branches/ARM/embedded-4_6-branch/gcc/testsuite/ChangeLog.arm
Author: jye2 Date: Tue Jun 19 03:01:10 2012 New Revision: 188766 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188766 Log: 2012-06-18 Joey Ye <joey.ye@arm.com> Backported from mainline 2011-10-14 David Alan Gilbert <david.gilbert@linaro.org> PR target/48126 * config/arm/arm.c (arm_output_sync_loop): Move label before barrier. Modified: branches/gcc-4_6-branch/gcc/ChangeLog branches/gcc-4_6-branch/gcc/config/arm/arm.c