Bug 48126 - arm_output_sync_loop: misplaced memory barrier
Summary: arm_output_sync_loop: misplaced memory barrier
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.6.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2011-03-14 23:11 UTC by Michael K. Edwards
Modified: 2012-06-19 03:01 UTC (History)
4 users (show)

See Also:
Host:
Target: arm
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-06-24 13:55:55


Attachments
Patch alters DMB placement and adds CLREXNE (543 bytes, patch)
2011-03-14 23:11 UTC, Michael K. Edwards
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael K. Edwards 2011-03-14 23:11:58 UTC
Created attachment 23656 [details]
Patch alters DMB placement and adds CLREXNE

The ARMv6+ implementation of __sync_*_compare_and_swap branches on a failed compare.  There are two (theoretical, as I understand it) flaws in this branch path.  One, it skips past the memory barrier at the end of the critical region, which could cause memory accesses to get speculated in.  Two, it doesn't perform a clrex (or, for older armv6, a dummy strex) to clear the local monitor.  This may not be a practical problem in most userland code, but it's at least not technically correct according to ARM docs, and it interferes with auditing locking code using valgrind/qemu/etc.

The attached patch fixes these two issues for ARMv7-a / Thumb2 targets.  However, the "clrexne" part of it is not correct for older ARMv6 variants which lack clrex or the ability to add a conditional on it (or for assemblers which can't handle the Thumb2 "it" opcode when assembling for ARM).
Comment 1 Marcus Shawcroft 2011-03-17 16:14:36 UTC
The code sequence in question looks like this:
    dmb     sy
1:
    ldrex   r0, ...
    cmp     r0, r1
    bne     2f
    strex   r4, r2, ..
    teq     r4, #0 
    bne     1b
    dmb     sy
2:

The ARM Architecture reference manual (section 3.4.5) requires that
there are no explicit memory references between the LDREX and the
STREX, this requirement does not extended to speculative loads in
branch shadows.

An LDREX without a following STREX is explicilty permitted by the
reference manual (section 3.4.5).

The CLREX instruction is provided for use by context switch code in
order to prevent a false positive STREX following a context switch
(section 3.4.4). The inclusion of a CLREX instruction in the above
fragment is not required by the ARM architecture reference manual.
Comment 2 Michael K. Edwards 2011-03-17 18:14:11 UTC
Please insert the text of your citations, since the ARM ARM is not a public document.

I think I'm persuaded that the CLREX isn't necessary -- although I'm going to keep it in my compiler so that I can monitor ldrex/strex pairing accurately in valgrind/qemu.

How about the branch past the DMB?  What's actually required in order to get "ordered" semantics?  If it's safe to skip the trailing DMB in the "compare failed" case, maybe it can be disposed of entirely?
Comment 3 Marcus Shawcroft 2011-05-24 13:37:03 UTC
The primitive is required to have lock semantics therefore the load of the old value must be followed by a dmb in the case that the old value comparison succeeds and the swap goes ahead. In the branch out case the final dmb serves no purpose, the swap did not occur and no lock was taken. Therefore the branch over dmb is ok.
Comment 4 Michael K. Edwards 2011-05-24 16:38:41 UTC
OK, that's a clear explanation of why the DMB is necessary in the case where both the compare and the store succeed (neither branch is taken; at a higher semantic level, a lock is acquired, if that's what the atomic is being used for).  For future reference, I would appreciate having those ARM ARM quotations, along with some indication of how load scheduling interacts with a branch past a memory barrier.

Suppose that the next instruction after label "2" is a load.  On some ARMv7 implementations, I presume that this load can get issued speculatively as early as label "1", due to the "bne 2f" branch shadow, which skips the trailing dmb.  I gather that the intention is that, if this branch is not taken (and thus we execute through the trailing dmb), the fetch results from the branch shadow should be discarded, and the load re-issued (with, in a multi-core device, the appropriate ordering guarantee with respect to the strex).

If this interpretation is more or less right, and the shipping silicon behaves as intended, then the branch past the dmb may be harmless -- although I might argue that it wastes memory bandwidth in what is usually the common case (compare-and-swap succeeds) in exchange for a slight reduction in latency in the other case.

Yet that's still not quite the documented semantics of the GCC compare-and-swap primitive, which is supposed to be totally ordered whether or not the swap succeeds.  When I write a lock-free algorithm, I may well rely on the guarantee that the value read in the next line of code was actually fetched after the value fetched by the ldrex.  In fact, I have code that does rely on this guarantee; if thread A sees that thread B has altered the atomic location, then it expects to be able to see all data that thread B wrote before issuing its compare-and-swap.  Here's the problem case:

thread A:                 thread B:

dmb
<speculated load of Y>
                          store Y
                          dmb
                          ldrex X
                          cmp
                          bne (doesn't branch)
                          strex X
ldrex X
cmp
bne (branches)
load Y (speculated above)

Is there something I'm not seeing that prevents this?
Comment 5 Dr. David Alan Gilbert 2011-06-22 16:40:07 UTC
Michael:
I think I agree with you on the need for the barrier in the branch out case;
gcc's info page (section 6.49 'Built-in functions for atomic memory access') state:

-----
 In most cases, these builtins are considered a "full barrier".  That
is, no memory operand will be moved across the operation, either
forward or backward.  Further, instructions will be issued as necessary
to prevent the processor from speculating loads across the operation
and from queuing stores after the operation.
------

so it does look like that last barrier would be needed to stop a subsequent load floating backwards before the ldrex.

If I understand correctly however most cases wouldn't need it - I think most cases are use the compare&swap to take some form of lock, and then once you know you have the lock go and do your accesses - and in that case the ordering is guaranteed, where as if you couldn't take the lock you wouldn't use the subsequent access anyway.


Dave
Comment 6 Michael K. Edwards 2011-06-22 19:00:54 UTC
(In reply to comment #5)

> If I understand correctly however most cases wouldn't need it - I think most
> cases are use the compare&swap to take some form of lock, and then once you
> know you have the lock go and do your accesses - and in that case the ordering
> is guaranteed, where as if you couldn't take the lock you wouldn't use the
> subsequent access anyway.

Yes, that fits my understanding.  It's only when you actually use the compare-and-swap as a compare-and-swap that you can get bit.  I expect that it is quite hard to hit this in the 32-bit case, but with your 64-bit atomics and a dual-core system it should be a little easier to expose.  I have an implementation of Michael-Scott lock-free queues (which rely on applying DCAS to a counter+pointer), in which I currently use the assembly cmpxchg64 equivalent we discussed; I'll adapt it to use the GCC intrinsic and re-test.
Comment 7 Marcus Shawcroft 2011-06-23 09:32:59 UTC
(In reply to comment #5)
> Michael:
> I think I agree with you on the need for the barrier in the branch out case;
> gcc's info page (section 6.49 'Built-in functions for atomic memory access')
> state:

I agree. The DMB needs to be included in the branch out case in order to comply with the "full barrier" requirement.

/Marcus
Comment 8 Michael K. Edwards 2011-06-24 11:28:53 UTC
So I think we agree that the CLREX is needless, but the DMB should move after the branch target.  Does that make this bug "confirmed"?  (I don't feel the need for patch credit. :-)
Comment 9 Ramana Radhakrishnan 2011-06-24 13:55:55 UTC
It is confirmed - though it's valid only for 4.6.0 onwards in the FSF tools. 

4.5.x FSF doesn't have inline sync primitives. 

cheers
Ramana
Comment 10 Richard Sandiford 2011-10-14 14:38:48 UTC
Author: rsandifo
Date: Fri Oct 14 14:38:42 2011
New Revision: 179980

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=179980
Log:
gcc/
2011-10-14  David Alan Gilbert  <david.gilbert@linaro.org>

	PR target/48126
	* config/arm/arm.c (arm_output_sync_loop): Move label before barrier.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
Comment 11 Richard Sandiford 2011-10-14 14:45:37 UTC
Dave's patch was applied to trunk.  This isn't a regression,
so it probably isn't suitable for backporting.
Comment 12 jye2 2012-06-08 06:58:32 UTC
Author: jye2
Date: Fri Jun  8 06:58:25 2012
New Revision: 188327

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188327
Log:
Backport mainline r179607, r179979, r179980, r181416, r182014
2012-06-08  Joey Ye  <joey.ye@arm.com>

	Backport r182014 from mainline.
	2011-12-05  Kazu Hirata  <kazu@codesourcery.com>

	PR target/51408
	* config/arm/arm.md (*minmax_arithsi): Always require the else
	clause in the MINUS case.

	Backport r181416 from mainline.
	2011-11-16  Richard Earnshaw  <rearnsha@arm.com>
		Bernd Schmidt <bernds@coudesourcery.com>
		Sebastian Huber <sebastian.huber@embedded-brains.de>

	PR target/49641
	* config/arm/arm.c (store_multiple_sequence): Avoid cases where
	the base reg is stored iff compiling for Thumb1.

	Backport r179980 from mainline.
	2011-10-14  David Alan Gilbert  <david.gilbert@linaro.org>

	PR target/48126
	* config/arm/arm.c (arm_output_sync_loop): Move label before barrier.

	Backport r179979 from mainline.
	2011-10-14  David Alan Gilbert  <david.gilbert@linaro.org>

	* config/arm/arm.h (TARGET_HAVE_DMB_MCR): MCR Not available in Thumb1.

	Backport r179607 from mainline.
	2011-10-06  Bernd Schmidt  <bernds@codesourcery.com>

	PR target/49049
	* config/arm/arm.md (arm_subsi3_insn): Lose the last alternative.

Testsuites:
	Backport r182014 from mainline
	2011-12-05  Kazu Hirata  <kazu@codesourcery.com>

	PR target/51408
	* gcc.dg/pr51408.c: New.

	Backport r181416 from mainline
	2011-11-16  Richard Earnshaw  <rearnsha@arm.com>
		Bernd Schmidt <bernds@coudesourcery.com>
		Sebastian Huber <sebastian.huber@embedded-brains.de>

	PR target/49641
	* gcc.target/arm/pr49641.c: New test.

	Backport r179607 from mainline
	2011-10-06  Bernd Schmidt  <bernds@codesourcery.com>

	PR target/49049
	* gcc.c-torture/compile/pr49049.c: New test.


Added:
    branches/ARM/embedded-4_6-branch/gcc/testsuite/gcc.c-torture/compile/pr49049.c
    branches/ARM/embedded-4_6-branch/gcc/testsuite/gcc.dg/pr51408.c
    branches/ARM/embedded-4_6-branch/gcc/testsuite/gcc.target/arm/pr49641.c
Modified:
    branches/ARM/embedded-4_6-branch/gcc/ChangeLog.arm
    branches/ARM/embedded-4_6-branch/gcc/config/arm/arm.c
    branches/ARM/embedded-4_6-branch/gcc/config/arm/arm.h
    branches/ARM/embedded-4_6-branch/gcc/config/arm/arm.md
    branches/ARM/embedded-4_6-branch/gcc/testsuite/ChangeLog.arm
Comment 13 jye2 2012-06-19 03:01:16 UTC
Author: jye2
Date: Tue Jun 19 03:01:10 2012
New Revision: 188766

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188766
Log:
2012-06-18  Joey Ye  <joey.ye@arm.com>

	Backported from mainline
	2011-10-14  David Alan Gilbert  <david.gilbert@linaro.org>

	PR target/48126
	* config/arm/arm.c (arm_output_sync_loop): Move label before barrier.

Modified:
    branches/gcc-4_6-branch/gcc/ChangeLog
    branches/gcc-4_6-branch/gcc/config/arm/arm.c