Consider the code: #include <stdatomic.h> atomic_uint foo; atomic_uint bar; int glob; int main(void) { glob = atomic_compare_exchange_strong (&foo, &bar, 0); return glob; } At -O2 -march=armv7-a GCC 5 generates: main: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 movw r2, #:lower16:bar movw r3, #:lower16:foo movt r2, #:upper16:bar movt r3, #:upper16:foo mov r0, #0 str lr, [sp, #-4]! ldr r1, [r2] dmb sy .L3: ldrex ip, [r3] cmp ip, r1 bne .L4 strex lr, r0, [r3] cmp lr, #0 bne .L3 .L4: movw r3, #:lower16:glob moveq r1, #1 movne r1, r0 movt r3, #:upper16:glob dmb sy mov r0, r1 strne ip, [r2] str r1, [r3] ldr pc, [sp], #4 GCC 6 generates: main: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 movw r2, #:lower16:bar movt r2, #:upper16:bar movw r3, #:lower16:foo mov r0, #0 ldr r1, [r2] movt r3, #:upper16:foo dmb ish ldrex ip, [r3] cmp ip, r1 bne .L14 str lr, [sp, #-4]! // Stack push moved below barrier .L8: strex lr, r0, [r3] cmp lr, #0 bne .L3 .L4: movw r3, #:lower16:glob movt r3, #:upper16:glob moveq r1, #1 movne r1, r0 dmb ish mov r0, r1 strne ip, [r2] str r1, [r3] ldr pc, [sp], #4 //pop stack and return .L3: ldrex ip, [r3] cmp ip, r1 beq .L8 b .L4 .L14: movw r3, #:lower16:glob movt r3, #:upper16:glob moveq r1, #1 movne r1, r0 dmb ish mov r0, r1 strne ip, [r2] str r1, [r3] bx lr // simple return Notice how the stack push "str lr, [sp, #-4]!" got moved below the barrier "dmb ish". It turns that this is not fatal from a correctness perspective because if that push is executed after the load-exclusive has executed then in ARMv7-A it may clear the exclusive monitor (according to the architecture it is implementation defined whether it clears the exclusive monitor or not), causing the store-exclusive after .L8 to fail, which will cause the control flow to jump to .L3 and replay the load-exclusive/store-exclusive pair without the intervening stack push. So it all sort of works out in the end, but is really suboptimal and not the sequence one would expect for this code This occurs during shrink-wrapping. -fno-shrink-wrap "fixes" this.
Not allowing the prologue to be moved over clobber-mem-scratch would pessimize quite a lot of code I think. Here, the problem is that you really cannot allow any store machine insn inside the loop. Maybe the loop should be just one RTL insn? Confirmed btw.
If you disallow this memory clobber from being copied (via the cannot_copy_insn_p hook), do you then get what you want?
(In reply to Segher Boessenkool from comment #2) > If you disallow this memory clobber from being copied (via the > cannot_copy_insn_p hook), do you then get what you want? Catching the memory barrier pattern in cannot_copy_insn_p does indeed get the same sequence as GCC 5. But I don't see documentation for it in tm.texi ? As for the single RTL insn idea, we keep the compare exchange operation as a single insn up until after reload (to prevent the register allocator doing any funny business with spills) and split if after reload. We don't have precedent in the arm backend of keeping splittable insns any further than that (for example until split4) and I don't know what splitting predicate would look like for that. Considering that the memory barrier insn is represented as a clobber of a BLKmode MEM doesn't that conceptually mean that it potentially clobbers all of memory and thus no memory operations should be moved past it?
(In reply to ktkachov from comment #3) > (In reply to Segher Boessenkool from comment #2) > > If you disallow this memory clobber from being copied (via the > > cannot_copy_insn_p hook), do you then get what you want? > > Catching the memory barrier pattern in cannot_copy_insn_p does indeed get > the same sequence as GCC 5. Good to hear. This, together with this pattern being a very late split as you say, sounds to me like a good way forward. Probably make some special unspec for this to not pessimise other barrier uses. > But I don't see documentation for it in tm.texi ? It is DEFHOOK_UNDOC in target.def, although it does have documentation. It is the only undoc hook there that does not say "documenting this requires a GFDL license grant". Maybe just an oversight. The doc is a trivial one-liner anyway. > Considering that the memory barrier insn is represented as a clobber of a > BLKmode MEM doesn't that conceptually mean that it potentially clobbers all > of memory and thus no memory operations should be moved past it? Yes, but shrink-wrapping isn't actually moving any instructions; it decides where to put *new* instructions. Putting the prologue later than some barrier is perfectly well and good in general, and quite beneficial.
I'll propose a patch to disallow copying of load-exclusive insns
(In reply to ktkachov from comment #5) > I'll propose a patch to disallow copying of load-exclusive insns https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00211.html
Author: ktkachov Date: Thu Mar 3 17:25:43 2016 New Revision: 233941 URL: https://gcc.gnu.org/viewcvs?rev=233941&root=gcc&view=rev Log: [ARM] PR rtl-optimization/69904: Disallow copying/duplicating of load-exclusive operations PR rtl-optimization/69904 * config/arm/arm.c (arm_cannot_copy_insn_p): Return true for load-exclusive instructions. * gcc.target/arm/pr69904.c: New test. Added: trunk/gcc/testsuite/gcc.target/arm/pr69904.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c trunk/gcc/testsuite/ChangeLog
Fixed.