69904 – [6 Regression] shrink-wrapping creates weird atomic compare exchange loop on arm

Bug 69904 - [6 Regression] shrink-wrapping creates weird atomic compare exchange loop on arm

Summary: [6 Regression] shrink-wrapping creates weird atomic compare exchange loop on arm

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	6.0

Importance:	P1 normal
Target Milestone:	6.0
Assignee:	ktkachov

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-02-22 16:30 UTC by ktkachov
Modified:	2016-03-03 17:27 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	arm
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-02-23 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description ktkachov 2016-02-22 16:30:20 UTC

Consider the code:

#include <stdatomic.h>
 
 
atomic_uint foo;
atomic_uint bar;
int glob;
 
 
int main(void)
{
  glob = atomic_compare_exchange_strong (&foo, &bar, 0);
  return glob;
}

At -O2 -march=armv7-a GCC 5 generates:
main:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        movw    r2, #:lower16:bar
        movw    r3, #:lower16:foo
        movt    r2, #:upper16:bar
        movt    r3, #:upper16:foo
        mov     r0, #0
        str     lr, [sp, #-4]!
        ldr     r1, [r2]
        dmb     sy
.L3:
        ldrex   ip, [r3]
        cmp     ip, r1
        bne     .L4
        strex   lr, r0, [r3]
        cmp     lr, #0
        bne     .L3
.L4:
        movw    r3, #:lower16:glob
        moveq   r1, #1
        movne   r1, r0
        movt    r3, #:upper16:glob
        dmb     sy
        mov     r0, r1
        strne   ip, [r2]
        str     r1, [r3]
        ldr     pc, [sp], #4

GCC 6 generates:
main:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        movw    r2, #:lower16:bar
        movt    r2, #:upper16:bar
        movw    r3, #:lower16:foo
        mov     r0, #0
        ldr     r1, [r2]
        movt    r3, #:upper16:foo
        dmb     ish
        ldrex   ip, [r3]
        cmp     ip, r1
        bne     .L14
        str     lr, [sp, #-4]! // Stack push moved below barrier
.L8:
        strex   lr, r0, [r3]
        cmp     lr, #0
        bne     .L3
.L4:
        movw    r3, #:lower16:glob
        movt    r3, #:upper16:glob
        moveq   r1, #1
        movne   r1, r0
        dmb     ish
        mov     r0, r1
        strne   ip, [r2]
        str     r1, [r3]
        ldr     pc, [sp], #4  //pop stack and return
.L3:
        ldrex   ip, [r3]
        cmp     ip, r1
        beq     .L8
        b       .L4
.L14:
        movw    r3, #:lower16:glob
        movt    r3, #:upper16:glob
        moveq   r1, #1
        movne   r1, r0
        dmb     ish
        mov     r0, r1
        strne   ip, [r2]
        str     r1, [r3]
        bx      lr  // simple return


Notice how the stack push "str lr, [sp, #-4]!" got moved below the barrier "dmb ish". It turns that this is not fatal from a correctness perspective because if that push is executed after the load-exclusive has executed then in ARMv7-A it may clear the exclusive monitor (according to the architecture it is implementation defined whether it clears the exclusive monitor or not), causing the store-exclusive after .L8 to fail, which will cause the control flow to jump to .L3 and replay the load-exclusive/store-exclusive pair without the intervening stack push. So it all sort of works out in the end, but is really suboptimal and not the sequence one would expect for this code

This occurs during shrink-wrapping. -fno-shrink-wrap "fixes" this.

Comment 1 Segher Boessenkool 2016-02-23 00:24:57 UTC

Not allowing the prologue to be moved over clobber-mem-scratch would
pessimize quite a lot of code I think.  Here, the problem is that you
really cannot allow any store machine insn inside the loop.  Maybe the
loop should be just one RTL insn?

Confirmed btw.

Comment 2 Segher Boessenkool 2016-02-23 00:46:16 UTC

If you disallow this memory clobber from being copied (via the
cannot_copy_insn_p hook), do you then get what you want?

Comment 3 ktkachov 2016-02-23 11:18:32 UTC

(In reply to Segher Boessenkool from comment #2)
> If you disallow this memory clobber from being copied (via the
> cannot_copy_insn_p hook), do you then get what you want?

Catching the memory barrier pattern in cannot_copy_insn_p does indeed get the same sequence as GCC 5. But I don't see documentation for it in tm.texi ?

As for the single RTL insn idea, we keep the compare exchange operation as a single insn up until after reload (to prevent the register allocator doing any funny business with spills) and split if after reload.
We don't have precedent in the arm backend of keeping splittable insns any further than that (for example until split4) and I don't know what splitting predicate would look like for that.

Considering that the memory barrier insn is represented as a clobber of a BLKmode MEM doesn't that conceptually mean that it potentially clobbers all of memory and thus no memory operations should be moved past it?

Comment 4 Segher Boessenkool 2016-02-23 14:22:16 UTC

(In reply to ktkachov from comment #3)
> (In reply to Segher Boessenkool from comment #2)
> > If you disallow this memory clobber from being copied (via the
> > cannot_copy_insn_p hook), do you then get what you want?
> 
> Catching the memory barrier pattern in cannot_copy_insn_p does indeed get
> the same sequence as GCC 5.

Good to hear.  This, together with this pattern being a very late split as
you say, sounds to me like a good way forward.  Probably make some special
unspec for this to not pessimise other barrier uses.

> But I don't see documentation for it in tm.texi ?

It is DEFHOOK_UNDOC in target.def, although it does have documentation.
It is the only undoc hook there that does not say "documenting this
requires a GFDL license grant".  Maybe just an oversight.  The doc is a
trivial one-liner anyway.

> Considering that the memory barrier insn is represented as a clobber of a
> BLKmode MEM doesn't that conceptually mean that it potentially clobbers all
> of memory and thus no memory operations should be moved past it?

Yes, but shrink-wrapping isn't actually moving any instructions; it decides
where to put *new* instructions.  Putting the prologue later than some barrier
is perfectly well and good in general, and quite beneficial.

Comment 5 ktkachov 2016-02-24 09:30:24 UTC

I'll propose a patch to disallow copying of load-exclusive insns

Comment 6 ktkachov 2016-03-03 09:36:32 UTC

(In reply to ktkachov from comment #5)
> I'll propose a patch to disallow copying of load-exclusive insns

https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00211.html

Comment 7 ktkachov 2016-03-03 17:26:15 UTC

Author: ktkachov
Date: Thu Mar  3 17:25:43 2016
New Revision: 233941

URL: https://gcc.gnu.org/viewcvs?rev=233941&root=gcc&view=rev
Log:
[ARM] PR rtl-optimization/69904: Disallow copying/duplicating of load-exclusive operations

	PR rtl-optimization/69904
	* config/arm/arm.c (arm_cannot_copy_insn_p):
	Return true for load-exclusive instructions.

	* gcc.target/arm/pr69904.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/arm/pr69904.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
    trunk/gcc/testsuite/ChangeLog

Comment 8 ktkachov 2016-03-03 17:27:12 UTC

Fixed.