Bug 68908

Summary: inefficient code for _Atomic operations
Product: gcc Reporter: Martin Sebor <msebor>
Component: cAssignee: Marek Polacek <mpolacek>
Status: RESOLVED FIXED    
Severity: normal CC: amacleod, jakub, jsm28, rth
Priority: P3 Keywords: missed-optimization
Version: 6.0   
Target Milestone: 6.0   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2015-12-15 00:00:00

Description Martin Sebor 2015-12-15 03:54:19 UTC
GCC 6.0 for powerpc64le emits the following inefficient code for an atomic increment.  (The bogus warning is the subject of bug 68907).

$ cat ~/tmp/t.c && /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O2 -S -Wall -o/dev/stdout ~/tmp/t.c
typedef _Atomic int AI;

void increment (AI *ai)
{
    ++*ai;
}
	.file	"t.c"
	.machine power8
	.abiversion 2
	.section	".toc","aw"
	.section	".text"
/home/remote/msebor/tmp/t.c: In function ‘increment’:
/home/remote/msebor/tmp/t.c:5:5: warning: right-hand operand of comma expression has no effect [-Wunused-value]
     ++*ai;
     ^~~~~

	.align 2
	.p2align 4,,15
	.globl increment
	.type	increment, @function
increment:
	sync
	lwz 9,0(3)
	cmpw 7,9,9
	bne- 7,$+4
	isync
	extsw 9,9
	stw 9,-16(1)
	.p2align 4,,15
.L6:
	lwz 8,-16(1)
	addi 9,9,1
	sync
.L2:
	lwarx 10,0,3
	cmpw 0,10,8
	bne- 0,.L3
	stwcx. 9,0,3
	bne- 0,.L2
.L3:
	isync
	mfcr 9,128
	rlwinm 9,9,3,1
	beq 0,.L4
	stw 10,-16(1)
.L4:
	andi. 10,9,0x1
	bnelr- 0
	lwa 9,-16(1)
	b .L6
	.long 0
	.byte 0,0,0,0,0,0,0,0
	.size	increment,.-increment
	.ident	"GCC: (GNU) 6.0.0 20151214 (experimental)"
	.section	.note.GNU-stack,"",@progbits


Clang 3.8 emits the expected:

increment:                              # @increment
.Lfunc_begin0:
# BB#0:                                 # %entry
	li 4, 1
	sync
.LBB0_1:                                # %entry
                                        # =>This Inner Loop Header: Depth=1
	lwarx 5, 0, 3
	add 5, 4, 5
	stwcx. 5, 0, 3
	bne	 0, .LBB0_1
# BB#2:                                 # %entry
	lwsync
	blr
Comment 1 Martin Sebor 2015-12-15 04:12:09 UTC
Interestingly, it does emit equally as efficient code as Clang for __atomic_fetch_add:

void fetch_and_add (AI *ai)
{
    __atomic_fetch_add (ai, 1, __ATOMIC_SEQ_CST);
}

i.e.,

fetch_and_add:
	sync
.L2:
	lwarx 9,0,3
	addi 9,9,1
	stwcx. 9,0,3
	bne- 0,.L2
	isync
	blr
Comment 2 Jakub Jelinek 2015-12-15 08:24:21 UTC
Doesn't seem to be ppc64le specific in any way, and doesn't affect just preincrement.
Try:
typedef _Atomic int AI;
AI i;

void
fn1 (AI * ai)
{
  ++*ai;
}

void
fn2 (AI * ai)
{
  (*ai)++;
}

void
fn3 (AI * ai)
{
  *ai += 6;
}

void
fn4 (void)
{
  ++i;
}

void
fn5 (void)
{
  i++;
}

void
fn6 (void)
{
  i += 2;
}
and you'll see even on x86_64-linux that all the sequences use the generic CAS instructions instead of __atomic_fetch_add etc.

The comment above build_atomic_assign even says this:
"Also note that the compiler is simply issuing the generic form of the atomic operations."

So, the question is, should we add smarts to the FE to optimize the cases already when emitting them (this would be similar to what omp-low.c does when expanding #pragma omp atomic, see:
          /* When possible, use specialized atomic update functions.  */
          if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
              && store_bb == single_succ (load_bb)
              && expand_omp_atomic_fetch_op (load_bb, addr,
                                             loaded_val, stored_val, index))
            return;
), or should we add some pattern matching in some pass that would try to detect these rather complicated patterns like:
  <bb 2>:
  _5 = __atomic_load_4 (ai_3(D), 5);
  _6 = (int) _5;
  D.1768 = _6;

  <bb 3>:
  # prephitmp_17 = PHI <_6(2), pretmp_16(4)>
  _9 = prephitmp_17 + 1;
  _10 = (unsigned int) _9;
  _12 = __atomic_compare_exchange_4 (ai_3(D), &D.1768, _10, 0, 5, 5);
  if (_12 != 0)
    goto <bb 5>;
  else
    goto <bb 4>;

  <bb 4>:
  pretmp_16 = D.1768;
  goto <bb 3>;

(with the casts in there optional) and convert those to the more efficient __atomic_* calls if possible?  Note one issue is that the pattern involves non-SSA loads/stores (the D.1768 var above) and we'd need to prove that the var is used only in those two places and nowhere else.
Comment 3 Marek Polacek 2015-12-15 15:21:41 UTC
Inviting Richard and Andrew to comment.
I'm afraid that the pattern-matching approach would be quite fragile, but as Jakub pointed out elsewhere, it'd have the advantage that even user written code with manual atomic_load and compare_and_swap could be optimized into the fetch_and_* patterns.
Comment 4 Richard Henderson 2015-12-15 15:44:26 UTC
I think we should rather handle this in the front end than with
quite complex pattern matching.

If we want to do any complex logic with atomics in the middle-end,
we should change their representation so that we don't have to
struggle with a sequence of builtins.  Which is clearly a subject
for gcc7 at minimum.
Comment 5 Marek Polacek 2015-12-15 15:50:53 UTC
Thanks, moving to C FE component then.
Comment 6 Martin Sebor 2015-12-15 16:24:49 UTC
(In reply to Jakub Jelinek from comment #2)
> Doesn't seem to be ppc64le specific in any way, and doesn't affect just
> preincrement.

The inefficiency I was pointing out was the redundant syncs above the loop on powerpc64.  The x86_64 assembly looks fairly efficient both ways.

I also intentionally focused the bug on the increment expression and didn't mention others like compound assignment because I expected the former to be more common.  But I suppose ++a really should be equally as efficient as a += 1 which shouldn't be any less efficient than a += X for any arbitrary X.

If it's preferable to treat this as a generic opportunity to improve the efficiency of all atomic expressions (perhaps along with those discussed on the Wiki: https://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations) that sounds great.
Comment 7 Andrew Macleod 2015-12-15 16:44:42 UTC
(In reply to Richard Henderson from comment #4)
> I think we should rather handle this in the front end than with
> quite complex pattern matching.
> 
> If we want to do any complex logic with atomics in the middle-end,
> we should change their representation so that we don't have to
> struggle with a sequence of builtins.  Which is clearly a subject
> for gcc7 at minimum.

Yes, I think anything more complex than this should be part of an atomics optimization framework using a new set of ATOMIC gimple ops rather than builtins.

For the purpose of this PR we ought to just fix it in the FE.
Comment 8 jsm-csl@polyomino.org.uk 2015-12-15 21:49:05 UTC
I'm fine with making the front end smarter.  Note that if either side of 
the assignment is of floating-point type, you need to keep the existing 
logic; if you're adding to / subtracting from a pointer, you need to 
ensure the multiplication by the size of the pointer target type still 
occurs; and if the arithmetic operation might be sanitized, you probably 
need to keep the existing logic as well (but otherwise, if the 
__atomic_fetch_* operations never have undefined overflow, it should be 
safe to do the operation in the type of the LHS).
Comment 9 Marek Polacek 2015-12-16 09:54:11 UTC
Ok then, let me try that.
Comment 10 Marek Polacek 2015-12-17 16:12:08 UTC
While working on this, I've found something weird, as shown in the following testcase:

int
main (void)
{
  _Atomic _Bool a = 1;
  __builtin_printf ("%d\n", a);
  __atomic_fetch_add (&a, 1, __ATOMIC_SEQ_CST);
  __builtin_printf ("%d\n", a);

  _Atomic _Bool b = 1;
  __builtin_printf ("%d\n", b);
  ++b;
  __builtin_printf ("%d\n", b);

  if (a != b)
    __builtin_abort ();
}
crashes, i.e. given _Atomic _Bool x = 1;
++x == 1, but
__atomic_fetch_add (&x, 1, __ATOMIC_SEQ_CST) == 2.

And because of that my patch to implement this optimization regresses c11-atomic-exec-2.c and c11-atomic-exec-3.c, where we deal with incrementing atomic bool.
Comment 11 Jakub Jelinek 2015-12-17 16:18:49 UTC
IMHO you should use the current code for _Bool atomic increments/decrements, _Bool has just doesn't behave like an integer.
Only normal integers (with the precision the same as mode's precision, properly aligned; perhaps enums too) and pointer types should be optimized.
Comment 12 Marek Polacek 2015-12-17 16:21:51 UTC
Okay, thanks, with other types the optimization seems to work (yay!).
Comment 13 Martin Sebor 2015-12-17 16:29:28 UTC
(In reply to Marek Polacek from comment #10)

C doesn't allow the atomic_fetch_xxx operations to be used with the atomic_bool type (it says they're "not applicable" without spelling what that means, but that will be the subject of a future defect; IMO, it should be a constraint and require an error).  If GCC accepts it, it should be changed to either reject it or at least issue a warning.
Comment 14 Marek Polacek 2015-12-17 16:32:56 UTC
(In reply to Martin Sebor from comment #13)
> (In reply to Marek Polacek from comment #10)
> 
> C doesn't allow the atomic_fetch_xxx operations to be used with the
> atomic_bool type (it says they're "not applicable" without spelling what
> that means, but that will be the subject of a future defect; IMO, it should
> be a constraint and require an error).  If GCC accepts it, it should be
> changed to either reject it or at least issue a warning.

I see, thanks.  We should have a PR then; I'll open one.  Shouldn't be hard to fix, one would think.
Comment 15 jsm-csl@polyomino.org.uk 2015-12-17 16:45:58 UTC
Note that we also have bug 64843 for atomic_fetch_* on pointers (with a 
suggested approach for how stdatomic.h could handle the multiplication by 
the pointer target size).
Comment 16 Marek Polacek 2016-01-04 12:27:40 UTC
Author: mpolacek
Date: Mon Jan  4 12:27:08 2016
New Revision: 232052

URL: https://gcc.gnu.org/viewcvs?rev=232052&root=gcc&view=rev
Log:
	PR c/68908
	* c-typeck.c (build_atomic_assign): Improve commentary.  Add
	optimization to use __atomic_fetch_* built-in if possible.

	* gcc.dg/atomic/c11-atomic-exec-6.c: New test.
	* gcc.dg/atomic/c11-atomic-exec-7.c: New test.
	* gcc.dg/atomic/stdatomic-op-5.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-6.c
    trunk/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-7.c
    trunk/gcc/testsuite/gcc.dg/atomic/stdatomic-op-5.c
Modified:
    trunk/gcc/c/ChangeLog
    trunk/gcc/c/c-typeck.c
    trunk/gcc/testsuite/ChangeLog
Comment 17 Marek Polacek 2016-01-04 12:30:00 UTC
Done for GCC 6.
Comment 18 Jakub Jelinek 2016-02-19 22:21:56 UTC
*** Bug 69878 has been marked as a duplicate of this bug. ***