GCC 6.0 for powerpc64le emits the following inefficient code for an atomic increment. (The bogus warning is the subject of bug 68907). $ cat ~/tmp/t.c && /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O2 -S -Wall -o/dev/stdout ~/tmp/t.c typedef _Atomic int AI; void increment (AI *ai) { ++*ai; } .file "t.c" .machine power8 .abiversion 2 .section ".toc","aw" .section ".text" /home/remote/msebor/tmp/t.c: In function ‘increment’: /home/remote/msebor/tmp/t.c:5:5: warning: right-hand operand of comma expression has no effect [-Wunused-value] ++*ai; ^~~~~ .align 2 .p2align 4,,15 .globl increment .type increment, @function increment: sync lwz 9,0(3) cmpw 7,9,9 bne- 7,$+4 isync extsw 9,9 stw 9,-16(1) .p2align 4,,15 .L6: lwz 8,-16(1) addi 9,9,1 sync .L2: lwarx 10,0,3 cmpw 0,10,8 bne- 0,.L3 stwcx. 9,0,3 bne- 0,.L2 .L3: isync mfcr 9,128 rlwinm 9,9,3,1 beq 0,.L4 stw 10,-16(1) .L4: andi. 10,9,0x1 bnelr- 0 lwa 9,-16(1) b .L6 .long 0 .byte 0,0,0,0,0,0,0,0 .size increment,.-increment .ident "GCC: (GNU) 6.0.0 20151214 (experimental)" .section .note.GNU-stack,"",@progbits Clang 3.8 emits the expected: increment: # @increment .Lfunc_begin0: # BB#0: # %entry li 4, 1 sync .LBB0_1: # %entry # =>This Inner Loop Header: Depth=1 lwarx 5, 0, 3 add 5, 4, 5 stwcx. 5, 0, 3 bne 0, .LBB0_1 # BB#2: # %entry lwsync blr
Interestingly, it does emit equally as efficient code as Clang for __atomic_fetch_add: void fetch_and_add (AI *ai) { __atomic_fetch_add (ai, 1, __ATOMIC_SEQ_CST); } i.e., fetch_and_add: sync .L2: lwarx 9,0,3 addi 9,9,1 stwcx. 9,0,3 bne- 0,.L2 isync blr
Doesn't seem to be ppc64le specific in any way, and doesn't affect just preincrement. Try: typedef _Atomic int AI; AI i; void fn1 (AI * ai) { ++*ai; } void fn2 (AI * ai) { (*ai)++; } void fn3 (AI * ai) { *ai += 6; } void fn4 (void) { ++i; } void fn5 (void) { i++; } void fn6 (void) { i += 2; } and you'll see even on x86_64-linux that all the sequences use the generic CAS instructions instead of __atomic_fetch_add etc. The comment above build_atomic_assign even says this: "Also note that the compiler is simply issuing the generic form of the atomic operations." So, the question is, should we add smarts to the FE to optimize the cases already when emitting them (this would be similar to what omp-low.c does when expanding #pragma omp atomic, see: /* When possible, use specialized atomic update functions. */ if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type)) && store_bb == single_succ (load_bb) && expand_omp_atomic_fetch_op (load_bb, addr, loaded_val, stored_val, index)) return; ), or should we add some pattern matching in some pass that would try to detect these rather complicated patterns like: <bb 2>: _5 = __atomic_load_4 (ai_3(D), 5); _6 = (int) _5; D.1768 = _6; <bb 3>: # prephitmp_17 = PHI <_6(2), pretmp_16(4)> _9 = prephitmp_17 + 1; _10 = (unsigned int) _9; _12 = __atomic_compare_exchange_4 (ai_3(D), &D.1768, _10, 0, 5, 5); if (_12 != 0) goto <bb 5>; else goto <bb 4>; <bb 4>: pretmp_16 = D.1768; goto <bb 3>; (with the casts in there optional) and convert those to the more efficient __atomic_* calls if possible? Note one issue is that the pattern involves non-SSA loads/stores (the D.1768 var above) and we'd need to prove that the var is used only in those two places and nowhere else.
Inviting Richard and Andrew to comment. I'm afraid that the pattern-matching approach would be quite fragile, but as Jakub pointed out elsewhere, it'd have the advantage that even user written code with manual atomic_load and compare_and_swap could be optimized into the fetch_and_* patterns.
I think we should rather handle this in the front end than with quite complex pattern matching. If we want to do any complex logic with atomics in the middle-end, we should change their representation so that we don't have to struggle with a sequence of builtins. Which is clearly a subject for gcc7 at minimum.
Thanks, moving to C FE component then.
(In reply to Jakub Jelinek from comment #2) > Doesn't seem to be ppc64le specific in any way, and doesn't affect just > preincrement. The inefficiency I was pointing out was the redundant syncs above the loop on powerpc64. The x86_64 assembly looks fairly efficient both ways. I also intentionally focused the bug on the increment expression and didn't mention others like compound assignment because I expected the former to be more common. But I suppose ++a really should be equally as efficient as a += 1 which shouldn't be any less efficient than a += X for any arbitrary X. If it's preferable to treat this as a generic opportunity to improve the efficiency of all atomic expressions (perhaps along with those discussed on the Wiki: https://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations) that sounds great.
(In reply to Richard Henderson from comment #4) > I think we should rather handle this in the front end than with > quite complex pattern matching. > > If we want to do any complex logic with atomics in the middle-end, > we should change their representation so that we don't have to > struggle with a sequence of builtins. Which is clearly a subject > for gcc7 at minimum. Yes, I think anything more complex than this should be part of an atomics optimization framework using a new set of ATOMIC gimple ops rather than builtins. For the purpose of this PR we ought to just fix it in the FE.
I'm fine with making the front end smarter. Note that if either side of the assignment is of floating-point type, you need to keep the existing logic; if you're adding to / subtracting from a pointer, you need to ensure the multiplication by the size of the pointer target type still occurs; and if the arithmetic operation might be sanitized, you probably need to keep the existing logic as well (but otherwise, if the __atomic_fetch_* operations never have undefined overflow, it should be safe to do the operation in the type of the LHS).
Ok then, let me try that.
While working on this, I've found something weird, as shown in the following testcase: int main (void) { _Atomic _Bool a = 1; __builtin_printf ("%d\n", a); __atomic_fetch_add (&a, 1, __ATOMIC_SEQ_CST); __builtin_printf ("%d\n", a); _Atomic _Bool b = 1; __builtin_printf ("%d\n", b); ++b; __builtin_printf ("%d\n", b); if (a != b) __builtin_abort (); } crashes, i.e. given _Atomic _Bool x = 1; ++x == 1, but __atomic_fetch_add (&x, 1, __ATOMIC_SEQ_CST) == 2. And because of that my patch to implement this optimization regresses c11-atomic-exec-2.c and c11-atomic-exec-3.c, where we deal with incrementing atomic bool.
IMHO you should use the current code for _Bool atomic increments/decrements, _Bool has just doesn't behave like an integer. Only normal integers (with the precision the same as mode's precision, properly aligned; perhaps enums too) and pointer types should be optimized.
Okay, thanks, with other types the optimization seems to work (yay!).
(In reply to Marek Polacek from comment #10) C doesn't allow the atomic_fetch_xxx operations to be used with the atomic_bool type (it says they're "not applicable" without spelling what that means, but that will be the subject of a future defect; IMO, it should be a constraint and require an error). If GCC accepts it, it should be changed to either reject it or at least issue a warning.
(In reply to Martin Sebor from comment #13) > (In reply to Marek Polacek from comment #10) > > C doesn't allow the atomic_fetch_xxx operations to be used with the > atomic_bool type (it says they're "not applicable" without spelling what > that means, but that will be the subject of a future defect; IMO, it should > be a constraint and require an error). If GCC accepts it, it should be > changed to either reject it or at least issue a warning. I see, thanks. We should have a PR then; I'll open one. Shouldn't be hard to fix, one would think.
Note that we also have bug 64843 for atomic_fetch_* on pointers (with a suggested approach for how stdatomic.h could handle the multiplication by the pointer target size).
Author: mpolacek Date: Mon Jan 4 12:27:08 2016 New Revision: 232052 URL: https://gcc.gnu.org/viewcvs?rev=232052&root=gcc&view=rev Log: PR c/68908 * c-typeck.c (build_atomic_assign): Improve commentary. Add optimization to use __atomic_fetch_* built-in if possible. * gcc.dg/atomic/c11-atomic-exec-6.c: New test. * gcc.dg/atomic/c11-atomic-exec-7.c: New test. * gcc.dg/atomic/stdatomic-op-5.c: New test. Added: trunk/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-6.c trunk/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-7.c trunk/gcc/testsuite/gcc.dg/atomic/stdatomic-op-5.c Modified: trunk/gcc/c/ChangeLog trunk/gcc/c/c-typeck.c trunk/gcc/testsuite/ChangeLog
Done for GCC 6.
*** Bug 69878 has been marked as a duplicate of this bug. ***