This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/65697] atomic memory barriers not strong enough for sync builtins

From: "torvald at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 15 Apr 2015 22:12:53 +0000
Subject: [Bug target/65697] __atomic memory barriers not strong enough for __sync builtins
Auto-submitted: auto-generated
References: <bug-65697-4 at http dot gcc dot gnu dot org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #24 from torvald at gcc dot gnu.org ---
I think we need to at least clarify the documentation of __atomic, probably
also of __sync; we might also have to change the implementation of __sync
builtins on some archs.

First, I think the __atomic builtins should strictly follow the C11 model --
one major reason being that it's not worthwhile for us to support a
GCC-specific memory model, which is complex.  Therefore, I think we should
clarify in the __atomic docs that C++11 is the reference semantics, and that
any additional text we provide (e.g., to describe what MEMMODEL_SEQ_CST does)
is just some hand-wavy attempt of makings things look simpler.  I'd choose
C++11 over C11 because, although the models are supposed to be equal, C++11 has
more detailed docs in the area of atomics and synchronization in general, IMO. 
There are also more people involved in C++ standardization than C
standardization.

The only exception that I see is that we might want to give more guarantees wrt
data-race-freedom (DRF) as far as compiler reordering is concerned.  I'm
mentioning just the compiler because I'm not aware of GCC supporting C11/C++11
on any archs whose ISA distinguishes between atomic and nonatomic accesses (ie,
where DRF is visible at the ISA level); however, I've heard of upcoming archs
that may do that.  Trying to promise some support for non-DRF programs that
synchronize might make transition of legacy code to __atomic builtins easier,
but it's a slippery slope; I don't think it's worthwhile for us to try to give
really precise and hard guarantees, just because of the complexity involved. 
I'm not quite sure what's best here.

Second, in C11 there is a difference between a memory access or
read-modify-write op (RMW) that is seq-cst, and a seq-cst fence.  For
reference, try this example in cppmem:

int main() {
  atomic_int x = 0; 
  atomic_int y = 0; 
  atomic_int d1 = 0; 
  atomic_int d2 = 0; 
  {{{ {
        x.store(1, memory_order_relaxed);
        // atomic_thread_fence(memory_order_seq_cst);
        r3=cas_strong(d1, 0, 1);
        r1=y.load(memory_order_relaxed).readsvalue(0);
      }
  ||| {
        y.store(1, memory_order_relaxed);
        // atomic_thread_fence(memory_order_seq_cst);
        r4=cas_strong(d2, 0, 1);
        r2=x.load(memory_order_relaxed).readsvalue(0);
      }
  }}};
  return 0; }

This is Dekker synchronization: It doesn't work (ie, both threads can read
value 0) with seq-cst RMWs to separate locations (d1 and d2) even though those
RMWs will be totally ordered; however, it does work when using fences (see the
comments).  Therefore, for __atomic, it's not only the MEMMODEL_* value that
counts, but also which operation it is applied to.  Perhaps we need to point
this out in the docs.


However, our current docs for __sync indicate that most of the __sync RMWs are
"full barriers". The latter has a very TSO-like description, and my guess would
be that most people might understand it this way too (ie, that it would at
least be as strong as a seq-cst C11 fence).  Also, Andrew mentioned offline
that prior documentation seemed to have described the __sync RMWs as
  __sync_synchronize();
  operation...
  __sync_synchronize();

For x86 and similar, a C11-seq-cst RMW will be a full barrier; Jakub says the
same holds for PowerPC currently.  AFAIU, on ARM it doesn't (but it kind of
surprises me that HW would reorder into an LLSC region; my guess would that
this complicates matters, and I can't see a benefit).

So, at least on ARM, there seems to be a difference between a __sync RMW that
is documented to be at least as strong as a C11 seq-cst fence, and how we
implement a C11 seq-cst fence.  This difference is observable by a program, but
doing so requires a non-DRF program (but there are no relaxed-MO __sync
variants, so programs that used __sync might just relied on non-DRF accesses).

I would guess that programs that actually contain synchronization code affected
by this difference are rather rare.  Normal locks and such are not affected, as
are many common synchronization algorithms such as for linked lists etc (eg,
which work fine with acq_rel MO ops).  I'd say that the full fences are really
necessary mostly if one wants to synchronize using separate locations and
without a global common order for those accesses (eg, using several mem
locations and asymmetric access patterns to tune contention favorably for
common cases (eg, Dekker-like stuff if made asymmetric; libitm uses that)).  I
guess it's much more common to "route" the synchronization through centralized
or hierarchical locations so that eventually there is a tie-breaker.  For
example, AFAIR there's no use of a seq-cst fence in current glibc.

Thus, we need to at least improve the __sync docs.  If we want to really make
them C11-like, we need to update the docs, and there might be legacy code
assuming different semantics that breaks.  If we don't, we need to update the
implementation of __sync RMWs.  I don't know what would be the easiest way to
do that:
1) We could put __synch_synchronize around them on all archs, and just don't
care about the performance overhead (thinking that we want to deprecate them
anyway).  But unless the backends optimize unnecessary sync ops (eg, on x86,
remove mfence adjacent to a lock'd RMW), there could be performance degradation
for __sync-using code.
2) We could make __atomic RMWs with MEMMODEL_SEQ_CST stronger than required by
C11.  This could result in other C11-conforming compilers to produce more
efficient synchronization code, and is thus not a good option IMO.
3) We could do something just on ARM (and scan other arcs for similar issues). 
That's perhaps the cleanest option.

Thoughts?

References:
- [Bug target/65697] New: __atomic memory barriers not strong enough for __sync builtins
  - From: matthew.wahab at arm dot com

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

[Bug target/65697] __atomic memory barriers not strong enough for __sync builtins

[Bug target/65697] atomic memory barriers not strong enough for sync builtins