Bug 51766 - [4.7 regression] sync_fetch_and_xxx atomicity
Summary: [4.7 regression] sync_fetch_and_xxx atomicity
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.7.0
: P1 normal
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-05 15:00 UTC by David Edelsohn
Modified: 2012-01-12 20:33 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.6.0, 4.6.1
Known to fail: 4.7.0
Last reconfirmed: 2012-01-05 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Edelsohn 2012-01-05 15:00:35 UTC
During the C++11 memory model conversion, expand_builtin_sync_operation was updated.  __sync_fetch_and_xxx and __sync_xxx_and_fetch now invoke expand_atomic_fetch_op with MEMMODEL_SEQUENTIAL_CST.

  return expand_atomic_fetch_op (target, mem, val, code, MEMMODEL_SEQ_CST,
                                 after);

This has changed the memory model semantics of the builtins and causes the rs6000/POWER architecture to emit a heavy-weight "sync" instruction instead of a light-weight "lwsync" instruction, which is a regression.
Comment 1 Richard Biener 2012-01-09 15:40:10 UTC
The docs of __sync_* say

This builtin is not a full barrier, but rather an @dfn{acquire barrier}.
This means that references after the builtin cannot move to (or be
speculated to) before the builtin, but previous memory stores may not
be globally visible yet, and previous memory loads may not yet be
satisfied.

But it is not exactly clear to which builtins this applies.  Thus, is
the intended behavior actually target depedent?
Comment 2 Jonathan Wakely 2012-01-09 15:45:06 UTC
(In reply to comment #1)
> The docs of __sync_* say
> 
> This builtin is not a full barrier, but rather an @dfn{acquire barrier}.
> This means that references after the builtin cannot move to (or be
> speculated to) before the builtin, but previous memory stores may not
> be globally visible yet, and previous memory loads may not yet be
> satisfied.
> 
> But it is not exactly clear to which builtins this applies.  Thus, is
> the intended behavior actually target depedent?

It refers to __sync_lock_test_and_set only (it says "this builtin" and follows that one)

And "This builtin is not a full barrier, but rather a release barrier." refers to __sync_lock_release.

All the others are full barriers.  It says above them "In most cases, these builtins are considered a full barrier." and only __sync_lock_test_and_set and __sync_lock_release specify different barrier semantics.
Comment 3 David Edelsohn 2012-01-09 16:49:10 UTC
> It says above them "In most cases, these
> builtins are considered a full barrier." and only __sync_lock_test_and_set and
> __sync_lock_release specify different barrier semantics.

The next sentence is:

"That is, no memory operand will be moved across the operation, either forward or
backward."

Note that this refers to memory operands, not memory operations -- memory stores and memory loads referenced in documentation of the other sync builtins.  In other words, one could interpret "full memory barrier" as:

asm volatile ("" : : : "memory");

that refers to a GCC scheduling barrier.

The GCC documentation references Intel processors, which do not have have a distinction between instructions for RELEASE, ACQ_REL and SEQ_CST semantics.

The basic problem is that the GCC builtins and atomic instruction semantics were designed for Intel processors that do not provide the level of granularity implemented in POWER processors.  The POWER port implemented lighter weight ACQ_REL semantics. Retrofitting the original builtins on the new C++11 memory model semantics and imposing SEQ_CST interpretation has changed the behavior and performance on POWER, but not on other targets.
Comment 4 Richard Biener 2012-01-10 09:43:16 UTC
(In reply to comment #3)
> > It says above them "In most cases, these
> > builtins are considered a full barrier." and only __sync_lock_test_and_set and
> > __sync_lock_release specify different barrier semantics.
> 
> The next sentence is:
> 
> "That is, no memory operand will be moved across the operation, either forward
> or
> backward."
> 
> Note that this refers to memory operands, not memory operations -- memory
> stores and memory loads referenced in documentation of the other sync builtins.
>  In other words, one could interpret "full memory barrier" as:
> 
> asm volatile ("" : : : "memory");
> 
> that refers to a GCC scheduling barrier.
> 
> The GCC documentation references Intel processors, which do not have have a
> distinction between instructions for RELEASE, ACQ_REL and SEQ_CST semantics.
> 
> The basic problem is that the GCC builtins and atomic instruction semantics
> were designed for Intel processors that do not provide the level of granularity
> implemented in POWER processors.  The POWER port implemented lighter weight
> ACQ_REL semantics. Retrofitting the original builtins on the new C++11 memory
> model semantics and imposing SEQ_CST interpretation has changed the behavior
> and performance on POWER, but not on other targets.

But for more precise semantics we now have the __atomic_* builtins, right?
And the __sync_* ones are deprecated.  I don't see how we can preserve
old behavior for the __sync_* ones without adding a new target hook.
The documentation would need to be adjusted, of course, I'm not sure
that different atomic semantics are "useful" for these "portable"
synchronization primitives?

Thus, fixing the libstdc++ side seems worthwhile, but I'm not sure
with respect to the deprecated __sync builtins?
Comment 5 David Edelsohn 2012-01-10 14:39:16 UTC
I understand that fixing __sync_* is a hassle.  This is why I opened a separate bug for libstdc++.

While __sync_* is deprecated in favor of __atomic_*, use of __sync_* for portability is fairly pervasive in FOSS applications that need it because of its implementation in GCC.  Most programmers do not know about memory models and do not care about memory models.  And it will take time for programmers to switch to __atomic_*, if they even bother to choose a memory model and don't introduce a bug.

The basic problem is MEMMODEL_SEQ_CST only makes a performance difference for POWER and developers are going to continue to use __sync_* builtins for a while.  This change in default behavior only hurts performance for applications on POWER relative to all other architectures, which sucks. :-(
Comment 6 Richard Biener 2012-01-10 14:48:54 UTC
(In reply to comment #5)
> I understand that fixing __sync_* is a hassle.  This is why I opened a separate
> bug for libstdc++.
> 
> While __sync_* is deprecated in favor of __atomic_*, use of __sync_* for
> portability is fairly pervasive in FOSS applications that need it because of
> its implementation in GCC.  Most programmers do not know about memory models
> and do not care about memory models.  And it will take time for programmers to
> switch to __atomic_*, if they even bother to choose a memory model and don't
> introduce a bug.
> 
> The basic problem is MEMMODEL_SEQ_CST only makes a performance difference for
> POWER and developers are going to continue to use __sync_* builtins for a
> while.  This change in default behavior only hurts performance for applications
> on POWER relative to all other architectures, which sucks. :-(

Yes, I see that.  But my question is - did a developer reading the documentation
get _correct_ code on POWER (which uses a laxer memory model than documented!)
in all cases?  Or can you construct a testcase that works fine on IA64
while surprisingly (after reading docs) does not work on POWER?  Thus, didn't
we simply fix a wrong-code bug (albeit by producing slower code)?
Comment 7 Jonathan Wakely 2012-01-10 15:29:09 UTC
(In reply to comment #3)
> > It says above them "In most cases, these
> > builtins are considered a full barrier." and only __sync_lock_test_and_set and
> > __sync_lock_release specify different barrier semantics.
> 
> The next sentence is:
> 
> "That is, no memory operand will be moved across the operation, either forward
> or
> backward."

It continues:
"Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation."

Doesn't that rule out lwsync, which allows loads to be moved to before the store preceding the lwsync?


Isn't the fact users will continue to use the __sync builtins all the more reason they should work as documented?
Comment 8 David Edelsohn 2012-01-10 18:08:23 UTC
For the way that programmers use __sync_* builtins, release or acquire-release semantics are sufficient.  As we see in libstdc++, release semantics are overly strict when incrementing the reference, as opposed to destroying an object.

Again, there is no cost to Intel specifying sequential consistency as opposed to a slightly weaker memory model.  Intel chose a memory model that matched and benefited their architecture.  IBM should have the freedom to choose memory models that benefit its architectures.
Comment 9 Jonathan Wakely 2012-01-10 18:20:00 UTC
(In reply to comment #8)
> For the way that programmers use __sync_* builtins, release or acquire-release
> semantics are sufficient.

Are you claiming you know how all programmers have used those builtins?

>  As we see in libstdc++, release semantics are overly
> strict when incrementing the reference, as opposed to destroying an object.
> 
> Again, there is no cost to Intel specifying sequential consistency as opposed
> to a slightly weaker memory model.  Intel chose a memory model that matched and
> benefited their architecture.  IBM should have the freedom to choose memory
> models that benefit its architectures.

How or why Intel chose those semantics is not really relevant, the fact is that with the exception of lock_test_and_set and lock_release the __sync builtins are documented to be full barriers, and always have been documented as full barriers, both in GCC's and Intel's docs. If someone uses them in their code relying on the fact they'll get a full barrier, then someone else runs that code on POWER and there's a bug because it isn't a full barrier, who is to blame for the bug?

That said, I don't have any personal interest in what they do on POWER, as long as I'm not asked to deal with any libstdc++ bugs resulting from the builtins not behaving as documented.
Comment 10 Jonathan Wakely 2012-01-10 18:22:01 UTC
(In reply to comment #8)
> IBM should have the freedom to choose memory
> models that benefit its architectures.

I meant to add that the new __atomic builtins give that freedom, by design.

Changing the documented behaviour of the __sync builtins to do something they weren't designed for doesn't seem like a good idea - if they're inefficient then convince people not to use them.
Comment 11 David Edelsohn 2012-01-10 18:33:56 UTC
Jonathan,

I understand that the new __atomic_* builtins provide that flexibility.

I also am pointing out that GCC followed Intel semantics and Intel chose semantics to benefit them.  Why should GCC prefer and impose semantics that advantage a particular architecture or vendor?

You can argue that the __sync_* definition is the playing field and the rule and POWER has to deal with it.  And I am responding that the builtins were implemented appropriate for POWER and now GCC has moved the goal post.
Comment 12 Jakub Jelinek 2012-01-10 18:46:59 UTC
If the PPC implementation of __sync_fetch_and_* and __sync_*_and_fetch was intentionally implemented to violate the documentation, then the documentation should have been adjusted at that point to note that on PPC it behaves differently.  Otherwise it should be just considered a bug in the implementation that it didn't implement the documented semantics.
Comment 13 Andrew Macleod 2012-01-10 19:13:44 UTC
I don't think it matters what the original rationale was for implementing
__sync. They are documented as being seq-cst.

I would argue that power implementing them not as seq-cst is not truly
compliant.  It shouldn't be too difficult to write a test case using __sync
which expects seq-cst as documented that would pass on all targets except
power.

It may be that power-specific programmers have been taking advantage of this,
but doesn't it has to have the potential to break generic code?... (lucky TM only needed relaxed atomics!)

That said... it would be possible to simply have an optional macro with a
default definition of 
  #define __SYNC_LEGACY_MODE   MEMMODEL_SEQ_CST
and if a port like power wants something different, it can define
  #define __SYNC_LEGACY_MODE   MEMMODEL_ACQ_REL

then use that value instead of MEMMODEL_SEQ_CST for legacy __sync routines.

Would that satisfy all your requirements?  shouldn't really need any other
changes then.  Although as Jakub says, it should then be clearly documented.
Comment 14 David Edelsohn 2012-01-10 20:25:57 UTC
Jakub,

The POWER implementation of __sync_* was not written intentionally to violate the documentation.  I don't think those of us who implemented the feature on POWER realized the documentation was trying to require sequential consistency.

Given this clarification, the POWER maintainers need to figure out what implementation is appropriate.

The lwsync implementation was more than sufficient for the common use of atomic ops, like fetch_and_<op>.  That other architectures provide stronger semantics is a bonus.
Comment 15 David Edelsohn 2012-01-12 20:33:57 UTC
It looks like POWER needs to accept and deal with the sequentially consistent semantics now specified for the __sync_* intrinsics, so I am closing this bug as WONTFIX.  But we still need to fix the libstdc++ regression created by this change.