This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Call for compiler help/advice: atomic builtins for v3


Richard Henderson wrote:

To keep all this in perspective, folks should remember that atomic
operations are *slow*.  Very very slow.  Orders of magnitude slower
than function calls.  Seriously.  Taking p4 as the extreme example,
one can expect a null function call in around 10 cycles, but a locked
memory operation to take 1000.  Usually things aren't that bad, but
I believe some poor design decisions were made for p4 here.  But even
on a platform without such problems you can expect a factor of 30
difference.

Apologies in advance if the following is not relevant...


Even on a P4, inlining may enable compiler optimizations. One case is when the compiler can see that the return value of __sync_fetch_and_or (for instance) isn't used. It's possible to use a wait-free "lock or" instead of a "lock cmpxchg" loop (MSVC 8 does this for _InterlockedOr.)

Another case is when inlining results in a sequence of K adjacent __sync_fetch_and_add( &x, 1 ) operations. These can legally be replaced with a single __sync_fetch_and_add.

Currently the __sync_* intrinsics seem to be fully locked, but if acquire/release/unordered variants are added, other platforms may also suffer from lack of inlining. On a PowerPC an unordered atomic increment is pretty much the same speed as an ordinary increment (when there is no contention.)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]