Atomic Optimization Issues

There are a number of optimization issues regarding Atomics that appear unclear. This page is will attempt to sort out the issues and describe how GCC will approach the subject.

There are a number of subjects to cover:

Impact on existing optimizations

How does the existence of atomic operations affect the compilers ability to optimize code?

First, there is no impact on thread local optimizations. The impact is only on shared memory optimizations which may have visibility in another thread.

All atomic operations are handled as a __sync builtin-in operation which the optimizers currently treat as function calls with unknown side effects. This causes them to behave as full barriers to any kind of code motion on shared memory. This conservative approach will result in correct code, but is not always ideal. In order to improve this situation, at least 2 things are required

Whats valid

RELAXED

Relaxed mode operation has no synchronization side effects. This would indicate it could be treated like a normal shared memory access, but there appears to be a wrinkle. It is required to retain its ordering within a thread, and if I read the standard correct [n3242.1.10.19]:

The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations.

This would seem to indicate that you cannot perform common subexpression elimination on 2 atomic relaxed loads if they are the only atomic operations.

Well, that would appear to eliminate the possibility of treating them just like regular variables. It does not impact other shared memory optimizations. A relaxed mode operation is not a barrier to regular code optimizations in either direction, so existing optimizations can treat it as having no side effects.

RELEASE and ACQUIRE

When a thread executes an acquire operation, it synchronizes with some store operation in another thread, and all shared memory stores before thats store must be visible.

Take for example:

     Thread 1                            Thread 2
y = 0;
x = 0;                              while (a.load (acquire) != 10)
a.store (10, release);                ;
y = 1;                              r1 = x;     // x must == 0
while (b.load (acquire) != 20))     b.store (20, release);
  ;                                 while (c.load (acquire) != 30)
x = 1;                                ;
c.store (30, release);              r2 = y;     // y must == 1

This fairly simple example provides a clue to what kinds of things it is possible to optimize and what isn't.

In Thread 1, 'y = 0' and 'x = 0' must be executed before the a.store in order to be properly visible in other threads. It is an indication that a release is a sink barrier. Any non-atomic code which comes after a release has no guarantees, so 'y = 1' could be hoisted before the release.

In the same vein, 'r1 = x' in thread 2 depends on the value of 'x' in another thread. It cannot be executed before the acquire, indicating a hoist barrier. Any non-atomic code which comes before the load has no guarantees and could potentially be moved after it.

These traits would prevent any of the assignments to 'x','r1', or 'r2' from being moved.

There is some flexibility with 'y = 1'. The only guarantee in thread 1 is that it happens-before c.store. 'y = 1' could be moved above a.store since the release is a barrier only to sinking, or it could be moved below b.load since an acquire is only a barrier to hoisting code.

Some possible Thread 1 transformations:

    Thread 1a - original             Thread 1b - hoist y               Thread 1c - sink y
y = 0;                           y = 0; // Now a dead store!       y = 0;
x = 0;                           x = 0;                            x = 0;
a.store (10, release);           y = 1; // hoisted                 a.store (10, release);
y = 1;                           a.store (10, release);            while (b.load (acquire) != 20))
while (b.load (acquire) != 20))  while (b.load (acquire) != 20))     ;
  ;                                ;                               x = 1;
x = 1;                           x = 1;                            y = 1; // sunk
c.store (30, release);           c.store (30, release);            c.store (30, release);

If some other thread synchronizes with a.store, and then proceeds to use the value of 'y':

     Thread 3
while (a.load (acquire)  != 10)
  ;
r3 = y;

A data race on 'y' has now been created, but put that aside for the moment. The synchronization guarantees that the 'y = 0' will be seen, but thread 1 may have moved on and executed 'y = 1' already. 'y' could legally have an expected value of either 0 or 1. If an optimization had decided to legally move 'y = 1' above the a.store (Thread 1b), Thread 3 would always see a valid value for 'y', it happens to always be '1'.

Alternatively, if 'y = 1' were to be moved past the b.load (Thread 1c), there is still no impact on thread 3's results. 'y' will always at least have the guaranteed value of 0, and if Thread 1 happens to have synchronized with some other thread on b.store its possible that 'y = 1' may have been executed which still leaves the outcome of Thread 3 correct.

Also note that the original Thread 2 will always behave as expected with all 3 versions of Thread 1.

ACQ_REL

Acquire-release atomic operations logically behave as both modes simultaneously, and therefore act as a barrier to any kind of movement in either direction.

SEQ_CST

Seq-cst is also a barrier to movement in both directions. The difference is primarily in the type of hardware synchronizations which are issued. Seq-cst is required to enforce a system wide consistency between all threads while release and acquire operations only guarantee synchronization directly between 2 threads.

CONSUME

Consume is a special case of relaxing hardware synchronization for an acquire operation. It will be treated as an acquire as far as the optimizers are concerned for the moment.

The fine tuning of what a consume does can be found in this document: It simply prevents dependency trees from being broken. So for example

r1 = x.load(memory_order_relaxed);
r3 = &a + r1 - r1;
r2 = *r3;

r3 can be optimized to '&a', and then there is no dependency between r1 and r3 any more. with memory order consume, the compiler must maintain the dependency, which will continue to prevent r3 from being hoisted above the atomic load. If the optimization is performed, then an acquire fence would have to be inserted to prevent the code motion from occurring. ie

r1 = x.load(memory_order_consume);
mem_thread_fence(memory_order_acquire)
r3 = &a;
r2 = *r3;

Optimizer Changes

Most optimizations will likely have to be examined one at a time to see how best they fit this barrier model, but a pattern may emerge along the way. It may be possible to simply teach the alias oracle about acquire and release barriers, but odds are that will be tricky since there is no indication of direction in the oracle, simply may use or may set attributes for function call/builtins.

It may be that we need some additional enabling optimizations. If these aren't automatically handled elsewhere it should be possible to identify and perform them during a future atomic optimization phase. Again, this will be developed as the requirements for that are flushed out.

A further avenue of exploration is the potential to provide diagnostics on data races when then optimizations encounter certain conditions. For instance, if hoisting a store above a release barrier enables a dead store removal, that indicates a potential data race. These will be identified along the way as well.

Dead Store Elimination

DSE typically performs a reverse walk of the control flow graph. DSE finds 2 consecutive stores where there are no intervening uses and eliminates the first one. This is the same effect as sinking the first store to the second one, and then eliminating it. This provides an indication that DSE can look backwards through one or more acquire barriers since they allow code to be moved from above them to below.

Ignoring the data race on 'x' for the moment, consider:

a.store (10, release);
x = 1;
while (b.load (acquire) != 20)
  ;
x = 2;
c.store (30, release);
x = 3;

Starting at the bottom with 'x = 3', DSE would start a reverse walk through the function. Since code can't be sunk through a release, the store must remain. DSE then considers 'x = 2'. Looking back through the acquire, 'x = 1' would be considered a dead store and could be eliminated.

Notice that if 'x = 3' were hoisted above 'c.store' (which would be legal) it would enable DSE to see that x = 2 is a dead store as well. Also notice that this would signify a potential data race on 'x' and the possibility of issuing a diagnostic.

It looks like it should be easy to teach this directly to DSE. It already processes built-ins and simply assumes they all are barriers. It should be simple enough to make look for the presence of a memory order parameter and allow it to look through relaxed or acquire operations.

All the rest

A quick rundown on all the optimizations gcc is considering looking at is located here: Optimization Details

Optimizations On Atomic Operations

At the moment we do nothing about Optimizing Atomics.

It is unlikely that they can be integrated into existing optimizations directly, so one possibility is to create an Optimize Atomics pass which will be called 1 or more times, and will look for opportunities.

If enough optimization opportunities eventually present themselves, it may be possible to instead generate smaller specialized passes, such as Atomic_DSE or Atomic_DCE and simple call from the pass manager at the end of existing DSE and DCE passes when atomics are present.

See also Optimization Details for more.

Improving Code Generation

The first thing a port can do is provide specialized routines for the builtins which efficiently utilize whatever synchronizations there are.

This will result in much better than the default code which is wrapped in __sync_synchronize(), but there will still often be room for improvement.

a.load(seq_cst);
b.store(1, seq_cst);  
c.load(seq_cst);

Every architecture represents these synchronizations differently, making it almost impossible to do it in a machine independent way. On a powerpc, this may result in a sequence of code such as

hwsync; ld; cmp; bc; isync       // a.load(seq_cst);
hwsync; st                       // b.store(1, seq_cst);

A peephole optimizer could look for common sequences such as cmp; bc; isync; hwsync; recognize the hwsync accomplishes everything that is required, and rewrite the code as

hwsync; ld;                      // a.load(seq_cst);
hwsync; st                       // b.store(1, seq_cst);

There is also improvement for the default case. when atomics are wrapped in __sync_synchronize(), there will be too many issued for back to back statements. ie:

__sync_synchronize();  a.load();  __sync_synchronize();
__sync_synchronize();  b.store(1);    __sync_synchronize();
__sync_synchronize();  c.load();  __sync_synchronize();

can also be reduced to

__sync_synchronize();  a.load();
__sync_synchronize();  b.store(1);
__sync_synchronize();  c.load();  __sync_synchronize();

I considered the possibility of trying to expose the release, acquire, and consume synchronizations as separate opcodes or built-ins and such, but its far too complicated. There is no consistency between architectures, and there will still be peephole opportunities. If the use of Atomics ever becomes so prolific that they are everywhere, and its demonstrable that we might gain something from it, then we'll look into it again :-)

None: Atomic/GCCMM/Optimizations (last edited 2016-04-15 19:55:57 by MartinSebor)