Codegen

Current Status

Most memory models usually consist of up to 4 modes, relaxed, consume, release/acquire, and sequentially consistent. The C++11 memory model also splits release/acquire into its separate components for finer grain detail. So there is a separate release mode for stores and an acquire mode for loads.

Not every architecture provides all the fancy facilities to implement the these memory models. We have a base implementation which in the presence of just a test-and-swap primitive can implement all the atomic operations correctly, although inefficiently.

When hardware provides additional instructions which can provide lock free primitives, a more advanced implementation (include/bits/atomic_2.h) is used which makes use of a number of __sync builtins which can be provided by the port. These consist of:

This is where the implementation stands today. Currently the C++11 load, store, and compare_exchange operations are built in templates using __sync_synchronize when required and then the basic instruction. This is suboptimal when the architecture supports instructions which more closely map to the individual memory model modes for each operation.

For instance, even though these loads are both memory acquire operations, the optimal hardware sequence on a PowerPC for

are different, with significant performance impacts. Currently there is no decent way to resolve this.

Resolving The Issue

The new approach will be to provide a builtin __sync routine for each atomic operation and a memory order as a parameter. These builtins will then later expand into a default set of instructions unless the machine description for a given architecture provides a better option.

The current definition of the libstdc++'s bits/atomics_2.h include file will need to be changed. It is currently incorrect anyway (in a few cases), but most importantly it does not allow specialization. Most of the methods will be changed to a pattern similar to:

  assert (memory_order != disallowed modes)
  emit_builtin_operation (memory_order)

std::memory_order_relaxed operations have no memory synchronization characteristics at all, so these operations should hopefully be emitted simply as an instruction rather than a bultin (ie, immediate expansion) but that would be handled by the builtin_operation emitter... This would allow the optimizers to treat them a cleanly as possible. They will still be atomic instructions, but will allow them to be optimized.

I am not aware of any reason right now why relaxed mode operations should ever need to be special cased for some architecture beyond issuing the atomic instruction. If anyone can point out a reason, then relaxed mode operations will simply be emitted as a builtin __sync function with the relaxed parameter. This is a trivial change.

All other modes have synchronization side effects and we supply a default for each of the possible method/memory order combination.

The default for the std::memory_order_seq_cst version defaults to

    __sync_synchronize
      atomic operation
    __sync_synchronize

This will ensure that the operation is properly execute normally. If this is insufficient for some architecture, it will have to provide a correct option in the machine description. (apparently Itanium falls into this category)

All the other required synchronization modes will default to whatever the std::memory_order_seq_cst is defined as.

When there is a better sequence for one or more operation, the machine description can provide the sequence and override the default version. This will allow much better sequences for any architecture with better hardware support for the various memory modes.

If the machine description provides a pattern for one of these __sync builtins, then this pattern is responsible for handling all the various memory model options.

Optimizations

At this point, we aren't going to worry about how to optimize atomic instructions with synchronization side effects. It appears that how synchronizations are handled varies significantly between architectures. It might be that the best approach is to simply have a machine dependent peephole optimizer run after builtin expansion to look for sequences of synchronizations that can be eliminated or replaced. This is likely to produce the most efficient results. When the defaults are being used, there may be a lot of __sync_synchronize()'s that are redundant, and we may want to look into eliminating those during expansion. (imagine 3 load (seq_cst) in row... we'd end up with 6 __sync_synchronize()!!!)

All builtins will appear as function calls, and therefore act as shared memory barriers to normal optimizations. The optimizers (at least initially) will not distinguish between the various memory model modes, but this is an ongoing area of investigation. At the moment, it looks like it may be possible to treat release and acquire modes as directional barriers (ie, you can sink some shared memory code through an acquire, and hoist some code past a release). This would probably need to be handled with some sort of attribute on the builtin so the optimizers would understand the nature of the barrier, but I'll treat this entire subject with a different document.

What Is Needed

List of all the required builtins which will take a memory model parameter and the valid modes. All required routines in this table will be prefixed with __atomic_.

Name

Valid Modes

store

seq_cst, release, relaxed

load

seq_cst, acquire, consume, relaxed

exchange

seq_cst, acq_rel, release, acquire, relaxed

compare_exchange

seq_cst, acq_rel, release, acquire, consume, relaxed

fetch_add

seq_cst, acq_rel, release, acquire, consume, relaxed

fetch_sub

seq_cst, acq_rel, release, acquire, consume, relaxed

fetch_and

seq_cst, acq_rel, release, acquire, consume, relaxed

fetch_or

seq_cst, acq_rel, release, acquire, consume, relaxed

fetch_xor

seq_cst, acq_rel, release, acquire, consume, relaxed

flag_test_and_set

seq_cst, acq_rel, release, acquire, consume, relaxed

* flag_clear

seq_cst, release, ?? consume ??, relaxed

thread_fence

seq_cst, acq_rel, release, acquire, consume, relaxed

signal_fence

seq_cst, acq_rel, release, acquire, consume, relaxed

* I don't think atomic_flag_clear should be able to use consume order, but the latest standard Ive seen (n3242 29.7.7) doesn't actually seem to prohibit it...

This should be simple and flexible enough to be used by TM, OpenMP, and others. Its really just an expansion of what we already have to handle memory model variations. Note that some of these built-ins already exist, providing a relaxed mode implementation. Up until now the other variations were constructed within the C++ header files, and now we push them out to be built-in intrinsics to allow machine dependent specialization.

None: Atomic/GCCMM/CodeGen (last edited 2012-03-22 21:09:57 by 209)