External Library Interface

GCC provides atomic operations via the older __sync routines, and newer __sync_mem routines (which include a memory model). Whenever possible, the compiler turns these atomic routines into lock-free instructions sequences, falling back on a compare and swap loop if the architecture doesn't directly support the operation.

However, there are cases when this isn't possible:

The architecture lacks a compare_and_swap, or otherwise doesn't support atomic operations
The size of the object doesn't map to a supported atomic size
The program has to interact with other architectures which don't support atomics.

GCC's implementation leaves the built-in function as a call to be resolved later if lock free instructions could not be generated.

Earlier versions of libstdc++-v3 for c++0x provided a locked implementation of the atomic classes which were to be utilized in these cases. This does not interact well with other potential implementations by other compilers. It also doesn't provide a very good upgrade path for legacy code being moved to improved hardware. And it didn't work very well with GCC's implementation of __sync functions since there was no consistent way to tell if something was lock free or not.

The preferred solution (from discussion with Lawrence Crowl) being looked at is to always fall back to a standardized library call when lock-free instructions cannot be produced, and forgetting any other attempts at implementation. Since lock-free and locked versions cannot be utilized simultaneously on any particular object, the compiler must consistently produce either a library call or lock free instructions for atomic operations on any specific object. This will prevent code from being locked into less efficient or less portable implementations.

Lawrence's upgrade example goes something like this:

Compile a library for 386. The 386 does not support a lock-free atomic int, so the compiler generates function calls.
Compile an application for 486. The 486 does support a lock-free atomic int, so the compiler generates machine instructions.
Link those two and run them on a 486. The operating system was built knowing that it's architectural level is 486, so its atomic ABI library implements the atomic int with lock-free machine instructions.
The application code compiled for 486 is as fast as it possibly can be.
The library code compiled for 386 has the overhead of a function call, but is otherwise as fast as it can be. However, when the library was built, it was built for locking, and that was much slower. So, we have improved the performance of the library simply by running it on a newer machine.

Libstdc++-v3 changes - only lock free routines are supplied

Libstdc++-v3 will continue to use lock free routines whenever the target architecture supports it. When a lock free routine cannot be supplied by the compiler, a standardized external call will be made to perform the operation.

The compiler must ensure that for any given object, it either always inlines lock free routines, or calls the external routines. For any given object, these cannot be intermixed. Ie, you can't perform a lock-free store on an object while attempting to perform a load with the external routines. All operations on the object must be one or the other.

These external routines will not be supplied by libstdc++-v3, but rather provided elsewhere to ensure compatibility between various compilers or operating systems.

Implementation of the external routines must agree with the compiler for any given object size. If the compiler generates lock-free code for an object, the library must also implement lock free for that size as well. It is conforming for the compiler to simply always call the library, so both must agree on the implementation. This means that a global decision is made on whether a target has lock-free support for a specific size object.

The existing implementation of locked atomics contained in libstdc++-v3 will be removed and repackaged in some form as a set of routines which can be built as library to satisfy all the external references and provide a locked implementation when someone explicitly requires it. For a given target, this library will build lock-free routines for sizes which are supported, and provide locks for all the remaining routines.

At the moment, there are no plans to add a flag to disable inlining lock-free routines in the compiler completely (which would result in nothing but external calls). This could be useful if a program is being compiled on an architecture which supports lock free operations yet somehow needs to interact with another architecture which does not. Until a need is determined such a flag will not be implemented, but would be easy to do so when required.

The Library interface

In order to satisfy the C++11 atomic functionality, GCC is looking to implement calls to all the required routines utilizing the following pattern:

First, provide a generic routine to handle objects of all sizes. This routine effectively treats the object as a 'slab of memory' and performs the operations on memory which is passed in by the caller.

  void __atomic_exchange (size_t obj_size_in_bytes, void *mem, const void *value, void *return_value, enum memory_model model)

A typical caller, such as the C++ atomic template for exchange would provide any required temporaries, relieving the library of memory allocation responsibilities.

template class <T>
T exchange (T *mem, T val, memory_order m)
{
  T tmp;
  __atomic_exchange (sizeof (T), mem, &val, &tmp, m);
  return tmp;
}

Second, also provide a set of optimized routines for 1, 2, 4, 8, and 16 bytes values which forego the slab of memory approach and pass the required information by value. These map to the sizes which different architectures may provide as atomic instruction sequences.

  I1  __atomic_exchange_1  (I1 *mem, I1 val, enum memory_model model)
  I2  __atomic_exchange_2  (I2 *mem, I2 val, enum memory_model model)
  I4  __atomic_exchange_4  (I4 *mem, I4 val, enum memory_model model)
  I8  __atomic_exchange_8  (I8 *mem, I8 val, enum memory_model model)
  I16 __atomic_exchange_16 (I16 *mem, I16 val, enum memory_model model)

How the library decides to deal with these internally is up to the library implementation, but the interface should be fully exposed since the compiler may generate any of these calls. A lazy compiler may simply call the generic routine, bypassing the optimized versions, so the generic routine should be prepared to handle all sizes. It is up to the implementation of the library to decide whether to call the optimized versions from within or not.

Calls to __atomic_exchange () with a compile time constant 'size' matching an optimized routine can be converted by the compiler into a direct call to that routine. Enabling optimization should fold away all the address taken, extra copies, and other bits, resulting in either just the call or an inlined instruction sequence.

Alignment

All the optimized routines expect that the object will be properly aligned for a data type of the specified size. Results are undefined for objects not properly aligned if they are invoked directly. (ie, may or may not work as expected). The compiler will not map the generic routine to an optimized routine unless the alignment is correct.

__atomic_is_lock_free takes 2 parameters, the first is the size of the object, the second is a pointer to the object. If the second parameter is 0, then the lock-free property for a properly aligned object is returned, other wise the compiler will examine the type of the object and try to determine whether the object is always lock free or not.

If it cannot be determined until runtime, the library has the option to use lock free routines if an object is the right size and proper alignment. This means that objects of the same size may not all be either locked or lock-free.

Homogeneous library parameters

Note that the library routines require that all objects are of one size. The C and C++ standards allow for the size of an object (obj) to be different than the size of the atomic version of that object (_Atomic(obj). This enables the compiler to pad a 3 byte structure into a 4 byte atomic structure, and then use lock free operations on it which wouldn't otherwise be possible.

In order to inline lock-free instructions if sizeof(_Atomic(obj)) != sizeof (obj) either the compiler or the language wrapper code (ie C++ template) has to take care of additional padding in _Atomic(obj). Since the compiler has to be able to do this work, it may as well perform it before calling the library as well. If the library were to also have its own implementation of padding, it could introduce potential differences and may cause incorrect interaction between library and non-library code.

In order to provide this functionality, the compiler could define _Atomic(obj) as a union between Obj, and the approriately sized lock free unsigned integer

struct AtomicObj {
     union {
          Obj data;   // sub data
          UINT_POWER2(sizeof(Obj)) all;  // Next size power of two integer
     };
};

A call to exchange from the c++ templates using the generic interface might then look something like:

Obj AtomicExchange (AtomicObj *ptr, Obj val, memory_order m)
{
  AtomicObj tmp1, tmp2;
  tmp1.all = 0;       // clear entire lock free word
  tmp1.data = val;    // copy in actual sub data
  atomic_exchange (sizeof(AtomicObj), ptr, &tmp1, &tmp2);
  return  tmp2.data;  // return sub data.
}

If this maps to a lock-free word size, then with optimization all the unnecessary copies and addr_taken stuff gets optimized away. If AtomicObj and Obj are the same size, then the assignment 'tmp1.all = 0;' will also get optimized away.

list of library routines

In order to fully satisfy the C++11 spec, GCC is planning on making external calls to the following routines which have both a generic and 5 optimized routines:

  void __atomic_load (size_t size, void *mem, void *return, int model)
  I1  __atomic_load_1  (I1 *mem, int model)
  I2  __atomic_load_2  (I2 *mem, int model)
  I4  __atomic_load_4  (I4 *mem, int model)
  I8  __atomic_load_8  (I8 *mem, int model)
  I16 __atomic_load_16 (I16 *mem, int model)

  void __atomic_store (size_t size, void *mem, void *val, int model)
  void __atomic_store_1  (I1 *mem, I1 val, int model)
  void __atomic_store_2  (I2 *mem, I2 val, int model)
  void __atomic_store_4  (I4 *mem, I4 val, int model)
  void __atomic_store_8  (I8 *mem, I8 val, int model)
  void __atomic_store_16 (I16 *mem, I16 val, int model)

  void __atomic_exchange (size_t size, void *mem, void *val, void *return, int model)
  I1  __atomic_exchange_1  (I1 *mem, I1 val, int model)
  I2  __atomic_exchange_2  (I2 *mem, I2 val, int model)
  I4  __atomic_exchange_4  (I4 *mem, I4 val, int model)
  I8  __atomic_exchange_8  (I8 *mem, I8 val, int model)
  I16 __atomic_exchange_16 (I16 *mem, I16 val, int model)

  bool __atomic_compare_exchange (size_t size, void *obj, void *expected, void *desired, int success, int failure)
  bool  __atomic_compare_exchange_1  (I1 *mem, I1 *expected, I1 desired, int success, int failure)
  bool  __atomic_compare_exchange_2  (I2 *mem, I2 *expected, I2 desired, int success, int failure)
  bool  __atomic_compare_exchange_4  (I4 *mem, I4 *expected, I4 desired, int success, int failure)
  bool  __atomic_compare_exchange_8  (I8 *mem, I8 *expected, I8 desired, int success, int failure)
  bool  __atomic_compare_exchange_16 (I16 *mem, I16 *expected, I16 desired, int success, int failure)

The __atomic_compare_exchange routine in the library is a strong version only. If the compare_exchange_weak cannot be inlined as a lock-free instruction sequence, it loses much of its usefulness, so the compiler should map compare_exchange_weak to call the strong version in the external library. If this is demonstrated as an issue, the weak variant can be added later.

The following routines will not have a generic version as they are integral only operations. Only the 5 size optimized versions are provided.

   I1  __atomic_fetch_add_1  (I1 *mem, I1 val, int model)
   I2  __atomic_fetch_add_2  (I2 *mem, I2 val, int model)
   I4  __atomic_fetch_add_4  (I4 *mem, I4 val, int model)
   I8  __atomic_fetch_add_8  (I8 *mem, I8 val, int model)
   I16 __atomic_fetch_add_16 (I16 *mem, I16 val, int model)

   I1  __atomic_fetch_sub_1  (I1 *mem, I1 val, int model)
   I2  __atomic_fetch_sub_2  (I2 *mem, I2 val, int model)
   I4  __atomic_fetch_sub_4  (I4 *mem, I4 val, int model)
   I8  __atomic_fetch_sub_8  (I8 *mem, I8 val, int model)
   I16 __atomic_fetch_sub_16 (I16 *mem, I16 val, int model)

   I1  __atomic_fetch_and_1  (I1 *mem, I1 val, int model)
   I2  __atomic_fetch_and_2  (I2 *mem, I2 val, int model)
   I4  __atomic_fetch_and_4  (I4 *mem, I4 val, int model)
   I8  __atomic_fetch_and_8  (I8 *mem, I8 val, int model)
   I16 __atomic_fetch_and_16 (I16 *mem, I16 val, int model)

   I1  __atomic_fetch_or_1  (I1 *mem, I1 val, int model)
   I2  __atomic_fetch_or_2  (I2 *mem, I2 val, int model)
   I4  __atomic_fetch_or_4  (I4 *mem, I4 val, int model)
   I8  __atomic_fetch_or_8  (I8 *mem, I8 val, int model)
   I16 __atomic_fetch_or_16 (I16 *mem, I16 val, int model)

   I1  __atomic_fetch_xor_1  (I1 *mem, I1 val, int model)
   I2  __atomic_fetch_xor_2  (I2 *mem, I2 val, int model)
   I4  __atomic_fetch_xor_4  (I4 *mem, I4 val, int model)
   I8  __atomic_fetch_xor_8  (I8 *mem, I8 val, int model)
   I16 __atomic_fetch_xor_16 (I16 *mem, I16 val, int model)

In addition, the following routine will used to determine whether an object is lock free or not:

  bool __atomic_is_lock_free (size_t object_size, void *ptr)

memory_order

If the library is standardized then the values for the memory_order enumeration must also be standardized. GCC is currently using:

      memory_order_relaxed == 0
      memory_order_consume == 1
      memory_order_acquire == 2
      memory_order_release == 3
      memory_order_acq_rel == 4
      memory_order_seq_cst == 5

volatility

At the moment, GCC is doing nothing different with volatile objects.

Treating volatile memory separately from non-volatile was considered due to the nature of fall back implementations. Often, if there is not direct native support, implementation of an atomic operation is performed with a compare_and_swap loop. This means multiple loads may happen to perform an operation which may technically violate the description of volatile. Furthermore, x86-64 can perform all 16 byte atomic operations with instruction sequences... except a load. Load's can be implemented by doing a single compare_exchange with 0, but if the value is already 0 a 'harmless' store would be performed. On a volatile object this is likely to be a stronger violation than multiple loads.

Jeffrey Yasskin points out here that volatile and non-volatile for the same size cannot be segregated:

typedef pair<int*, int64> DWord;
std::atomic<DWord> shared_var;

void thread1() {
  use(shared_var.load());
}

void thread2() {
  // It's legal to add "volatile" in a pointer or reference cast.
  volatile std::atomic<DWord>* v_shared_var = &shared_var;
  // Now this looks identical to an access to a real volatile object.
  v_shared_var->store(DWord(ptr, val));
}

This examples shows that an object can be accessed in both a volatile and non-volatile way, and must therefore have the same implementation.

As he suggests, these library routines will be documented to say that volatile atomic accesses may be implemented with more than one instruction-level access, which may include a 'harmless' store of the same value. If this continues to be an issue for some architecture, then all the lock-free instructions will have to be disabled for objects of the offending size on that architecture.

GCC intrinsics

GCC provides a routine to determine whether an object is always lock free or not

  bool __atomic_always_lock_free (size_t object_size, T *ptr)

If ptr is NULL, then it considers alignment of the typical integral object of object_size. This always evaluates to a compile time constant. The compiler uses this to determine whether to inline an instruction sequence or not.

There are 4 generic memory routines as follows:

void __atomic_load (T* object, T* return_value, memory_order m)
void __atomic_store (T* object, T* new_value, memory_order m)
void __atomic_exchange (T* object, T* new_value, T* return_value, memory_order m)
void __atomic_compare_exchange (T* object, T* expected_value, T* new_value, bool weak, memory_order success, memory_order fail)

The compiler will confirm all the required parameters are of type T *, and if sizeof(T) maps to one of the 5 optimized versions, the call will be translated into the appropriate form, ie for sizeof (T) == 4:

I4 __atomic_load_4 (I4 *obj, memory_order m)

If the compiler can generate instructions for that, ie __atomic_always_lock_free(4, 0) is true, it will be inlined.

If sizeof(T) does not map to one of these versions, then the call will have the additional size_t parameter added, and the external reference will be left as the library expects it. ie

__atomic_load (size_t size, void *object, void *return_value, memory_order m)

In the case of __atomic_compare_exchange, the 'weak' parameter is simply be dropped for the library call.

When a parameter is known to be an integral value that matches one of the 5 optimized variations, the optimzied format is still available for use by specifying an _n as a suffix. ie.

void __atomic_store_n (T *obj, T val, memory_order m)
T    __atomic_load_n (T *obj, memory_order m)
T    __atomic_exchange_n (T* obj, T val, memory_order m)
bool __atomic_compare_exchange_n (T* obj, T* expected, T val, bool weak, memory_order success, memory_order_failure)

It is an error to specify the _n variant and NOT resolve to a properly sized type.

The remaining __atomic calls consist of the __fetch_OP routines. These do not require a generic version since they are type dependent in order to execute arithmetic or a logical operation.

T __atomic_fetch_add (T* object, T value, memory_order m)
T __atomic_fetch_sub (T* object, T value, memory_order m)
T __atomic_fetch_and (T* object, T value, memory_order m)
T __atomic_fetch_or  (T* object, T value, memory_order m)
T __atomic_fetch_xor (T* object, T value, memory_order m)
T __atomic_fetch_nand (T* object, T value, memory_order m)

GCC is also consolidating with the original set of circa 2003 __sync atomic operations which are heavily used in openMP and other places. The functionality of these routines have been made available with the addition of a memory model parameter with the __atomic routines. The original __sync routines provided a __sync_bool_compare_and_swap as well as a __sync_val_compare_and_swap, but nothing which implemented both a bool and val returned from the same call. This made implementing a weak version of CAS impossible.

GCC looks for an rtl pattern for __atomic_compare_and_swap which has a parameter for the weak condition and a memory model and bool/value results. Internally __atomic_compare_exchange will be mapped to wrapper code to satisfy it's parameter constraints and utilize this RTL pattern. If that fails to produce lock-free instructions, then an external call to __atomic_compare_exchange will be made. If a target does not support weak compare_and_swap, it can simply ignore the parameter.

The original __sync routines also have a full set of __sync_OP_and_fetch which return the value after the operation is complete. The new __atomic routines also provide these. If they cannot be inlined as lock-free, they will be implemented by calling the single external library routine and doing the appropriate math to get the correct value. ie: for i = __atomic_add_fetch (ptr, j, model) the code generated will be

  tmp = __atomic_fetch_add (ptr, j, model);
  i = tmp + j;

GCC provides all these routines because it may be possible to expose additional efficiencies on some architectures when generating lock free instructions. The RTL patterns also allow for patterns of the form atomic_OP which are further optimized for cases when the return value discarded.

GCC also has a __atomic_fetch_and_nand operation. Although the C++ spec does not utilize this, it is prudent to provide a set of external routines in the library for GCC compatibility going forward.

For GCC 4.7 the compiler isn't using any new __atomic rtl patterns to implement these builtins. When a new pattern is not available, the implementation falls back to the original __sync rtl patterns if they are present. The original patterns are almost exclusively SEQ_CST, so the memory model parameter may not provide the expected efficiencies. This however does allow the new __atomic builtins to immediately function on any target which already supports the __sync routines.

The __sync builtins have also been adjusted to first look for a new __atomic pattern and use that in SEQ_CST mode if available. This provides an easy upgrade path for any target maintainer, simply rename and adjust the __sync pattern to utilize the memory model, and both the new and legacy routines will just work. There is no need to maintain 2 different sets of atomic operation patterns. It is expected that the next release of GCC will have all targets migrated to the new patterns.