Executive Summary
The C++ memory model is designed to provide predictable results in a parallel environment as well as in a sequential one. This memory model is also being adopted by the C standard. There are 2 primary components:
- Optimizations are no longer allowed to introduce data races. Generally this means you can't introduce loads or stores of cross-thread visible data that were not present before.
- The new atomic types.
The data-race component of the C++11 memory model can be accomplished by implementing something like the following flags to the compiler:
--param allow-load-data-races={0,1} - Disable or enable optimizing loads which introduce data races.
--param allow-store-data-races={0,1} - Disable or enable optimizing stores which introduce data races.
--param allow-packed-load-data-races={0,1} - Disable or enable load and mask sequences, use byte sequences instead.
--param allow-packed-store-data-races={0,1} - Disable or enable mask and store sequences, use byte sequences instead.
These 4 flags will implement the required restrictions in the optimizer, and enable them to be turned on or off as required by testing or knowledgable users.
Exposure to the normal users can be provided through something along the lines of:
-fmemory-model=c++0x - Disable data races as per architectural requirements.
-fmemory-model=safe - Disable all data race introductions. (enforce all 4 internal restrictions.)
-fmemory-model=single - Enable all data races introductions, as they are today. (relax all 4 internal restrictions.)
The memory-model=c++0x option simply disables the necessary data races for compliance. Architectures which have no hardware support for data race detection only need to disable the store data races, otherwise all four must be disabled.
The -fmemory-model flag itself doesn't limit what the user can use the program for, it simply lets the optimizers know what the limitations are regarding synchronization/awareness of other threads.
Optimizations must be audited for situations which would break compliance, and modified to check these flags. A conformance testsuite is being developed to help find these locations and then ensure they aren't accidentally re-enabled.
GCC has made the decision that optimizations will be allowed to introduce new load data races, as long as the results are thrown away. It will remain this way until it causes an issue with targeted hardware. This is also allowed by the latest draft standard [N3242.1.10.23]:
Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]
The other required aspect for memory model compliance is implementing the atomic types and operations. Atomic types are defined such that no other thread may ever see an “in between” state. Ie, if 3 stores are needed to change the value of a class, no thread may read a value from the class in which only a subset of the 3 required stores have been performed. The C++ model also provides for a memory ordering parameter which has effects on what kinds of code motion are valid.
The simplest mechanism to implement the atomic feature is to use mutual exclusion locks on each type. When a value is being written, the lock is acquired and all reads are held up until the lock is released. This is undesirable in practice as it creates a bottle neck and poor performance. There are other options and the initial implementation has a more efficient variation of locking.
The ultimate goal is to produce lock-free atomic types. Most modern architectures provide hardware instructions for 1, 2, 4 and 8 byte atomic types. These basic instructions have already been made available in GCC when available. The challenge is providing it for other sized types which are not native. This will be covered in a future section.
Atomic types are also defined to be synchronization points for cross thread communication. The compiler is being modified to emit these synchronization elements as well, look for the Codegen section. This causes some restrictions on what optimizations can be performed around atomic types, and the optimizer needs to be taught about these. This is being covered in the Optimizations section.
Once all the atomic types and operations are supported in a lock-free way, the next step is to provide lock-free versions of much of the standard template library.