The other big impact is the introduction of atomic types into the standard. There are atomic versions of most built-in types, as well as user types with the atomic attribute. At the basic level, an atomic type is defined such that all reads and writes must only ever be seen by other threads as a single operation. For instance, a store to a long long may be 2 or more instructions on some architectures. A parallel thread reading from that memory location could read the value between the two writes and come up with a value which should never exist. This is not allowed... parallel threads should never see a partially completed atomic object.
All atomic operations are implemented to ensure memory coherency per atomic object. Atomic loads and/or stores may require different or additional instructions in order to achieve this. This is critical for different threads which are performing read/modify/write atomic operations so they don't interfere with each other. Each atomic object is required to have a single modification order. Threads may see sub-sequences, but values cannot be seen out of order. It amounts to serializing atomic writes. If the values 2 and 4 are written to an atomic variable by 2 different threads, and they are turn out to be in the order 2,4 for some execution, no thread reading that atomic variable during that execution will ever see the sequence 4,2. So there may be some additional cost to read or write an atomic value.
The second aspect of the memory model involves the thread synchronization mechanism built into the atomic types which is used to prevent data races. A synchronization mode is attached to atomic operations based on what other shared memory variables also need to be synchronized. This gives the programmer some level of control over when and how much synchronization actually happens. There is also an impact on what compiler optimizations are allowed to be performed on and around atomics. The synchronization mode may also have a significant impact the overall run-time performance of the system.
There is a paper which tries to describe this (A Less formal explanation of the Proposed C++ concurrency memory model) which is better than the draft standard, but still leaves unanswered questions to those unfamiliar with the standard.
Anthony Williams is producing a book titled “C++ Concurrency In Action : Practical Multithreading” which is due to be released in Sept 2010, but can be purchased in an early release program. This book appears to give a much more rounded description of the memory model and its impact, along with general threading and such.
The atomic types
The first issue is implementing the atomic types themselves. Many architectures provide instructions for some integral size atomic operations such as adding, incrementing, subtracting, etc. The operations are usually provided with 1,2,4 and 8 byte variations. This allows you to easily work with atomic types which map to these sizes.
Some architectures do not provide this nice variety of operations, but do provide the basic compare_and_swap operation. This instruction can be used to emulate the other operations with a load/manipulate/compare-and-swap loop, it just isn't as efficient.
There are also a few architectures which have no atomic support, but the linux kernel does provide some support for software atomic sequences, and a few of the ports provide atomic support using this approach.
The remaining architectures simply do not support atomic operations. All of this is documented in the status section on Atomics wiki page.
Some built-in types may have sizes which do not map to a provided atomic operation, and user defined classes can be any arbitrary size. User classes which happen to be the same size as one of the built-in supported types can be mapped to the same integral routines, and therefore be satisfied easily. (Ie, a user type which is 4 bytes long can use the 4-byte builtin atomic operations). Its the other sizes which are more interesting.
There are various approaches that can be taken. The simplest approach is to simply add a mutual exclusion lock to the class, and then and reads or writes would get held up until modifications are complete. This could be sped up by providing separate read and write locks so that multiple reads can proceed simultaneously, and only a write would hold everything up. And there are further enhancements which can easily be implemented to provide initial compliance with the standard.
The problem with locks is that if a thread which has a lock suddenly dies for some reason, the lock will never be released, and the entire system comes to a grinding halt. This has triggered a lot of research into lock-free algorithms, many of which it would be nice to incorporate into the standard library. The basis of all this work requires that there is a lock-free implementation of atomic types.
When available, the hardware operations are already lock free, so it's the non-builtin sizes which a solution is required for. There are various approaches, and at the moment investigations involve maintaining instances of these types with a pointer, and keeping the contents in dynamic memory. When a write operation is performed, new memory is allocated and the new value constructed there. The write is atomically performed by doing a compare-and-swap on the pointer. If the pointer value hasn't changed, then the pointer 'instantly' refers to the entire newly created value. Any new reads will now pick up this new value. Any reads which were using the old value are still referring to the original memory, and still get the original values. The issues left to resolve are how to dispose of memory when it is no longer needed, but investigations are still ongoing.