Data Races

The compiler is not allowed to perform any optimization which may introduce a potential data race on shared memory. This generally involves not introducing stores on code paths which would not have performed the store otherwise, and taking care with loads. There is a paper by Hans-J. Boehm (Concurrency memory model compiler consequences) which describes this aspect of the memory model with some concrete examples of how various optimizations in the compiler need attention. That paper covers the subject pretty well, so I won't go into any specific details, rather just summarize it here.

Accessing Packed Data

The new memory model defines each individual field/object as being its own “memory location”. The compiler is no longer allowed to write to unrelated memory locations because that can introduce a data race between threads. In practice, this means the current technique of packing small data fields into a single word and accessing it with various combinations of full word loads, masks, shifts, and stores is no longer acceptable.

This most commonly occurs in the form of bitfields and small data types in a structure. On some architectures partial word reads are much slower. Typically the entire word is read, the irrelevant parts masked off, bits are shifted, and whats left is the correct data. Similarly, writes are performed by reading the entire word, clearing the bits containing the field, masking the data into the correct bits, and writing the entire word back.

This can be fairly efficient, but the problem is that a second thread may trying to write to another field in that same word, and that introduces a potential data race.

struct x { char a; char b; } S = { 'x','y' }; // structure fits in a single word


thread1() { S.a = 'a'; }

thread2() { S.b = 'b'; }

If thread1 and thread2 both read the original full-word value of S (which is {'x','y'}), manipulate that value to set the required field, and then write the result back, one of the results will be overwritten by the other thread. The final result will be either {'a','y'} or {'x','b'}. The 2 threads are “racing” to see who completes their operation first... This new data race introduced by storing the unrelated values of 'S' back into memory could be addressed by not allowing bitfields or smaller-than-word sized data to share words. This solution is not practical since many existing structures would blow up in size tremendously.

This could also be addressed by replacing the current load/mask/store sequence with a load/mask/compare-and-swap sequence, and if the compare-and-swap fails, restart the sequence with the load again.

(For those not aware, compare-and-swap is an atomic operation most modern architectures provide in which you specify both the original value, and what you want to change it to. The processor then ensures that the memory location still contains the original value before it performs the store atomically. If the memory location contains something different, the operation fails and the store is not performed. This allows one to be sure the contents haven't been changed since it was last examined.)

This wouldn't be very speed efficient, but would resolve the data race problem on architectures which support that built-in atomic construct. Using this operation to prevent data races between word-sharing fields technically violates the standard, because a store is still performed into the other location, it just happens to be the identical value. Another concern is that this sequence tends to be slow, but it might be worth experimenting with at some point to measure what the performance impact actually is. Of course, it is not an option on architectures with hardware which can detect data races as the store then becomes observable.

In any case, neither option is really considered acceptable, so the recommended option is to access these fields using smaller-than-word operations, so half-word or byte loads and stores. This needs to be enforced in GCC. On architectures which do not have hardware support for data race detection, we can still load these values using shift and masks, as there is no way to detect that the non-relevant portion of the data is being loaded. (Since the other parts are being thrown away by the compiler).

It is not practical to implement single bit loads and stores on most hardware, so the standards committee has seen fit to decide that all contiguous bitfields can be considered a single memory location for the purpose of introducing data races. (Meaning that, yes, 2 threads can still data race and overwrite bitfields in the same word.) The only exception is when bitfields are separated by a zero-length bitfield or they happen to be contiguous, but occur in different structures/unions. Those conditions force alignments on those fields anyway, and thus the compiler can fall back to smaller loads and stores again. In these cases, the programmer will have to add appropriate locking for thread safe accesses.

Avoiding Speculation

Most forms of speculation involve introducing loads and/or stores on code paths which may not have executed a load or store to that location before. This new load or store may introduce a data race between threads which was not present before. This is not allowed in the standard, and these types of optimizations must be disabled. This applies to loads and stores of cross-thread visible data (shared memory) only, purely thread local optimizations are still allowed.

An examination of each optimization which may perform these types of transformations is underway, and new flags will control the individual aspects of compliance. This allows individual components of the model to be fine tuned and tested for compliance, or turned off completely if compliance isn't required.

There are cases of optimizations where an alternative implementation can be provided which do not introduce data races, but they tend to be very special purpose and care needs to be taken. The paper quoted earlier has a few examples, but an example here may help to make things clear:

The following accumulation transformation:

  for (p = q; p; p = p->next)
    if (p->data > 0) ++count;

can no longer be transformed to:

  int tmp = count;

  for (p = q; p; p = p->next)
    if (p->data > 0) ++tmp;
  count = tmp;

The rationale is that in the case of q == NULL, the new transformation performs a store of “count = count” at the end of the loop. There was no store executed by this snippet before the transformation, and another thread could be counting on this and write a different value into count when q == NULL. This first thread could now overwrite that value with the original value of count. This was not possible in the program before, thus the semantics have been changed and a new data race introduced by the store.

The transformation could be allowed if it were transformed into:

 int tmp = 0;

 for (p = q; p; p = p->next)
  if (p->data > 0) ++tmp; if (tmp != 0) count += tmp;

Now there is no extra store on that code path in the optimized code.

Speculative loads may also introduce data races, and are disallowed in all source-to-source translators. The optimizer however has some extra freedoms. A speculative load which the has its results thrown away are considered to not have changed the semantics of the program, and are therefore allowed. The compiler must ensure that it doesn't later introduce a data race with that load. If a future target architecture has hardware race detection abilities then new races will again be disallowed. This is why there are separate flags to disable load and store operations...

GCC is currently not going to invest the time, effort, and required code obfuscation to prevent speculative loads until such hardware exists.

Disallowing speculative loads will impact a number of optimizations, including scheduling, common subexpression elimination, etc. The effect is you can no longer move a load above a control flow construct if there wasn't already a load on that path. Note again that this is only loads of shared memory. Local variables do not have these restrictions, assuming their addresses never escape.

 If (x) 
   val = global * 2;

can no longer be transformed to

 reg = global; 
 if (x) 
  val = reg * 2;

unless there is already a load of 'global' on the path leading to the 'if'.

None: Atomic/GCCMM/DataRaces (last edited 2011-06-08 22:49:49 by AndrewMacLeod)