This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: should sync builtins be full optimization barriers?


On 09/09/2011 09:09 PM, Geert Bosch wrote:
For the C++0x atomic types there are:

void A::store(C desired, memory_order order = memory_order_seq_cst) volatile;
void A::store(C desired, memory_order order = memory_order_seq_cst);

where the first variant (with order = memory_order_relaxed)
would allow fences to be omitted, while still preventing the compiler from
reordering memory accesses, IIUC.

I thought the volatile tags were actually for type correctness so the compiler wouldn't complain when used on volatile objects.... Ie, you can't call a non volatile method with a volatile object, or something like that.


The different memory models are meant to provide some level of consistency to how these atomic operations are treated.

If you use seq-cst, all shared memory optimizations will be inhibited across the operation, and you will see the behaviour you are expecting across the system. The cost can be significant on some architectures if the code is at all performance sensitive.

The memory models expose the different types of lower cost synchronizations available in the hardware. The behaviour potentially seen across threads by these different models can also be reflected in the optimizations which are allowed.

Back to the original example:

tail->value = othervalue                   // global variable write
atomic_exchange (&var, tail)           // acquire operation

although the optimizer moving the store of tail->value to AFTER the exchange seems very wrong on the surface, it's really emulating what another thread could possibly see. When another thread synchronizes and reads 'var', an acquire operation doesn't cause outstanding stores to be fully flushed, so the other process has no guarantee that the store to tail->value has happened yet even though it gets the expected value of 'var'. That is why it is valid for the optimizer to move the store. In order for this program to work as the user expects, this atomic exchange has to have at least release semantics if not something stronger. Using the new builtins, specifying a more appropriate memory model would resolve the issue.

As it turns out, the sample program would never have failed x86 without the optimizer since XCHG has an implicit lock and is really seq-cst by nature, but if this program were compiled on another architecture where the instruction actually DID have only the documented acquire semantics, this exact same failure could be triggered by the hardware rather than the optimizer, so the bug would still be there and bloody hard to find.

Allowing the optimizers to move things based on the memory model actually increases the chances of detecting an error :-) I've started a summary of what the optimizers can and cant do here: http://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations/Details Its further down on the list of todo's, but eventually we'll get there.

Note that this code movement the optimizer performed cannot be detected by a single thread program. it satisfies all the various data dependencies in order to move it, and any operations which utilizes the value will see the store as it should. So as expected, this code "bug" would still only show up with multiple threads, its just more likely to with optimization.


To be honest, I can't quite see the use of completely unordered
atomic operations, where we not even prohibit compiler optimizations.
It would seem if we guarantee that a variable will not be accessed
concurrently from any other thread, we wouldn't need the operation
to be atomic in the first place. That said, it's quite likely I'm
missing something here.

there is no guarantee it isnt being accessed concurrently, we are only guaranteeing that if it is accessed from another thread, it wont be a partially written value... if you read a 64 bit value on a 32 bit machine, you need to guarantee that both halves are fully written before any read can happen. Thats the bare minimum guarantee of an atomic.

For Ada, all atomic accesses are always memory_order_seq_cst, and we
just care about being able to optimize accesses if we know they'll be
done from the same processor. For the C++11 model, thinking about
the semantics of any memory orders other than memory_order_seq_cst
and their interaction with operations with different ordering semantics
makes my head hurt.
I had many headaches over a long period wrapping my head around it, but ultimately it maps pretty closely to various hardware implementations. Best bet? just use seq-cst until you discover you have a performance problem!! I expect thats why its the default :-)

There is a longer term plan to optimize the actual atomic operations as well, but that still drawing board stuff until we have a a solid implementation.

Andrew


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]