This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: C++0x Memory model and gcc
Michael Matz wrote:
Hi,
On Mon, 17 May 2010, Andrew MacLeod wrote:
The guarantees you seem to want to establish by the proposed memory model.
Possibly I misunderstood.
I'm not 100% sure on the guarantees you want to establish. The proposed
model seems to merge multiple concepts together, all related to
memory access ordering and atomicity, but with different scope and
difficulty to guarantee.
I think the standard is excessively confusing, and overly academic. I
even find the term memory model adds to the confusion. Some effort was
clearly involved in defining behaviour for hardware which does not yet
exist, but the language is "prepared" for. I was particularly unhappy
that they merged the whole synchronization thing to an atomic load or
store, at least originally. I would hazard a guess that it evolved to
this state based on an observation that synchronization is almost
inevitably required when an atomic is being accessed. Thats just a guess
however.
However, there is some fundamental goodness in it once you sort through it.
Lets see if I can paraphrase normal uses and map them to the standard :-)
The normal case would be when you have a system wide lock, and when you
acquire the lock, you expect everything which occurred before the lock
to be completed.
ie
process1 : otherglob = 2; global = 10; set atomic_lock(1);
process2: wait (atomic_lock() == 1); print (global)
you expect 'global' in process 2 to always be 10. You are in effect
using the lock as a ready flag for global.
In order for that to happen in a consistent manner, there is more
involved than just waiting for the lock. If process 1 and 2 are running
on different machines, process 1 will have to flush its cache all the
way to memory, and process 2 will have to wait for that to complete and
visible before it can proceed with allowing the proper value of global
to be loaded. Otherwise the results will not be as expected.
Thats the synchronization model which maps to the default or
'sequentially consistent' C++ model. The cache flushing and whatever
else is required is built into the library routines for performing
atomic loads and stores. There is no mechanism to specify that this lock
is for the value of 'global', so the standard extends the definition of
the lock to say it applies to *all* shared memory before the atomic lock
value is set. so
process3: wait (atomic_lock() == 1) print (otherglob);
will also work properly. This memory model will always involve some
form of synchronization instructions, and potentially waiting on other
hardware to complete. I don't know much about this , but Im told
machines are starting to provide instructions to accomplish this type of
synchronization. The obvious conclusion is that once the hardware starts
to be able to do this synchronization with a few instructions, the
entire library call to set or read an atomic and perform
synchronization may be inlinable without having a call of any kind,
just straight line instructions. At this point, the optimizer will need
to understand that those instructions are barriers.
If you are using an atomic variable simply as an variable, and don't
care about the synchronization aspects (ie, you just want to always see
a valid value for the variable), then that maps to the 'relaxed' mode.
There may be some academic babble about certain provisions, but this is
effectively what it boils down to. The relaxed mode is what you use when
you don't care about all that memory flushing and just want to see the
values of the atomic itself. So this is the fastest model, but don't
depend on the values of other shared variables. This is also what you
get when you use the basic atomic store and load macros in C.
The sequential mode has the possibility of being VERY slow if you have a
widely distributed system. Thats where the third mode comes in, the
release/acquire model. Proper utilization of it can remove many of the
waits present in the sequential model since different processes don't
have to wait for *all* cache flushes, just ones directly related to a
specific atomic variable in a specific other process. The model is
provided to allow code to run more efficiently, but requires a better
understanding of the subtleties of multi-processor side effects in the
code you write. I still don't really get it completely, but I'm not
implementing the synchronization parts, so I only need to understand
some of it :-) It is possible to optimize these operations, ie you can
do CSE and dead store elimination which can also help the code run
faster. That comes later tho.
The optimization flags I'm currently working on are orthogonal to all
this, even though it uses the term memory-model. When a program is
written for multi-processing the programmer usually attempts to write it
such that there are no data races, otherwise there may be
inconsistencies during execution. If a program has been developed and
is data race free, the flags are meant to guarantee that the resulting
code will also be data race free, regardless of whether optimizations is
on or off.
Does that make anything clearer? Its true that a bunch of these things
are all intertwined, and that's one of the reasons it comes across as
being so complicated.
Its up to the library guys to make whatever process synchronization is
required to happen, I leave that to them. They say they have a handle on
it, we'll see. When they do, then we might get to inline it and do some
interesting things.
Andrew