[RFC][PATCH 0/5] arch: atomic rework

Paul E. McKenney paulmck@linux.vnet.ibm.com
Tue Feb 18 16:47:00 GMT 2014


On Tue, Feb 18, 2014 at 03:33:35PM +0000, Peter Sewell wrote:
> Hi Paul,
> 
> On 18 February 2014 14:56, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies.  In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >>     void f(int ra, int*rb) {
> >>       if (ra==42)
> >>         *rb=42;
> >>       else
> >>         *rb=42;
> >>     }
> >
> > Hello, Peter!
> >
> > Nice example!
> >
> > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
> > says the following about the control dependency in the above construct:
> >
> > ------------------------------------------------------------------------
> >
> >         q = ACCESS_ONCE(a);
> >         if (q) {
> >                 barrier();
> >                 ACCESS_ONCE(b) = p;
> >                 do_something();
> >         } else {
> >                 barrier();
> >                 ACCESS_ONCE(b) = p;
> >                 do_something_else();
> >         }
> >
> > The initial ACCESS_ONCE() is required to prevent the compiler from
> > proving the value of 'a', and the pair of barrier() invocations are
> > required to prevent the compiler from pulling the two identical stores
> > to 'b' out from the legs of the "if" statement.
> 
> thanks
> 
> > ------------------------------------------------------------------------
> >
> > So yes, current compilers need significant help if it is necessary to
> > maintain dependencies in that sort of code.
> >
> > Similar examples came up in the data-dependency discussions in the
> > standards committee, which led to the [[carries_dependency]] attribute for
> > C11 and C++11.  Of course, current compilers don't have this attribute,
> > and the current Linux kernel code doesn't have any other marking for
> > data dependencies passing across function boundaries.  (Maybe some time
> > as an assist for detecting pointer leaks out of RCU read-side critical
> > sections, but efforts along those lines are a bit stalled at the moment.)
> >
> > More on data dependencies below...
> >
> >> and in another compilation unit the bodies of two threads:
> >>
> >>     // Thread 0
> >>     r1 = x;
> >>     f(r1,&r2);
> >>     y = r2;
> >>
> >>     // Thread 1
> >>     r3 = y;
> >>     f(r3,&r4);
> >>     x = r4;
> >>
> >> where accesses to x and y are annotated C11 atomic
> >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
> >>
> >> (Of course, this is an artificial example, to make the point below as
> >> simply as possible - in real code the branches of the conditional
> >> might not be syntactically identical, just equivalent after macro
> >> expansion and other optimisation.)
> >>
> >> In the source program there's a dependency from the read of x to the
> >> write of y in Thread 0, and from the read of y to the write of x on
> >> Thread 1.  Dependency-respecting compilation would preserve those and
> >> the ARM and POWER architectures both respect them, so the reads of x
> >> and y could not give 42.
> >>
> >> But a compiler might well optimise the (non-atomic) body of f() to
> >> just *rb=42, making the threads effectively
> >>
> >>     // Thread 0
> >>     r1 = x;
> >>     y = 42;
> >>
> >>     // Thread 1
> >>     r3 = y;
> >>     x = 42;
> >>
> >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
> >> permit those two reads to see 42. That is moreover actually observable
> >> on current ARM hardware.
> >
> > I do agree that this could happen on current compilers and hardware.
> >
> > Agreed, but as Peter Zijlstra noted in this thread, this optimization
> > is to a control dependency, not a data dependency.
> 
> Indeed.  In principle (again as Hans has observed) a compiler might
> well convert between the two, e.g. if operating on single-bit values,
> or where value-range analysis has shown that a variable can only
> contain one of a small set of values.   I don't know whether that
> happens in practice?  Then there are also cases where a compiler is
> very likely to remove data/address dependencies, eg if some constant C
> is #define'd to be 0 then an array access indexed by x * C will have
> the dependency on x removed.  The runtime and compiler development
> costs of preventing that are also unclear to me.
> 
> Given that, whether it's reasonable to treat control and data
> dependencies differently seems to be an open question.

Here is another (admittedly fanciful and probably buggy) implementation
of f() that relies on data dependencies (according to C11 and C++11),
but which could not be relied on to preserve thosse data dependencies
given current pre-C11 compilers:

	int arr[2] = { 42, 43 };
	int *bigarr;

	int f(int ra)
	{
		return arr[ra != 42];
	}

	// Thread 0
	r1 = atomic_load_explicit(&gidx, memory_order_consume);
	r2 = bigarr[f(r1)];

	// Thread 1
	r3 = random() % BIGARR_SIZE;
	bigarr[r3] = some_integer();
	atomic_store_explicit(&gidx, r3, memory_order_release);

	// Mainprogram
	bigarr = kmalloc(BIGARR_SIZE * sizeof(*bigarr), ...);
	// Note: bigarr currently contains pre-initialization garbage
	// Spawn threads 1 and 2

Many compilers would be happy to convert f() into something like the
following:

	int f(int ra)
	{
		if (ra == 42)
			return arr[0];
		else
			return arr[1];
	}

And many would argue that this is a perfectly reasonable conversion.

However, this breaks the data dependency, and allows Thread 0's load
from bigarr[] to be speculated, so that r2 might end up containing
pre-initialization garbage.  This is why the consume.2014.02.16c.pdf
document advises against attempting to carry dependencies through
relational operators and booleans (&& and ||) when using current compilers
(hmmm...  I need to make that advice more strongly stated).  And again,
this is one of the reasons for the [[carries_dependency]] attribute in
C11 -- to signal the compiler to be careful in a given function.

Again, this example is fanciful.  It is intended to illustrate a data
dependency that could be broken given current compilers and hardware.
It is -not- intended as an example of good code for the Linux kernel,
much the opposite, in fact.

That said, I would very much welcome a more realistic example.

> >> So as far as we can see, either:
> >>
> >> 1) if you can accept the latter behaviour (if the Linux codebase does
> >>    not rely on its absence), the language definition should permit it,
> >>    and current compiler optimisations can be used,
> >>
> >> or
> >>
> >> 2) otherwise, the language definition should prohibit it but the
> >>    compiler would have to preserve dependencies even in compilation
> >>    units that have no mention of atomics.  It's unclear what the
> >>    (runtime and compiler development) cost of that would be in
> >>    practice - perhaps Torvald could comment?
> >
> > For current compilers, we have to rely on coding conventions within
> > the Linux kernel in combination with non-standard extentions to gcc
> > and specified compiler flags to disable undesirable behavior.  I have a
> > start on specifying this in a document I am preparing for the standards
> > committee, a very early draft of which may be found here:
> >
> > http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf
> >
> > Section 3 shows the results of a manual scan through the Linux kernel's
> > dependency chains, and Section 4.1 lists a probably incomplete (and no
> > doubt erroneous) list of coding standards required to make dependency
> > chains work on current compilers.  Any comments and suggestions are more
> > than welcome!
> 
> Thanks, that's very interesting (especially the non-local dependency chains).
> 
> At a first glance, the "4.1 Rules for C-Language RCU Users" seem
> pretty fragile - they're basically trying to guess the limits of
> compiler optimisation smartness.

Agreed, but that is the world we currently must live in, given pre-C11
compilers and the tepid implementations of memory_order_consume in
the current C11 implementations that I am aware of.  As long as the
Linux kernel must live in this world for some time to come, I might as
well document the limitations, fragile though they might be.

> >> For more context, this example is taken from a summary of the thin-air
> >> problem by Mark Batty and myself,
> >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
> >> dependencies via other compilation units was AFAIK first pointed out
> >> by Hans Boehm.
> >
> > Nice document!
> >
> > One point of confusion for me...  Example 4 says "language must allow".
> > Shouldn't that be "language is permitted to allow"?  Seems like an
> > implementation is always within its rights to avoid an optimization if
> > its implementation prevents it from safely detecting the oppportunity
> > for that optimization.  Or am I missing something here?
> 
> We're saying that the language definition must allow it, not that any
> particular implementation must be able to exhibit it.

Ah, got it.  You had me worried there for a bit!  ;-)

							Thanx, Paul



More information about the Gcc mailing list