This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: C provenance semantics proposal
- From: Peter Sewell <Peter dot Sewell at cl dot cam dot ac dot uk>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: Jeff Law <law at redhat dot com>, GCC Development <gcc at gcc dot gnu dot org>, cl-c-memory-object-model <cl-c-memory-object-model at lists dot cam dot ac dot uk>
- Date: Wed, 17 Apr 2019 10:15:18 +0100
- Subject: Re: C provenance semantics proposal
- References: <CAHWkzRTd-AsOxOckRaDAbyqWQf_tytFACubN9wpi=NG6=ha_jA@mail.gmail.com> <ddf469fd-685c-8f99-9164-bb62ec435685@redhat.com> <CAHWkzRTp8fFqXo7M5U5idHubxg3Q7rJ6GCqkG+o1-T8V8vCaYg@mail.gmail.com> <CAFiYyc0Tc4Et8ND73Zb14goRs95ZwuCE48wrGB=JXjSTTjgwcA@mail.gmail.com>
- Reply-to: Peter dot Sewell at cl dot cam dot ac dot uk
On 17/04/2019, Richard Biener <richard.guenther@gmail.com> wrote:
> On Fri, Apr 12, 2019 at 5:31 PM Peter Sewell <Peter.Sewell@cl.cam.ac.uk>
> wrote:
>>
>> On Fri, 12 Apr 2019 at 15:51, Jeff Law <law@redhat.com> wrote:
>> >
>> > On 4/2/19 2:11 AM, Peter Sewell wrote:
>> > > Dear all,
>> > >
>> > > continuing the discussion from the 2018 GNU Tools Cauldron, we
>> > > (the WG14 C memory object model study group) now
>> > > have a detailed proposal for pointer provenance semantics, refining
>> > > the "provenance not via integers (PNVI)" model presented there.
>> > > This will be discussed at the ISO WG14 C standards committee at the
>> > > end of April, and comments from the GCC community before then would
>> > > be very welcome. The proposal reconciles the needs of existing code
>> > > and the behaviour of existing compilers as well as we can, but it
>> > > doesn't
>> > > exactly match any of the latter, so we'd especially like to know
>> > > whether
>> > > it would be feasible to implement - our hope is that it would only
>> > > require
>> > > minor changes. It's presented in three documents:
>> > >
>> > > N2362 Moving to a provenance-aware memory model for C: proposal for
>> > > C2x
>> > > by the memory object model study group. Jens Gustedt, Peter Sewell,
>> > > Kayvan Memarian, Victor B. F. Gomes, Martin Uecker.
>> > > This introduces the proposal and gives the proposed change to the
>> > > standard
>> > > text, presented as change-highlighted pages of the standard
>> > > (though one might want to read the N2363 examples before going into
>> > > that).
>> > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf
>> > >
>> > > N2363 C provenance semantics: examples.
>> > > Peter Sewell, Kayvan Memarian, Victor B. F. Gomes, Jens Gustedt,
>> > > Martin Uecker.
>> > > This explains the proposal and its design choices with discussion of
>> > > a
>> > > series of examples.
>> > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
>> > >
>> > > N2364 C provenance semantics: detailed semantics.
>> > > Peter Sewell, Kayvan Memarian, Victor B. F. Gomes.
>> > > This gives a detailed mathematical semantics for the proposal
>> > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
>> > >
>> > > In addition, at http://cerberus.cl.cam.ac.uk/cerberus we provide an
>> > > executable version of the semantics, with a web interface that
>> > > allows one to explore and visualise the behaviour of small test
>> > > programs, stepping through and seeing the abstract-machine
>> > > memory state including provenance information. N2363 compares
>> > > the results of this for the example programs with gcc, clang, and icc
>> > > results, though the tests are really intended as tests of the
>> > > semantics
>> > > rather than compiler tests, so one has to interpret this with care.
>> > THanks. I just noticed this came up in EuroLLVM as well. Getting
>> > some standards clarity in this space would be good.
>> >
>> > Richi is in the best position to cover for GCC, but I suspect he's
>> > buried with gcc-9 issues as we approach the upcoming release.
>> > Hopefully
>> > he'll have time to review this once crunch time has past. I think more
>> > than anything sanity checking the proposal's requirements vs what can
>> > be
>> > reasonably implmemented is most important at this stage.
>>
>> Indeed. We talked with him at the GNU cauldron, without uncovering
>> any serious problems, but more detailed review from an implementability
>> point of view would be great. For the UB mailing list we just made
>> a brief plain-text summary of the proposal (leaving out all the examples
>> and standards diff, and glossing over some details). I'll paste that
>> in below in case it's helpful. The next WG14 meeting is the week of
>> April 29; comments before then would be particularly useful if that's
>> possible.
>>
>> best,
>> Peter
>>
>> C pointer values are typically represented at runtime as simple
>> concrete numeric values, but mainstream compilers routinely exploit
>> information about the "provenance" of pointers to reason that they
>> cannot alias, and hence to justify optimisations. This is
>> long-standing practice, but exactly what it means (what programmers
>> can rely on, and what provenance-based alias analysis is allowed to
>> do), has never been nailed down. That's what the proposal does.
>>
>>
>> The basic idea is to associate a *provenance* with every pointer
>> value, identifying the original storage instance (or allocation, in
>> other words) that the pointer is derived from. In more detail:
>>
>> - We take abstract-machine pointer values to be pairs (pi,a), adding a
>> provenance pi, either @i where i is a storage instance ID, or the
>> *empty* provenance, to their concrete address a.
>>
>> - On every storage instance creation (of objects with static, thread,
>> automatic, and allocated storage duration), the abstract machine
>> nondeterministically chooses a fresh storage instance ID i (unique
>> across the entire execution), and the resulting pointer value
>> carries that single storage instance ID as its provenance @i.
>>
>> - Provenance is preserved by pointer arithmetic that adds or subtracts
>> an integer to a pointer.
>>
>> - At any access via a pointer value, its numeric address must be
>> consistent with its provenance, with undefined behaviour
>> otherwise. In particular:
>>
>> -- access via a pointer value which has provenance a single storage
>> instance ID @i must be within the memory footprint of the
>> corresponding original storage instance, which must still be
>> live.
>>
>> -- all other accesses, including those via a pointer value with
>> empty provenance, are undefined behaviour.
>>
>> Regarding such accesses as undefined behaviour is necessary to make
>> optimisation based on provenance alias analysis sound: if the standard
>> did define behaviour for programs that make provenance-violating
>> accesses, e.g.~by adopting a concrete semantics, optimisation based on
>> provenance-aware alias analysis would not be sound. In other words,
>> the provenance lets one distinguish a one-past pointer from a pointer
>> to the start of an adjacently-allocated object, which otherwise are
>> indistinguishable.
>>
>> All this is for the C abstract machine as defined in the standard:
>> compilers might rely on provenance in their alias analysis and
>> optimisation, but one would not expect normal implementations to
>> record or manipulate provenance at runtime (though dynamic or static
>> analysis tools might).
>>
>>
>> Then, to support low-level systems programming, C provides many other
>> ways to construct and manipulate pointer values:
>>
>> - casts of pointers to integer types and back, possibly with integer
>> arithmetic, e.g.~to force alignment, or to store information in
>> unused bits of pointers;
>>
>> - copying pointer values with memcpy;
>>
>> - manipulation of the representation bytes of pointers, e.g.~via user
>> code that copies them via char* or unsigned char* accesses;
>>
>> - type punning between pointer and integer values;
>>
>> - I/O, using either fprintf/fscanf and the %p format, fwrite/fread on
>> the pointer representation bytes, or pointer/integer casts and
>> integer I/O;
>>
>> - copying pointer values with realloc; and
>>
>> - constructing pointer values that embody knowledge established from
>> linking, and from constants that represent the addresses of
>> memory-mapped devices.
>>
>>
>> A satisfactory semantics has to address all these, together with the
>> implications on optimisation. We've explored several, but our main
>> proposal is "PNVI-ae-udi" (provenance not via integers,
>> address-exposed, user-disambiguation).
>>
>> This semantics does not track provenance via integers. Instead, at
>> integer-to-pointer cast points, it checks whether the given address
>> points within a live object that has previously been *exposed* and, if
>> so, recreates the corresponding provenance.
>>
>> A storage instance is deemed exposed by a cast of a pointer to it to
>> an integer type, by a read (at non-pointer type) of the representation
>> of the pointer, or by an output of the pointer using %p.
>
> So this is not what GCC implements which tracks provenance through
> non-pointer types to a limited extent when only copying is taking place.
>
> Your proposal makes
>
> int a, b;
> int *p = &a;
> int *q = &b;
> uintptr_t pi = (uintptr_t)p; //expose
> uintptr_t qi = (uintptr_t)q; //expose
> pi += 4;
> if (pi == qi)
> *(int *)pi = 1;
>
> well-defined since (int *)pi now has the provenance of &b.
Yes. (Just to be clear: it's not that we think the above example is
desirable in itself, but it's well-defined as a consequence of what
we do to make other common idioms, eg pointer bit manipulation,
well-defined.)
> Note GCC, when tracking provenance of non-pointer type
> adds like in
>
> int *p = &a;
> uintptr_t pi = (uintptr_t)p;
> pi += 4;
>
> considers pi to have provenance "anything" (not sure if you
> have something like that) since we add 4 which has provenance
> "anything" to pi which has provenance &a.
We don't at present have a provenance "anything", but if the gcc
"anything" means that it's assumed that it might alias with anything,
then it looks like gcc's implementing a sound approximation to
the proposal here?
best,
Peter
>> The user-disambiguation refinement adds some complexity but supports
>> roundtrip casts, from pointer to integer and back, of pointers that
>> are one-past a storage instance.
>