[PATCH] doc: clarify the situation with pointer arithmetic

Wed Jan 29 09:01:00 GMT 2020

On Tue, Jan 28, 2020 at 1:24 PM Uecker, Martin
<Martin.Uecker@med.uni-goettingen.de> wrote:
>
> Am Dienstag, den 28.01.2020, 11:01 +0100 schrieb Richard Biener:
> > On Tue, Jan 28, 2020 at 8:20 AM Alexander Monakov <amonakov@ispras.ru> wrote:
> > >
> > > On Tue, 28 Jan 2020, Uecker, Martin wrote:
> > >
> > > > > (*) this also shows the level of "obfuscation" needed to fool compilers
> > > > > to lose provenance knowledge is hard to predict.
> > > >
> > > > Well, this is exactly the problem we want to address by defining
> > > > a clear way to do this. Casting to an integer would be the way
> > > > to state: "consider the pointer as escaped and forget the
> > > > provenance"  and casting an integer to a  pointer would
> > > > mean "this pointer may point to all objects whose pointer has
> > > > escaped". This would give the programmer explicit control about
> > > > this aspect and make most existing code using pointer-to-integer
> > > > casts well-defined. At the same time, this should be simple
> > > > to add to existing points-to analysis.
> > >
> > > Can you explain why you make it required for the compiler to treat the
> > > points-to set unnecessarily broader than it could prove? In the Matlab
> > > example, there's a simple chain of computations that the compiler is
> > > following to prove that the pointer resulting from the final cast is
> > > derived from exactly one other pointer (no other pointers have
> > > participated in the computations).
> > >
> > > Or, in other words:
> > >
> > > is there an example where a programmer can distinguish between the
> > > requirement you explain above vs. the fine-grained interpretation
> > > that GIMPLE aims to implement (at least as I understand it), which is:
> > >
> > >   when the program creates a pointer by means of non-pointer computations
> > >   (casts, representation access, etc), the resulting pointer may point to:
> > >
> > >     * any object which address could have participated in the computation
> > >       (which is at worst the entire set of "exposed" objects up to that
> > >        program point, but can be much narrower if the compiler can see
> > >        the entire chain of computations)
> > >
> > >     * any objects which is not "exposed" but could have known address, e.g.
> > >       because it is placed at a specific address during linking
> >
> > Note for the current PTA implementation there's almost no cases we can
> > handle conservatively enough.  Consider the simple
> >
> >  int a[4];
> >  int *p = &a[1];
> >  uintptr_t pi = (uintptr_t)p;
> >  pi += 4;
> >  int *q = (int *)pi;
> >
> > our PTA knows that p points to a (but not the exact offset), same for pi
> > (the cast doesn't change the value).  Now you add 4 - this could lead
> > you outside of 'a' so the points-to set becomes 'a and anything'.
>
> PVNI would say that 'a' gets exposed in the first case and
> then 'q' can point to all exposed objects because of the
> second cast.
>
> A correct and conservative implementation would do this:
> In the PTA you would need to mark the address of a escaped
> and then later assign a conservative points-to set (all
> escaped objects) to q.

I see (I'm asking all these questions to see what implementing -fpta-pnvi
would mean).  That might be a feasible implementation route.  How
does "escaped" as used in your answer apply when doing inter-procedural
analysis?  I guess we would need to assume points-to sets include
automatic variables that escaped in the caller.

> Yes, this limits optimizations, but I do not think this is
> terrible. (optimizations could be re-enabled with a compiler
> option)

We'll see.

> > I'm also not sure what PVNI does to
> >
> >  int a[4];
> >  int *p = &a[1];
> >  p += 10;
> >  uintptr_t pi = (uintptr_t)p;
> >  p = (int *)pi;
> >
> > we assume that p points to a even after p += 10 (but it of course points
> > outside of the object - obvious here, but not necessarily in more
> > obfuscated cases).
>
> This is UB because a pointer to outside of the object is formed.
>
> > Now, can we assume pi points to a?  The cast isn't value-changing.  Do we have
> > to assume (int *)pi points to anything?  So, is
> >
> >  p = (int *)(uintptr_t)p;
> >
> > something like "laundering" a pointer?  We don't preserve such simple round-trip
> > casts since they are value-preserving.  Are they provenance preserving?
>
> Yes, such a pair would be "laundering" as it allows 'p' to then
> point to any exposed object provenance-wise.
>
> For such casts the FE would maybe add a marker. Maybe a calls
> to builtin functions 'builtin_expose(a)' and 'builting_bless'.
> (having those builtins would be interesting on its own, btw).

Uh, ok.

> Having said this, some optimizations could still be allowed using
> the "as-if" rule and other lines of reasoning. Specifally, PVNI
> states that 'p' gets assigned the provenance of the object the
> integer values is the address of. So if the compiler can proof
> that the address belongs to certain objects it can reassign the
> points-to set to the new 'p'. Only if there is ambiguity between
> which objects the address belongs to, the reasoning needs to
> be more conservative.
>
> For example:
> int a[3];
> int b[3];
>
> &b; // b also exposed
> int *p = (int*)(uintptr_t)&a[3];
>
> Here, p could point to the one-after address of 'a' or the
> first element of 'b'. (but only because 'b' was also exposed).
>
> If the compiler can prove that something like this can not
> happen (e.g. by considering offsets), it can still do some
> tracking of points-to sets.

That's probably the very case that we'll get wrong since
we definitely won't be able to reliably preserve these
kind of laundering points...

I guess they could be obfuscated like

union { void *p; long l; } u;

u.p = p;
u.l = u.l;
p = u.p;

where GCC (and IIRC also now the standard) allows the
type-punning when reading u.p via u.l and vice versa.
That is, the "conversions" might be hidden in the
memory access types.  That means (our PTA tracks
points-to sets of memory) that all stores of pointers
(even to automatic vars that are themselves not exposed)
make them escaped and CSE would need to desperately
avoid CSEing the load from u.p to p or insert laundering
operations.

I guess I'd me much more happy if PVNI said that when
an integer is converted to a pointer and the integer
is value-equivalent to pointers { p1, p2, ... } then
the provenance of the resulting pointer is
that of p1 (or p2, ... which is semantically equivalent)
and when two pointers p1 and p2 are
value-equivalent and their provenance is not the
same then the behavior is undefined.  That is,

int a, b;
int  *p = &a + 1;
int *q = &b;
if (p == q)
  ... undefined ...

Richard.

> Best,
> Martin
>
>
>
> > Richard.
> >
> > > Thanks.
> > > Alexander