Re: [PATCH] Make strlen range computations more conservative

On 08/01/2018 01:19 AM, Richard Biener wrote:
On Tue, 31 Jul 2018, Martin Sebor wrote:

On 07/31/2018 09:48 AM, Jakub Jelinek wrote:
On Tue, Jul 31, 2018 at 09:17:52AM -0600, Martin Sebor wrote:
On 07/31/2018 12:38 AM, Jakub Jelinek wrote:
On Mon, Jul 30, 2018 at 09:45:49PM -0600, Martin Sebor wrote:
Even without _FORTIFY_SOURCE GCC diagnoses (some) writes past
the end of subobjects by string functions.  With _FORTIFY_SOURCE=2
it calls abort.  This is the default on popular distributions,

Note that _FORTIFY_SOURCE=2 is the mode that goes beyond what the
requires, imposes extra requirements.  So from what this mode accepts or
rejects we shouldn't determine what is or isn't considered valid.

I'm not sure what the additional requirements are but the ones
I am referring to are the enforcing of struct member boundaries.
This is in line with the standard requirements of not accessing
[sub]objects via pointers derived from other [sub]objects.

In the middle-end the distinction between what was originally a reference
to subobjects and what was a reference to objects is quickly lost
(whether through SCCVN or other optimizations).
We've run into this many times with the __builtin_object_size already.
So, if e.g.
struct S { char a[3]; char b[5]; } s = { "abc", "defg" };
strlen ((char *) &s) is well defined but
strlen (s.a) is not in C, for the middle-end you might not figure out which
one is which.

Yes, I'm aware of the middle-end transformation to MEM_REF
-- it's one of the reasons why detecting invalid accesses
by the middle end warnings, including -Warray-bounds,
-Wformat-overflow, -Wsprintf-overflow, and even -Wrestrict,
is less than perfect.

But is strlen(s.a) also meant to be well-defined in the middle
end (with the semantics of computing the length or "abcdefg"?)


And if so, what makes it well defined?

The fact that strlen takes a char * argument and thus inline-expansion
of a trivial implementation like

 int len = 0;
 for (; *p; ++p)

will have

 p = &s.a;

and the middle-end doesn't reconstruct s.a[..] from the pointer

Certainly not every "strlen" has these semantics.  For example,
this open-coded one doesn't:

  int len = 0;
  for (int i = 0; s.a[i]; ++i)

It computes 2 (with no warning for the out-of-bounds access).


If that's not a problem then why is it one when strlen() does
the same thing?  Presumably the answer is: "because here
the access is via array indexing and in strlen via pointer
dereferences."  (But in C there is no difference between
the two.  Also see below.)

So if the standard doesn't guarantee it and different kinds
of accesses behave differently, how do we explain what "works"
and what doesn't without relying on GCC implementation details?

In the middle-end accesses via pointers - accesses where the
access path is not visible in the access itself - are not
constrained by the "access" path of how the pointer was built.

I have seen and I think shown in this discussion examples
where this is not so.  For instance:

  struct S { char a[1], b[1]; };

  void f (struct S *s, int i)
    char *p = &s->a[i];
    char *q = &s->b[0];

    char x = *p;
    *q = 11;

    if (x != *p)            // folded to false
      __builtin_abort ();   // eliminated

Is this a bug?  (I hope not.)

If we can't then the only language we have in common with users
is the standard.  (This, by the way, is what the C memory model
group is trying to address -- the language or feature that's
missing from the standard that says when, if ever, these things
might be valid.)

Well, you simply have to not compare apples and oranges,
a strlen implementation that isn't a strlen implementation
and strlen.

As I'm sure you know, the C standard doesn't differentiate
between the semantics of array subscript expressions and
pointer dereferencing.  They both mean the same thing.
(Nothing prevents an implementation from defining strlen
as a macro that expands into a loop using array indices
for array arguments.)

But this, I suspect, might be behind the disagreement.  You
seem to think in terms of GIMPLE and GCC internals, and have
a clear idea in your head what's meant to be valid and what
isn't.  I suspect only a few GCC developers think this way.
Most of the rest of us think in terms of the language
specification. Not just because that's the contract between
programmers and the compiler, but also because it's the only specification available (the GCC internals manual doesn't go
into nearly enough detail to even hint at what the answers
to some of these questions might be).


