Bug 86265

Summary: Wrong code on an invalid code starting with r255790
Product: gcc Reporter: Martin Liška <marxin>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: normal CC: msebor, rguenth
Priority: P3 Keywords: wrong-code
Version: unknown   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2018-06-21 00:00:00

Description Martin Liška 2018-06-21 11:59:11 UTC
Starting from the mentioned revision we do:

$ cat tester.c
#include <stdio.h>
#include <string.h>

#define min(a, b) (((a) < (b)) ? (a) : (b))

struct S {
  char data[4];
  char fallout[100];
};

int main(int argc, char **argv) {
  struct S s;
  strncpy(s.data, argv[1], 4 + 100);
  int length = min(strlen(s.data), 4);
  printf("length: %d\n", length);
  return 0;
}

$ gcc tester.c -O3 && ./a.out 123456
length: 6

before the mentioned revision we did:
length: 4.

optimized dumps:
after revision:


;; Function main (main, funcdef_no=11, decl_uid=2580, cgraph_uid=11, symbol_order=11) (executed once)

main (int argc, char * * argv)
{
  struct S s;
  char * _1;
  long unsigned int _2;
  int iftmp.0_3;

  <bb 2> [local count: 1073741825]:
  _1 = MEM[(char * *)argv_4(D) + 8B];
  strncpy (&s.data, _1, 104);
  _2 = strlen (&s.data);
  iftmp.0_3 = (int) _2;
  printf ("length: %d\n", iftmp.0_3);
  s ={v} {CLOBBER};
  return 0;

}

before:


;; Function main (main, funcdef_no=11, decl_uid=2580, cgraph_uid=11, symbol_order=11) (executed once)

main (int argc, char * * argv)
{
  struct S s;
  char * _1;
  long unsigned int _2;
  int iftmp.0_3;
  long unsigned int _10;

  <bb 2> [local count: 1073741825]:
  _1 = MEM[(char * *)argv_4(D) + 8B];
  strncpy (&s.data, _1, 104);
  _2 = strlen (&s.data);
  if (_2 <= 3)
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 536870913]:

  <bb 4> [local count: 1073741825]:
  # _10 = PHI <_2(2), 4(3)>
  iftmp.0_3 = (int) _10;
  printf ("length: %d\n", iftmp.0_3);
  s ={v} {CLOBBER};
  return 0;

}
Comment 1 Martin Liška 2018-06-21 12:03:18 UTC
It's invalid because we properly identify that strlen (s.data) <= 4, thus no check is needed.
Comment 2 Richard Biener 2018-06-21 12:50:07 UTC
(In reply to Martin Liška from comment #1)
> It's invalid because we properly identify that strlen (s.data) <= 4, thus no
> check is needed.

No, we can't conclude that.  See PR86259.
Comment 3 Richard Biener 2018-06-21 12:50:53 UTC
Duplicate even.

*** This bug has been marked as a duplicate of bug 86259 ***
Comment 4 Martin Sebor 2018-06-21 15:39:46 UTC
Using strncpy or strlen to cross subobject boundaries isn't valid.  GCC catches the strncpy bug with -Warray-bounds:

pr86265.c: In function ‘main’:
...
In file included from /usr/include/string.h:630,
                 from pr86265.c:2:
pr86265.c:13:3: warning: ‘__builtin_strncpy’ offset [5, 104] from the object at ‘s’ is out of the bounds of referenced subobject ‘data’ with type ‘char[4]’ at offset 0 [-Warray-bounds]
   strncpy(s.data, argv[1], 4 + 100);
   ^~~~~~~

Use memcpy (&s, ...) and memchr (&s, ...) to get that effect.  To make it strictly valid, be sure to pass to the functions the address of the enclosing object rather than a pointer to one of its members:

  struct S s = { "", "" };
  memcpy (&s, argv[1], strlen (argv[1]));
  int length = min((char*)memchr (&s, 0, sizeof s) - (char*)&s, 4);

(-Warray-bounds lets raw memory functions like memcpy and memchr cross subobject boundaries without being diagnosed.)
Comment 5 rguenther@suse.de 2018-06-22 07:27:01 UTC
On Thu, 21 Jun 2018, msebor at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86265
> 
> --- Comment #4 from Martin Sebor <msebor at gcc dot gnu.org> ---
> Using strncpy or strlen to cross subobject boundaries isn't valid.  GCC catches
> the strncpy bug with -Warray-bounds:
> 
> pr86265.c: In function ‘main’:
> ...
> In file included from /usr/include/string.h:630,
>                  from pr86265.c:2:
> pr86265.c:13:3: warning: ‘__builtin_strncpy’ offset [5, 104] from the object at
> ‘s’ is out of the bounds of referenced subobject ‘data’ with type ‘char[4]’ at
> offset 0 [-Warray-bounds]
>    strncpy(s.data, argv[1], 4 + 100);
>    ^~~~~~~
> 
> Use memcpy (&s, ...) and memchr (&s, ...) to get that effect.  To make it
> strictly valid, be sure to pass to the functions the address of the enclosing
> object rather than a pointer to one of its members:
> 
>   struct S s = { "", "" };
>   memcpy (&s, argv[1], strlen (argv[1]));
>   int length = min((char*)memchr (&s, 0, sizeof s) - (char*)&s, 4);
> 
> (-Warray-bounds lets raw memory functions like memcpy and memchr cross
> subobject boundaries without being diagnosed.)

I believe the warning (and policy like _FORTIFY_SOURCE=n) is simply
a policy deemed good for security reasons.  We may not take advantage
of policy violations in code generation though.
Comment 6 Martin Sebor 2018-06-25 22:28:23 UTC
The strlen range optimization doesn't take advantage of undefined behavior -- like all other optimizations, it simply assumes code is free of it.

I have two goals for the warnings I work on: a) most important is to find bugs in user code, and b) less important is to drive improvements to help GCC better analyze source code and emit more efficient object code.

By relying on valid calls to strcpy() writing only into the destination array and not beyond, and reading only from the source array and not beyond, GCC can safely assume that other members of the same struct or other elements of the same array of structs than the one written to are unchanged by the strcpy() call.  For instance, in the following, the tests can safely be eliminated:

  struct A {
    char a[4];
    int i;
  };

  void f (struct A *a)
  {
    int i = a[0].i + a[1].i;

    __builtin_strcpy (a[0].a, a[1].a);

    if (i != a[0].i + a[1].i)
      __builtin_abort ();
  }

There is no reason not to take advantage of this except to cater to exceptionally poorly written (and I'd say exceedingly rare) code, and thus penalize the overwhelming majority of code that doesn't violate the basic rules of the language.
Comment 7 rguenther@suse.de 2018-06-26 06:59:58 UTC
On Mon, 25 Jun 2018, msebor at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86265
> 
> --- Comment #6 from Martin Sebor <msebor at gcc dot gnu.org> ---
> The strlen range optimization doesn't take advantage of undefined behavior --
> like all other optimizations, it simply assumes code is free of it.
> 
> I have two goals for the warnings I work on: a) most important is to find bugs
> in user code, and b) less important is to drive improvements to help GCC better
> analyze source code and emit more efficient object code.
> 
> By relying on valid calls to strcpy() writing only into the destination array
> and not beyond, and reading only from the source array and not beyond, GCC can
> safely assume that other members of the same struct or other elements of the
> same array of structs than the one written to are unchanged by the strcpy()
> call.  For instance, in the following, the tests can safely be eliminated:
> 
>   struct A {
>     char a[4];
>     int i;
>   };
> 
>   void f (struct A *a)
>   {
>     int i = a[0].i + a[1].i;
> 
>     __builtin_strcpy (a[0].a, a[1].a);
> 
>     if (i != a[0].i + a[1].i)
>       __builtin_abort ();
>   }
> 
> There is no reason not to take advantage of this except to cater to
> exceptionally poorly written (and I'd say exceedingly rare) code, and thus
> penalize the overwhelming majority of code that doesn't violate the basic rules
> of the language.

The "basic rules of the language" are hard to understand.  Not only 
because GCC chooses to apply them selectively (not to mem* but to str*).

As GCC developer I'm more concerned about applying rules of the C
language to the GIMPLE IL without much consideration.  I guess
exceptions for mem* also exist because other FEs _do_ emit calls
to those functions as part of their IL lowering to GENERIC.  Now
I would fully expect they do the same for at least a subset of
str* simply because when strings are a first-level entity in a
programming language and you are using the C runtime you have
no choice but using them.

Until now the GIMPLE IL rule was that ADDR_EXPR expressions are
just address computations and you are not allowed to make
any assumptions about the address use based on the structure
of a component reference appearing inside it.  This very rule
made it possible to aggressively propagate into address-computations
but at the same time restricted propagation into dereferences.

That str* argument ADDR_EXPRs now carry semantic value (same
issue applies to the personally "much loved" __builtin_object_size)
is disturbing.  This basically means we may not propagate
into address computations as much as we do?

That is, enforcing C language rules to do more optimization
without evaluating what (invalid under those C language rules)
transforms GCC does itself lead to wrong-code bugs in the past.
Doing that for diagnostics only can only lead to false positive
diagnostics...
Comment 8 Martin Sebor 2018-06-26 17:53:27 UTC
I would expect the additional detail (about the structure of data) to only help improve things, not ever make them worse, or introduce bugs into correct code.

But relying on the structure of data is not novel -- the ability is an important part of the language, and GCC already takes advantage of it in some cases.  For instance, it eliminates the test below:

  struct S { char a[4]; int b; char c[1]; };

  void f (struct S *s, int i)
  {
    int c = s->c[0];
    s->a[i] = 'a';
    if (c != s->c[0])
      __builtin_abort ();
  }

even though calling f() with i set to offsetof (struct S, c) writes 'a' into s->c.

GCC doesn't yet eliminate the test when the write is by a built-in like memcpy() or strcpy() but there is no reason for it not to.