This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: array bounds checking?


Jim Wilson <wilson@cygnus.com> writes:

> 	Is this no longer proposed for the backend?  If not, I can probably do
> 	it for Fortran with some advice on writing the tree code, for which I
> 	couldn't find a suitable example.
> 
> It is probably more a question of when it will happen.  It is likely that
> this support will appear in the middle-end eventually, but it may not happen
> soon enough for your purposes.

Last year, I did a complete array and pointer bounds checking
implementation for C and C++ in gcc-2.7.2.  I have every intention of
getting this merged into gcc-2.8.x and/or egcs, but my paid work has
been placed unrelenting demands on my time.  It should be adaptable to
g77 as well.

I would be *very* grateful for some assistance with merging, cleanup
suggested by kenner, and testing on more targets.  So far, my only
runtime testing has been for i960 and i386.

Here's a detailed technical summary of what I've done:

------------------------------------------------------------------------------
	Technical Specification for
	Bounded Pointers in GCC

	by Greg McGary <gkm@gnu.ai.mit.edu>
	revised April 8, 1997

In the following discussion, `gcc' shall refer to both the GNU C
compiler and the GNU C++ compiler.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>>>

* Fine grained bounds checking of pointer, array and object references
  of all storage classes: static, heap and automatic.

* Runtime overhead of 50-75% and space overhead of 50-75%: This should
  be small enough that programmers will find it acceptable to enable
  bounded-pointers during development, only disabling them for
  production releases.

* Escape mechanism to override defaults and explicitly enable or
  disable checking of specific types or declarations.

* Design & implementation acceptable as a permanent feature of gcc.

* Ability to mix checked and unchecked object code without
  recompilation: The option `--(un)bounded-libraries' allows you to
  tell gcc what default boundedness is possessed by pointers whose
  declarations appear system header files have, and whose machine
  representations occur in system libraries.  This allows you to
  conveniently mix bounds-checked application code with unchecked
  libraries.

  Bounded pointers can't do anything to patch the unchecked objects
  the way Purify does, but a program compiled with bounded pointers
  can at least inter-operate with code compiled without.  However,
  bounded pointers offer functionality that Purify can only dream
  about, since Purify is constrained by a commercial business model
  wherein it must operate on code available in object-form only.
  Since the GNU system has no such constraint, we are free to
  implement bounds checking a better way, and since the GNU system
  has a complete C library, it's possible to build applications that
  contain 100% bounds-checked code.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Non-Goals:
>>>>>>>>>>>>>>>>

Bounded pointers are no help for the following problems.  Bounded
pointers are just one piece, albeit a major one, in a complete
system for automatic detection of memory usage bugs.

* Detection of memory leaks for heap-allocated memory.

* Detection of the use of uninitialized memory:  References through a
  valid pointer will not detect the (un)initialized condition of the
  referent object.  However, references through NULL pointers can be
  caught by bounded pointers, and if gcc is optionally instructed to
  initialize all otherwise uninitialized pointer variables with
  automatic storage class, some level of uninitialized variable
  checking is possible.

* Detection of the use of stale or dangling pointers: A bounded
  pointer appears valid if its value falls within the interval
  [base..extent).  However, the referent object might be an automatic
  variable that's no longer in scope, or a heap variable that's been
  released.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bounded Pointers with Checks Prior to Dereference:
>>>>>>>>>>>>>>>>

All pointer objects (unless explicitly overridden), comprise three
words: the pointer, a base and an extent.  The base is lowest valid
address value that the pointer may assume, while the extent is the
lowest address above the highest valid address value.

Dereferencing a pointer causes gcc to generate code to check the
pointer against its base & extent, and to call an error-function if
the bounds are violated.  (An alternate bounds-checking approach is to
check the results of all pointer arithmetic.  The drawback here is
that loops commonly auto-increment past the extent of an array in the
loop-exit test.)

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Error functions:
>>>>>>>>>>>>>>>>

(Note: the design of this facility is not yet final, and will change)

The default error function is `abort'.  The user may override this
with a custom error function.  If a function with the name
`__array_bounds_violation' has been declared at the time gcc generates
bounds checks for an array, it will be called instead of `abort'.  If
a function with the name `__pointer_bounds_violation' has been
declared at the time gcc generates bounds checks for a pointer, it
will be called instead of `abort'.

The function `__array_bounds_violation' accepts zero or more
arguments from the following list:

    char const *file_name;
    int line_number;
    char const *array_name;
    int upper_bound;
    char const *index_name;
    int bad_index;

The function `__pointer_bounds_violation' accepts zero or more
arguments from the following list:

    char const *file_name;
    int line_number;
    char const *pointer_name;
    int upper_bound;
    int bad_index;

Gcc looks for these arguments by name and in this order, so the
user-declared argument names and types *must* match those in this
list, and appear in this order, though unwanted arguments may be
freely omitted.

Here is an example definition of an array error function that accepts
all possible arguments:

    void
    __array_bounds_violation (char const *file_name, int line_number,
			      char const *array_name, int upper_bound,
			      char const *index_name, int bad_index)
    {
      printf ("%s:%d: array bounds violation: %s[%s == %d] bounds: 0..%d\n",
	      file_name, line_number, array_name,
	      index_name, bad_index, upper_bound);
    }

Here is an example definition of an array error function that accepts
all possible arguments:

    void
    __pointer_bounds_violation (char const *file_name, int line_number,
				char const *pointer_name,
				int upper_bound, int bad_index)
    {
      printf ("%s:%d: pointer bounds violation: %s, index: %d, bounds: 0..%d\n",
	      file_name, line_number, pointer_name, bad_index, upper_bound);
    }

Here are example functions that only report filename and line number:

    void
    __array_bounds_violation (char const *file_name, int line_number)
    {
      printf ("%s:%d: array bounds violation\n", file_name, line_number);
    }

    void
    __pointer_bounds_violation (char const *file_name, int line_number)
    {
      printf ("%s:%d: array bounds violation\n", file_name, line_number);
    }

In all cases, these functions return without causing the program to
crash, so potentially many violations may be reported in one run.
If you wish to cause a crash at the point of a violation, simply
add a call to `abort' after printing the message.


The type of the inner bounded object determines which of these is
called, *not* the syntax of the expression in which it is used.  If
the inner bounded object is a pointer, then we use
`__pointer_bounds_violation', if it's an array, we use
`__array_bounds_violation'.  e.g.,

    char array[10]
    char *pointer = array + 5;

    array[i] = 1;	/* calls __array_bounds_violation */
    pointer[i] = 2;	/* calls __pointer_bounds_violation */
    *(array + i) = 3;	/* calls __array_bounds_violation */
    *(pointer + i) = 4;	/* calls __pointer_bounds_violation */
    
For `__pointer_bounds_violation', the value of `bad_index' is computed
as (__ptrvalue__ (p) - __ptrbase__ (p)), and the value of
`upper_bound' is computed as (__ptrextent__ (p) - __ptrbase__ (p) - 1)
(dereferencing a NULL pointer will show an upper_bound of -1).  The
value of `pointer_name' is the expression that's dereferenced, e.g.,
"(pointer+i)" for *both* of the above example assignments using
`pointer'.

The most convenient way of ensuring that gcc sees the declarations of
your reporting function for all compilation units, to put them into a
header file, then use gcc's `-include <file>' command-line option.
This relieves you of the burden of changing the source code.

Caveats:

* As is also the case with compiler syntax errors, it is possible for
  bounds violations to cascade, or for one violation to be reported
  many times.

* The above feature is no substitute for a symbolic debugger.  The
  reporting functions above only indicate the source code location of
  the topmost frame of the call stack.  It is often necessary to
  inspect the complete program state in order to diagnose the cause of
  the bounds violation and identify the best fix.

* The verbose reporting functions can be very expensive in text space
  overhead, which is proportional to the number of arguments passed.
  If you're already in the habit of using gdb to debug bounds violations,
  you're probably better off just using the default `abort'.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Other Checks:
>>>>>>>>>>>>>>>>

Pointer subtraction could cause gcc to generate checks that the
operands refer to the same object.  Low-level systems code should
probably disable this form of checking by operating on unbounded
pointers.

Use of uninitialized pointers and the NULL pointer can be trapped by
automatically by initializing the bounded-pointer object to a value
that is guaranteed to trigger a bounds violation upon dereference.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bounded Pointer Declarations:
>>>>>>>>>>>>>>>>

The bounded or unbounded state of a pointer is determined by a
type-qualifier of either `__bounded__' or `__unbounded__'.  With the
`-fno-bounded-pointers' switch (the gcc default), in the absence of an
explicit qualifier, a pointer decl is considered `__unbounded__'.  With
the `-fbounded-pointers' switch, in the absence of an explicit
qualifier, a pointer decl is considered `__bounded__'.  A pointer
qualified implicitly or explicitly as `__bounded__' occupies three
words of storage for the pointer itself, a base and an extent.  A
pointer qualified implicitly or explicitly as `__unbounded__' occupies
a single word, as is traditional for C pointers.

If a file is compiled with `-fno-bounded-pointers' *and* contains no
explicit `__bounded__' qualifiers, then the resultant object will be
the same as if it were compiled with a version of gcc without this
feature.  `-fno-bounded-pointers' will *not* disable the effect of an
explicit `__bounded__' qualifier.  If the user desires that behavior,
s/he should include add `-D__bounded__= -D__unbounded__=' to the gcc
command-line to elide the qualifiers.

Header files included from specific directories can be designated as
containing pointer declarations that are bounded or unbounded.

    --unbounded-includes DIRECTORY
    -iunbounded DIRECTORY
	When processing an include file whose directory prefix matches
	DIRECTORY, make the pointer declarations unbounded by default.

    --bounded-includes DIRECTORY
    -ibounded DIRECTORY
	When processing an include file whose directory prefix matches
	DIRECTORY, make the pointer declarations bounded by default.

    --unbounded-standard-includes
    -iunboundedstdinc
	When processing an include file whose directory prefix matches
	one of the standard system include directories, make the
	pointer declarations unbounded by default.

    --bounded-standard-includes
    -iboundedstdinc
	When processing an include file whose directory prefix matches
	one of the standard system include directories, make the
	pointer declarations bounded by default.

Type qualifiers represent fine-grained control over the boundedness of
pointers.  Command-line switches represent coarse-grained control.
Intermediate-grained control is provided by the following syntax:

    __bounded__ {
	/* pointer declarations are bounded by default
	   within this block. */
    }

    __unbounded__ {
	/* pointer declarations are unbounded by default
	   within this block. */
    }


The __bounded__ and __unbounded__ keywords may be regarded as prefix
operators that apply to blocks.  This syntax may also appear at
toplevel (file scope).

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Runtime pointer boundary checks:
>>>>>>>>>>>>>>>>

The command-line switch `-fcheck-bounds' controls generation of
runtime code to perform bounds checks on bounded pointers.  The effect
of this switch is orthogonal to `-fbounded-pointers'.
`-fbounded-pointers' controls how much space (one word or three words)
is occupied by pointers passed as function arguments, or stored in
memory with storage class static, whereas `-fcheck-bounds' controls
whether or not gcc generates runtime code to validate bounded
pointers.  When `-fcheck-bounds' is specified in conjunction with
`-fno-bounded-pointers', gcc treats pointer variables having automatic
storage-class as bounded, and those passed as arguments or having
static storage-class as unbounded.  This gives as much bounds checking
as can be accomplished without altering the static storage layout or
function calling conventions.  Generally, this means that all array
references and some pointer dereferences can be checked.

The following syntax offers finer grained control over generation of
runtime pointer boundary checks:

    __check_bounds__ {
	/* generate range checks for bounded pointers. */
    }

    __no_check_bounds__ {
	/* Don't generate range checks for bounded pointers. */
    }

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Selection of Bounded-Pointer Components:
>>>>>>>>>>>>>>>>

The builtin prefix operators `__ptrvalue__ <expr>', `__ptrbase__
<expr>', `__ptrextent__ <expr>', where <expr> is a pointer-valued
expression, extract the individual components of a bounded pointer.
These operators have the same syntax and precedence as `sizeof'.

Applied to an unbounded pointer, `__ptrbase__' returns UNKNOWN_BASE and
`__ptrextent__' returns UNKNOWN_EXTENT, in other words, the most permissive range
of values possible.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Optimizations:
>>>>>>>>>>>>>>>>

These two optimizations are essential to meeting both the time and
space performance goals:

(1) It is unnecessary to recheck bounds of a pointer that remains
    unchanged since the last check.  The gcc back-end must be enhanced
    to eliminate the redundant checks.

(2) If a bounded-pointer loop-induction variable increments
    monotonically the check against the base need be performed only
    once at the top of the loop.  Similarly, if such a variable
    decrements monotonically, the check against the extent need only
    be performed once.

    Even better, if gcc can statically determine the range of values a
    bounded-pointer loop-induction variable will assume, it need only
    check the variable's minimum value against the base, and the
    variable's maximum value against the extent, performing both
    checks once before the loop.

    Since these are generally valid loop optimizations, they should be
    implemented in a general way.

This third item is not an optimization per se, but rather describes
storage allocation policy that allows the optimizer to function on
bounded-pointer variables with automatic storage class:

(3) The three components of a bounded-pointer triple (pointer, base,
    extent) should be assigned to registers whenever registers are
    available.  Although it is tempting to think of a bounded pointer
    object as a three-word structure, there is no requirement that the
    three components of a bounded-pointer variable with automatic
    storage class occupy contiguous stack slots, unless the address of
    the variable is taken, in which case it must be laid out
    contiguously as (pointer, base, extent).  If in registers, there
    should be no requirement that they occupy contiguous registers
    (although there is an advantage to doing so on architectures that
    can move, load or store multiple registers with a single
    instruction), nor should there be any restriction against
    assigning some components to registers and others to stack slots.
    As an automatic variable, unless its address is taken, a
    bounded-pointer may be considered to be individually declared
    variables whose register assignment is done independently.
    Bounded pointers passed as arguments must be assigned contiguous
    positions in the proper order under the target's argument passing
    convention, either as contiguous registers or as contiguous stack
    slots (or possibly a mixture, if the three components of the
    argument straddle the boundary between args passed in registers
    and those passed on the stack).  Note that gcc already handles
    automatic storage for the two components (real & imaginary) of its
    builtin complex type in this way, using the CONCAT RTL expression
    to hold the two components.  Bounded pointers employ the CONCAT3
    RTL expression type.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> External Arrays:
>>>>>>>>>>>>>>>>

When arrays are declared `extern' with array-bounds omitted, a
compilation unit has no way of knowing the bounds of the array at
compile time.  In this case, gcc emits the extent as a SYMBOL_REF to a
synthetic symbol called `foo.ext' (for an array named `foo').  If such
a reference is generated for an array defined in an object not
compiled with `-fbounded-pointers' or `-fcheck-bounds', then this
reference will be undefined.  In this case, if the GNU ld (or collect)
can be modified to synthesize the definition, or set it to
UNKNOWN_EXTENT if for some reason it can't determine the size of the
array.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Incomplete Structures:
>>>>>>>>>>>>>>>>

GNU C allows variables of incomplete structure type to be declared
extern.  E.g., the following from libio.h:

    extern struct _IO_FILE_plus _IO_stdin_;
    #define _IO_stdin ((_IO_FILE*)(&_IO_stdin_))

Gcc can't determine the extent of _IO_stdin for ordinary user programs
whose compilation units do not include the declaration of this
structure.  This is currently an unsolved problem.  A probable
solution is to mark the extent of all public static-storage-class
structs with a synthetic symbol exactly as is now done for arrays, as
described above.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Variably Sized Arrays in Structs:
>>>>>>>>>>>>>>>>

A common technique for coding a fixed header followed by a variably
sized body is to declare a structure such as this:

    struct foo
    {
      int fixed1;
      int fixed2;
      char name[1];
    };

Where `name' is the variably sized member.  Gcc even allows the size
of such an array to be specified as 0.  Such structures are always
dynamically allocated.  In this case, the extent of the variably sized
array is the same as the extent of the enclosing structure.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Function Pointers:
>>>>>>>>>>>>>>>>

To save space, pointers to functions are `__unbounded__' by default,
since programs don't generally perform arithmetic on them.
`-Wpointer-arith' will give compiler warnings about arithmetic on
function pointers, and so long one adheres to the the discipline of
avoiding casts of function pointers to non-function-pointer types,
that should be sufficient to ensure safety in this context.
When a function is called through a pointer, the pointer value will be
implicitly checked against the range [start, etext).  Calls through
trampolines will have such checks disabled.

	Explicit Casts:

By default, casts have no effect on the bounds of a pointer.
E.g.:

    char chars[10];		/* bounds are [chars..chars+10): 10 bytes */
    int *icp = (int *)chars;	/* bounds remain [chars..chars+10): 10 bytes */
    int i;			/* bounds are [&i..&i+1): 4 bytes */
    char *cip = (char *)&i	/* bounds remain [&i..&i+1): 4 bytes */

If the programmer desires a cast to reset bounds, the __bounded__
qualifier must be applied to the pointer type in the cast expression.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Heap Storage Allocation (Malloc):
>>>>>>>>>>>>>>>>

A special bounds-aware malloc and C++ `new' operator returns
bounded pointers with a base & extent covering the allocated region.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Stack Storage Allocation (alloca):
>>>>>>>>>>>>>>>>

Just as with malloc, a bounded-pointer aware alloca returns bounded
pointers.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> UNKNOWN:
>>>>>>>>>>>>>>>>

A special bit-patterns represent unknown values of base & extent.  The
value of UNKNOWN_BASE is generally 0, and the UNKNOWN_EXTENT value is
generally ~0 (all 1s).  The values are chosen to yield the maximally
permissive bounds.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> String and memory library functions:
>>>>>>>>>>>>>>>>

String & mem functions and any other library primitives that must be
fast can be coded so that the boundary checks are performed once at
the beginning, followed by the real work which is done using simple
unbounded pointers, e.g.:

    char *
    strcpy (char *dest, char const *src)
    {
      char const *dest_0 = dest;
    #if BOUNDED_POINTERS
      /* strlen will check bounds of src */
      char const *__unbounded__ dest_end = dest + 1 + strlen (src);
      if ((__ptrvalue__ dest) < (__ptrbase__ dest)
	  || (dest_end >= (__ptrextent__ dest)))
	abort ();
    #endif
      do
	*(__ptrvalue__ dest)++ = *(__ptrvalue__ src);
      while (*(__ptrvalue__ src)++);
      return dest_0;
    }

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> argv and environ arrays:
>>>>>>>>>>>>>>>>

The program argument string vector (argv) and environment string
vector (environ) come from the OS as arrays of unbounded pointers.
The interim solution is to apply explcit __unbounded__ qualifiers
to both of these declarations, e.g.:

    char *__unbounded__ *__unbounded__ argv;

This approach, while expedient, has the disadvantages of requiring
source code changes to all programs and creates an unnecessary
bounds-checking loophole.  A better solution is to recreate both
vectors with bounded-pointers very early in the startup sequence.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mixing Bounded & Unbounded Pointers at Source Level:
>>>>>>>>>>>>>>>>

Semantics of assignment & argument passing for mixed bounded and
unbounded pointers is as follows:

lhs (or param)	rhs (or arg)	semantics
--------------	------------	---------
bounded		unbounded	lhs of base & extent set to UNKNOWN
unbounded	bounded		base & extent of rhs are discarded

Two types must have identical __unbounded__ qualifiers at all but the
outermost pointer level in order to be compatible.  e.g.,

    char *__unbounded__ *a;
    char *__unbounded__ *__unbounded__ b;
    char **__unbounded__ c;

`a' and `b' are compatible, but `c' is compatible with neither `a' nor
`b'.  The incompatibility arises from the different sizes of bounded
and unbounded pointer objects.

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mixing Compiled Modules, some with Bounded
>>>>>>>>>>>>>>>> Pointers and others with Unbounded Pointers:
>>>>>>>>>>>>>>>>

As long as the module interface declarations are compiled with
boundedness type-qualifiers that match the module's object code, all
should be well, e.g.:

    /* My libc doesn't grok bounded pointers 8^( */
    extern int strcmp (char const *__unbounded__ x, char const *__unbounded__ y);

    void
    my_strcmp (char const *x, char const *y)
    {
      return strcmp (x, y);
    }

The above example is contrived.  Here's another contrived example:

    /* My libc doesn't grok bounded pointers 8^( */
    __unbounded__ {
    #include <string.h>
    #include <stdlib.h>
    #include <stdio.h>
    }

    ...

The easiest way to handle the problem of properly declaring standard
system library interfaces whose pointers are unbounded is to use the
option `--unbounded-standard-includes' (also known as `-iunboundedstdinc').

>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation Status:
>>>>>>>>>>>>>>>>

As of the first week of March, 1997, the full bounded pointer
implementation for C is 95% complete, though bugs surely remain.
glibc has been ported to bounded-pointers for Linux on the x86, which
involved adding bounded-pointer checks to the top of assorted
assembler-language string functions, and putting wrappers around the
system call entry-points to check the bounds of pointer arguments and
convert the bounded pointers to unbounded prior to trapping to the OS.

glibc-2.0.1 successfully compiles with `-fbounded-pointers', several
utilities from textutils (md5sum, fmt and sort) run to completion for
the small number of test cases tried so far.  The next step is to test
more GNU packages compiled with `-fbounded-pointers' and linked with
the bounded-pointer version of glibc.

For the C++ front-end, only `-fcheck-bounds' is complete and tested.
Much of the support for bounded-pointers is present in the C++
front-end and back-end but is still incomplete and untested.

Remaining tasks are:

* Add bounded-pointer support to libstdc++.
* Implement optimizations to eliminate redundant checks.
* Test, debug, test, debug, test, debug...


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]