This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

My project


Dear all,
a few weeks ago I began adding a few new extension to gcc, just for
fun.  Since now it seems that this project could actually see the
light, I believe it is time to tell you about it.

I will first give you a fast overview of my project, then I will fill
in some details and, finally, I will tell you about the current
project status togheter with my future plans and previsions.

Please note: I used markers "*", "**",... to mark the beginning of
sections, subsections, and so on... ==> You can read this text using
emacs outline mode.

* Overview

My goal is to add to C a couple of constructs which will help the
programmer to avoid two quite common pitfalls which are often
exploited in attacks, that is,

  - buffer overflows

  - passing to "critical" functions (exec, printf, ...)
    parameters which are controlled by the user

A possible extension of this project is the construction of few new
library functions which would help in constructing more secure
software (e.g., a popen()-like function which does not invoke the
shell, a checked_fopen() function which avoids race conditions, and so
on...)  This, of course, does not require to modify the compiler and I
will not further elaborate about it here.

** Buffer overflows

Of course a possible solution to the buffer overflows problem is one
of the "bounds-checking" patches that are currently being developed,
but, as I understand, the added checks can sensibly affect the program
performance.  In order to make less severe such a loss of performance,
I added a new control structure.  Let me introduce it with an example

    void
    sum_vect (double *res, double *a, double *b, int dim)
    {
      int n;

      loop (n=0:dim-1 ;; dst ?- res [[n]] ; A ?- a [[n]] ; B ?- b [[n]])

        {
          dst = A+B;
        }
    }

I know, the syntax is quite ugly, but I will change it (if I can think
something better...:-)

In the code the loop, as you can guess, is executed `dim' times with n
assuming the integer values between 0 and dim-1.  "Variables" `dst',
`A' and `B' declared inside the loop structure are called "loop
macros".  For example, with "dst ?- res[[n]]" I say to the compiler
that every instance of "dst" in this loop must be replaced by
"res[n]".  The advantage of this structure is that the compiler knows
in advance which elements will be accessed and it can check the bounds
before the loop starts.  Inside the loop the elements of `res', `a'
and `b' will be accessed without any check, saving time.

The `loop' structure has few other features, including the possibility
of writing a "bounded while", which will be illustrated in the
"Details" section (together with an explanation of why I am using the
weird "double brackets" `[[...]]').

** Critical parameters

Some functions of the C library are quite critical from a security
point of view since if some of their parameters are controlled by the
user, a security hole can be introduced.  The most trivial example is
the system() function, a less obvious example are the functions in the
printf() family which could be used to mount an attack if the `format'
parameter is controlled by the user.  Although the printf() case is
handled by another gcc patch, my proposal introduces a more general
approach to the problem.

In order to avoid this type of problem, I introduced the new keyword
`trusted'.  Let me explain with an example.  If one declare the
printf() prototype as follows

  printf(trusted char *fmt, ...)

the compiler will know that the first parameter of the printf() must
be a "trusted" value.  What is a trusted value?  It is a value which
cannot be influenced by the user; for example,

       * a constant is a trusted value

       * a pure function of trusted values is trusted

For example, with this scheme I cannot say

     int main(int argc, char **argv)
      {
        ...
 printf(argv[1], ...)
      }

since the values in `argv' are not trusted, but I can write

      printf("%d", ...)

since a constant is a trusted value.  The keyword `trusted' can also
be used to declare trusted variables as in

      trusted char s[128];

In this case I can assign to `s' only trusted values and I can use

      printf(s, ...)

since I know that `s' will hold a trusted value, but I cannot use

      scanf("%s", s)

since scanf() could write an untrusted value in s.

I can also declare a trusted function as in

      trusted char *f(...)

With such a declaration I know that f() will return only trusted
values. The keyword `trusted' can also be used in a cast-like
statement in order to enforce trustiness.  For example, in

    char *s;
    ...
    scanf("%s", s);
    if (check_format (s))
       printf((trusted) s, ...);

the function `check_format (s)' returns true if `s' can be used as a
format of printf().  If `s' is a good format string, it is declared
trusted and passed to printf().

* Details

** `loop' structure

The syntax of the loop structure is the following

   ?loop> := 'loop' '(' ?index_list> [';;' ?macro_list> [';;' ?cond>]]
')'

   ?index_list> := ?single_index> | ?index_list> ';' ?single_index>
   ?single_index> := ?idx> '=' ?start> ':' [ ?step> ':' ] ?end>
   ?idx> := IDENTIFIER
   ?start> := ?expr>
   ?step>  := ?expr>
   ?end>   := ?expr>

   ?macro_list> := ?single_macro> | ?macro_list> ; ?single_macro>
   ?single_macro> := ?macro> '?-' ?base> '[[' ?delta> ']]'
   ?macro> := IDENTIFIER
   ?base> := ?expr>
   ?delta> := ?expr>

   ?cond> := ?expr>

*** ?index_list>

The ?index_list> part is a list of expressions like `i=0:N' or
`i=0:a+b:2*N-1'.  In the first case, the variable `i' assumes the
values between 0 and N (inclusive), in the second case `i' is
incremented of a+b at each iteration.  (If you think that this
notation it smells a bit like Matlab, you are right... ;-)

Observe that a construct like

 loop (i = 1:N; j = 1:N ;; ...)

is legal and it corresponds to having two nested loops.  Two indices
which appears in the same `loop' structure will be called "twin
indices".  Because of reasons that will become clear in the following,
the `start', `step' and `end' values of an index cannot depend on the
value of a twin index of its.  For example, a construct like

        loop (i = 1:N ; j = i:N ;; ...)

is illegal, since the start value of `j' depends on the twin index
`i'.  In some sense, you should be able to change the order of twin
indices without changing the behavior of the program.

A peculiar characteristic of the loop structure is that the `start',
`step' and `end' values are computed before the loop starts and then
"frozen".  For example, the following loop

    N=5;
    loop (i=1:N)
      {
         N=10;
  ....
      }

executes 5 times.  This is necessary if the bound check is to be done
at the beginning of the loop.  Consider, for example, the following
code

 int a[8], b[8];
 N=5;
 loop (i=1:N ;; src ?- a[[i]]; dst ?- b[[i]])
   {
     N=10;
     dst=src;
   }

If the `end' value was not frozen at the beginning of the loop, one
would have an undetected buffer overflow.

*** ?macro_list>

An example of a declaration in the ?macro_list> is

       src ?- buf[[2*n+k]]

where `src' is the "macro", `buf' is the "macro base" and `2*n+k' is
the "macro delta".  The reason for using the double brackets `[[...]]'
is that this syntax makes easier to separate the base from the delta.
In a future I plan to remove this by writing a procedure which
analyzes expressions like `buf[2*n+k]' and
`foo->buf[strlen(c)].s[2*k+n]'.  Observe that in the last expression
`foo->buf[strlen(c)].s' is the base and `2*k+n' is the delta.

A special condition on the delta is that it must be a _linear_
function of the loop index, that is, a construct like

        loop (i=1:N; j=1:M ;; src ?- c[[ (i-1) + N*(j-1)]])

is legal, but

        loop (i=1:N; j=1:M ;; src ?- c[[ i*j ]])

is not.  The reason for imposing such a restriction is that we want to
check for bound violations before the loop start.  If the delta is a
linear function of the indices, it can be brought in the form

        k_0 + k_1*i_1 + ... + k_N*i_N                     (*)

where k_0...k_N are constant values and i_1...i_N are loop indices.
The values of constants k_i, as the `start', `step' and `end' values,
are computed at the beginning of the loop and then frozen.  For
example, code

   int a[10];

   K=2;
   loop (i=0:4 ;; A ?- a[[i*K]])
    {
       printf("%d", A);
       K=3;
    }

will print the values a[0], a[2], a[4], a[6] and a[8].  If we did not
freeze K, the last iteration would try to access a[12].

When delta is in the form (*) it is quite easy to compute the maximum
and the minimum value it will assume during the loop by replacing each
i_n with its start or end value, depending on sign of k_n.  Note that
if we allowed deltas like

 2 + i_1*i_2 + i_3*i_2

or, even worse,

  30*sin(i_1*2*pi)

it would be very difficult to compute the extreme values of delta.

The motivation of requiring the independence of twin indices is for
checking more efficiently the bounds.  Consider the following code for
matrix multiplication

       loop (i=0:N-1)
         {
    loop(j=0:M-1 ;; dst ?- a[[i+j*N]])
      {
        acc = 0.0;

        loop (k=0:L-1 ;; s1 ?- b[[i+k*N]] ; s2 ?- c[[k+j*L]])
          acc += s1*s2;

          dst = acc;
              }
         }

In this code the check for `a[[i+j*N]]' is repeated N times and the
checks for `b[[i+k*N]]' and `c[[k+j*L]]' N*M times.  However, if the
same code is written by using twin indices

     loop (i=0:N*M-1 ;; dst ?- a[[i]])
       dst=0.0;

     loop (i=0:N-1 ; j=0:M-1 ; k=0:L-1 ;;
           dst ?- a[[i+j*N]] ; s1 ?- b[[i+k*N]] ; s2 ?- c[[k+j*L]]))
        dst += s1*s2;

the checks for dst, s1 and s2 can be done only once, since the values
assumed by each index are independent from the values of its twins.

Note that index independence is necessary in order to optimize the
checks; consider, for example, the following convolution code

  loop(m=0:L-1 ;; dst ?-res[[m]])
    {
      acc = 0.0;

      first = (1-M+m > 0) ? (1-M+m) : 0;
      last = (N-1 ? m) ? (N-1) : m;
      loop(n=first:last ;; src1 ?- a[[n]] ; src2 ?- b[[m-n]])
   acc += src1*src2;

      dst = acc;
    }

in this case the indices `n' and `m' are not independent, since the
`start' and `end' values of `n' depend on `m'.  In this case it is not
possible to do checks for `src1' and `src2' at the beginning of the
first loop, since we cannot predict the values that `n' will assume.

In the case of the matrix product without twin indices it should be
possible to check for index independence and move the checks outside
the first loop, but this requires the analysis of the loop code after
the _whole_ more external loop has been compiled, since someone could
write something like

   N=5;
   M=8;
   loop(i=1:N)
     {
       loop(j=1:M ;; a ?- b[[i+j]])   /* (1) */
         {
           ...
         }

       M += i;
     }

When the compiler arrives at the line (1), `M' seems to be independent
from `i', but it is not so.

*** ?cond>

The third part of the `loop' construct allows one to write a "bounded
while", that is, a loop while-like (i.e. a loop that ends when a given
condition becomes false), but that will be executed at most a bounded
number of times.  Actually, the construction

    loop (... ;; ... ;; ?cond>)
      {
        ...
             }

is equivalent to

    loop (... ;; ...)
      {
        if (! ?cond>) break;
        ...
             }

The bounded loop construction can be used to write, for example, the
strncpy() function

    char *strncpy(char *dest, const char *src, size_t n)
      {
        int i;
 loop (i=0:n-2 ;; d ?- dest[[i]] ; s ?- src[[i] ;; d=s);
 d[n-1]='\0';
      }

In this case the compiler knows that no elements outside the range
src[0]...src[n-2] and dst[0]...dst[n-2] will be accessed and it can do
the check before the loop starts.

** `trusted' values

*** Trusted pointers

A peculiar characteristic of trusted pointers is that they cannot be
assigned to untrusted variables.  An example should help to clarify
why

     trusted char t[128];
     char *s;

     s=t;
     strncpy(s, argv[1], 128);

It is clear that assigning the address of a trusted variable to an
untrusted one, it is possible to bypass the compiler checks.  By the
same token, one cannot pass a trusted value by address to a function
unless  the parameter is declared `trusted'; for example, the code

    trusted char t[128];
    scanf("%s", t);

is illegal because the second parameter of scanf() is not trusted.  It
is worth emphasizing that if one declares

     void f(trusted char *p)

it will not be possible to write untrusted values in p.  A problem
with the previous declaration is that it will be impossible to pass
untrusted values to f() even if the value of `p' is not critical.  A
possible solution is to declare `p' const.  Since a const value will
not be written it should be safe to pass a pointer to trusted to a
const parameter.

If one wants to be paranoid, this would weaken the compiler controls
since one could write something like

  void f(const char *p)
    {
      char *q = p;

      scanf ("%s", q);
    }

but it would be quite perverted... (note that this bypasses also the
controls on `const')

*** Reliable functions

As anticipated, a "pure" function of trusted values is a trusted
value.  Because of this, one can write

 char *f(char *s) __const__;

        trusted char *t;

 t=f("foo");

since we know that the value returned by f() depends only on its
argument.  However, sometimes the requirement of being __const__ could
be too strong.  For example, suppose we want to write an
interpreter-like program which read commands from the standard input
and execute them, maybe by calling some external program.  In order to
know which program to call, the interpreter looks in a file (belonging
to root) with lines of type

  ?com>  ?external program>

Let search(char *s) be the function which searches `s' in such a file
and construct a command to be passed to system().

The function search() is clearly not a pure function, since it reads
an external file, but we would like to declare that the return value
of search() is trusted if its input value is trusted.  We can achieve
such a goal by declaring search() reliable, that is,

 reliable char *search (char *s)

In other words, the keyword `reliable' means that one can trust the
return value of search(s) if `s' is trusted, even if search() is not a
`const' function.


*** Reliable parameters

Remember that we cannot pass a trusted pointer to an untrusted
parameter unless the parameter is declared const.  A different
solution to this problem is the introduction of the concept of
"reliable parameter".  If we write

     void f(reliable char *p)

we declare that it is safe to pass a trusted address to f() since the
memory pointed by p will not be written _or_ it will be written only
with trusted values.  For example, one could write a "trusted_scanf"
as follows

     trusted_scan (reliable char *s)
       {
         char buf[1024];
  int i;

  scanf("%s", buf);
  if (is_ok (buf))
           for (i=0; s[i]=(trusted) buf[i]; i++);
         else
    abort ();
       }

Note the cast to `trusted' is necessary to write s[i], since s is
declared reliable.

This example raises a question: how can we copy a trusted string into
a trusted variable by using strcpy()?   If we declare it as

         char *strcpy(char *dest, const char *src);

we cannot pass a trusted pointer as dest, we cannot (obviously)
declare dest `const' and if we declare it `reliable'

         char *strcpy(reliable char *dest, const char *src);

we could do something like

         trusted s[128];

  strcpy(s, argv[1]);

In order to avoid this we will say that the value written in a
reliable parameter is granted to be trusted __if_all_the_function
parameters_are_trusted__.

In some sense, this is the same idea of a reliable function: if a
pointer parameter is not const, it is reasonable to assume that the
function will use it as an output parameter and an output declared
reliable is trusted only if the function parameters are trusted.


*** Implementation

The `trusted' concept could be implemented only at the syntax level,
if there was not a little problem...  Consider the following code

   trusted char *s, *t;
   int i;

   for (i=0; i?10; i++)
     s[i] = t[i];

This code is illegal because `t[i]' is a function of `t' (which is
trusted) and `i' (which is not trusted).  Therefore, the value `t[i]'
is not trusted and it cannot be assigned to s[i].  In order to avoid
this problem one could declare `i' trusted or cast t[i] to trusted.
Both solution, if used often, would make the code quite cumbersome.

It is worth observing that it is unfair to say that t[i] is untrusted,
since `i' is not controlled by the user (it assumes only the values in
the range 0...9).  This suggests of classifying the values used in a
program in three broad classes

 - "trusted values"
      i.e., values that we can trust, like constants and
      variables declared trusted

        - "untrusted values"
      i.e., values that we cannot trust. Examples:

         - values returned by untrusted/unreliable functions
  - global variables not declared trusted
  - values read from the external environment e.g. argv[1]

        - "maybe-trusted values"
      i.e., values like the `i' variable in the previous
      example.

If we do not introduce the maybe-trusted class, the best choice is the
paranoid one and we would have to consider every maybe-trusted value
as untrusted.  However, if we want to implement the maybe-trusted
class we need to do some analysis on the code in order to check if
untrusted values are assigned to maybe-trusted variables.

Here it is a fast sketch of how this check can be implemented.
Consider the following code

  trusted int N;
  int M;

  main (int argc, char **argv)
    {
      int i, j, k;

      j = 10;
      k = 2*j+N;  /* (1) */

      i = strlen (argv[1]);
      k += j+i;   /* (2) */
      j += M;
      if (N%2)
        i = (trusted) j;
      else
        i = M;

      /* (3) */
    }

The code uses 6 variables: N, M, argv, i, j and k (argc is not used).
Associate with such a code a (column) vector of 6 boolean values, each
value is associated to a variable and it is true if the variable is
untrusted.  For example, before the execution of line (1) the vector
will contain

     [ F  T   T    T  F  T ]^t
       N  M  argv  i  j  k

Observe that N is trusted because it was declared trusted and that M
and argv are untrusted because they come from "outside" the function
main().

Suppose now that a statement is of the form

     y = f(x, w, ..., z)

where f() is a pure function (if it is not, its return value is
untrusted).  It is clear that y is untrusted as soon as one parameter
of f is untrusted.  It is easy to see that the boolean value
associated to y can be computed as F^t V where F and V are both
boolean vector, F(i) is true if and only if the i-th variable is a
parameter of f() and V(i) is true iff the i-th variable is untrusted.
(It is clear that the product F^t V is computed as a normal matrix
product, but with multiplication and addition replaced by boolean AND
and OR).

For example, the product associated with the line (1) in the previous
example is

            [ T  F   F    F  T  F ]  [ F  T   T    T  F  T ]^t
       N  M  argv  i  j  k      N  M  argv  i  j  k

  F-vector  V-vector

It is easy to see that the result of such a product is `F', meaning
that `k' after the execution of (1) holds a trusted value.

As a second example, consider line (2); the corresponding product is

            [ F  F   F    T  T  T ]  [ F  T   T    T  F  F ]^t
       N  M  argv  i  j  k      N  M  argv  i  j  k

  F-vector  V-vector

Observe that `i' is untrusted since it holds the value strlen(argv[1])
and argv[1] is untrusted.  It is easy to see that the result of such a
product is `T' meaning that k now holds an untrusted value.

In general, it is possible to see that one can construct a graph whose
edges are labelled with boolean matrix describing how un-trustiness is
propagated.  By constructing something similar to the transitive
closure, one can determine if a given variable may hold at a given
time an untrusted value.

For example, at line (3) the variable `i' can contain a trusted value
(if N is odd) or an untrusted one (if N is even).  In this case, the
V-vector relative to line (3) will associate to `i' a `T' value since
it could happen that `i' contains an untrusted value.

* Current status

** `loop' structure

- At the moment the loop structure is recognized and (it seems)
  correctly compiled, but I must do some more extensive testing.

- The bound checks are not included yet because I would like to merge
  my code with one of the patches for bound checking and I have still
  to study them.

- The twin indices are still to be implemented, since I thought about
  them while writing this....

- The analysis code to check index independence is still to be written

** `trusted' values

At the moment the modified gcc recognizes the `trusted' and `reliable'
keywords and catches things like

      printf(argv[1], ...);

where a surely untrusted value is assigned to a trusted
variable/parameter, but it does not catch

             char *s=argv[1];
      printf(s, ...);

since during syntax analysis `s' is only "maybe-untrusted".  To catch
errors like this requires the V-vector analysis sketched before and
this is still to be implemented.

** "To do" things

- Merge with some other bound check patch

- Remove the weird double brackets

- Analysis to discover index independence

- Trustiness analysis

** Previsions

Well, as I told you at the beginning, I am doing this just for fun
during my spare time, so I cannot grant that I will ever finish this.
Anyhow, since it is really fun, I am quite optimist.

As you could see from my current status report, I have almost finished
what can I do during syntax analysis and my next planned step is to
dive myself into RTL gore details in order to do trustiness and
index-independence analysis.  (BTW, does anyone know if there is some
part of gcc source that I can reuse?)



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]