RFC: stack/heap collision vulnerability and mitigation with GCC

Mon Jun 19 17:07:00 GMT 2017

As some of you are likely aware, Qualys has just published fairly
detailed information on using stack/heap clashes as an attack vector.
Eric B, Michael M -- sorry I couldn't say more when I contact you about
-fstack-check and some PPC specific stuff.  This has been under embargo
for the last month.

--

http://www.openwall.com/lists/oss-security/2017/06/19/1

Obviously various vulnerabilities pointed out in that advisory are being
mitigated, particularly those found within glibc.  But those are really
just scratching the surface of this issue.

At its core, this chained attack relies first upon using various
techniques to bring the stack and heap close together.  Then the
exploits rely on large stack allocations to "jump the guard".  Once the
guard has been jumped, the stack and heap have collided and all hell
breaks loose.

The "jump the guard" step can be mitigated with help from the compiler.
We just have to ensure that as we allocate chunks of stack space that we
touch each allocated page.  That ensures that the guard page is hit.

This sounds a whole lot like -fstack-check and initially that's what
folks were hoping could be used to eliminate this class of problems.

--

Unfortunately, -fstack-check is actually not well suited for our purposes.

Some background.  -fstack-check was designed primarily for Ada's needs.
It assumes the whole program is compiled with -fstack-check and it is
designed to ensure there is enough stack space left so that if the
program hits the guard (say via infinite recursion) the program can
safely call into a signal handler and raise an exception.

To ensure there's always enough space to meet that design requirement,
-fstack-check probes stack space ahead of the actual need of the code.

The assumption that all code was compiled with -fstack-check allows for
elision of some stack probes as they are assumed to have been probed by
earlier callers in the call chain.  This elision is safe in an
environment where all callers use -fstack-check, but fatally flawed in a
mixed environment.

Most ports first probe by pages for whatever space is requested, then
after all probing is done, they actually allocate space.  This runs
afoul of valgrind in various unpleasant ways (including crashing
valgrind on two targets).

Only x86-linux currently uses a "moving sp" allocation and probing
strategy.  ie, it actually allocates space, then probes the space.

--

After much poking around I concluded that we really need to implement
allocation and probing via a "moving sp" strategy.   Probing into
unallocated areas runs afoul of valgrind, so that's a non-starter.

Allocating stack space, then probing the pages within the space is
vulnerable to async signal delivery between the allocation point and the
probe point.  If that occurs the signal handler could end up running on
a stack that has collided with the heap.

Ideally we would allocate and probe a page as an atomic unit (which is
feasible on PPC).  Alternatively, due to ISA restrictions, allocate a
page, then probe the page as distinct instructions.  The latter still
has a race, but we'd have to take the async signal in a single
instruction window.

A key point to remember is that you can never have an allocation
(potentially using more than one allocation site) which is larger than a
page without probing the page.

Furthermore, we can not assume that earlier functions in the call stack
were compiled with stack checking enabled.  Thus we can not make any
assumptions about what pages other functions in the callstack have
probed or not probed.

Finally, we need not ensure the ability to handle a signal at stack
overflow.  It is fine for the kernel to halt the process immediately if
it detects a reference to the guard page.

--

With all that in mind, we also want to be as efficient as possible and I
think we do pretty good on x86 and ppc.  On x86, the call instruction
itself stores into the stack and on ppc stack is only supposed to be
allocated via the store-with-base-register-modification instructions
which also store into *sp.

Those "implicit probes" allow us to greatly reduce the amount of probing
we do on those architectures.  If a function allocates less than a page
of space, no probing is needed -- this covers the vast majority of
functions.  Furthermore, if we allocate N pages + M bytes of residuals,
we need only explicitly probe the N pages, but not any of the residual
allocation.

On glibc, we end up creating probes in ~1.5% of the functions on those
two architectures.  We could probably do even better on PPC, but we
currently assume 4k pages which is overly-conservative on that target.

aarch64 is significantly worse.  There are no implicit probes we can
exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
So we have the track the distance to the most recent probe and when that
distance grows too large, we have to emit a probe.  Of course we have to
make worst case assumptions at function entry.

s390 is much like aarch64 in that it doesn't have implicit probes.
However, it has simpler prologue code.

Dynamic (alloca) space is handled fairly generically with simple code to
allocate a page and probe the just allocated page.

Michael Matz has suggested some generic support so that we don't have to
write target specific code for each and every target we support.  THe
idea is to have a helper function which allocates and probes stack
space.  THe port can then call that helper function from within its
prologue generator.  I  think this is wise -- I wouldn't want to go
through this exercise on every port.

--

So, time to open the discussion to questions & comments.

I've got patches I need to cleanup and post for comments that implement
this for x86, ppc, aarch64 and s390.  x86 and ppc are IMHO in good
shape.  THere's an unhandled case for s390.  I've got evaluation still
to do on aarch64.

Jeff