Split Stacks in GCC

Ian Lance Taylor

The goal of split stacks is to permit a discontiguous stack which is grown automatically as needed. This means that you can run multiple threads, each starting with a small stack, and have the stack grow and shrink as required by the program. It is then no longer necessary to think about stack requirements when writing a multi-threaded program. The memory usage of a typical multi-threaded program can decrease significantly, as each thread does not require a worst-case stack size. It becomes possible to run millions of threads (either full NPTL threads or co-routines) in a 32-bit address space.

Basic implementation

The stack will have a guaranteed zone which is always available. The size of the guard area will be target specific: TARGET_GUARANTEED_STACK_ZONE_SIZE. Each function which is not a leaf function, or whose stack frame is larger than TARGET_GUARANTEED_STACK_ZONE_SIZE, will have to verify that it has enough space in the current stack to execute. We will also have a target dependent amount of slop; this is the amount of stack space required to allocate more stack space.

The basic verification will be a comparison between the stack pointer and the current bottom of the stack plus the guaranteed zone size. This will have to be the first operation in the function, and will also be target specific. It must be fast, as it will be executed by each called function.

There are two cases to consider. For functions which require a stack frame less than TARGET_GUARANTEED_STACK_ZONE_SIZE, we can do a simple comparison between the stack pointer and the stack limit. For functions which require a larger stack frame, we must do a comparison including the size of the stack frame.

Possible strategies:

  1. Reserve a register to hold the bottom of the stack plus the guaranteed size. This will have to be a callee-saved register. Each function with a small stack frame then starts with a two instruction sequence:

            cmp     %sp,%reg
            bg      1f
            expand stack
  2. Use a TLS variable. In the general case, in a shared library, this will require calling the __tls_get_addr function. That means that that function will have to work without requiring any additional stack space. This is infeasible unless the whole system is compiled with split stacks. It would require LD_BIND_NOW to be set, so that the __tls_get_addr function is resolved at program startup time. Even that is probably insufficient unless we can ensure that the space for the variable is fully allocated. In general I don't think we can ensure this, because dlopen can cause a thread to require more space for TLS variables, and that space will be allocated on the first call to __tls_get_addr. So I don't think this approach can work in general.

  3. Have the stack always end at a N-bit boundary. E.g., if we always allocate stack segments as a multiple of 4K, then align each one so that the stack always ends at a 12-bit boundary. Then the amount of space remaining on the stack is SP & 0xfff. We then have a four instruction sequence:

            mov     %sp,%junkreg
            and     $0xfff,%junkreg
            cmp     $framesize plus slop,%junkreg
            bg      1f
            expand stack
  4. Introduce a new function call which handles the comparison of the stack pointer and the stack expansion. This function call would be part of the threading library and can use special knowledge--e.g, the ability to quickly locate information about the stack of the current frame. This function call would have to use a special calling sequence, with a single argument which is the size of the stack frame. Then each function would start with a call to this function. On some targets this function call could perhaps be combined with a function call required by PIC code--e.g., __i686.get_pc_thunk.bx.

  5. Reuse the stack protector support field. When using glibc each thread descriptor has a field used by the stack protector. We can reuse that field to hold the stack limit. Of course it is then not possible to use split stacks in conjunction with stack protector. This is probably the best solution.

Expanding the stack

Expanding the stack requires allocating additional memory. This additional memory will have to be allocated using only the stack space slop. In general this will require using a temporary alternate stack. This alternate stack will have to be thread specific. Ideally, all of the functions used to allocate additional stack space must be compiled to not use a split stack. A new function attribute will be introduced to mean that the stack should not be split. It will also work to ensure that the stack is large enough that they do not need to split the stack during the allocation call.

After expanding the stack, the function will copy any stack based parameters from the old stack to the new stack (in C++ this may require using a copy or move constructor; in such a case it may be better to always use the address of the parameter, and not copy it). For varargs functions, this is impossible in general, so we will compile varargs functions differently: they will use an argument pointer which is not necessarily based on the frame pointer. For functions which return objects on the stack, the objects will be returned on the old stack. This should normally happen automatically, as the initial hidden parameter will naturally point to the old stack.

When expanding the stack, the return address of the function will be munged to point to a function which will release the allocated stack block and reset the stack pointer to the caller. The address of the old stack block, and the old stack pointer, will have been saved somewhere in the new stack block.

Backward compatibility

We want to be able to use split stack programs on systems with prebuilt libraries compiled without split stacks. This means that we need to ensure that there is sufficient stack space before calling any such function.

Each object file compiled in split stack mode will be annotated to indicate that the functions use split stacks. There will be no support for compiling some functions in a single file in split stack mode and some not. In ELF this will be represented using a SHT_NOTE section (with the SHF_ALLOC flag clear), which will be present in object files compiled in split stack mode. Using ld -r to link together objects with this note and without this note will produce an error.

When the linker links an executable or shared library, it will look for calls from split-stack code to non-split-stack code. This will include calls to non-split-stack shared libraries (thus, a program linked against a split-stack shared library may fail if at runtime the dynamic linker finds a non-split-stack shared library; it might be desirable to use a PT_NOTE, or perhaps a new segment type, to detect this situation).

For calls from split-stack code to non-split-stack code, the linker will change the initial instructions in the split-stack (caller) function. This means that the linker will have to have special knowledge of the instructions that the compiler emits. The effect of the changes will be to increase the required framesize by a number large enough to reasonably work for a non-split-stack. This will be a target dependent number; the default will be something like 64K. Note that this large stack will be released when the split-stack function returns. Note that I'm disregarding the case of split-stack code in a shared library calling non-split-stack code in the main executable; that seems like an unlikely problem.

Function pointers are a tricky case. In general we don't know whether a function pointer points to split-stack code. Therefore, all calls through a function pointer will be modified to call (or jump to) a special function __fnptr_morestack. This will use a target specific function calling sequence, and will be implemented as though it were itself a function call instruction. That is, all the parameters will be set up, and then the code will jump to __fnptr_morestack. The __fnptr_morestack function takes two parameters: the function pointer to call, and the number of bytes of arguments pushed on the stack.


The debugger will have to learn about split stack mode. There are two major issues: unwinding the stack, and finding the arguments to a varargs function.

Initially we can require that every function which may grow the stack should have a frame pointer. That should permit gdb to unwind the stack by using the current frame pointer to find the previous frame. Ideally we do not want to require this. When not using a frame pointer, gdb will have to see that the return address has been changed to point to the stack reset routine, and use some target dependent operations to find the caller's frame.

For a varargs function, the varargs will be accessed via an argument pointer rather than on the stack. gdb may need to be adjusted to understand this.

Implementation steps: