Split Stacks in GCC

Ian Lance Taylor

The goal of split stacks is to permit a discontiguous stack which is grown automatically as needed. This means that you can run multiple threads, each starting with a small stack, and have the stack grow and shrink as required by the program. It is then no longer necessary to think about stack requirements when writing a multi-threaded program. The memory usage of a typical multi-threaded program can decrease significantly, as each thread does not require a worst-case stack size. It becomes possible to run millions of threads (either full NPTL threads or co-routines) in a 32-bit address space.

This is currently implemented for 32-bit and 64-bit x86 targets running GNU/Linux in gcc 4.6.0 and later. For full functionality you must be using the gold linker, which you can get by building binutils 2.21 or later with --enable-gold.

Basic implementation

The stack will have a guaranteed zone which is always available. The size of the guard area will be target specific. It will include enough stack space to actually allocate more stack space. Each function will have to verify that it has enough space in the current stack to execute.

The basic verification will be a comparison between the stack pointer and the current bottom of the stack plus the guaranteed zone size. This will have to be the first operation in the function, and will also be target specific. It must be fast, as it will be executed by each called function.

There are two cases to consider. For functions which require a stack frame less than the size of the guaranteed guard area, we can do a simple comparison between the stack pointer and the stack limit. For functions which require a larger stack frame, we must do a comparison including the size of the stack frame.

Possible strategies:

Reserve a register to hold the bottom of the stack plus the guaranteed size. This will have to be a callee-saved register. Each function with a small stack frame then starts with a two instruction sequence:
```
        cmp     %sp,%reg
        bg      1f
        expand stack
    1:
```
Use a TLS variable. In the general case, in a shared library, this will require calling the __tls_get_addr function. That means that that function will have to work without requiring any additional stack space. This is infeasible unless the whole system is compiled with split stacks. It would require LD_BIND_NOW to be set, so that the __tls_get_addr function is resolved at program startup time. Even that is probably insufficient unless we can ensure that the space for the variable is fully allocated. In general I don't think we can ensure this, because dlopen can cause a thread to require more space for TLS variables, and that space will be allocated on the first call to __tls_get_addr. So I don't think this approach can work in general.
Have the stack always end at a N-bit boundary. E.g., if we always allocate stack segments as a multiple of 4K, then align each one so that the stack always ends at a 12-bit boundary. Then the amount of space remaining on the stack is SP & 0xfff. We then have a four instruction sequence:
```
        mov     %sp,%junkreg
        and     $0xfff,%junkreg
        cmp     $framesize plus slop,%junkreg
        bg      1f
        expand stack
    1:
```
Introduce a new function call which handles the comparison of the stack pointer and the stack expansion. This function call would be part of the threading library and can use special knowledge--e.g, the ability to quickly locate information about the stack of the current frame. This function call would have to use a special calling sequence, with a single argument which is the size of the stack frame. Then each function would start with a call to this function. On some targets this function call could perhaps be combined with a function call required by PIC code--e.g., __i686.get_pc_thunk.bx.
Reuse the stack protector support field. When using glibc each thread descriptor has a field used by the stack protector. We can reuse that field to hold the stack limit. Of course it is then not possible to use split stacks in conjunction with stack protector.
At least on x86, arrange to allocate a new field in the TCB header accessible via %fs or %gs. This is probably the best solution, and it is the one implemented for i386 and x86_64.

Expanding the stack

Expanding the stack requires allocating additional memory. This additional memory will have to be allocated using only the stack space slop. All of the functions used to allocate additional stack space must be compiled to not use a split stack. A new function attribute, no_split_stack will be introduced to mean that the stack should not be split. It would also work to ensure that the stack is large enough that they do not need to split the stack during the allocation call.

After expanding the stack, the function will copy any stack based parameters from the old stack to the new stack. Fortunately, all C++ objects which require a copy or move constructor are implicitly passed by reference, so copying the parameters on the stack is OK. For varargs functions, this is impossible in general, so we will compile varargs functions differently: they will use an argument pointer which is not necessarily based on the frame pointer. For functions which return objects on the stack, the objects will be returned on the old stack. This should normally happen automatically, as the initial hidden parameter will naturally point to the old stack.

When expanding the stack, the return address of the function will be munged to point to a function which will release the allocated stack block and reset the stack pointer to the caller. The address of the old stack block, and the old stack pointer, will have been saved somewhere in the new stack block.

Backward compatibility

We want to be able to use split stack programs on systems with prebuilt libraries compiled without split stacks. This means that we need to ensure that there is sufficient stack space before calling any such function.

Each object file compiled in split stack mode will be annotated to indicate that the functions use split stacks. This should probably be annotated with a note but there is no general support for creating arbitrary notes in GNU as. Therefore, each object file compiled in split stack mode will have an empty section with a special name: .note.GNU-split-stack. If an object file compiled in split stack mode includes some functions with the no_split_stack attribute, then the object file will also have a .note.GNU-no-split-stack section. This will tell the linker that some functions may not have the expected split stack prologue.

When the linker links an executable or shared library, it will look for calls from split-stack code to non-split-stack code. This will include calls to non-split-stack shared libraries (thus, a program linked against a split-stack shared library may fail if at runtime the dynamic linker finds a non-split-stack shared library; it might be desirable to use a new segment type to detect this situation).

For calls from split-stack code to non-split-stack code, the linker will change the initial instructions in the split-stack (caller) function. This means that the linker will have to have special knowledge of the instructions that the compiler emits. The effect of the changes will be to increase the required framesize by a number large enough to reasonably work for a non-split-stack. This will be a target dependent number; the default will be something like 64K. Note that this large stack will be released when the split-stack function returns. Note that I'm disregarding the case of split-stack code in a shared library calling non-split-stack code in the main executable; that seems like an unlikely problem.

Function pointers are a tricky case. In general we don't know whether a function pointer points to split-stack code. Therefore, all calls through a function pointer will be modified to call (or jump to) a special function __fnptr_morestack. This will use a target specific function calling sequence, and will be implemented as though it were itself a function call instruction. That is, all the parameters will be set up, and then the code will jump to __fnptr_morestack. The __fnptr_morestack function takes two parameters: the function pointer to call, and the number of bytes of arguments pushed on the stack. (This is not yet implemented.)

Debugging

The debugger will have to learn about split stack mode. There are two major issues: unwinding the stack, and finding the arguments to a varargs function.

Fortunately DWARF is complex enough to represent the unusual return sequence used by a split stack function. Therefore, the only major issue is that gdb expects all stack address to monotonically decrease. Some marker will be needed to tell gdb to disable this sanity check.

For a varargs function, the varargs will be accessed via an argument pointer rather than on the stack. gdb may need to be adjusted to understand this.

Implementation steps:

Add a new target function bool supports_split_stack() along with TARGET_SUPPORTS_SPLIT_STACK.
Add -fsplit-stack option to gcc. This will only be permitted if targetm.supports_split_stack() returns true.
Add no_split_stack function attribute. This will be ignored if -fno-split-stack.
If -fsplit-stack is used, emit a split-stack note into the object file.
Write the __generic_morestack routine. It must allocate the required amount of memory without itself splitting the stack. This is likely to be OS dependent.
alloca and dynamic arrays must change to check for available stack space. If there is not enough available, they must call a heap allocation routine. The space must be released on function exit.
For each target:
- Define the function entry sequence for small stack frame functions and large stack frame functions.
- Define the calling convention for the __morestack routine.
- Implement this sequence in the prologue generation code when -fsplit-stack is used and the no_split_stack attribute is not used.
- Implement varargs differently if necessary, using a separate argument pointer.
- Define the calling convention for the __fnptr_morestack routine. Change the call define_expands to use the new calling convention for calls to a REG.
- Write the __morestack routine. This will have to gets its argument in a target dependent fashion. It will have to save registers and switch to an alternate stack. Then it can call __generic_morestack. It must change the return value of the calling function to point to __releasestack. The function entry sequence must permit this to be done reliably.
- Write the __releasestack routine. It must retrieve the original stack pointer and original return address from the allocated stack, release the allocated stack, and adjust the stack pointer to point to the old stack.
- Write the __morestack_nosplit routine. This is called when split-stack code calls no-split-stack code. It is similar to __morestack, but uses a target dependent large stack size.
- Write the __fnptr_morestack routine. In some cases it may be straightforward to examine the code at the function pointer to see if it checks for a split stack. If it does, the code can jump straight to the called function. Otherwise, it must allocate stack as with __morestack_nosplit, and then copy the argument bytes over to the new stack.
Change the linker to look for the new split-stack note.
If ld -r is used to link an object with the split-stack note with an object without the split-stack note, issue an error.
For each target:
- If code in a split-stack object calls code in a non-split-stack object, change the initial starting sequence of the code to always call __morestack_nosplit.
- In a link which includes split-stack objects, if a function with a non-split-stack object is referenced in a way other than calling it (e.g., taking its address), then instead take the address of a linker-generated stub function. The stub function will call __morestack_nosplit, copy parameters to the new stack if necessary, and then jump to the real function. Copying parameters will have to make an assumption about the number of parameters, so if too many parameters are passed this will fail. I'm not sure how best to handle this unlikely case. We should only allocate a new stack if called from a split-stack function.
Adjust gdb to understand the new code.
Figure out the best way to incorporate this efficiently into glibc. It might be appropriate to compile glibc both with and without split-stack, and to somehow arrange for split-stack code to get the split-stack glibc. It would be best if both the program linker and the dynamic linker could agree, or if the program linker could somehow know that a split-stack glibc would be used at runtime.
Ideally the glibc dynamic linker would know whether a program expected a split-stack version of a shared library, and refuse to link it with a non-split-stack version of that library.