This is the mail archive of the
mailing list for the GCC project.
random commentary on -fsplit-stack (and a bug report)
- From: "Jay Freeman (saurik)" <saurik at saurik dot com>
- To: gcc at gcc dot gnu dot org
- Date: Tue, 28 Feb 2012 05:50:11 +0000 (UTC)
- Subject: random commentary on -fsplit-stack (and a bug report)
Hello. A couple years ago I got really excited about the gcc "split stacks" feature that was being developed, and I recently noticed that it is ready to use now. I thereby have been spending the last few days trying it with one of my side-projects (currently just a toy, but something I hope to have in production one day: an event mediator that makes usage of light-weight coroutine-based threads to implement various protocols).
Yesterday, I integrated support for the new -fsplit-stacks libgcc __splitstack_*context functions (the ones that were added to allow coroutine libraries to save/restore the splitstack context; I've linked to the relevant mailing list threads below this paragraph), and I noticed that using __splitstack_releasecontext didn't actually seem to cause anything to get deallocated (watching with strace: mmap, no munmap).
After staring at it some in gdb, I figured out why: a pointer is being passed as if it were a pointer to a pointer rather than a direct pointer, obscured and not found by the compiler because it is being cast and marshaled through the void * array that is used to store split stack contexts (which, btw, might be better represented internally as a struct, to avoid issues like this ;P).
223 __thread struct stack_segment *__morestack_segments
224 __attribute__ ((visibility ("default")));
992 __splitstack_getcontext (void *context[NUMBER_OFFSETS])
994 memset (context, 0, NUMBER_OFFSETS * sizeof (void *));
995 context[MORESTACK_SEGMENTS] = (void *) __morestack_segments;
1105 __splitstack_releasecontext (void *context)
1107 __morestack_release_segments (context[MORESTACK_SEGMENTS], 1);
441 struct dynamic_allocation_blocks *
442 __morestack_release_segments (struct stack_segment **pp, int free_dynamic)
As demonstrated by these snippets, __morestack_segments is a pointer to a stack_segment; it is being stored in the context as a void *, but is being removed from the context and being passed directly to __morestack_release_segments, which in turn expects a pointer to a pointer to a stack_segment, not just a bare pointer to a stack segment. Probably quite simple to fix (although might be more complex than just "add a &").
While I am sending an e-mail regarding -fsplit-stack, though, I figured I would also mention some design issues I've noticed while using it. Some of these may just be "me being stupid" (as I've only been looking at this in depth over the last few days), but I at least have had this idea "on the back burner" for a long time now, and am actually integrating and consuming the APIs that are resulting. Feel free to ignore me.
1) The current implementation (maybe this is intended to change?) uses mmap() to allocate stack segments, which means that every allocation involves a system call, a lock in the kernel on a slow data structure (anon_vma), and has some non-zero probability of ending up with a separate VMA (which is not only slow, but in my understanding uses up a limited resource: you can only have 64k VMAs per process).
Is it possible to instead expose the functionality for allocating stack segments out of libgcc for easy replacement by coroutine runtimes? I would really love to be able to use my existing memory manager to allocate the stack segments. I realize that this allocation routine would need to be able to operate with almost no stack: that isn't a problem (I can always pivot to another stack if I need any stack).
2) I had seen a discussion on the mailing list regarding argument copying, and I must say I'm somewhat confused as to why it is sufficient to memcpy the arguments from the old stack to the new one: if I have an argument with a non-POD type that has a non-trivial copy constructor, it would seem like I need a copy operation to be able to use the object from the new stack (maybe, for example, it has an internal pointer).
3) If I have either blocked signals on my thread or have setup an alternate signal stack with sigaltstack, I can get away with super-tiny stacks. However, allocate_segment has a minimum stack size of MINSIGSTKSZ (I presume to allow for signals), which on some systems I use (such as Mac OS X) I've seen be set as high as 32kB. (Meanwhile, MINSIGSTKSZ on Linux is smaller than a page, so mmap() can't even allocate it.)
4) 10 64-bit words for the splitstack context is a really large amount of space. :( I don't even have that much CPU-state (there are only 8 registers that really need to be saved when switching between coroutines). Considering the stack segments form a doubly-linked-list, it would seem like MORESTACK_SEGMENTS and CURRENT_SEGMENT are redundant. I also feel like CURRENT_STACK could be worked around fairly well.
5) As currently implemented, the stack space check is added to every single function. However, some functions do not actually use the stack (or might even be avoiding memory accesses entirely). When I look at the disassembly of my project, I see many references to __morestack and "cmp %fs:0x70,%rsp" in functions that would otherwise be just a few instructions long. Functions that don't use stack should avoid the check.
6) I have noticed that the purpose of having split stacks seems largely hobbled by the way the linker enforces humungous stacks on outgoing calls to non-split-stack code, even if that code isn't called. As an incredibly painful example: __splitstack_getcontext is not compiled with split-stack support, which means that the function I have to switch coroutines (called from every coroutine) allocates stack.
To explain what I mean by "even if that code isn't called": my code hardly ever throws exception, but because I support them I end up with _Unwind_resume in most of my functions; I thereby get burned with giant stacks. It would seem more ideal (although I see how this would be much more difficult) if there were some way to only allocate the larger stack as the call is made to the non-split-stack function, not when entering the split-stack one.
A lot of these problems would be solved if libgcc (and whatever friends, such as libsupc++) were themselves compiled with -fsplit-stack. Of course, I can't imagine that anyone would want to pay the performance penalty for that globally ;P. So, is there some plan to either do that for the entire build, or to provide alternative versions of those libraries that can be linked to while using -fsplit-stack in your own code?
That said, I don't think that that entirely does away with this "uncalled function drags in stack requirements" problem, as I want to say the core issue comes down to how this interacts with the inliner. In many of these cases, the call to the non-split-stack function is in some leaf function of a giant call graph that was flattened to a single massive function during the optimization pass.
The result is that if you ever interact with non-split-stack code anywhere, you really need to be quite explicit about __noinline__ to keep it from tainting the stack requirements of other functions. Part of me feels like there must be a better way of handling the stack expansions (such as by putting it at the call-site in situations like this), although I realize that might be difficult with the linker in charge of it.
A specific idea that might help, however, is to set things up so that the PLT actually handles the stack increases when you are linking to functions that are in a dynamic library. That way, calls to open (for example) would not cause the function that called it to suddenly require a large stack, but instead only as control is transferred to open would the stack size increase. (This might be quite complex, though.)
7) Using the linker to handle the transition between split-stack and non-split-stack code seems like a good way to solve the problem of "we need large stacks when hitting external code", but in staring at the resulting code I have in my project I'm seeing that it isn't reliable: if you have a pointer to a function the linker will not know what you are calling. In my case, this is coming up often due to using std::function.
More awkwardly, split-stack functions that mention (but do not call) non-split-stack functions (such as to return their address) are being mis-flagged by the linker. Honestly, I question whether the linker fundamentally has enough information about what is going on to be able to make sufficiently accurate decisions with regards to stack constraints to warrant the painful abstraction breakage that split-stack uses. :(
That said, I don't have a better solution to suggest right now (I really want to say that having attributes available to declare split-stack functionality in the code would be better, but that has other ramifications), but I do have concerns that due to attempts to keep the ABI fixed decisions made now (when there seem to only be a single major user, Go) will lock in how the mechanism is capable of functioning in the future.
Well, if you did read any of that, thanks for taking the time to do so. I really appreciate this feature you've been working on, and have been excited by it for a while now. When I ran into the aforementioned bug in the splitstack context implementation, I figured I'd send an e-mail that explained it, and I hope that I then didn't waste too much of peoples' time with my other split-stack related musings. ;P
If some of these things are in the "yeah, changing that would be interesting, we are just don't have many people working on the feature", I'd be happy to throw some patches towards it. I hesitate to just start sending patches over the wall, however, without first doing some kind of verification that I have any clue what I'm doing; I certainly am not certain how things would be prioritized, or even really who is working on it. ;P
Jay Freeman (saurik)