This is the mail archive of the
mailing list for the GCC project.
Re: Split Stack performance, signals
- From: Keith Randall <keith dot randall at gmail dot com>
- To: Anders Oleson <anders at openpuma dot org>
- Cc: gcc at gcc dot gnu dot org
- Date: Sun, 13 Sep 2015 00:17:34 -0700
- Subject: Re: Split Stack performance, signals
- Authentication-results: sourceware.org; auth=none
- References: <CAKTo2EkTGrb9_ucRVXOVB3FBiVBcqW6qOkHKuxUhZfdFPrjQBA at mail dot gmail dot com>
- Reply-to: keithr at alum dot mit dot edu
I can tell you a lot about the Go stack-copy implementation, I worked
on a lot of it. The big drawback is that you need to ensure that you
have accurate stack maps for all frames on the stack. This is
something we can do in Go with some work but would be more problematic
for C. Most of that work is necessary for precise garbage collection
(GC) anyway, so any language with precise GC gets this for free. Some
1) You need to be able to find any pointers from heap->stack
efficiently. For Go, these were things like defer records, panic
records, and channel queues.
2) You need to be able to find contiguous memory for the stacks. We
haven't seen this to be a problem in real-world servers.
3) You need to figure out when to shrink a stack. In Go we shrink
stacks at GC time by finding stacks less than 1/4 used and shrinking
them by half. Shrinking logic is complicated by the fact that the
shrinkee may be in a syscall, in which case shrinking is not possible
because you might have passed a stack pointer into that syscall.
The big advantage is that you only have to copy once (amortized). The
stack grows as big as you need it and then you don't need to do any
more work from then on. This saves lots of morestack calls, but
perhaps more importantly it saves you from having to optimize
everything from morestack on down. morestack is almost never called,
so you can write it in a high-level language, put lots of debugging
checks in it, etc.
Another advantage is runtime predictability. Split stacks were not a
large performance problem, but they were a large source of unexplained
performance variance. We'd often see apparently no-op changes make
runtime vary by +/- 5% because they change frame sizes by just a
little bit and trigger or untrigger the hot split. So if you test a
code change and get a 4% improvement, is your code really better code,
or did you just get lucky? Maybe your code change slows things down
by 1% but happens to avoid a hot split.
On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson <firstname.lastname@example.org> wrote:
> I have been experimenting with -fsplit-stack, in particular related to
> performance issues I have read about in rust and go relative to a "hot
> split" where a function call in a tight loop continuously crosses a
> split. Originally I was interested as an analogy to approximate
> performance issues with other similar stack manipulations. I think the
> original idea is that __morestack would be called extremely rarely.
> Unfortunately there is the edge case that a stack segment boundary
> falls right in the middle of some hot function call.
> I wrote a minimal routine called in a loop a few hundred million times
> and I adjusted the stack usages until I found the split. I was really
> surprised to see how bad the hot-split problem can be. Presumably the
> actual allocation is done on the first split. The subsequent ones
> should just link to the next segment "quickly" so the one-time
> allocation cost can be ignored. I'm using an older stock Ubuntu x64
> GCC 4.8.4 btw., but things don't appear to have changed recently.
> It is taking >4000 clocks per __morestack call (about 1us on 4GHz Haswell)!
> The difference between a non-split call and one with the split prolog
> in the optimistic case where __morestack is not called for comparison
> is < 1 clock. So the prolog is fairly efficient in execution time. It
> is unfortunately large at around 36 bytes per function. I was able to
> nibble away slightly at both the speed and size, but there isn't much
> to gain.
> From examining the __morestack code, I found that the sigprocmask
> system call is being called (twice?!) per __morestack, even when it
> should just need to switch to the next allocated segment. I did read
> the reason for that change: to allow signal handlers to be
> split-stack, (ignoring that detail for the moment). A quick experiment
> shows that removing the calls to __morestack_block_signals and
> __morestack_unblock_signals brings the overhead of the hot split down
> to around 60 clocks which is much more reasonable.
> However in concept simply switching stack segments *should* not be
> hugely expensive. I made a proof-of-concept that only does the very
> minimal work to switch from one segment to another. This is done in
> assembler (conceptually in __morestack) and eliminates the call out to
> "C" on the likely hot path where the boundary has already been crossed
> and the next segment is already big enough. If you cache the details
> of the next/prev segments (if present) in the space right below the
> bottom (limit) of each stack segment, you can shrink the time down to
> 5-6 clocks. This is probably close to the achievable floor, which was
> in part what I was trying to find out.
> prolog overhead, no call to __morestack : < 1 clock
> stock call to __morestack (hot): > 4000 clocks
> without signal blocking: < 60 clocks
> potential best case: < 6 clocks
> I have noticed that both Go and Rust have now abandoned the split
> stack approach due to performance issues. In Go, the idea to have
> zillions of tiny (go)co-routines or green threads is closer to my
> interest area than the Rust use. Even on x64, I think there are still
> reasons for wanting to break out of needing large linear stacks. Or it
> may be useful to other embedded applications. But in Go, apparently
> the fix is to copy the stack (all of it?) which seems pretty drastic,
> expensive and really tricky. At least it would only happen once. I was
> wondering if there was any thought into doing more work to optimize
> the -fsplit-stack? Does the Go stack-copy implementation have other
> Another area I didn't explore was that certain leaf and small routines
> with known maximum stack usage could avoid needing the prolog.This
> might ameliorate much of the size issue.
> Bottom line is that I don't know whether this is something anyone
> still has any interest in, but in theory at least the "hot-split"
> problem could be improved significantly. At least I learned what I was
> trying to, and I put this out in case it is of use/interest to anyone.
> -- Anders