This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Split Stack performance, signals

From: Anders Oleson <anders at openpuma dot org>
To: gcc at gcc dot gnu dot org
Date: Sat, 12 Sep 2015 23:38:00 -0700
Subject: Split Stack performance, signals
Authentication-results: sourceware.org; auth=none

I have been experimenting with -fsplit-stack, in particular related to
performance issues I have read about in rust and go relative to a "hot
split" where a function call in a tight loop continuously crosses a
split. Originally I was interested as an analogy to approximate
performance issues with other similar stack manipulations. I think the
original idea is that __morestack would be called extremely rarely.
Unfortunately there is the edge case that a stack segment boundary
falls right in the middle of some hot function call.

I wrote a minimal routine called in a loop a few hundred million times
and I adjusted the stack usages until I found the split. I was really
surprised to see how bad the hot-split problem can be. Presumably the
actual allocation is done on the first split. The subsequent ones
should just link to the next segment "quickly" so the one-time
allocation cost can be ignored. I'm using an older stock Ubuntu x64
GCC 4.8.4 btw., but things don't appear to have changed recently.

It is taking >4000 clocks per __morestack call (about 1us on 4GHz Haswell)!

The difference between a non-split call and one with the split prolog
in the optimistic case where __morestack is not called for comparison
is < 1 clock. So the prolog is fairly efficient in execution time. It
is unfortunately large at around 36 bytes per function. I was able to
nibble away slightly at both the speed and size, but there isn't much
to gain.

>From examining the __morestack code, I found that the sigprocmask
system call is being called (twice?!) per __morestack, even when it
should just need to switch to the next allocated segment. I did read
the reason for that change: to allow signal handlers to be
split-stack, (ignoring that detail for the moment). A quick experiment
shows that removing the calls to __morestack_block_signals and
__morestack_unblock_signals brings the overhead of the hot split down
to around 60 clocks which is much more reasonable.

However in concept simply switching stack segments *should* not be
hugely expensive. I made a proof-of-concept that only does the very
minimal work to switch from one segment to another. This is done in
assembler (conceptually in __morestack) and eliminates the call out to
"C" on the likely hot path where the boundary has already been crossed
and the next segment is already big enough. If you cache the details
of the next/prev segments (if present) in the space right below the
bottom (limit) of each stack segment, you can shrink the time down to
5-6 clocks. This is probably close to the achievable floor, which was
in part what I was trying to find out.

Summary:
  prolog overhead, no call to __morestack : < 1 clock
  stock call to __morestack (hot): > 4000 clocks
  without signal blocking: < 60 clocks
  potential best case: < 6 clocks

I have noticed that both Go and Rust have now abandoned the split
stack approach due to performance issues. In Go, the idea to have
zillions of tiny (go)co-routines or green threads is closer to my
interest area than the Rust use. Even on x64, I think there are still
reasons for wanting to break out of needing large linear stacks. Or it
may be useful to other embedded applications. But in Go, apparently
the fix is to copy the stack (all of it?) which seems pretty drastic,
expensive and really tricky. At least it would only happen once. I was
wondering if there was any thought into doing more work to optimize
the -fsplit-stack? Does the Go stack-copy implementation have other
issues?

Another area I didn't explore was that certain leaf and small routines
with known maximum stack usage could avoid needing the prolog.This
might ameliorate much of the size issue.

Bottom line is that I don't know whether this is something anyone
still has any interest in, but in theory at least the "hot-split"
problem could be improved significantly. At least I learned what I was
trying to, and I put this out in case it is of use/interest to anyone.

-- Anders

Follow-Ups:
- Re: Split Stack performance, signals
  - From: Keith Randall
- Re: Split Stack performance, signals
  - From: Ian Lance Taylor

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]