Bug 92172 - ARM Thumb2 frame pointers inconsistent with clang
Summary: ARM Thumb2 frame pointers inconsistent with clang
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: c (show other bugs)
Version: 8.3.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: ABI
Depends on:
Blocks:
 
Reported: 2019-10-21 23:02 UTC by Seth LaForge
Modified: 2020-03-18 00:07 UTC (History)
2 users (show)

See Also:
Host:
Target: arm
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Seth LaForge 2019-10-21 23:02:42 UTC
This is a bit of a feature request, which has been rejected before, but I think there are compelling reasons to reconsider.

The issue is described pretty well in this gcc-patches thread:
https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg195725.html

And in this clang bug:
https://bugs.llvm.org/show_bug.cgi?id=18505

The request is to provide an option to make gcc's frame pointer behavior consistent with clang, either with a special flag, or by default.

The behavior of frame pointers on ARM is a mess, with AAPCS not defining it, the obsolete ARM-Thumb Procedure Call Standard (ATPCS) recommdending a frame layout different than GCC and clang, and ARM's obsolete armcc compiler implementing different semantics.

However, as of 2014, ARM's standard toolchain is "ARM Compiler 6", which packages clang:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.comp6/index.html

The Keil embedded toolchain, which is pretty industry-standard for ARM embedded development, uses armclang:
http://www.keil.com/support/man/docs/armclang_ref/armclang_ref_vvi1466179578564.htm

Addressing some of the objections to modifying the frame layout from the gcc-patches thread:

Wilco Dijkstra <https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg195782.html>:
> However changing the frame pointer like in the proposed patch
> will have a much larger cost - both in performance and codesize. You'd be 
> lucky if it is less than 10%. This is due to placing the frame pointer at the
> top rather than the bottom of the frame, and that is very inefficient in Thumb-2.

I don't understand this objection. For a simple function the additional overhead is literally nothing - for example <https://godbolt.org/z/BhvM2t>, GCC generates:
        push    {r3, r4, r7, lr}
        add     r7, sp, #0
while clang adds a small constant to make r7 point to the previous r7 on the stack, with lr immediately above - zero overhead:
        push    {r4, r6, r7, lr}
        add     r7, sp, #8
For a more complex function where the compiler has to spill r8-r11 one extra instruction is required to generate the right frame layout - gcc generates:
        push    {r3, r4, r5, r6, r7, r8, r9, lr}
        add     r7, sp, #0
While clang generates:
        push    {r4, r5, r6, r7, lr}
        add     r7, sp, #12
        push.w  {r8, r9, r11}
Push (stmia) instructions take, at least on Cortex-M3, 1+N cycles, where N is the number of registers saved. So clang's frame pointer approach takes one extra cycle and 4 extra bytes.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337e/BABBCJII.html

> Doing real unwinding is also far more accurate than frame pointer based
> unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
> non-leaf functions and shrinkwrapped functions - and this breaks callgraph
> profiling).

This is true, but doing real unwinding is prohibitively expensive in an embedded systems context, in which one has only hundreds of KiB of code storage and RAM.

Richard Earnshaw <https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg196444.html>:
> I object to another hack going in for another ill-specified frame
> pointer variant until such time as the ABI is updated to sort this out
> properly.
>
> So until the ABI sanctions a proper inter-function frame chain record,
> GCC will only support local use of the frame pointer and no chaining.

Since this is not defined by the ABI, the ABI is unlikely to specify it any time soon. However, ARM seems to have blessed clang as the official ARM compiler, so it's a defacto standard at this point.

Richard Earnshaw <https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg196488.html>:
> On entry to a function the code has to save the existing frame register.
> It doesn't know (can't trivially know) whether the caller is code
> compiled in Arm state or Thumb state.  So how can it save the caller's
> frame register if they are not the same?
>
> Furthermore, the 'other' frame register (ie r7 in Arm state, r11 in
> Thumb) is available as a call-saved register, so can contain any random
> value.  If you try to use that random value during a frame chain walk
> your program will most like take an access violation.  It will certainly
> give you a garbage frame chain.

This is true - you cannot safely walk the stack frames if thumb and arm functions are intermixed. However, for the situations in which this feature is most useful this is not a problem. For deeply embedded codebases, the entire codebase is compiled with a single compiler and instruction set. Most microcontrollers use a Cortex-M instruction set, which doesn't even implement ARM instructions, so by definition they will not be present!

Someone wrote something like:
> The extra overhead of frame pointers will remove the benefit of thumb instructions - 
> why not just use ARM instructions?

As noted above, there exist many MCUs for which ARM mode is not implemented.

I have two applications motivating me to wanting this fixed. I'm working on safety-critical firmware running on small microcontrollers.

1) In case of a crash, it would be extremely helpful to be able to have the embedded firmware relay back a simple stack trace. Integrating libunwind and including the unwind tables in our firmware is too heavyweight. We know the boundaries of the stack, so it's easy to validate address when traversing frames. If the stack trace sometimes ends early due to issues such as ARM/Thumb interworking, we don't mind - it's much better than no trace at all.

2) It would be really helpful to have random sampling profiling, by capturing stack traces from a randomly triggered timer interrupt handler. Full profiling would add excessive overhead.

I'm totally willing to take a slight performance hit to get the two features above. Judging from stackoverflow questions and such, there are others who would like predictable frame pointers:
https://stackoverflow.com/questions/19643047/arm-call-stack-generation-with-no-frame-pointer
http://cplusadd.blogspot.com/2008/11/frame-pointers-and-function-call.html
https://gcc-help.gcc.gnu.narkive.com/D8BDrQzp/stack-backtrace-for-arm-thumb
https://github.com/google/sanitizers/issues/640
Comment 1 Wilco 2019-10-22 00:47:13 UTC
Firstly it's important to be clear this is about adding support for a frame chain for unwinding.

A frame pointer is something different since it is used for addressing local variables. Historically Arm compilers only supported a frame pointer, not a frame chain for unwinding. So different Arm backends use different frame pointer registers and there is no defined layout since it is not designed for unwinding.

Why does this matter? Well as your examples show, if you want to emit a frame chain using standard push/pop, it typically ends up pointing to the top of the frame. That is the worst possible position for a frame pointer on Thumb - while Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support negative offsets at all, and Thumb-2 supports offsets up to -255 but only with 32-bit instructions. So the result of conflating the frame chain and frame pointer implies a terrible codesize hit for Thumb.

There is also an issue with using r7 as the frame chain in that you're now reserving a precious low register callee-save and just use it once in a typical function. So using r7 is a very bad idea for Thumb.

Your examples suggest LLVM suffers from both of these issues, and IIRC it still uses r11 on Arm but r7 on Thumb. That is way too inefficient/incorrect to consider a defacto standard.
Comment 2 Seth LaForge 2019-10-23 00:22:09 UTC
Good point on frame pointers vs a frame chain for unwinding. I'm looking for the unwindable frame chain.

Wilco:
> Why does this matter? Well as your examples show, if you want to emit a frame
> chain using standard push/pop, it typically ends up pointing to the top of the
> frame. That is the worst possible position for a frame pointer on Thumb - while
> Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support
> negative offsets at all, and Thumb-2 supports offsets up to -255 but only with
> 32-bit instructions. So the result of conflating the frame chain and frame
> pointer implies a terrible codesize hit for Thumb.

Well, there's really no need for a frame pointer for efficiency, since the stack frame can be efficiently accessed with positive immediate accesses relative to the stack pointer. There are even special encodings for Thumb-2 16-bit LDR/STR which allow an immediate offset of 0 to 1020 when relative to SP - much larger than other registers. You're saying using a frame pointer implies a terrible codesize hit for Thumb, but I don't see how that can be - stack access will continue to go through SP, and the only code size hit should be pushing/popping R7 (~2 cycles), computing R7 as a frame pointer (~1 cycle), and potential register spills due to one less register available. That's a pretty small amount of overhead for a non-leaf function.

> Your examples suggest LLVM suffers from both of these issues, and IIRC it still
> uses r11 on Arm but r7 on Thumb. That is way too inefficient/incorrect to consider
> a defacto standard.

Using R11 in ARM and R7 on Thumb is mandated by the AAPCS I believe. I don't think the overhead is likely to be particularly different in Thumb vs ARM.

Numbers talk, so I collected some benchmarks on some production firmware used in self-driving cars. This code is executing on a Cortex-R5F MCU, processing a large amount of data, with a wide variety of function sizes. Unfortunately precise benchmarking on this MCU is difficult - there seem to be swings of a few percent in performance due to changes in code alignment, but the rough results have been reliable. I'm collecting .text size and time spent in computation. Unfortunately we're using a pretty old version of gcc, but the frame pointer generation doesn't seem to have changed in newer releases.

Baseline: With gcc 4.7, -fomit-frame-pointer, -mthumb: 384016 bytes, 110.943 s.
With gcc 4.7, -fno-omit-frame-pointer, -mthumb: 396688 bytes, 113.539 s.
This shows a +3.2% size overhead and +2.3% time overhead for enabling frame pointers in Thumb-2 code.

With gcc 4.7, -fomit-frame-pointer, ARM mode: 487152 bytes, 113.874 s.
That's +26.9% size and +2.6% time over -mthumb.
With gcc 4.7, -fno-omit-frame-pointer, ARM mode: 498064 bytes, 116.936 s.
This shows a +2.7% size overhead and +2.7% time overhead for enabling frame pointers in Thumb-2 code.
Within margin of error, it appears the frame pointer overhead is comparable in Thumb-2 and ARM code.

With clang 7, -fomit-frame-pointer, -mthumb: 371008 bytes, 107.072 s.
That's -3.4% size and -3.5% time over gcc 4.7.
With clang 7, -fomit-frame-pointer, -mthumb: 377296 bytes, 110.868 s.
This shows a +1.7% size overhead and +3.5% time overhead for enabling frame pointers in Thumb-2 code for clang 7.
Within margin of error, it appears clang's frame pointer overhead is slightly higher than gcc's for Thumb-2, but not much.

With clang 7, -fomit-frame-pointer, ARM mode: 458592 bytes, 112.829 s.
That's +21.5% size +1.8% time over clang -mthumb.
With clang 7, -fno-omit-frame-pointer, ARM mode: 463440 bytes, 111.796 s.
That's +1.1% size -0.9% time over clang ARM without frame pointers. I'm a bit mystified by this result - I looked at the generated code and it does what I'd expect, so I think this is just benchmarking variation due to caches/alignment.

For my application, a ~2.5% performance hit is very worthwhile to gain the extra debugability of easy stack traces. I'll probably end up switching over to clang and frame pointers. It'd be nice if people using gcc for embedded ARM development had an easy option for generating stack traces.
Comment 3 Richard Earnshaw 2019-10-23 09:36:27 UTC
(In reply to Seth LaForge from comment #2)

> Using R11 in ARM and R7 on Thumb is mandated by the AAPCS I believe. I don't
> think the overhead is likely to be particularly different in Thumb vs ARM.

No it doesn't.  The AAPCS for AArch32 makes no reference to a frame pointer, so there is no portable way defined for walking a frame other than by using dwarf records or C++ unwinding descriptions.  The latter are preferred, but only support unwinding from 'synchronous' unwind points (after the prologue and before the epilogue).

Compilers are, of course, free to use frame pointers internally, within a frame, but there is no frame chain that can be walked.
Comment 4 Wilco 2019-10-23 12:56:22 UTC
(In reply to Seth LaForge from comment #2)
> Good point on frame pointers vs a frame chain for unwinding. I'm looking for
> the unwindable frame chain.
> 
> Wilco:
> > Why does this matter? Well as your examples show, if you want to emit a frame
> > chain using standard push/pop, it typically ends up pointing to the top of the
> > frame. That is the worst possible position for a frame pointer on Thumb - while
> > Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support
> > negative offsets at all, and Thumb-2 supports offsets up to -255 but only with
> > 32-bit instructions. So the result of conflating the frame chain and frame
> > pointer implies a terrible codesize hit for Thumb.
> 
> Well, there's really no need for a frame pointer for efficiency, since the
> stack frame can be efficiently accessed with positive immediate accesses
> relative to the stack pointer. There are even special encodings for Thumb-2
> 16-bit LDR/STR which allow an immediate offset of 0 to 1020 when relative to
> SP - much larger than other registers. You're saying using a frame pointer
> implies a terrible codesize hit for Thumb, but I don't see how that can be -
> stack access will continue to go through SP, and the only code size hit
> should be pushing/popping R7 (~2 cycles), computing R7 as a frame pointer
> (~1 cycle), and potential register spills due to one less register
> available. That's a pretty small amount of overhead for a non-leaf function.

On GCC10 the codesize overhead of -fno-omit-frame-pointer is 4.1% for Arm and 4.8% for Thumb-2 (measured on SPEC2006). That's already a large overhead, especially since this feature doesn't do anything useful besides adding overhead...

The key is that GCC uses the frame pointer for every stack access, and thus the placement of the frame pointer within a frame matters. Thumb compilers place the frame pointer at the bottom of the frame so they can efficiently access locals using positive offsets. Despite that the overhead is significant already.

If GCC would emit a frame chain like the LLVM sequence this means placing the frame pointer at the top of the stack. This forces negative frame offsets for all frame accesses. Getting a 10% overhead is being lucky, I've seen worse...

So this is something that needs to be properly designed and carefully implemented.

> Baseline: With gcc 4.7, -fomit-frame-pointer, -mthumb: 384016 bytes, 110.943
> s.

Thanks for posting actual numbers, but GCC 4.7?!? It might be time to try GCC9...
Comment 5 Seth LaForge 2019-10-23 16:26:18 UTC
Richard:
> No it doesn't.  The AAPCS for AArch32 makes no reference to a frame pointer,
> so there is no portable way defined for walking a frame other than by using
> dwarf records or C++ unwinding descriptions.  The latter are preferred, but
> only support unwinding from 'synchronous' unwind points (after the prologue
> and before the epilogue).

...in other words, neither is suitable for generating stack traces in an embedded context, which is a genuinely useful feature.

You're right, AAPCS does not mention frame pointers. ATPCS does - I'm not sure if it's still normative. However, it says the thumb frame pointer is any of r4-r7, and dictates a frame pointer even *higher* on the stack - just above the saved LR. That's not what any compiler I know of does.

At this point it's entirely an argument from consistency:
- GCC ARM and Clang ARM use R11 for frame pointer, pointing to the stacked R11. Useful.
- Clang Thumb uses R7 for frame pointer, pointing to the stacked R7. Useful.
- GCC Thumb uses R7 for the frame pointer, pointing to an arbitrary location. Useless for stack traces.

Stack traces are a genuinely useful thing. Many language runtimes do them automatically all the time (e.g. Python). Many C/C++ development environments do them automatically on a crash, either via a debugger or something like libunwind. Many embedded devices would like to do them on a crash - they often have very little storage to store debugging information and relay it to some server, and something like libunwind is just too much for them.

> Compilers are, of course, free to use frame pointers internally, within a frame,
> but there is no frame chain that can be walked.

With clang, there is. With GCC and ARM mode, there is. I'm promoting making thumb mode work the same as ARM mode, thus making stack tracing possible.

Wilco:
> On GCC10 the codesize overhead of -fno-omit-frame-pointer is 4.1% for Arm and 4.8%
> for Thumb-2 (measured on SPEC2006). That's already a large overhead, especially
> since this feature doesn't do anything useful besides adding overhead...

Well, that's basically my point: as implemented, gcc frame pointers are useless on Thumb. There's no reason to enable them. With a small adjustment to behave the same as clang they are quite useful: software can create stack traces easily. Adding a small amount of overhead to a useless feature in order to make it useful seems like a very worthwhile tradeoff to me.

> The key is that GCC uses the frame pointer for every stack access, and thus the
> placement of the frame pointer within a frame matters.

It does? Why?!? The SP register is a better register to offset from in every case I can think of that doesn't involve alloca() or variable-size-arrays, which should be rare. Clang, when using frame pointers, uses SP to access local variables in most cases - compare the implementation of AccessLocal():

https://godbolt.org/z/3o4TlD

int AccessLocal(int a) {
    volatile int b = a;
    SimpleLeaf();
    return b;
}

GCC 8:
        push    {r7, lr}
        sub     sp, sp, #8
        add     r7, sp, #0
        str     r0, [r7, #4]
        ...

Clang 9:
        push    {r7, lr}
        mov     r7, sp
        sub     sp, #8
        str     r0, [sp, #4]
        ...

Same numer of instructions, same code size, same performance, but the clang version has an unwindable/traceable frame pointer.

> Thanks for posting actual numbers, but GCC 4.7?!? It might be time to try GCC9...

There are, sadly, compelling historical reasons. We're putting our effort into moving to clang instead.
Comment 6 Wilco 2019-10-23 17:54:10 UTC
(In reply to Seth LaForge from comment #5)

> GCC 8:
>         push    {r7, lr}
>         sub     sp, sp, #8
>         add     r7, sp, #0
>         str     r0, [r7, #4]
>         ...
> 
> Clang 9:
>         push    {r7, lr}
>         mov     r7, sp
>         sub     sp, #8
>         str     r0, [sp, #4]
>         ...

Crazy yes, but it's due to historical reasons. Originally GCC could only emit code using a frame pointer. Later the frame pointer could be switched off (hence -fomit-frame-pointer), but you still needed it for debug tables. Then there was Dwarf which didn't need a frame pointer anymore. And today the frame pointer is off by default globally in GCC.

> - GCC ARM and Clang ARM use R11 for frame pointer, pointing to the stacked R11. Useful.

Well Clang does this:

       push    {r4, r10, r11, lr}
       add     r11, sp, #8

but GCC does something different:

        push    {r4, r5, fp, lr}
        add     fp, sp, #12

Ie. FP points to saved LR with GCC but saved FP with Clang, so it's not possible for a generic unwinder to follow the chain, even ignoring Arm/Thumb interworking (which is a real issue when an application is Thumb-2 but various library functions use Arm assembly).