This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Frame pointer for arm with THUMB2 mode


Hi,
thanks for the answer.

> Switching on the frame pointer typically costs 1-2% performance, so it's a bad > idea to use it. However changing the frame pointer like in the proposed patch > will have a much larger cost - both in performance and codesize. You'd be > lucky if it is less than 10%. This is due to placing the frame pointer at the top > rather than the bottom of the frame, and that is very inefficient in Thumb-2.
>
> You would need to unwind ~100k times a second before you might see a
> performance gain. However you pay the performance cost all the time, even
> when no unwinding is required. So I don't see this as being a good idea.
>
> If unwind performance is an issue, it would make far more sense to solve that. > Profiling typically hits the same functions again and again. Callgraph profiling to > a fixed depth hits the same functions all the time. So caching is the obvious
> solution.

We are working on applying Address/LeakSanitizer for the full Tizen OS
distribution. It's about ~1000 packages, ASan/LSan runtime is installed to ld.so.preload. As we know ASan/LSan has interceptors for allocators/deallocators such as (malloc/realloc/calloc/free) and so on.
On every allocation from user space program, ASan calls
GET_STACK_TRACE_MALLOC;
which unwinds the stack frame, and by default uses frame based stack
unwinder. So, it requires to build with "-fno-omit-frame-pointer", switching it to default unwinder really hits the performance in our case.

> Doing real unwinding is also far more accurate than frame pointer based
> unwinding (the latter doesn't handle leaf functions correctly, entry/exit in > non-leaf functions and shrinkwrapped functions - and this breaks callgraph
> profiling).

I agree, but in our case, all interceptors for allocators are
leaf functions, so the frame based stack unwinder works well for us.

> So my question is, is there any point in making code run significantly slower
> all the time and in return get inaccurate unwinding?

By default we build packages with ("-marm" "-fno-omit-frame-pointer"),
because need frame based stack unwinder for every allocation, as I said
before. As we know GCC sets fp to lr on the stack with ("-fno-omit-frame-pointer" and "-marm") and I don't really know why. But the binary size is bigger than for thumb, so, we cannot use default thumb frame pointer and want to reduce binary size for the full sanitized image.

In other case clang works the same way, as I offered at the patch.
It has the same issue, but it was fixed at the end of 2017
https://bugs.llvm.org/show_bug.cgi?id=18505 (The topics starts from
discussion about APCS, but it is not the main point.)

Also, unresolved issue related to this
https://github.com/google/sanitizers/issues/640

Thanks.


On 08/28/2018 03:48 AM, Wilco Dijkstra wrote:
Hi,

But we still have an issue with performance, when we are using default
unwinder, which uses unwind tables. It could be up to 10 times faster to
use frame based stack unwinder instead "default unwinder".

Switching on the frame pointer typically costs 1-2% performance, so it's a bad
idea to use it. However changing the frame pointer like in the proposed patch
will have a much larger cost - both in performance and codesize. You'd be
lucky if it is less than 10%. This is due to placing the frame pointer at the top
rather than the bottom of the frame, and that is very inefficient in Thumb-2.

You would need to unwind ~100k times a second before you might see a
performance gain. However you pay the performance cost all the time, even
when no unwinding is required. So I don't see this as being a good idea.

If unwind performance is an issue, it would make far more sense to solve that.
Profiling typically hits the same functions again and again. Callgraph profiling to
a fixed depth hits the same functions all the time. So caching is the obvious
solution.

Doing real unwinding is also far more accurate than frame pointer based
unwinding (the latter doesn't handle leaf functions correctly, entry/exit in
non-leaf functions and shrinkwrapped functions - and this breaks callgraph
profiling).

So my question is, is there any point in making code run significantly slower
all the time and in return get inaccurate unwinding?

Cheers,
Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]