Random Linux Kernel functions have 16 byte stack alignment at the start of the function. This stack alignment happens before the push %ebp mov %esp, %ebp sequence and breaks the kernel function graph tracer which needs to manipulate the return address. When the alignment happens then still 4(%ebp) contains the return address, but this is only a copy of the real stack entry which is used by the ret instruction. So the tracer modifies the copy and not the real return address stack entry. There are two problems: 1) why is gcc doing 16 byte stack aligment at all 2) why is the stack alignment happening _before_ the "push %ebp, mov %esp %ebp" sequence.
Created attachment 19057 [details] source code (kernel/time/timer_stat.c)
Created attachment 19058 [details] intermediate file timer_stats.i
Created attachment 19059 [details] compiler command line
Is this really a bug since you have: struct entry { ... } __attribute__((__aligned__((1 << (4))))); ... void timer_stats_update_stats(void *timer, pid_t pid, void *startf, void *timerf, char *comm, unsigned int timer_flag) { spinlock_t *lock; struct entry *entry, input; Since input is required to be 16byte aligned by the __aligned__ attribute on the struct.
(In reply to comment #4) > Is this really a bug since you have: > struct entry { > ... > } __attribute__((__aligned__((1 << (4))))); > > ... > > void timer_stats_update_stats(void *timer, pid_t pid, void *startf, > void *timerf, char *comm, > unsigned int timer_flag) > { > spinlock_t *lock; > struct entry *entry, input; > > > Since input is required to be 16byte aligned by the __aligned__ attribute on > the struct. Yes, Andrew pointed that out in the LKML thread as well. This still does not explain why the mcount magic push %ebp mov %esp, %ebp happens _after_ the alignment and the stack layout assumption of mcount: return address saved ebp is done via a copy of the return address instead of just keeping the push %ebp mov %esp, %ebp sequence right at the beginning of the function. GCC 4.4.x silently changed this and we now need to figure out how to _NOT_ trip over that.
I changed the summary to match the real problem. Further info: While testing various kernel configs we found out that the problem comes and goes. Finally I started to compare the gcc command line options and after some fiddling it turned out that the following minimal deltas change the code generator behaviour: Bad: -march=pentium-mmx -Wa,-mtune=generic32 Good: -march=i686 -mtune=generic -Wa,-mtune=generic32 Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 The good ones produce: 650: 55 push %ebp 651: 89 e5 mov %esp,%ebp 653: 83 e4 f0 and $0xfffffff0,%esp The bad one: 000005f0 <timer_stats_update_stats>: 5f0: 57 push %edi 5f1: 8d 7c 24 08 lea 0x8(%esp),%edi 5f5: 83 e4 f0 and $0xfffffff0,%esp 5f8: ff 77 fc pushl -0x4(%edi) 5fb: 55 push %ebp 5fc: 89 e5 mov %esp,%ebp It's worse code for no reason and breaks the kernel assumption of ebp + 4 pointing to the real return address on the stack.
(In reply to comment #6) > The good ones produce: > > 650: 55 push %ebp > 651: 89 e5 mov %esp,%ebp > 653: 83 e4 f0 and $0xfffffff0,%esp > > The bad one: > > 000005f0 <timer_stats_update_stats>: > 5f0: 57 push %edi > 5f1: 8d 7c 24 08 lea 0x8(%esp),%edi > 5f5: 83 e4 f0 and $0xfffffff0,%esp > 5f8: ff 77 fc pushl -0x4(%edi) > 5fb: 55 push %ebp > 5fc: 89 e5 mov %esp,%ebp > > It's worse code for no reason and breaks the kernel assumption of ebp + 4 > pointing to the real return address on the stack. I think the difference comes from DRAP: /* Nonzero if function being compiled needs dynamic realigned argument pointer (drap) if stack needs realigning. */ bool need_drap; It may be triggered by -mno-accumulate-outgoing-args, alloca, long jump, ...