This is performance regression from 4.3 (which was better). On i386, when -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2 is used, and function operates with long long values, stack frame is generated, although it doesn't have to be. Example: int f(long long x); int g(long long x) { f(x); return 0; } Generated code: .text .p2align 4,,15 .globl g .type g, @function g: pushl %ebp movl %esp, %ebp subl $8, %esp movl 8(%ebp), %eax movl 12(%ebp), %edx movl %eax, (%esp) movl %edx, 4(%esp) call f xorl %eax, %eax leave ret .size g, .-g Gcc 4.3 didn't generate stack frame in this case. On i386, spurious stack frame is especially bad because there are few registers and one register is lost for the stack frame. One my program doing heavy 64-bit math shows almost 1% code size increase because of these unneeded frames.
Why do you limit your stack boundary artificially?
I think it is related to PR 39137.
"Why do you limit your stack boundary artificially?" Because if I don't use FP code there is no point in aligning the stack. Aligning the stack wastes stack space and code size and doesn't improve performance of integer code in any way.
For some reason IRA reloads argp using ebp-relative address as: Reloads for insn # 22 Reload 0: reload_in (DI) = (mem/c/i:DI (plus:SI (reg/f:SI 6 bp) (const_int 8 [0x8])) [2 x+0 S8 A32]) GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), optional reload_in_reg: (mem/c/i:DI (plus:SI (reg/f:SI 6 bp) (const_int 8 [0x8])) [2 x+0 S8 A32]) Without IRA (4.3.x), gcc reloads argp through esp-relative addr: Reloads for insn # 23 Reload 0: reload_in (DI) = (mem/c/i:DI (plus:SI (reg/f:SI 7 sp) (const_int 12 [0xc])) [2 x+0 S8 A32]) GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), optional reload_in_reg: (mem/c/i:DI (plus:SI (reg/f:SI 7 sp) (const_int 12 [0xc])) [2 x+0 S8 A32]) Confirmed, regression from 4.3.
Why do you think it has anything to do with IRA? It has all to do with dynamic stack realignment IMNSHO. Just try -fno-ira on any of the fall snapshots, it makes absolutely no difference.
Indeed, on this testcase crtl->stack_realign_needed is 1, while when using say int instead of long long it is 0.
There are multiple places which bump crtl->stack_alignment_estimated to 64 during expansion in this case (thus forcing stack_realign_needed): if (SUPPORTS_STACK_ALIGNMENT) { unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode, data.passed_type); if (TYPE_ALIGN (data.nominal_type) > align) align = TYPE_ALIGN (data.passed_type); if (crtl->stack_alignment_estimated < align && 0) { gcc_assert (!crtl->stack_realign_processed); crtl->stack_alignment_estimated = align; } } in assign_parms and: /* If a virtual register with bigger mode alignment is generated, increase stack alignment estimation because it might be spilled to stack later. */ if (SUPPORTS_STACK_ALIGNMENT && crtl->stack_alignment_estimated < align && !crtl->stack_realign_processed) crtl->stack_alignment_estimated = align; in gen_reg_rtx (DImode) called from parmreg = gen_reg_rtx (promoted_nominal_mode); in assign_parm_setup_reg. We'd need to arrange for these bumps not to happen for long long incoming parameters somehow. Another testcase: int f (long long x); long long x; int g (void) { f (x); return 0; } which would need something in expand_one_var: 1166 if (TREE_STATIC (var) || DECL_EXTERNAL (var)) 1167 align = TYPE_ALIGN (TREE_TYPE (var)); 1168 else 1169 align = DECL_ALIGN (var); 1170 1171 if (crtl->stack_alignment_estimated < align) 1172 { 1173 /* stack_alignment_estimated shouldn't change after stack 1174 realign decision made */ 1175 gcc_assert(!crtl->stack_realign_processed); 1176 crtl->stack_alignment_estimated = align; 1177 }
Created attachment 18170 [details] gcc45-pr40667.patch Very ugly patch which just shows that changing these 3 spots helps both of these testcases. It uses minimum of the currently used alignment and LOCAL_ALIGNMENT/LOCAL_DECL_ALIGNMENT/STACK_SLOT_ALIGNMENT, so only when these functions return smaller alignment than expected they change anything. As only i386/x86_64 is SUPPORTS_STACK_ALIGNMENT supporting platform and only for DImode on -m32 we ever decrease the stack alignment, this shouldn't break anything. BTW, the patch also changes something that looks like a thinko to me: unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode, data.passed_type); if (TYPE_ALIGN (data.nominal_type) > align) align = TYPE_ALIGN (data.passed_type); data.passed_type has already been used in FUNCTION_BOUNDARY computation, so it really surprises me it doesn't use nominal_type's TYPE_ALIGN instead.
(In reply to comment #8) > on -m32 we ever decrease the stack alignment, this shouldn't break anything. > BTW, the patch also changes something that looks like a thinko to me: > unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode, > data.passed_type); > if (TYPE_ALIGN (data.nominal_type) > align) > align = TYPE_ALIGN (data.passed_type); > data.passed_type has already been used in FUNCTION_BOUNDARY computation, so it > really surprises me it doesn't use nominal_type's TYPE_ALIGN instead. You may have an unaligned parameter. If it is spilled onto stack, stack should be properly aligned. It isn't a problem for DImode. But it may be a problem for other types. There may be some testcases aleady to test it. If not, I can add a few. What happened is we find DImode and try to align stack. At the last minute, we find it isn't spilled and we skip the stack alignment. What we need it is MINIMUM_TYPE_ALIGN, which is the minimum alignment for a type. MINIMUM_TYPE_ALIGN should be the same as TYPE_ALIGN. But on IA32, MINIMUM_TYPE_ALIGN (DImode) == 4 and TYPE_ALIGN (DImode) == 8. We can use MINIMUM_TYPE_ALIGN for stack alignment.
I'm not sure what you mean MINIMUM_TYPE_ALIGN should be. A new type field? That would be IMHO an overkill, would enlarge types too much. If it is just a macro, it should be probably MINIMUM_ALIGNMENT, not MINIMUM_TYPE_ALIGN, and take a tree (TYPE or DECL), mode and initial alignment and just return a possibly lower alignment. So pretty much like ix86_local_alignment, except that it would only ever decrease alignment, rather than also increase it. On most targets the macro would just return the third argument.
(In reply to comment #10) > I'm not sure what you mean MINIMUM_TYPE_ALIGN should be. A new type field? > That would be IMHO an overkill, would enlarge types too much. > If it is just a macro, it should be probably MINIMUM_ALIGNMENT, not > MINIMUM_TYPE_ALIGN, and take a tree (TYPE or DECL), mode and initial alignment > and just return a possibly lower alignment. So pretty much like > ix86_local_alignment, except that it would only ever decrease alignment, rather > than also increase it. On most targets the macro would just return the third > argument. That should work.
Subject: Bug 40667 Author: jakub Date: Sat Jul 11 17:40:29 2009 New Revision: 149513 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=149513 Log: PR rtl-optimization/40667 * defaults.h (MINIMUM_ALIGNMENT): Define if not defined. * doc/tm.texi (MINIMUM_ALIGNMENT): Document it. * config/i386/i386.h (MINIMUM_ALIGNMENT): Define. * config/i386/i386.c (ix86_minimum_alignment): New function. * config/i386/i386-protos.h (ix86_minimum_alignment): New prototype. * cfgexpand.c (expand_one_var): Use MINIMIM_ALIGNMENT. * emit-rtl.c (gen_reg_rtx): Likewise. * function.c (assign_parms): Likewise. If nominal_type needs bigger alignment than FUNCTION_ARG_BOUNDARY, use its alignment rather than passed_type's alignment. Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/config/i386/i386-protos.h trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/defaults.h trunk/gcc/doc/tm.texi trunk/gcc/emit-rtl.c trunk/gcc/function.c
Subject: Bug 40667 Author: jakub Date: Sat Jul 11 19:06:26 2009 New Revision: 149517 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=149517 Log: PR rtl-optimization/40667 * defaults.h (MINIMUM_ALIGNMENT): Define if not defined. * doc/tm.texi (MINIMUM_ALIGNMENT): Document it. * config/i386/i386.h (MINIMUM_ALIGNMENT): Define. * config/i386/i386.c (ix86_minimum_alignment): New function. * config/i386/i386-protos.h (ix86_minimum_alignment): New prototype. * cfgexpand.c (expand_one_var): Use MINIMIM_ALIGNMENT. * emit-rtl.c (gen_reg_rtx): Likewise. * function.c (assign_parms): Likewise. If nominal_type needs bigger alignment than FUNCTION_ARG_BOUNDARY, use its alignment rather than passed_type's alignment. Modified: branches/gcc-4_4-branch/gcc/ChangeLog branches/gcc-4_4-branch/gcc/cfgexpand.c branches/gcc-4_4-branch/gcc/config/i386/i386-protos.h branches/gcc-4_4-branch/gcc/config/i386/i386.c branches/gcc-4_4-branch/gcc/config/i386/i386.h branches/gcc-4_4-branch/gcc/defaults.h branches/gcc-4_4-branch/gcc/doc/tm.texi branches/gcc-4_4-branch/gcc/emit-rtl.c branches/gcc-4_4-branch/gcc/function.c
Fixed.
The patch just walks around the problem and doesn't really fix it. It is simply a hack that disables frame generation for long long, while for the other types the same problem persists: Example: void g(double); int f(double a) { g(a); return 0; } (compile with -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2) --- here it still generates stack frame although it doesn't have to. Gcc 4.3 is correct. Stack frame needs to be generated if * the user requests it (no -fomit-frame-pointer) * there is alloca * there is stack realignment needed (either because of attribute that requests it or SSE variables are on the stack) --- in the above example, none of these three conditions are true, frame is not realigned, yet the frame pointer is generated. The logic for deciding whether to generate the frame pointer is flawed. The current double regression is not as "critical" as the long long but it is still incorrect.
In the above example, the output of assembler is: f: pushl %ebp movl %esp, %ebp subl $8, %esp fldl 8(%ebp) fstpl (%esp) call g xorl %eax, %eax leave ret
Incorrect? Why do you think so? double on ix86 has 8 byte alignment, and unlike long long it has also performance effects when you misalign it (in theory, we could handle double the same as long long with -Os though). When gcc needs to decide whether to align the stack or not, it doesn't yet know whether any temporaries of that type will need to be spilled to stack. -mpreferred-stack-boundary=2 is typically only used in the kernel anyway and that one doesn't have floating point stuff, as this option affects the Linux ABI (you can't call libc functions from code compiled that way among other things).
The bug is this: you don't align the stack and you generate the frame. Why? Why don't you do one of these?: * generate the frame and align * don't generate the frame and don't align these two would be reasonable, but generating the frame and not aligning is not.
Because you need to decide whether to use a frame pointer or not before register allocation, and only during/after register allocation you can find out that something needs to be spilled to stack.
I see, if it gets spilled to the stack as a local variable, it realigns the stack, if it doesn't get spilled, it doesn't. But shouldn't "passing the variable as an argument on the stack" be treated equal to spilling? It is the same instruction.
Hmm, it still generates the stack frame (and the alignment itself) when there are structures containing long long and with -malign-double. Example, compile with -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2 -malign-double: struct s { long long a; int b; }; void f(struct s *x); void g(void) { struct s x; f(&x); } --- the stack is aligned although it doesn't have to be. Output: g: pushl %ebp movl %esp, %ebp andl $-8, %esp subl $20, %esp leal 4(%esp), %eax movl %eax, (%esp) call f leave ret I compile the whole project with -malign-double (so I must use it on all modules, even integer ones, as it's ABI thing) and I compile integer-only parts with -mpreferred-stack-boundary=2. I don't know if you can extend that hack for "structures containing long long" ... but the whole stack alignment thing really needs to be redesigned for gcc-4.5.
(In reply to comment #21) > Hmm, it still generates the stack frame (and the alignment itself) when there > are structures containing long long and with -malign-double. > > Example, compile with -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2 > -malign-double: > It is because -malign-double will align long long to 8 byte.
(In reply to comment #22) > It is because -malign-double will align long long to 8 byte. Yes, it aligns it in the structures ... but why on the stack? Those people who were writing it really didn't understand the difference between preferred alignment (long long, double, long double) that shouldn't trigger any stack realigns and enforced alignment (sse 16-byte) that should. So gcc aligns the stack when it's not needed and doesn't align it when it is (PR 40838). That's why I think it needs redesign, it can't be fixed with incremental hacks.
Another case when stack frame is spuriously generated: /* -O2 -fomit-frame-pointer -mno-accumulate-outgoing-args */ void __attribute__((__noreturn__)) crash(__const__ char *, ...); void F(int *q) { while (1) { if (*q < 0) crash("buuu"); if (!*q) break; q++; } } --- stack frame is generated for no apparent reason. The switch that actually does it is -ftree-ch (with -fno-tree-ch, the stack frame is not generated). This is not misgenerated code byt it may indicate that something bad is going on in the compiler. The conditions are very peculiar ...
>stack frame is generated for no apparent reason. It is generated because of noreturn which is done as a regular call instead of a sibcall. So this is expected. Closing as fixed as it was previously. Please open a new bug next time for a new testcase.
Comment #25: I don't understand your logic, what does attribute((noreturn)) have to do with a stack frame? The only valid reasons for generating a stack frame are alloca() or needed stack realignment. Other functions calls, either returning or noreturn don't need a frame. Note that attribute((noreturn)) functions normally don't trigger a stack frame. That example function was actually carefully minimized from a larger real-world function. If you change the content of the loop, the stack frame won't be generated. It looks like there is something rotten in gcc.
>-mno-accumulate-outgoing-args Well if I understand this option, this is the correct thing to do. Also note this was unrelated to the original problem so please file a new bug next time.