Bug 40667 - [4.4/4.5/4.6 Regression] stack frames are generated even with -fomit-frame-pointer
Summary: [4.4/4.5/4.6 Regression] stack frames are generated even with -fomit-frame-po...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.4.1
: P2 normal
Target Milestone: 4.4.1
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2009-07-07 02:25 UTC by mikulas
Modified: 2010-06-29 21:17 UTC (History)
3 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work: 4.3.4
Known to fail:
Last reconfirmed: 2009-07-09 09:05:17


Attachments
gcc45-pr40667.patch (803 bytes, patch)
2009-07-09 13:25 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description mikulas 2009-07-07 02:25:25 UTC
This is performance regression from 4.3 (which was better).

On i386, when -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2 is used, and function operates with long long values, stack frame is generated, although it doesn't have to be.

Example:
int f(long long x);

int g(long long x)
{
        f(x);
        return 0;
}

Generated code:
        .text
        .p2align 4,,15
.globl g
        .type   g, @function
g:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    8(%ebp), %eax
        movl    12(%ebp), %edx
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        call    f
        xorl    %eax, %eax
        leave
        ret
        .size   g, .-g

Gcc 4.3 didn't generate stack frame in this case. On i386, spurious stack frame is especially bad because there are few registers and one register is lost for the stack frame. One my program doing heavy 64-bit math shows almost 1% code size increase because of these unneeded frames.
Comment 1 Richard Biener 2009-07-07 09:07:33 UTC
Why do you limit your stack boundary artificially?
Comment 2 H.J. Lu 2009-07-07 14:41:05 UTC
I think it is related to PR 39137.
Comment 3 mikulas 2009-07-07 16:48:18 UTC
"Why do you limit your stack boundary artificially?"

Because if I don't use FP code there is no point in aligning the stack. Aligning the stack wastes stack space and code size and doesn't improve performance of integer code in any way.
Comment 4 Uroš Bizjak 2009-07-09 09:05:17 UTC
For some reason IRA reloads argp using ebp-relative address as:

Reloads for insn # 22
Reload 0: reload_in (DI) = (mem/c/i:DI (plus:SI (reg/f:SI 6 bp)
                                                        (const_int 8 [0x8])) [2 x+0 S8 A32])
	GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), optional
	reload_in_reg: (mem/c/i:DI (plus:SI (reg/f:SI 6 bp)
                                                        (const_int 8 [0x8])) [2 x+0 S8 A32])


Without IRA (4.3.x), gcc reloads argp through esp-relative addr:

Reloads for insn # 23
Reload 0: reload_in (DI) = (mem/c/i:DI (plus:SI (reg/f:SI 7 sp)
                                                        (const_int 12 [0xc])) [2 x+0 S8 A32])
	GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), optional
	reload_in_reg: (mem/c/i:DI (plus:SI (reg/f:SI 7 sp)
                                                        (const_int 12 [0xc])) [2 x+0 S8 A32])


Confirmed, regression from 4.3.
Comment 5 Jakub Jelinek 2009-07-09 11:45:37 UTC
Why do you think it has anything to do with IRA?  It has all to do with dynamic stack realignment IMNSHO.
Just try -fno-ira on any of the fall snapshots, it makes absolutely no difference.
Comment 6 Jakub Jelinek 2009-07-09 12:04:53 UTC
Indeed, on this testcase crtl->stack_realign_needed is 1, while when using say int instead of long long it is 0.
Comment 7 Jakub Jelinek 2009-07-09 12:56:42 UTC
There are multiple places which bump crtl->stack_alignment_estimated to 64 during expansion in this case (thus forcing stack_realign_needed):

      if (SUPPORTS_STACK_ALIGNMENT)
        {
          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
                                                      data.passed_type);
          if (TYPE_ALIGN (data.nominal_type) > align)
            align = TYPE_ALIGN (data.passed_type);
          if (crtl->stack_alignment_estimated < align && 0)
            {
              gcc_assert (!crtl->stack_realign_processed);
              crtl->stack_alignment_estimated = align;
            }
        }

in assign_parms and:

  /* If a virtual register with bigger mode alignment is generated,
     increase stack alignment estimation because it might be spilled
     to stack later.  */
  if (SUPPORTS_STACK_ALIGNMENT 
      && crtl->stack_alignment_estimated < align
      && !crtl->stack_realign_processed)
    crtl->stack_alignment_estimated = align;

in gen_reg_rtx (DImode) called from parmreg = gen_reg_rtx (promoted_nominal_mode); in assign_parm_setup_reg.
We'd need to arrange for these bumps not to happen for long long incoming parameters somehow.  Another testcase:
int f (long long x);

long long x;

int
g (void)
{
  f (x);
  return 0;
}
which would need something in expand_one_var:
1166      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
1167        align = TYPE_ALIGN (TREE_TYPE (var));
1168      else
1169        align = DECL_ALIGN (var);
1170
1171      if (crtl->stack_alignment_estimated < align)
1172        {
1173          /* stack_alignment_estimated shouldn't change after stack
1174             realign decision made */
1175          gcc_assert(!crtl->stack_realign_processed);
1176          crtl->stack_alignment_estimated = align;
1177        }
Comment 8 Jakub Jelinek 2009-07-09 13:25:20 UTC
Created attachment 18170 [details]
gcc45-pr40667.patch

Very ugly patch which just shows that changing these 3 spots helps both of these testcases.  It uses minimum of the currently used alignment and LOCAL_ALIGNMENT/LOCAL_DECL_ALIGNMENT/STACK_SLOT_ALIGNMENT, so only when these functions return smaller alignment than expected they change anything.  As only i386/x86_64 is SUPPORTS_STACK_ALIGNMENT supporting platform and only for DImode
on -m32 we ever decrease the stack alignment, this shouldn't break anything.

BTW, the patch also changes something that looks like a thinko to me:
          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
                                                      data.passed_type);
          if (TYPE_ALIGN (data.nominal_type) > align)
            align = TYPE_ALIGN (data.passed_type);
data.passed_type has already been used in FUNCTION_BOUNDARY computation, so it really surprises me it doesn't use nominal_type's TYPE_ALIGN instead.
Comment 9 H.J. Lu 2009-07-09 13:57:47 UTC
(In reply to comment #8)
> on -m32 we ever decrease the stack alignment, this shouldn't break anything.
> BTW, the patch also changes something that looks like a thinko to me:
>           unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
>                                                       data.passed_type);
>           if (TYPE_ALIGN (data.nominal_type) > align)
>             align = TYPE_ALIGN (data.passed_type);
> data.passed_type has already been used in FUNCTION_BOUNDARY computation, so it
> really surprises me it doesn't use nominal_type's TYPE_ALIGN instead.

You may have an unaligned parameter. If it is spilled onto stack,
stack should be properly aligned. It isn't a problem for DImode.
But it may be a problem for other types. There may be some testcases
aleady to test it. If not, I can add a few.

What happened is we find DImode and try to align stack. At the
last minute, we find it isn't spilled and we skip the stack
alignment. What we need it is MINIMUM_TYPE_ALIGN, which is
the minimum alignment for a type. MINIMUM_TYPE_ALIGN should
be the same as TYPE_ALIGN. But on IA32, MINIMUM_TYPE_ALIGN (DImode)
== 4 and TYPE_ALIGN (DImode) == 8. We can use MINIMUM_TYPE_ALIGN
for stack alignment.

Comment 10 Jakub Jelinek 2009-07-09 16:24:04 UTC
I'm not sure what you mean MINIMUM_TYPE_ALIGN should be.  A new type field?  That would be IMHO an overkill, would enlarge types too much.
If it is just a macro, it should be probably MINIMUM_ALIGNMENT, not MINIMUM_TYPE_ALIGN, and take a tree (TYPE or DECL), mode and initial alignment
and just return a possibly lower alignment.  So pretty much like ix86_local_alignment, except that it would only ever decrease alignment, rather than also increase it.  On most targets the macro would just return the third argument.
Comment 11 H.J. Lu 2009-07-09 16:34:59 UTC
(In reply to comment #10)
> I'm not sure what you mean MINIMUM_TYPE_ALIGN should be.  A new type field? 
> That would be IMHO an overkill, would enlarge types too much.
> If it is just a macro, it should be probably MINIMUM_ALIGNMENT, not
> MINIMUM_TYPE_ALIGN, and take a tree (TYPE or DECL), mode and initial alignment
> and just return a possibly lower alignment.  So pretty much like
> ix86_local_alignment, except that it would only ever decrease alignment, rather
> than also increase it.  On most targets the macro would just return the third
> argument.

That should work.
Comment 12 Jakub Jelinek 2009-07-11 17:40:42 UTC
Subject: Bug 40667

Author: jakub
Date: Sat Jul 11 17:40:29 2009
New Revision: 149513

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=149513
Log:
	PR rtl-optimization/40667
	* defaults.h (MINIMUM_ALIGNMENT): Define if not defined.
	* doc/tm.texi (MINIMUM_ALIGNMENT): Document it.
	* config/i386/i386.h (MINIMUM_ALIGNMENT): Define.
	* config/i386/i386.c (ix86_minimum_alignment): New function.
	* config/i386/i386-protos.h (ix86_minimum_alignment): New prototype.
	* cfgexpand.c (expand_one_var): Use MINIMIM_ALIGNMENT.
	* emit-rtl.c (gen_reg_rtx): Likewise.
	* function.c (assign_parms): Likewise.  If nominal_type needs
	bigger alignment than FUNCTION_ARG_BOUNDARY, use its alignment
	rather than passed_type's alignment.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
    trunk/gcc/config/i386/i386-protos.h
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/defaults.h
    trunk/gcc/doc/tm.texi
    trunk/gcc/emit-rtl.c
    trunk/gcc/function.c

Comment 13 Jakub Jelinek 2009-07-11 19:06:38 UTC
Subject: Bug 40667

Author: jakub
Date: Sat Jul 11 19:06:26 2009
New Revision: 149517

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=149517
Log:
	PR rtl-optimization/40667
	* defaults.h (MINIMUM_ALIGNMENT): Define if not defined.
	* doc/tm.texi (MINIMUM_ALIGNMENT): Document it.
	* config/i386/i386.h (MINIMUM_ALIGNMENT): Define.
	* config/i386/i386.c (ix86_minimum_alignment): New function.
	* config/i386/i386-protos.h (ix86_minimum_alignment): New prototype.
	* cfgexpand.c (expand_one_var): Use MINIMIM_ALIGNMENT.
	* emit-rtl.c (gen_reg_rtx): Likewise.
	* function.c (assign_parms): Likewise.  If nominal_type needs
	bigger alignment than FUNCTION_ARG_BOUNDARY, use its alignment
	rather than passed_type's alignment.

Modified:
    branches/gcc-4_4-branch/gcc/ChangeLog
    branches/gcc-4_4-branch/gcc/cfgexpand.c
    branches/gcc-4_4-branch/gcc/config/i386/i386-protos.h
    branches/gcc-4_4-branch/gcc/config/i386/i386.c
    branches/gcc-4_4-branch/gcc/config/i386/i386.h
    branches/gcc-4_4-branch/gcc/defaults.h
    branches/gcc-4_4-branch/gcc/doc/tm.texi
    branches/gcc-4_4-branch/gcc/emit-rtl.c
    branches/gcc-4_4-branch/gcc/function.c

Comment 14 Jakub Jelinek 2009-07-11 19:07:41 UTC
Fixed.
Comment 15 mikulas 2009-07-23 11:12:47 UTC
The patch just walks around the problem and doesn't really fix it. It is simply a hack that disables frame generation for long long, while for the other types the same problem persists:

Example:
void g(double);
int f(double a)
{
        g(a);
        return 0;
}
(compile with -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2)
--- here it still generates stack frame although it doesn't have to. Gcc 4.3 is correct. Stack frame needs to be generated if
* the user requests it (no -fomit-frame-pointer)
* there is alloca
* there is stack realignment needed (either because of attribute that requests it or SSE variables are on the stack)

--- in the above example, none of these three conditions are true, frame is not realigned, yet the frame pointer is generated. The logic for deciding whether to generate the frame pointer is flawed.

The current double regression is not as "critical" as the long long but it is still incorrect.
Comment 16 mikulas 2009-07-23 11:18:25 UTC
In the above example, the output of assembler is:

f:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        fldl    8(%ebp)
        fstpl   (%esp)
        call    g
        xorl    %eax, %eax
        leave
        ret
Comment 17 Jakub Jelinek 2009-07-23 11:36:25 UTC
Incorrect?  Why do you think so?
double on ix86 has 8 byte alignment, and unlike long long it has also performance effects when you misalign it (in theory, we could handle double the same as long long with -Os though).  When gcc needs to decide whether to align the stack or not, it doesn't yet know whether any temporaries of that type will need to be spilled to stack.  -mpreferred-stack-boundary=2 is typically only used in the kernel anyway and that one doesn't have floating point stuff, as this option affects the Linux ABI (you can't call libc functions from code compiled that way among other things).
Comment 18 mikulas 2009-07-23 12:16:45 UTC
The bug is this: you don't align the stack and you generate the frame. Why?
Why don't you do one of these?:

* generate the frame and align
* don't generate the frame and don't align

these two would be reasonable, but generating the frame and not aligning is not.
Comment 19 Jakub Jelinek 2009-07-23 12:33:57 UTC
Because you need to decide whether to use a frame pointer or not before register allocation, and only during/after register allocation you can find out that something needs to be spilled to stack.
Comment 20 mikulas 2009-07-23 13:28:02 UTC
I see, if it gets spilled to the stack as a local variable, it realigns the stack, if it doesn't get spilled, it doesn't. But shouldn't "passing the variable as an argument on the stack" be treated equal to spilling? It is the same instruction.
Comment 21 mikulas 2009-08-08 10:41:11 UTC
Hmm, it still generates the stack frame (and the alignment itself) when there are structures containing long long and with -malign-double.

Example, compile with -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2 -malign-double:

struct s {
        long long a;
        int b;
};

void f(struct s *x);

void g(void)
{
        struct s x;
        f(&x);
}

--- the stack is aligned although it doesn't have to be. Output:

g:
        pushl   %ebp
        movl    %esp, %ebp
        andl    $-8, %esp
        subl    $20, %esp
        leal    4(%esp), %eax
        movl    %eax, (%esp)
        call    f
        leave
        ret

I compile the whole project with -malign-double (so I must use it on all modules, even integer ones, as it's ABI thing) and I compile integer-only parts with -mpreferred-stack-boundary=2.

I don't know if you can extend that hack for "structures containing long long" ... but the whole stack alignment thing really needs to be redesigned for gcc-4.5.
Comment 22 H.J. Lu 2009-08-08 11:45:13 UTC
(In reply to comment #21)
> Hmm, it still generates the stack frame (and the alignment itself) when there
> are structures containing long long and with -malign-double.
> 
> Example, compile with -O2 -fomit-frame-pointer -mpreferred-stack-boundary=2
> -malign-double:
> 

It is because -malign-double will align long long to 8 byte.
Comment 23 mikulas 2009-08-08 14:15:38 UTC
(In reply to comment #22)
> It is because -malign-double will align long long to 8 byte.

Yes, it aligns it in the structures ... but why on the stack? Those people who were writing it really didn't understand the difference between preferred alignment (long long, double, long double) that shouldn't trigger any stack realigns and enforced alignment (sse 16-byte) that should. So gcc aligns the stack when it's not needed and doesn't align it when it is (PR 40838). That's why I think it needs redesign, it can't be fixed with incremental hacks.
Comment 24 mikulas 2009-08-11 21:01:01 UTC
Another case when stack frame is spuriously generated:

/* -O2 -fomit-frame-pointer -mno-accumulate-outgoing-args */

void __attribute__((__noreturn__)) crash(__const__ char *, ...);

void F(int *q)
{
        while (1) {
                if (*q < 0)
                        crash("buuu");
                if (!*q)
                        break;
                q++;
        }
}

--- stack frame is generated for no apparent reason. The switch that actually does it is -ftree-ch (with -fno-tree-ch, the stack frame is not generated). This is not misgenerated code byt it may indicate that something bad is going on in the compiler. The conditions are very peculiar ...
Comment 25 Andrew Pinski 2010-02-15 04:22:47 UTC
>stack frame is generated for no apparent reason.

It is generated because of noreturn which is done as a regular call instead of a sibcall.  So this is expected.  Closing as fixed as it was previously.  Please open a new bug next time for a new testcase.
Comment 26 mikulas 2010-02-15 10:34:28 UTC
Comment #25: I don't understand your logic, what does attribute((noreturn)) have to do with a stack frame?

The only valid reasons for generating a stack frame are alloca() or needed stack realignment. Other functions calls, either returning or noreturn don't need a frame.

Note that attribute((noreturn)) functions normally don't trigger a stack frame. That example function was actually carefully minimized from a larger real-world function. If you change the content of the loop, the stack frame won't be generated. It looks like there is something rotten in gcc.
Comment 27 Andrew Pinski 2010-06-29 21:17:03 UTC
>-mno-accumulate-outgoing-args

Well if I understand this option, this is the correct thing to do.  Also note this was unrelated to the original problem so please file a new bug next time.