Bug 47735 - [4.7/4.8/4.9 Regression] Unnecessary adjustments to stack pointer
Summary: [4.7/4.8/4.9 Regression] Unnecessary adjustments to stack pointer
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.6.0
: P2 normal
Target Milestone: 4.7.4
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on:
Reported: 2011-02-14 17:35 UTC by H.J. Lu
Modified: 2014-01-10 21:44 UTC (History)
6 users (show)

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2012-12-31 00:00:00


Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2011-02-14 17:35:40 UTC
For this tescase, gcc 4.4 generates the most efficient code
while 4.5/4.6 become worse and worse:

[hjl@gnu-6 gcc]$ cat x.i 
unsigned mulh(unsigned a, unsigned b)
  unsigned long long l __attribute__ ((aligned(32)))
    =((unsigned long long)a * (unsigned long long)b) >> 32;
  return l;
[hjl@gnu-6 gcc]$ /usr/gcc-4.4/bin/gcc -O2 -S -fomit-frame-pointer x.i -m32
[hjl@gnu-6 gcc]$ cat x.s 
    .file    "x.i"
    .p2align 4,,15
.globl mulh
    .type    mulh, @function
    movl    8(%esp), %eax
    mull    4(%esp)
    movl    %edx, %eax
    .size    mulh, .-mulh
    .ident    "GCC: (GNU) 4.4.4"
    .section    .note.GNU-stack,"",@progbits
[hjl@gnu-6 gcc]$ /usr/gcc-4.5/bin/gcc -O2 -S -fomit-frame-pointer x.i -m32
[hjl@gnu-6 gcc]$ cat x.s
    .file    "x.i"
    .p2align 4,,15
.globl mulh
    .type    mulh, @function
    pushl    %ebp
    movl    %esp, %ebp
    movl    12(%ebp), %eax
    mull    8(%ebp)
    popl    %ebp
    movl    %edx, %eax
    .size    mulh, .-mulh
    .ident    "GCC: (GNU) 4.5.1 20100507 (prerelease) [gcc-4_5-branch revision
    .section    .note.GNU-stack,"",@progbits
[hjl@gnu-6 gcc]$ ./xgcc -B./  -O2 -S -fomit-frame-pointer x.i -m32
[hjl@gnu-6 gcc]$ cat x.s
    .file    "x.i"
    .p2align 4,,15
    .globl    mulh
    .type    mulh, @function
    pushl    %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-32, %esp
    movl    12(%ebp), %eax
    mull    8(%ebp)
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    movl    %edx, %eax
    .size    mulh, .-mulh
    .ident    "GCC: (GNU) 4.6.0 20110131 (experimental)"
    .section    .note.GNU-stack,"",@progbits
[hjl@gnu-6 gcc]$

This is caused by revision 146817:

Comment 1 Richard Biener 2011-04-28 14:51:27 UTC
GCC 4.5.3 is being released, adjusting target milestone.
Comment 2 Richard Biener 2012-07-02 11:22:04 UTC
The 4.5 branch is being closed, adjusting target milestone.
Comment 3 Jakub Jelinek 2013-04-12 15:16:49 UTC
GCC 4.6.4 has been released and the branch has been closed.
Comment 4 Jakub Jelinek 2014-01-02 09:32:29 UTC
I wonder what is the point of even looking at the alignment of VAR_DECLs that are SSA_NAME_VAR of SSA_NAMEs if we're not putting those into stack.

So perhaps something like:
--- gcc/cfgexpand.c.jj	2013-12-16 09:08:17.000000000 +0100
+++ gcc/cfgexpand.c	2014-01-02 10:04:39.525480578 +0100
@@ -1215,8 +1215,11 @@ expand_one_var (tree var, bool toplevel,
 	 we conservatively assume it will be on stack even if VAR is
 	 eventually put into register after RA pass.  For non-automatic
 	 variables, which won't be on stack, we collect alignment of
-	 type and ignore user specified alignment.  */
-      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	 type and ignore user specified alignment.  Similarly for
+	 SSA_NAMEs for which use_register_for_decl returns true.  */
+      if (TREE_STATIC (var)
+	  || DECL_EXTERNAL (var)
+	  || (TREE_CODE (origvar) == SSA_NAME && use_register_for_decl (var)))
 				   TYPE_MODE (TREE_TYPE (var)),
 				   TYPE_ALIGN (TREE_TYPE (var)));

That said, I really wonder if we shouldn't besides estimated stack alignment track also what we really need, i.e. record stack alignment requirements without any pessimistic assumptions, only bump it when we actually allocate something on the stack that needs bigger alignment (when we create MEM DECL_RTL, when say assign_stack_temp* creates stack slot that needs bigger alignment, when RA spills something that needs bigger alignment etc.).  RA etc. would work as is, but ix86_finalize_stack_realign_flags would look at the actual value instead.

Consider say:
typedef double m256 __attribute__((vector_size (32)));
m256 bar (m256 x, m256 y);
m256 foo (m256 x, m256 y, m256 z)
  return bar (x + z, y - z) + (m256) { 1.0, 2.0, 3.0, 4.0 };

        vaddpd  %ymm2, %ymm0, %ymm0
        pushq   %rbp
        vsubpd  %ymm2, %ymm1, %ymm1
        movq    %rsp, %rbp
        andq    $-32, %rsp
        call    bar
        vaddpd  .LC0(%rip), %ymm0, %ymm0

pushq %rbp; movq %rsp, %rbp; andq $-32, %rsp; leave all seem to be completely unnecessary to me (well, some push/pop or rsp -=4/+=4 would be needed to maintain 128-bit stack alignment), bar doesn't take any argument on the stack, there is no V4DFmode spilling, etc.  For leaf functions ix86_finalize_stack_realign_flags already manages to avoid that if the stack pointer is never touched and frame pointer isn't needed.
I guess by adding another integer to x_rtl and tracking this carefully we could get rid of the dynamic stack realignment here, still likely frame_pointer_needed would be set.  Wonder if we couldn't optimize that away (unless user requested frame pointer) too in some cases if frame pointer register is unused or only used to look at arguments before stack is first decremented.
Comment 5 Jakub Jelinek 2014-01-09 20:50:19 UTC
Author: jakub
Date: Thu Jan  9 20:12:36 2014
New Revision: 206493

URL: http://gcc.gnu.org/viewcvs?rev=206493&root=gcc&view=rev
	PR middle-end/47735
	* cfgexpand.c (expand_one_var): For SSA_NAMEs, if the underlying
	var satisfies use_register_for_decl, just take into account type
	alignment, rather than decl alignment.

	* gcc.target/i386/pr47735.c: New test.

Comment 6 Jeffrey A. Law 2014-01-10 21:44:55 UTC
Fixed by Jakub's commit on the trunk.  With that patch I get the same code as GCC 4.4 for the included testcase.