Bug 53383 - Allow -mpreferred-stack-boundary=3 on x86-64
Summary: Allow -mpreferred-stack-boundary=3 on x86-64
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.0
: P3 enhancement
Target Milestone: 4.8.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/201...
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-17 00:12 UTC by H. Peter Anvin
Modified: 2017-03-28 16:51 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-05-18 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H. Peter Anvin 2012-05-17 00:12:57 UTC
In the Linux kernel, we do not use SSE or floating-point of any kind (except in very particular highly controlled places); furthermore, stack space is at an extreme premium.  As such, it makes absolutely no sense to align the stack to 16 bytes in x86-64.

gcc, however, refuses to allow setting the stack alignment to 8 bytes:

: anacreon 105 ; gcc -mpreferred-stack-boundary=3 -mincoming-stack-boundary=3 -O2 -mno-sse -c /tmp/foo.c
/tmp/foo.c:1:0: error: -mpreferred-stack-boundary=3 is not between 4 and 12
/tmp/foo.c:1:0: error: -mincoming-stack-boundary=3 is not between 4 and 12

This is particularly ironic since we have found out that, in fact, ALL kernel stacks are actually 16-byte misaligned on x86-64 as the entry code pushes a total of 88 bytes onto the stack.  As such, gcc trying to keep a 16-byte alignment actually makes the code do worse, not better.
Comment 1 H.J. Lu 2012-05-17 14:00:14 UTC
I think we can relax it under -mcmodel=kernel.
Comment 2 H. Peter Anvin 2012-05-17 14:11:23 UTC
Why would -mcmodel=kernel matter for this?  (For the record, there is C code in the Linux kernel which doesn't use -mcmodel=kernel, too, and other embedded programs may very well have the same issue.)
Comment 3 H.J. Lu 2012-05-17 15:37:02 UTC
Following the x86-64 psABI, GCC assumes incoming stack is
16byte aligned and generates 16-byte aligned vector move
on stack.  If it isn't true, you will get segfault at
run-time.
Comment 4 H. Peter Anvin 2012-05-17 15:59:28 UTC
Only if the program in question is actually using SSE.  If SSE is disallowed (because it is kernel code, or some other embedded piece of code) it is irrelevant.
Comment 5 H. Peter Anvin 2012-05-17 16:09:19 UTC
The point is that the key is -mno-sse, not -mcmodel=kernel.
Comment 6 Jan Hubicka 2012-05-18 11:54:03 UTC
The reason why -mpreferred-stack-boundary=3 is not allowed is that the va-args sequences generated are not safe in the case frame is misaligned and values with higher intended alignment are passed.
In that case the caller would be required to do dynamic stack alignment. 

Otherwise the value might end up misaligned on the stack and va-arg sequence will do alignment using & operation resulting in fetching wrong value from memory.
Since we have dynamic stack alignment now, we could allow that and explicitely document that as ABI breaking (like we have the double alignment and other kludges on x86)

Honza
Comment 7 H. Peter Anvin 2012-05-18 15:23:05 UTC
We can't use the SSE parts of the ABI anyway in the kernel, and I sure hope that -mno-sse (or perhaps -mcmodel=kernel, but that would be ugly) prevents those from being generated.

Yes, it's an ABI violation, but it's a very necessary one...
Comment 8 Jan Hubicka 2012-05-19 19:14:18 UTC
__int128 do not require SSE and yet it is 128bit aligned....
I am not against allowing smaller alignments, we just need to document it breaks ABI and it would be nice to explain how
(and probably warn on 128bit aligned varargs or force the dynamc alignment)
Comment 9 H.J. Lu 2012-05-19 21:03:10 UTC
With this patch:

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index eca542c..3e4e768 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3660,9 +3660,14 @@ ix86_option_override_internal (bool main_args_p)
   ix86_preferred_stack_boundary = PREFERRED_STACK_BOUNDARY_DEFAULT;
   if (global_options_set.x_ix86_preferred_stack_boundary_arg)
     {
-      int min = (TARGET_64BIT ? 4 : 2);
+      int min;
       int max = (TARGET_SEH ? 4 : 12);
 
+      if (TARGET_64BIT)
+	min = TARGET_SSE ? 4 : 3;
+      else
+	min = 2;
+
       if (ix86_preferred_stack_boundary_arg < min
 	  || ix86_preferred_stack_boundary_arg > max)
 	{
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index ddb3645..f7f13d2 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -708,7 +708,7 @@ enum target_cpu_default
 #define MAIN_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
 
 /* Minimum stack boundary.  */
-#define MIN_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+#define MIN_STACK_BOUNDARY (TARGET_64BIT ? (TARGET_SSE ? 128 : 64) : 32)
 
 /* Boundary (in *bits*) on which the stack pointer prefers to be
    aligned; the compiler cannot rely on having this alignment.  */

I got

[hjl@gnu-mic-2 gcc]$ cat /tmp/x.c
extern __int128 x;

extern void bar (int, int, int, int, int, __int128);
void
foo (void)
{
  bar (1, 2, 3, 4, 5, x);
}
[hjl@gnu-mic-2 gcc]$ ./xgcc -B./ -S -O2 /tmp/x.c -mpreferred-stack-boundary=3  -mno-sse
[hjl@gnu-mic-2 gcc]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movl	$5, %r8d
	movl	$4, %ecx
	movl	$2, %esi
	movl	$1, %edi
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	andq	$-16, %rsp
	subq	$16, %rsp
	movq	x+8(%rip), %rdx
	movq	x(%rip), %rax
	movq	%rdx, 8(%rsp)
	movq	%rax, (%rsp)
	movl	$3, %edx
	call	bar
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 4.8.0 20120519 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-mic-2 gcc]$ 

It looks OK to me.
Comment 10 Jan Hubicka 2012-05-19 21:47:39 UTC
The problem is va_args doing alignment based on stack pointer, i.e. in:
int
test (int a, ...)
{
  va_list p;

  va_start (p, a);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
    va_arg (p, int);
  if (a) 
    va_arg (p, int);
  return va_arg (p, __int128);
}

test:
.LFB0:
        .cfi_startproc
        leaq    8(%rsp), %rax
        movq    %rdx, -40(%rsp)
        movq    %rsi, -48(%rsp)
        movq    %rcx, -32(%rsp)
        movq    %r8, -24(%rsp)
        movq    %rax, -72(%rsp)
        leaq    -56(%rsp), %rax
        addq    $8, -72(%rsp)
        testl   %edi, %edi
        movq    %r9, -16(%rsp)
        movl    $48, -80(%rsp)
        movq    %rax, -64(%rsp)
        movq    -72(%rsp), %rax
        leaq    32(%rax), %rdx
        movq    %rdx, -72(%rsp)
        je      .L14
        addq    $40, %rax
        movq    %rax, -72(%rsp)
.L14:
        movq    -72(%rsp), %rax
        addq    $15, %rax
        andq    $-16, %rax
        leaq    16(%rax), %rdx
        movq    %rdx, -72(%rsp)
        movq    (%rax), %rax
        ret
        .cfi_endproc
.LFE0:

this will get out of sync with hard coded offsets if rsp hapepns to be misaligned.
Comment 11 H.J. Lu 2012-05-20 02:04:40 UTC
(In reply to comment #10)
> The problem is va_args doing alignment based on stack pointer, i.e. in:
> int
>
>   return va_arg (p, __int128);
> }
> 

>         addq    $15, %rax
>         andq    $-16, %rax

This isn't necessary.  If __int128 is put on stack by caller,
the stack must be aligned at 16 bytes.

>         leaq    16(%rax), %rdx
>         movq    %rdx, -72(%rsp)
>         movq    (%rax), %rax
>         ret
>         .cfi_endproc
> .LFE0:
> 
> this will get out of sync with hard coded offsets if rsp hapepns to be
> misaligned.

I don't think it will happen.
Comment 12 Jan Hubicka 2012-05-20 10:15:06 UTC
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
> 
> --- Comment #11 from H.J. Lu <hjl.tools at gmail dot com> 2012-05-20 02:04:40 UTC ---
> (In reply to comment #10)
> > The problem is va_args doing alignment based on stack pointer, i.e. in:
> > int
> >
> >   return va_arg (p, __int128);
> > }
> > 
> 
> >         addq    $15, %rax
> >         andq    $-16, %rax
> 
> This isn't necessary.  If __int128 is put on stack by caller,
> the stack must be aligned at 16 bytes.

Not when you call function with -fpreferred-stack-boundary=3
and it is itself compiled with -fpreferred-stack-boudnary=4
and calls another functions passing __int128.
Thus the ABI incompatibility.
Honza
Comment 13 Jan Hubicka 2012-05-20 11:34:27 UTC
> > This isn't necessary.  If __int128 is put on stack by caller,
> > the stack must be aligned at 16 bytes.
> 
> Not when you call function with -fpreferred-stack-boundary=3
> and it is itself compiled with -fpreferred-stack-boudnary=4
> and calls another functions passing __int128.
> Thus the ABI incompatibility.
... and the alignment code is there not to align stack frame, but to
align the __int128bit position within the argument area.  The
va_arg calls beforehand are of unknon count.
> Honza
Comment 14 H.J. Lu 2012-05-20 14:28:18 UTC
(In reply to comment #12)
> 
> Not when you call function with -fpreferred-stack-boundary=3
> and it is itself compiled with -fpreferred-stack-boudnary=4
> and calls another functions passing __int128.
> Thus the ABI incompatibility.

When everything is compiled with -fpreferred-stack-boundary=3,
it works fine.
Comment 15 H.J. Lu 2012-05-20 14:48:44 UTC
A patch is posted at

http://gcc.gnu.org/ml/gcc-patches/2012-05/msg01334.html
Comment 16 hjl@gcc.gnu.org 2012-06-22 17:11:10 UTC
Author: hjl
Date: Fri Jun 22 17:10:58 2012
New Revision: 188893

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188893
Log:
Allow -mpreferred-stack-boundary=3 on x86-64

	PR target/53383
	* doc/invoke.texi: Add a warning for -mpreferred-stack-boundary=3.

	* config/i386/i386.c (ix86_option_override_internal): Allow
	-mpreferred-stack-boundary=3 for 64-bit if SSE is disabled.

	* config/i386/i386.h (MIN_STACK_BOUNDARY): Set to 64 for 64-bit
	if SSE is disabled.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/doc/invoke.texi
Comment 17 H.J. Lu 2012-06-22 17:12:14 UTC
Fixed for 4.8.0.
Comment 18 Paolo Carlini 2013-02-27 10:21:37 UTC
*** Bug 56464 has been marked as a duplicate of this bug. ***
Comment 19 Andy Lutomirski 2015-07-05 19:56:35 UTC
I don't think the fix is correct.

This works:

gcc -mno-sse -mpreferred-stack-boundary=3 ...

This does not:

gcc -mno-sse -mpreferred-stack-boundary=3 -mincoming-stack-boundary=3 ...

This makes no sense, since they should be equivalent.

Also, I find the docs to be unclear as to what different values of the incoming and preferred stack boundaries mean.

Finally, why is -mno-sse required in order to set a low stack boundary?  Couldn't gcc figure out that the existence of a stack variable (SSE, alignas, __attribute__((aligned(32))), etc) should force dynamic stack alignment?  In fact, that should be necessary, and already appears to work, for __attribute__((aligned(32))), as gcc generates correct code for this:

typedef int __attribute__((aligned(32))) int_aligned_32;

extern void func2(int_aligned_32 *x);

void func(void)
{
	int_aligned_32 x;
	func2(&x);
}
Comment 20 H.J. Lu 2015-07-05 20:36:24 UTC
(In reply to Andy Lutomirski from comment #19)
> I don't think the fix is correct.
> 
> This works:
> 
> gcc -mno-sse -mpreferred-stack-boundary=3 ...
> 
> This does not:
> 
> gcc -mno-sse -mpreferred-stack-boundary=3 -mincoming-stack-boundary=3 ...
>

Please provide a testcase.

> This makes no sense, since they should be equivalent.
> 
> Also, I find the docs to be unclear as to what different values of the
> incoming and preferred stack boundaries mean.
> 
> Finally, why is -mno-sse required in order to set a low stack boundary? 
> Couldn't gcc figure out that the existence of a stack variable (SSE,
> alignas, __attribute__((aligned(32))), etc) should force dynamic stack
> alignment? 

Since the x86-86 psABI says that stack must be 16 byte aligned, if the stack
isn't 16-byte aligned,  the code with SSE insn, which follows the psABI,
will crash when called with 8-byte aligned stack.
Comment 21 Andy Lutomirski 2015-07-05 20:49:22 UTC
(In reply to H.J. Lu from comment #20)
> (In reply to Andy Lutomirski from comment #19)
> > I don't think the fix is correct.
> > 
> > This works:
> > 
> > gcc -mno-sse -mpreferred-stack-boundary=3 ...
> > 
> > This does not:
> > 
> > gcc -mno-sse -mpreferred-stack-boundary=3 -mincoming-stack-boundary=3 ...
> >
> 
> Please provide a testcase.

No code needed:

$ touch foo.c
$ gcc -c -mno-sse -mpreferred-stack-boundary=3 -mincoming-stack-boundary=3 foo.c
foo.c:1:0: error: -mincoming-stack-boundary=3 is not between 4 and 12
$ gcc -c -mno-sse -mpreferred-stack-boundary=3 foo.c

> 
> > This makes no sense, since they should be equivalent.
> > 
> > Also, I find the docs to be unclear as to what different values of the
> > incoming and preferred stack boundaries mean.
> > 
> > Finally, why is -mno-sse required in order to set a low stack boundary? 
> > Couldn't gcc figure out that the existence of a stack variable (SSE,
> > alignas, __attribute__((aligned(32))), etc) should force dynamic stack
> > alignment? 
> 
> Since the x86-86 psABI says that stack must be 16 byte aligned, if the stack
> isn't 16-byte aligned,  the code with SSE insn, which follows the psABI,
> will crash when called with 8-byte aligned stack.

I'm confused here.  I agree in principle, but I don't actually think that gcc works this way, or, if it does, it shouldn't.

If I compile with -mpreferred-stack-boundary=3 and create an aligned(32) local variable, then gcc will dynamically align the stack and the variable will have correct alignment even if the incoming stack was not 16-byte aligned.

Shouldn't an SSE variable work exactly the same way?  That is, if gcc is generating an SSE instruction with a memory reference to an on-stack variable that requires 16-byte alignment (movdqa, for example), wouldn't that variable be effectively aligned(16) or greater and thus trigger dynamic stack alignment.

Sure, the generated SSE code will be less efficient with -mpreferred-stack-boundary=3 (because neither "and $-16,%rsp" nor the required frame pointer is free), but it should still work, right?
Comment 22 H.J. Lu 2015-07-05 21:04:36 UTC
(In reply to Andy Lutomirski from comment #21)

> $ touch foo.c
> $ gcc -c -mno-sse -mpreferred-stack-boundary=3 -mincoming-stack-boundary=3
> foo.c
> foo.c:1:0: error: -mincoming-stack-boundary=3 is not between 4 and 12
> $ gcc -c -mno-sse -mpreferred-stack-boundary=3 foo.c

I will fix it.

> > 
> > > This makes no sense, since they should be equivalent.
> > > 
> > > Also, I find the docs to be unclear as to what different values of the
> > > incoming and preferred stack boundaries mean.
> > > 
> > > Finally, why is -mno-sse required in order to set a low stack boundary? 
> > > Couldn't gcc figure out that the existence of a stack variable (SSE,
> > > alignas, __attribute__((aligned(32))), etc) should force dynamic stack
> > > alignment? 
> > 
> > Since the x86-86 psABI says that stack must be 16 byte aligned, if the stack
> > isn't 16-byte aligned,  the code with SSE insn, which follows the psABI,
> > will crash when called with 8-byte aligned stack.
> 
> I'm confused here.  I agree in principle, but I don't actually think that
> gcc works this way, or, if it does, it shouldn't.
> 
> If I compile with -mpreferred-stack-boundary=3 and create an aligned(32)
> local variable, then gcc will dynamically align the stack and the variable
> will have correct alignment even if the incoming stack was not 16-byte
> aligned.

That is correct.

> Sure, the generated SSE code will be less efficient with
> -mpreferred-stack-boundary=3 (because neither "and $-16,%rsp" nor the
> required frame pointer is free), but it should still work, right?

It works only if ALL codes are compiled with -mpreferred-stack-boundary=3.
Comment 23 hjl@gcc.gnu.org 2015-07-06 11:51:19 UTC
Author: hjl
Date: Mon Jul  6 11:50:47 2015
New Revision: 225452

URL: https://gcc.gnu.org/viewcvs?rev=225452&root=gcc&view=rev
Log:
Allow -mincoming-stack-boundary=3 with -mno-sse

Similar to -mpreferred-stack-boundary=3, -mincoming-stack-boundary=3 is
allowed with -mno-sse in 64-bit mode.

gcc/

	PR target/53383
	* config/i386/i386.c (ix86_option_override_internal): Allow
	-mincoming-stack-boundary=3 for 64-bit if SSE is disabled.

gcc/testsuite/

	PR target/53383
	* gcc.target/i386/pr53383-1.c: New file.
	* gcc.target/i386/pr53383-2.c: Likewise.
	* gcc.target/i386/pr53383-3.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr53383-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr53383-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr53383-3.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
Comment 24 Katsunori Kumatani 2017-03-28 10:52:48 UTC
I think this bug should be reopened because the current behavior is restricting. Andy is right. The compiler should be a tool and the developer using it should be allowed to shoot himself in the foot if that's what he wants. In this particular case it restricts freedom and the compiler should be the developer's tool not the other way around.

Simply remove the TARGET_SSE check and let the actual developer decide if he wants the stack realigned instead of the compiler? There's even an attribute for that (to force stack realignment). I don't understand why such a check is even there, it servers NO purpose except to imply that the compiler knows better than the person using it, which is just wrong -- if someone wants to use SSE *and* 8-byte aligned stack, then let him do it.

This will simply force GCC to realign the stack if and *only if* it needs to use an aligned SSE variable on the stack, just as it does with AVX already.

So if 99% of your functions *do not* use SSE, they won't use 16-byte alignment.

For that odd function that does use aligned SSE (or AVX), it will simply realign the stack to 16 or 32-bytes in that function's prolog. Nothing special and it won't crash. (as long as you don't call into library functions obviously)

Just let the developer decide please, without having to recompile a custom modified GCC to remove that check that screams "I'm the compiler and I know better than you do what your code needs"...
Comment 25 Uroš Bizjak 2017-03-28 12:24:20 UTC
(In reply to Katsunori Kumatani from comment #24)
> I think this bug should be reopened because the current behavior is
> restricting.

[...]

I agree. Let's remove this artificial dependency on -msse.

Patch in testing:

--cut here--
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 246531)
+++ config/i386/i386.c  (working copy)
@@ -5927,9 +5927,8 @@ ix86_option_override_internal (bool main_args_p,
   ix86_preferred_stack_boundary = PREFERRED_STACK_BOUNDARY_DEFAULT;
   if (opts_set->x_ix86_preferred_stack_boundary_arg)
     {
-      int min = (TARGET_64BIT_P (opts->x_ix86_isa_flags)
-                ? (TARGET_SSE_P (opts->x_ix86_isa_flags) ? 4 : 3) : 2);
-      int max = (TARGET_SEH ? 4 : 12);
+      int min = TARGET_64BIT_P (opts->x_ix86_isa_flags)? 3 : 2;
+      int max = TARGET_SEH ? 4 : 12;
 
       if (opts->x_ix86_preferred_stack_boundary_arg < min
          || opts->x_ix86_preferred_stack_boundary_arg > max)
--cut here--
Comment 26 uros 2017-03-28 16:51:32 UTC
Author: uros
Date: Tue Mar 28 16:51:00 2017
New Revision: 246543

URL: https://gcc.gnu.org/viewcvs?rev=246543&root=gcc&view=rev
Log:
	PR target/53383
	* config/i386/i386.c (ix86_option_override_internal): Always
	allow -mincoming-stack-boundary=3 for 64-bit targets.

testsuite/ChangeLog:

	PR target/53383
	* gcc.target/i386/pr53383-1.c (dg-options): Remove -mno-sse.
	* gcc.target/i386/pr53383-2.c (dg-options): Ditto.
	* gcc.target/i386/pr53383-3.c (dg-options): Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/pr53383-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr53383-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr53383-3.c