Bug 27537

Summary: XMM alignment fault when compiling for i386 with -Os
Product: gcc Reporter: Agner Fog <agner>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Severity: normal CC: agner, belyshev, gcc-bugs, hjl.tools, hubicka, joey.ye, kpfleming, matz, mueller, rguenth, thiago, ubizjak, zuxy.meng
Priority: P3 Keywords: wrong-code
Version: 4.1.0   
Target Milestone: ---   
Host: x64 Target: ia32
Build: Known to work:
Known to fail: 3.4.0 4.0.0 4.1.0 4.2.0 Last reconfirmed: 2006-05-10 20:49:57
Bug Depends on: 33721    
Bug Blocks:    

Description Agner Fog 2006-05-10 19:54:37 UTC
The g++ compiler for i386 target assumes that the stack is aligned by 16 when storing xmm registers to the stack. However, the stack is not aligned when compiling with option -Os (or with the Intel compiler). A misaligned memory operand to an XMM instruction causes a general protection exception. An example to reproduce the error is included below.

Suggested remedies:
Either (1): Keep the stack pointer aligned by 16, even when compiling with option -Os and make this alignment an official requirement in the ABI (the ABI for IA32 Gnu is shamefully poorly documented!). This alignment is already a requirement in the Mac OS X IA32 ABI (http://developer.apple.com/documentation/DeveloperTools/Conceptual/LowLevelABI/index.html#//apple_ref/doc/uid/TP40002521). It is preferred to have the same ABI for IA32/i386 on MacOS and Linux, BSD, etc. since they all use the Gnu compiler.

Or (2): Make no assumptions about alignment of the stack. Align the stack pointer with an AND-instruction whenever an alignment higher than 4 is needed. (This is in accordance with the Windows IA32 ABI).

Steps to reproduce:
---------- begin file e1.cpp -------------
#include <stdio.h>
#include <emmintrin.h>
// Example showing stack alignment error
// compile with g++ v. 4.1.0:
// g++ -m32 -Os -msse2 e1.cpp
// ./a.out

char * e1() { // This function assumes that the stack is aligned by 16
   // This puts 0 in an XMM register and stores it on the stack:
   volatile __m128 dummy = _mm_set_ps1(0.f);
   return "OK";

int main() { // This function calls e1() without proper alignment
   printf("%s %s \n", e1(), e1());
   return 0;

---------- end file e1.cpp -------------

compile this file with:

g++ -m32 -Os -msse2 e1.cpp

Expected output:

Actual output:
Segmentation fault

Cause of error:
The instruction that generates the error is:
movaps %xmm0,-16(%ebp)
This instruction requires that the memory operand is aligned by 16, but there is only a 25% chance that %ebp is divisible by 16 when compiling with option -Os. The program works correctly when compiled with any other optimization option. Also works correctly when compiling for x64 platform.

I have submitted a bug report about the same issue to the Intel compiler developers. Issue # 369990 at premier.intel.com
Whether you choose remedy (1) or (2) above should preferably be coordinated with the Intel compiler developers and with Apple, because this is an ABI issue and the solution must be standardized.

[root@localhost t]# g++ -v -save-temps -m32 -Os -msse2 e1.cpp
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.4.2-gcj- --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.0 20060304 (Red Hat 4.1.0-3)
 /usr/libexec/gcc/x86_64-redhat-linux/4.1.0/cc1plus -E -quiet -v -D_GNU_SOURCE e1.cpp -m32 -msse2 -mtune=generic -Os -fpch-preprocess -o e1.ii
ignoring nonexistent directory "/usr/lib/gcc/x86_64-redhat-linux/4.1.0/../../../../x86_64-redhat-linux/include"
#include "..." search starts here:
#include <...> search starts here:
End of search list.
 /usr/libexec/gcc/x86_64-redhat-linux/4.1.0/cc1plus -fpreprocessed e1.ii -quiet -dumpbase e1.cpp -m32 -msse2 -mtune=generic -auxbase e1 -Os -version -o e1.s
GNU C++ version 4.1.0 20060304 (Red Hat 4.1.0-3) (x86_64-redhat-linux)
        compiled by GNU C version 4.1.0 20060304 (Red Hat 4.1.0-3).
GGC heuristics: --param ggc-min-expand=64 --param ggc-min-heapsize=63766
Compiler executable checksum: 145c15e175e7b5491c5723c9bdb452f1
 as -V -Qy --32 -o e1.o e1.s
GNU assembler version (x86_64-redhat-linux) using BFD version 20060212
 /usr/libexec/gcc/x86_64-redhat-linux/4.1.0/collect2 --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 /usr/lib/gcc/x86_64-redhat-linux/4.1.0/../../../../lib/crt1.o /usr/lib/gcc/x86_64-redhat-linux/4.1.0/../../../../lib/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.1.0/32/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/4.1.0/32 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.0/32 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.0/../../../../lib -L/lib/../lib -L/usr/lib/../lib e1.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-redhat-linux/4.1.0/32/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.1.0/../../../../lib/crtn.o
[root@localhost t]# ./a.out
Segmentation fault
[root@localhost t]#
Comment 1 Andrew Pinski 2006-05-10 20:39:24 UTC
First so what darwin aligns the stack by default to 16bytes (that is demanded by their ABI since their ABI is newer than GNU/Linux's).  GNU/Linux follows the SYSV x86 ABI which is documented, maybe you cannot find it but it does exist.
Comment 2 Andrew Pinski 2006-05-10 20:44:41 UTC
The SYSV x86 ABI says the stack is aligned 4 byte aligned.  Remember the SYSV x86 ABI was done before MMX or SSE was around or even thought about back in the 486 days (and maybe even before then).
Comment 3 Andrew Pinski 2006-05-10 20:49:57 UTC
This is confirmed, this is an interaction between stack slots and -mpreferred-stack-boundary= which is what -Os sets.  This is not a regression.  Maybe the real question is why are you using -Os for code with SSE in it?
Comment 4 Agner Fog 2006-05-11 07:11:28 UTC
Thanks for confirming this bug. If Gcc relies on the stack being aligned then it has to be an official ABI requirement. 

It makes perfectly sense to compile the whole program, or some of it, with -Os and also use XMM. -Os can be the optimal option if code cache or data cache is a critical resource. It is also a perfectly justifiable solution to compile the part of the program that contains the innermost loop with -O3 and the rest of the program with -Os. The error also occurs if part of the program is compiled with the Intel C++ compiler, because the Intel people follow the official ABI which hasn't been updated for many years. The Intel compiler is intended to be 100% binary compatible with Gnu.

Gcc is no longer a hobby project for a limited group of nerds. It is one of the most used compilers in the world and it is used for critical applications. Therefore, you have to be strict about ABI standards. Either the ABI must be changed and made public, or the compiler must be changed so that it doesn't rely on the stack being aligned by 16.

I can find the "SYSTEM V. APPLICATION BINARY INTERFACE. Intel386 Architecture Processor Supplement" at www.caldera.com. It says "DRAFT COPY, March 19, 1997". Nothing indicates that this is the official or the latest version. It says nothing about MMX or XMM. I have documented the things that are not clear from the ABI in http://www.agner.org/assem/calling_conventions.pdf as good as I can. I am going to change this document when this issue is resolved.
Comment 5 H.J. Lu 2006-06-07 15:51:45 UTC
This testcase doesn't use -Os on SSE registers:

[hjl@gnu-10 stack]$ cat m.c
#include <stdio.h>
extern char *e1 (void);
main ()
  printf ("%s\n", e1 ());
  return 0;
[hjl@gnu-10 stack]$ cat x.c
#include <emmintrin.h>
extern char *e1 (void);
char *e1 (void)
  volatile __m128 dummy = _mm_set_ps1(0.f);
  return "OK";
[hjl@gnu-10 stack]$ make
gcc -Os   -c -o m.o m.c
gcc -O -msse2   -c -o x.o x.c
gcc -o m m.o x.o
make: *** [all] Segmentation fault
[hjl@gnu-10 stack]$

It calls a function which uses SSE registers.
Comment 6 Agner Fog 2006-06-08 06:27:14 UTC
Comment #5 From hjl confirms my point: The error can occur in an optimized part of the program that uses XMM registers when some other, noncritical, part of the program is compiled with -Os

We need a comment from the ABI people about which solution to choose because the Intel compiler people are working on a fix to make the two compilers compatible.
Comment 7 H.J. Lu 2006-08-03 16:58:13 UTC
Apparently, it was done on purpose:

Comment 8 Agner Fog 2006-08-03 20:20:36 UTC
hjl wrote:
>Apparently, it was done on purpose

Yes, the -Os non-alignment was obviously done on purpose. The problem is that other modules that may be called from the -Os module rely on the stack being aligned by 16. The wrong alignment makes the program crash whem xmm registers are used. The alignment must be strictly enforced in the ABI if any function relies on it.
Comment 9 Mark Mitchell 2006-08-20 22:29:59 UTC
It is definitely a bug to change the ABI with -Os.  Since GCC relies on the stack being 16-byte aligned, -Os must in fact honor that requirement.
Comment 10 H.J. Lu 2006-08-21 17:42:32 UTC
I have a mixed feeling toward this. On one hand, gcc does assume 16byte stack
aligment. On the other hand, the original ia32 psABI only calls for 4 byte
stack aliment. Requiring 16 byte aligment will make sure the outputs from gcc
will be compatible with each other. But it means that the outputs from gcc
aren't 100% compatible with the outputs from other compilers, which only
enforce 4byte stack aligment following the original ia32 psABI.

I guess that it may not be feasible to enforce 4byte stack aligment in gcc
without code degradation.
Comment 11 Agner Fog 2006-08-23 08:04:28 UTC
This problem wouldn't have happened if the ABI had been better maintained. Somebody decides to change the calling convention without properly documenting the change, and somebody else makes another change that is incompatible because the alignment requirement has never made it into the ABI documents.

Let me help you making a decision on this issue by summarizing the pro's and con's of 16-bytes stack alignment in 32-bit x86 Linux/BSD.

Advantages of enforcing 16-bytes stack alignment:
The use of XMM code is becoming more common now that all new computers have support for the SSE2 or higher instructions set. The necessary alignment of XMM variables can be implemented more efficiently when the stack is aligned.

Variables of type double (double precision floating point) are accessed more efficiently when aligned by 8. This is easily achieved when the stack is aligned.

Function parameters of type double will automatically get the optimal alignment, unless the parameter is preceded by an odd number of smaller parameters (including any 'this' pointer and return pointer). This means that more than 50% of function parameters of type double will be optimally aligned, versus 50% without stack alignment. The C/C++ programmer will be able to ensure optimal alignment by manipulating the order of function parameters.

Functions that need to align local variables can do so without using EBP as stack frame. This frees EBP for other purposes. General purpose registers is a scarce resource in 32-bit mode.

16-bytes stack alignment is officially enforced in Intel-based Mac OS X. It is desirable to have identical ABI's for Linux, BSD and Mac. This makes it possible to use the same compilers and the same function libraries for all three platforms (except for the object file format, which can be converted).

The stack alignment requires no extra instructions in leaf functions, which are more likely to contain the critical innermost loop than non-leaf functions.

The stack alignment requires no extra instructions in a non-leaf function if the function adjusts the stack pointer anyway for the sake of allocating local storage.

Stack alignment is already implemented in Gcc and existing code relies on it.

Disadvantages of enforcing 16-bytes stack alignment:
A non-leaf function without any stack space allocated for local storage needs one or two extra instructions for conforming to the stack alignment requirement.

The alignment requirement results in unused space in the stack. This takes up to 12 bytes of extra space in the data cache for each function calling level except the innermost. Assuming that only the innermost three function levels matter in terms of speed, and that the number of unused bytes is 8 on average for all but the innermost function, the total amount of space wasted in the data cache is 16 bytes.

The Intel compiler does not enforce stack alignment. However, the Intel people are ready to change this as soon as you Gnu people make a decision on this issue. I have contact with the Intel people about this issue.

Stack alignment is not enforced in 32-bit Windows. Compatibility with the Windows ABI might be desirable.
Comment 12 H.J. Lu 2006-09-08 00:45:54 UTC

*** This bug has been marked as a duplicate of 13685 ***
Comment 13 hjl@gcc.gnu.org 2006-09-11 21:34:17 UTC
Subject: Bug 27537

Author: hjl
Date: Mon Sep 11 21:34:06 2006
New Revision: 116860

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116860

2006-09-11  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/13685
	PR target/27537
	PR target/28621
	* config/i386/i386.c (override_options): Always default to 16
	byte stack boundary.


2006-09-11  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/13685
	* gcc.target/i386/pr13685.c: New test.


Comment 14 hjl@gcc.gnu.org 2006-09-12 02:54:59 UTC
Subject: Bug 27537

Author: hjl
Date: Tue Sep 12 02:54:42 2006
New Revision: 116870

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116870

2006-09-11  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/13685
	PR target/27537
	PR target/28621
	* config/i386/i386.c (override_options): Always default to 16
	byte stack boundary.


2006-09-11  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/13685
	* gcc.target/i386/pr13685.c: New test.


Comment 15 Denis Vlasenko 2007-07-23 00:03:58 UTC
Disadvantages of enforcing 16-bytes stack alignment, continued:

* Code to align the stack is generated when we call a function, even when this function isn't going to use SSE. Which is ~90% of all functions out there.

* gcc-generated SSE code will still crash if called from code compiled by older gcc on non-gcc compiler which chose to not align stack. Note that alternative approach (to align stack _in the function which needs it_) will not crash if called by code generated by old or new gcc. I like it.
Comment 16 Denis Vlasenko 2007-07-23 00:48:14 UTC
You have it reversed here:

"8. Stack alignment is already implemented in Gcc and existing code relies on it."

No, stack alignment is _not_ in current de-facto i386 Linux ABI and there are tons of existing object code which will call new shiny 4.2.x compiled SSE code without aligning stack first. kaboom.
Comment 17 Zuxy 2007-08-26 07:58:18 UTC
*** Bug 28069 has been marked as a duplicate of this bug. ***