Bug 40838 - gcc shouldn't assume that the stack is aligned
Summary: gcc shouldn't assume that the stack is aligned
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.4.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords:
: 40985 42513 81646 (view as bug list)
Depends on:
Blocks: 32893 40985 41156
  Show dependency treegraph
 
Reported: 2009-07-23 12:53 UTC by mikulas
Modified: 2021-11-28 00:22 UTC (History)
19 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2009-08-06 21:43:34


Attachments
Crash because gcc assumes false stack alignment (465 bytes, text/plain)
2009-07-31 01:04 UTC, mikulas
Details
A patch (1.08 KB, patch)
2009-08-06 21:05 UTC, H.J. Lu
Details | Diff
A patch for gcc 4.4 (1.83 KB, patch)
2009-08-18 04:43 UTC, H.J. Lu
Details | Diff
A bug example for 4.4 patch (154 bytes, text/plain)
2009-09-13 13:58 UTC, mikulas
Details
Another bug in 4.4 patch (133 bytes, text/plain)
2009-09-13 13:59 UTC, mikulas
Details
An updated patch for gcc 4.4 (2.11 KB, patch)
2009-09-19 21:37 UTC, H.J. Lu
Details | Diff
An updated patch for gcc trunk (2.09 KB, patch)
2009-09-19 21:38 UTC, H.J. Lu
Details | Diff
An updated patch for gcc trunk (2.17 KB, patch)
2009-09-20 18:43 UTC, H.J. Lu
Details | Diff
An updated patch for gcc 4.4 (2.19 KB, patch)
2009-09-20 18:44 UTC, H.J. Lu
Details | Diff
A patch for gcc 4.4.1 (2.34 KB, patch)
2009-09-25 00:56 UTC, mikulas
Details | Diff
An updated patch for gcc 4.4 (2.20 KB, patch)
2009-09-26 16:55 UTC, H.J. Lu
Details | Diff
An updated patch for gcc 4.4 (2.36 KB, patch)
2009-09-26 16:58 UTC, H.J. Lu
Details | Diff
An updated patch for gcc 4.4 (2.24 KB, patch)
2009-10-16 00:56 UTC, H.J. Lu
Details | Diff
additional aligning on demand <10.0 (619 bytes, patch)
2020-01-31 15:03 UTC, Dzianis Kahanovich
Details | Diff
additional aligning on demand 10.0 (unsure) (875 bytes, patch)
2020-01-31 15:07 UTC, Dzianis Kahanovich
Details | Diff
additional aligning on demand <10.0 (fixed) (650 bytes, patch)
2020-02-19 02:25 UTC, Dzianis Kahanovich
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description mikulas 2009-07-23 12:53:32 UTC
typedef int v4si __attribute__ ((vector_size (16)));

v4si y(v4si *s3)
{
        return *s3;
}

extern v4si s1, s2;

v4si x(void)
{
        v4si s3 = s1 + s2;
        return y(&s3);
}

And compile it with -O2 -fno-inline -msse2 -fomit-frame-pointer

The variable s3 is stored using unaligned store (movdqu) and loaded using aligned load (movdqa). -mpreferred-stack-boundary=4 doesn't guarantee stack alignment, it only advises that there is stack alignment (the function may be called from OS callback, signal, another library compiled with lesser alignment, etc... --- and i386 mandates only 4-byte stack alignment), so use of movdqa is incorrect. (does GCC ABI mandate that all vector types must be aligned? If so, then movdqa is correct, but storing it on the stack, relying on alignment -mpreferred-stack-boundary=4 is not correct).

Now, if you compile it with -mpreferred-stack-boundary=2, function "x" aligns the stack but uses movdqu to store on the aligned stack, so it generates suboptimal code.
Comment 1 Jakub Jelinek 2009-07-23 12:56:21 UTC
Linux/ix86 ABI says that stack must be 16 byte aligned.  So GCC can rely on it.
Comment 2 Jakub Jelinek 2009-07-23 13:13:37 UTC

*** This bug has been marked as a duplicate of 38496 ***
Comment 3 mikulas 2009-07-23 13:15:22 UTC
What I would propose to fix this and bug #40667:

Each type has required alignment and preferred alignment. Enforced alignment is what is needed to not crash and not violate the ABI, preferred alignment is the alignment that has the best performance. For i386, all the types have enforced alignment 4-byte, except 128-bit SSE type having enforced alignment 16-bytes (the movdqa instruction crashes otherwise). Preferred alignment is 8 for double and 8-byte vector types and 16 for long double and 16-byte vector types.

Now, if the function has some variable with the enforced alignment greater than the ABI standard (4), the stack must be realigned. The ABI mandates 4-byte alignment and the function may be called from anywhere. As an optimization, the realign may be skipped if the function is static and it is proved to be called only from functions with greater or equal enforced alignment and having stack size aligned.

Each function aligns its stack size to -mpreferred-stack-boundary, which basically means "if the stack was aligned before (the most common case), performance will be good". But you can't rely on this for correctness, as in the pathological cases, the stack doesn't have to be aligned.

As an optimization, if you can prove that the function will call only functions manipulating types with preferred alignment at most X and X is lower than -mpreferred-stack-boundary, you can lower stack alignment to X (so that if there's a call graph of functions using only "double", you don't have to align the stack on the default 16 bytes, 8 bytes is sufficient).
Comment 4 mikulas 2009-07-23 13:19:11 UTC
"Linux/ix86 ABI says that stack must be 16 byte aligned."

No it doesn't. There is a plenty of Linux code that doesn't have the stack aligned on 16-byte boundary. (at least anything that was compiled with the old gcc that didn't have -mpreferred-stack-boundary switch).

Please don't change i386 ABI.

AFAIK only MacOSX/x86 enforced aligned stack.
Comment 5 Jakub Jelinek 2009-07-23 13:24:37 UTC
The ABI has changed 8+ years ago, you are coming too late.
Comment 6 H.J. Lu 2009-07-23 13:43:06 UTC

*** This bug has been marked as a duplicate of 39315 ***
Comment 7 mikulas 2009-07-23 13:49:09 UTC
See bug #27537, quoting "GNU/Linux follows the SYSV x86 ABI which is documented, maybe you cannot find it but it does exist. The SYSV x86 ABI says the stack is aligned 4 byte aligned."

That bug seems to reappear.

As Agner noted, 16-byte stack alignment requirement also break compatibility with Intel CC. I found even some part of current glibc that violates this 16-byte alignment (calling push %eax; call exit from the assembler without aligning the stack size).

Another point: if gcc realigns the stack, why then use movdqu to store the values on the stack? That is suboptimal.
Comment 8 Jakub Jelinek 2009-07-23 13:54:23 UTC
Please read Joseph's responses in PR38496.

If you are aware of places in glibc that don't maintain 16 byte stack alignment, please report them.  Certainly calling glibc (or any other default compiler flags compiled) library with non-16 byte aligned stack is a bug.

You are free to come up with a different ABI for the OS, but it will not be the ABI everybody is using.
Comment 9 H.J. Lu 2009-07-23 13:56:21 UTC
(In reply to comment #7)
> 
> Another point: if gcc realigns the stack, why then use movdqu to store the
> values on the stack? That is suboptimal.
> 

This is a dup for PR 39315.
Comment 10 mikulas 2009-07-23 14:36:18 UTC
Jakub: so try that "test $15, %esp; jnz abort" at every function, as I proposed in bug #38496. There are much more places that will trigger this. Just go catch them.
Comment 11 mikulas 2009-07-31 01:00:03 UTC
So I did this experiment whether the stack is aligned in current Linux binaries.
I applied this patch for gcc, so that it crashes on function entry if the function has stack not aligned on 16 bytes.

diff -urp gcc-4.4.1/gcc/varasm.c gcc-4.4.1-test-align/gcc/varasm.c
--- gcc-4.4.1/gcc/varasm.c      2009-03-17 21:18:21.000000000 +0100
+++ gcc-4.4.1-test-align/gcc/varasm.c   2009-07-25 16:18:11.000000000 +0200
@@ -1760,6 +1760,8 @@ assemble_start_function (tree decl, cons
   /* Standard thing is just output label for the function.  */
   ASM_OUTPUT_LABEL (asm_out_file, fnname);
 #endif /* ASM_DECLARE_FUNCTION_NAME */
+  if (!crtl->stack_realign_needed)
+         fputs("\tsubl\t$12, %esp\n\ttestl\t$15, %esp\n\tjz\t99999f\n\tud2a\n99999:\taddl\t$12, %esp\n", asm_out_file);
 }

 /* Output assembler code associated with defining the size of the

--- and the results are terrifying:

Gcc didn't even bootstrap itself. It failed because it calls glibc function obstack_init and it calls back to xmalloc - with misaligned stack. So I compiled gcc without bootstrap and tried to compile glibc-2.7 with it. Glibc compiles its integer-only code with -mpreferred-stack-boundary=2, so I changed it to -mpreferred-stack-boundary=4.

Glibc didn't finish its build either (failed when running some self-compiled scripts), but it at least produced libc.so.

So I tried to preload this libc.so with stack-alignment-checking to various Linux binaries (with LD_PRELOAD) and see what happens.

Out of 95 binaries in /bin/, only 23 succeeded! The remaining crashed because of glibc was called with unaligned stack. (the distribution is up-to-date Debian Lenny).

The non-crashing binaries are:

bzip2recover, cpio, dmesg, fgconsole, fuser, kill, loadkeys, lsmod, lvnet, mktemp, more (displays help only, crashes when attempting to display any file), mount, mountpoint, mt, mt-gnu, nbd-server, pidof, ping, ping6, run-parts, sed, su, tailf, umount

So anyone, who is saying that the stack is aligned to 16 bytes has his mind disconnected from reality. It isn't.

I find it very unreasonable that GCC developers try to declare their own ABI with aligned stack --- and that conflicts with what is being used by the majority of Linux applications. GCC developers are trying to say that 3/4 of programs in /bin/ are wrong because they don't align the stack.

I think you should really align the stack in the functions that do SSE math and don't rely on the fact that the stack is already aligned. It is definitelly easier to use the code for stack reallign than declaring that majority of Linux binaries are BAD and need to be recompiled.

If some scientists needed extreme performance and can't take the penalty of realigning the stack, you can add an option -massume-aligned-stack form them and it is the responsibility of a given scientist that the code compiled with this option is never called back from libc or anything else else. But don't assume stack alignment for general code. It just isn't true.
Comment 12 mikulas 2009-07-31 01:04:39 UTC
Created attachment 18276 [details]
Crash because gcc assumes false stack alignment

Here I'm submitting an example code that, when compiled with gcc 4.4.1 with -O3 -march=pentium3, crashes on Debian Lenny.

Don't close this bug unless this code is working! You can get it working either by modifying gcc to align the stack (IMHO the easier way) or forcing all the distributions to recompile all their binaries because you want to declare new ABI (IMHO harder or impossible).
Comment 13 Jakub Jelinek 2009-07-31 07:12:40 UTC
So, you found a glibc bug, which can be easily fixed by:
--- libc/malloc/Makefile 2009-05-16 19:23:36.000000000 +0200
+++ libc/malloc/Makefile 2009-07-31 09:09:51.760080072 +0200
@@ -104,6 +104,7 @@ $(objpfx)memusagestat: $(memusagestat-mo
 include ../Rules
 
 CFLAGS-mcheck-init.c = $(PIC-ccflag)
+CFLAGS-obstack.c = $(uses-callbacks)
 
 $(objpfx)libmcheck.a: $(objpfx)mcheck-init.o
 -rm -f $@

The deductions you make from it are wrong.
Comment 14 mikulas 2009-07-31 13:54:42 UTC
Jakub: And how many other "bugs" like this are there? 75% of binaries in /bin are "buggy". Do you think it is really sensible to declare that majority of Linux software is buggy?
Comment 15 H.J. Lu 2009-07-31 14:27:22 UTC
It is too late to change now. Even if we align the incoming
stack properly, we still have to align the outgoing stack
to 16byte since the existing binaries which use SSE won't
align the stack. That means we have to align the incoming
stack for all non-leaf functions.  Otherwise, we can't align
the outgoing stack. It is quite expensive.

The good news is we won't make the same mistake for AVX.
We now align the incoming stack as needed for AVX.
Comment 16 mikulas 2009-07-31 15:22:03 UTC
H.J. Lu: No, you only have to align the stack in functions that do 16-byte SSE.

I mean this:

there are two possible reasons for aligning the stack
1) improved performance (double, long double, MMX, 8-byte SSE)
2) avoid crashes (16-byte SSE)

To solve the case 1), it is perfectly reasonable to not align the stack and simply make sure that every function subtracts the stack by multiple of 16-bytes. If, by chance, the code is called from some other code with unaligned stack (like in that obstack example), the user will only suffer lower performance, not crashes.

It is legal to make floating point calculations from obstack allocation callback --- but it is not common. So we don't have to care about performance in this case.

To solve the case 2), you need to manually realign the stack at function prologue. But there are few functions that use SSE in typical desktop/server environment, that's why I'm saying that it won't have big impact on performance.

Anyway, if some scientist needs high 16-bit SSE performance, we can make a flag for him that avoids stack realign --- but don't compile typical desktop/server distribution with this flag, because there ARE cases where the stack is misaligned.

That's what I'm proposing in comment #3: that every type has two alignments, preferred alignment and enforced alignment. Only "enforced alignment" will force stack realign.
Comment 17 mikulas 2009-07-31 15:31:23 UTC
"Even if we align the incoming stack properly, we still have to align the outgoing stack to 16byte"

I'm not opposing it. What I mean is: every function will have stack frame size that is multiple of mpreferred-stack-boundary (16 bytes) --- it is what GCC is doing now. And additionally, there will be stack realign for functions that do 16-byte SSE math.
Comment 18 H.J. Lu 2009-07-31 15:51:43 UTC
(In reply to comment #17)
> "Even if we align the incoming stack properly, we still have to align the
> outgoing stack to 16byte"
> 
> I'm not opposing it. What I mean is: every function will have stack frame size
> that is multiple of mpreferred-stack-boundary (16 bytes) --- it is what GCC is
> doing now. And additionally, there will be stack realign for functions that do
> 16-byte SSE math.
> 

What you are looking for is an option to turn on force_align_arg_pointer
attribute for all functions which use 16byte SSE memory.
Comment 19 mikulas 2009-07-31 16:17:11 UTC
(In reply to comment #18)

Yes. But not an option. Make it default and make it optional to disable the alignment. Make it default, because such option would be useless if all libraries didn't use it.
Comment 20 H.J. Lu 2009-08-06 16:48:41 UTC
(In reply to comment #19)
> (In reply to comment #18)
> 
> Yes. But not an option. Make it default and make it optional to disable the
> alignment. Make it default, because such option would be useless if all
> libraries didn't use it.
> 

It isn't a bad idea. See PR 32893/40985.
Comment 21 H.J. Lu 2009-08-06 21:05:11 UTC
Created attachment 18314 [details]
A patch

Here is a patch to automatically realign stack if any SSE variable
is put on stack.
Comment 22 H.J. Lu 2009-08-06 21:43:33 UTC
A patch is posted at

http://gcc.gnu.org/ml/gcc-patches/2009-08/msg00392.html
Comment 23 mikulas 2009-08-08 17:30:55 UTC
(In reply to comment #21)

Unfortunatelly, that patch is wrong. It aligns when there is some vector type in the function but it doesn't align if the autovectorizer creates SSE instructions. Try that obstack example in comment #12 and you see that the function my_alloc uses 16-byte sse instructions on stack and it doesn't have aligned stack with the patch.
Comment 24 H.J. Lu 2009-08-16 17:37:39 UTC
(In reply to comment #23)
> (In reply to comment #21)
> 
> Unfortunatelly, that patch is wrong. It aligns when there is some vector type
> in the function but it doesn't align if the autovectorizer creates SSE
> instructions. Try that obstack example in comment #12 and you see that the
> function my_alloc uses 16-byte sse instructions on stack and it doesn't have
> aligned stack with the patch.
>

I got

my_alloc:
        pushl   %ebp
        movl    $.LC0, %edx
        movl    %esp, %ebp
        andl    $-16, %esp
        pushl   %ebx

It looks OK to me.

Comment 25 H.J. Lu 2009-08-18 04:43:21 UTC
Created attachment 18393 [details]
A patch for gcc 4.4
Comment 26 H.J. Lu 2009-08-18 04:50:25 UTC
*** Bug 40985 has been marked as a duplicate of this bug. ***
Comment 27 Dzianis Kahanovich 2009-08-18 11:28:44 UTC
(In reply to comment #26)
> *** Bug 40985 has been marked as a duplicate of this bug. ***

(In reply to comment #25)
> Created an attachment (id=18393) [edit]
> A patch for gcc 4.4

Are this patch may conflict with your patch?
http://gcc.gnu.org/bugzilla/attachment.cgi?id=18340&action=view
- equal to 32893 solution.

I found, once you are broke (make regression) 4.4 branch against 32893 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32893#c23 - I found there are your, sorry). May be this is way to suggest more correct sse solution vs. local bugfix for 32893, but while I not understand all - I trying to use both patches. May it be wrong?

PS One more sorry, but I want to know ALL ;)
Comment 28 H.J. Lu 2009-08-18 14:01:28 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > *** Bug 40985 has been marked as a duplicate of this bug. ***
> 
> (In reply to comment #25)
> > Created an attachment (id=18393) [edit]
> > A patch for gcc 4.4
> 
> Are this patch may conflict with your patch?
> http://gcc.gnu.org/bugzilla/attachment.cgi?id=18340&action=vie

This is not mine and isn't needed.
Comment 29 Dzianis Kahanovich 2009-08-19 19:08:36 UTC
(In reply to comment #28)
...
> This is not mine and isn't needed.

OK. New patch working. While only so (tested in seamonkey with all included libs).

Are realigning needed for both states of "TREE_STATIC (decl)"? Now in tree-vectorizer.c for i386 IMHO for both states are "return (alignment <= MAX_OFILE_ALIGNMENT);" (by fact, IMHO). But old Bug 32893 solution was only for false.


Comment 30 mikulas 2009-08-23 19:28:01 UTC
I tested the 4.4 patch and it works fine.
Comment 31 Dzianis Kahanovich 2009-08-27 19:17:54 UTC
Seamonkey still more unstable then with 4.3.3. With system libs, -O3 & sse - ruuning only in "safe-mode". All system rebuilt with 4.4.1 & this patch. There are looks like "seamonkey problem" (and I will add "-mno-sse" into ebuild for 4.4), but 4.3.3 or -mno-sse (all for seamonkey only) solving problem ("4.3.3 regression patch" no result too). IMHO in some situation(s) stack still unaligned. If I will find some more precise - I will say...

PS Also IMHO must be checked with -msseregparm.
Comment 32 mikulas 2009-09-13 13:58:38 UTC
Created attachment 18578 [details]
A bug example for 4.4 patch

Shows a bug in 4.4 patch
Comment 33 mikulas 2009-09-13 13:59:36 UTC
Created attachment 18579 [details]
Another bug in 4.4 patch

Another bug in 4.4 patch.
Comment 34 mikulas 2009-09-13 14:07:03 UTC
So I posted these two examples that show that the patch is insufficient:

1) if the array is embedded in a structure and that structure is on the stack, the stack is not aligned. (if the array were out of structure, it would be)

2) gcc sometimes spills xmm registers over function calls (spilling zero this way is suboptimal, but that's only a performance issue) and the stack is not aligned in this case.

Note, in align-counterexample2.c, there is another bug! Gcc shouldn't assume that arrays "x" and "y" are 16-byte aligned. Arrays "x" and "y" are in common section, that means that they can be declared in any other module and the linker joins these declarations. So, for example, if in another module someone declared int x[100] = { 1, 2, 3, 4 } and the module were compiled with different compiler that doesn't align the array (the ABI allows it) , the linker is forced to use the declaration with the initialization and the array would be misaligned.

We can only assume alignment on the variables that are not in common section, i.e. with -fno-common or explicitly initialized.
Comment 35 H.J. Lu 2009-09-19 21:37:31 UTC
Created attachment 18610 [details]
An updated patch for gcc 4.4
Comment 36 H.J. Lu 2009-09-19 21:38:15 UTC
Created attachment 18611 [details]
An updated patch for gcc trunk
Comment 37 H.J. Lu 2009-09-19 21:38:59 UTC
(In reply to comment #33)
> Created an attachment (id=18579) [edit]
> Another bug in 4.4 patch
> 
> Another bug in 4.4 patch.
> 

This one doesn't need stack alignment.
Comment 38 H.J. Lu 2009-09-19 21:40:27 UTC
(In reply to comment #32)
> Created an attachment (id=18578) [edit]
> A bug example for 4.4 patch
> 
> Shows a bug in 4.4 patch
> 

Please try the updated patch in comment #35
Comment 39 mikulas 2009-09-20 06:30:49 UTC
The updated patch fixes align-counterexample1.c, but not align-counterexample2.c. Note that you must align the stack for all functions that have some SSE operations, because you never know if the registers will be spilled. The generated code is:

.globl f
        .type   f, @function
f:
        pxor    %xmm0, %xmm0
        subl    $28, %esp
        xorl    %eax, %eax
        .p2align 5,,24
.L2:
        movaps  %xmm0, p(%eax)
        addl    $16, %eax
        cmpl    $400, %eax
        jne     .L2
        movaps  %xmm0, (%esp)
        call    g
        movdqa  (%esp), %xmm0
        xorl    %eax, %eax
        .p2align 5,,24
.L3:
        movaps  %xmm0, q(%eax)
        addl    $16, %eax
        cmpl    $400, %eax
        jne     .L3
        addl    $28, %esp
        ret
Comment 40 H.J. Lu 2009-09-20 18:43:49 UTC
Created attachment 18617 [details]
An updated patch for gcc trunk
Comment 41 H.J. Lu 2009-09-20 18:44:12 UTC
Created attachment 18618 [details]
An updated patch for gcc 4.4
Comment 42 H.J. Lu 2009-09-20 18:44:54 UTC
(In reply to comment #39)
> The updated patch fixes align-counterexample1.c, but not
> align-counterexample2.c. Note that you must align the stack for all functions
> that have some SSE operations, because you never know if the registers will be
> spilled. The generated code is:
> 

Please try the new patch in comment #41.
Comment 43 mikulas 2009-09-23 16:28:00 UTC
With the patch from comment #41, my test examples pass but seamonkey is still miscompiled, the function pow5mult still doesn't align the stack and spills xmm0 on it.
Comment 44 H.J. Lu 2009-09-23 16:34:39 UTC
(In reply to comment #43)
> With the patch from comment #41, my test examples pass but seamonkey is still
> miscompiled, the function pow5mult still doesn't align the stack and spills
> xmm0 on it.
> 

Please find a testcase. Thanks.
Comment 45 Dzianis Kahanovich 2009-09-23 18:37:35 UTC
(In reply to comment #41)
> Created an attachment (id=18618) [edit]
> An updated patch for gcc 4.4
> 

Seamonkey still segfault. Still required -mstackrealign.
Comment 46 mikulas 2009-09-25 00:56:53 UTC
Created attachment 18646 [details]
A patch for gcc 4.4.1

I decided to make a patch on my own. Seamonkey works with it. It sometimes aligns unnecessarily but it should never miss an alignment. (someone with more knowledge about gcc than me could refine the patch to make fewer unneeded alignments) I searching the generated seamonkey code for 'movaps.*esp' and it shows that all the functions that store xmm on the stack use stack alignment.
Comment 47 H.J. Lu 2009-09-25 02:31:30 UTC
(In reply to comment #46)
> Created an attachment (id=18646) [edit]
> A patch for gcc 4.4.1
> 
> I decided to make a patch on my own. Seamonkey works with it. It sometimes
> aligns unnecessarily but it should never miss an alignment. (someone with more
> knowledge about gcc than me could refine the patch to make fewer unneeded
> alignments) I searching the generated seamonkey code for 'movaps.*esp' and it
> shows that all the functions that store xmm on the stack use stack alignment.
> 

You can find a testcase as the first step.
Comment 48 mikulas 2009-09-26 04:25:19 UTC
It can be seen from the patch. I don't know how to detect that a structure contains an array with required SSE align, so I realign the stack for all types BLKmode with alignment >= 16. That may also catch structures containing long double with -m128bit-long-double and do unneeded align for them.

Another point where it aligns and doesn't have to is: there are some SSE variables used --- that trigger frame generation to prepare for a possible align --- after register allocation, they are not spilled, so the alignment is not needed --- but there are some other aligned types on the stack (for example floating point; they do not need enforced alignment). Then, my patch simply realigns the stack.

I think the patch is a good hack that may be added to 4.4 so that Gentoo people stop whining that gcc -O3 is unstable. But for 4.5, the stack realign needs to be redesigned. There are other cases ( PR/40667 ) where it triggers stack alignment that is not needed. As I said in comment #3: introduce preferred and enforced alignment for all types will do the right thing.
Comment 49 H.J. Lu 2009-09-26 14:06:30 UTC
(In reply to comment #48)
> It can be seen from the patch. I don't know how to detect that a structure
> contains an array with required SSE align, so I realign the stack for all types

Please find a testcase first. Otherwise, nothing will be done. Thanks.
Comment 50 mikulas 2009-09-26 15:44:40 UTC
> Please find a testcase first. Otherwise, nothing will be done. Thanks.

I don't want anything to be done (unless the patch causes generation of wrong code --- I am not aware of such case).

I am saying that the patch could be included in 4.4 as a quick fix and that 4.5 needs stack alignment redesign. You can't redesign it by incrementally testing against a set of examples. You (or someone else) need to sit down, draft the rules for propagation of preferred and enforced alignment across types down to the stack frame and code it. If you understand what you're doing, you can create test examples on your own. If you don't understand, then don't hack it at all.
Comment 51 H.J. Lu 2009-09-26 16:15:32 UTC
(In reply to comment #50)
> 
> I am saying that the patch could be included in 4.4 as a quick fix and that 4.5
> needs stack alignment redesign. You can't redesign it by incrementally testing
> against a set of examples. You (or someone else) need to sit down, draft the
> rules for propagation of preferred and enforced alignment across types down to
> the stack frame and code it. If you understand what you're doing, you can
> create test examples on your own. If you don't understand, then don't hack it
> at all.
>

The stack realignment works correctly as designed. The problems come from

1. Gcc generates code for 16byte-aligned incoming stack.
2. Gcc generates 16byte-aligned outgoing stack.
3. SSE variables segfault if they aren't 16byte aligned.
4. Some incoming stacks aren't aligned at 16byte, which violates #1 and
may cause #3.

We have to do #2 since many existing binaries need 16byte-aligned
incoming stack. If we assume incoming stack is 4byte aligned, we
have to realign stack for every function due to #2, which isn't
acceptable. To accommodate #4, which isn't strictly required, and
avoid #3 with minimum stack realignment, we need to list all #4
cases which lead to #3. Then we can investigate how to address them.

Comment 52 H.J. Lu 2009-09-26 16:55:37 UTC
Created attachment 18655 [details]
An updated patch for gcc 4.4

Please try this patch for gcc 4.4.
Comment 53 H.J. Lu 2009-09-26 16:58:06 UTC
Created attachment 18656 [details]
An updated patch for gcc 4.4

Oops. Wrong patch.  Trry this one.
Comment 54 mikulas 2009-09-27 09:03:16 UTC
(In reply to comment #51)

For 4.4, the designers held two wrong assumptions:

1) the incoming stack is always aligned on -mincoming-stack-boundary (wrong for functions called from assembler or code generated by other compilers).

2) all the variables must be aligned on their alignment (wrong for double, long double, long long, mmx: the processor may accept them misaligned).

The assumption 1) generates crashing code (for example Seamonkey). The assumption 2) generates suboptimal code (bug #40667).

The assumption 1) can be trivially quashed with parameter -mincoming-stack-boundary=2, then the code will be non-crashing, but you will be hitting problem 2) and the code will be ugly and slow: any function containing double or long double variable will generate code for stack realigning and a frame pointer. (for long long this inefficiency was partially fixed in 4.4.1, but only partially, single long long variables don't trigger the alignment after 4.4.1 but structures with long long do, see bug #40667).

So: to fix problem 1), gcc should assume that the incoming stack is 4-byte aligned. To fix problem 2), instead of single alignment, types and variables should have two alignments: preferred alignment and enforced alignment (so that you don't realign the stack if there is double on it, but you realign it if there is 16-byte SSE).
Comment 55 mikulas 2009-09-27 09:07:36 UTC
"If we assume incoming stack is 4byte aligned, we have to realign stack for every function due to #2, which isn't acceptable."

No, you don't. All you have to do is to allocate the stack frame that is multiple of 16 bytes (gcc does that already).
Comment 56 mikulas 2009-09-27 09:36:45 UTC
As for this "old code that assumes 16-byte alignment": this is wrong code generated by old versions of gcc. It only works as long as it is called from other gcc >= 3 code (call it from gcc < 3, icc or assembler and it crashes). So you don't have to realign the stack to make sure that the code works always (it doesn't, anyway). All you have to do with this old code is to make sure that you don't make things worse --- i.e. if it worked before, it should continue work after redesign. So all you need is to make a stack frame having a multiple of 16-bytes. No realign needed.
Comment 57 Dzianis Kahanovich 2009-10-15 14:29:29 UTC
(In reply to comment #53)
> Created an attachment (id=18656) [edit]
> An updated patch for gcc 4.4
> 
> Oops. Wrong patch.  Trry this one.
> 

Looks good for me. No segfaults while.
Comment 58 mikulas 2009-10-15 20:24:10 UTC
(In reply to comment #53)
> Created an attachment (id=18656) [edit]
> An updated patch for gcc 4.4

Seamonkey is correct with that patch. But it fails with structures containing sse vector, i.e. this. My patch (in comment #46) handles this case correctly (it reallign the stack if any structure on it has alignment at least 16), so should we submit that patch instead?

I think it's wrong to add alignment test to the autovectorizer because vector data types can be created explicitly without autovectorizer.

Mikulas


typedef int v4si __attribute__ ((vector_size (16)));

struct x {
        v4si v;
        v4si w;
};

void y(void *);

v4si x(void)
{
        struct x x;
        y(&x);
}

--- the function "x" should realign the stack but doesn't.
Comment 59 H.J. Lu 2009-10-15 20:54:11 UTC
(In reply to comment #58)
> (In reply to comment #53)
> > Created an attachment (id=18656) [edit]
> > An updated patch for gcc 4.4
> 
> Seamonkey is correct with that patch. But it fails with structures containing
> sse vector, i.e. this. My patch (in comment #46) handles this case correctly
> (it reallign the stack if any structure on it has alignment at least 16), so
> should we submit that patch instead?
> 
> I think it's wrong to add alignment test to the autovectorizer because vector
> data types can be created explicitly without autovectorizer.
> 
> Mikulas
> 
> 
> typedef int v4si __attribute__ ((vector_size (16)));
> 
> struct x {
>         v4si v;
>         v4si w;
> };
> 
> void y(void *);
> 
> v4si x(void)
> {
>         struct x x;
>         y(&x);
> }
> 

Why should gcc align the stack when SSE registers aren't used
at all?
Comment 60 H.J. Lu 2009-10-16 00:56:36 UTC
Created attachment 18805 [details]
An updated patch for gcc 4.4

This patch aligns stack if there is a stack variable with 128bit alignment.
Comment 61 mikulas 2009-10-16 02:10:58 UTC
> Why should gcc align the stack when SSE registers aren't used
> at all?

Because it passes pointer to the structure containing vector entries to someone else who expects it to be aligned.

As for the updated patch --- why does it modify the autovectorizer? Anything that the autovectorizer does can be done manually without the autovectorizer. So, if there is a case where patching the autovectorizer is required to avoid a bug, there is definitely another case, where the bug still persists if the programmer vectorizes the code explicitly.
Comment 62 H.J. Lu 2009-10-16 02:48:40 UTC
(In reply to comment #61)
> 
> As for the updated patch --- why does it modify the autovectorizer? Anything
> that the autovectorizer does can be done manually without the autovectorizer.
> So, if there is a case where patching the autovectorizer is required to avoid a
> bug, there is definitely another case, where the bug still persists if the
> programmer vectorizes the code explicitly.

Show me the testcase. When you create a vector variable on stack,
the hard stack alignment will be updated. Vectorizer may not alwats
put vector variable on stack. That is why I have to track it separately.
Comment 63 hjl@gcc.gnu.org 2009-10-30 14:32:44 UTC
Subject: Bug 40838

Author: hjl
Date: Fri Oct 30 14:32:26 2009
New Revision: 153750

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=153750
Log:
Optimize -mstackrealign.

gcc/

2009-10-30  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/40836
	* cfgexpand.c (expand_stack_alignment): Call update_stack_boundary
	first.  Move assert on stack_alignment_estimated just before
	setting stack_realign_needed.
	(gimple_expand_cfg): Initialize stack_alignment_estimated to 0.
	Don't call update_stack_boundary.

	* config/i386/i386.c (ix86_minimum_incoming_stack_boundary): New.
	(verride_options): Don't check ix86_force_align_arg_pointer here.
	(ix86_function_ok_for_sibcall): Use it.
	(ix86_update_stack_boundary): Likewise.

	* config/i386/i386.h (STACK_REALIGN_DEFAULT): Update comments.

gcc/testsuite/

2009-10-30  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/40838
	* gcc.target/i386/incoming-6.c: New.
	* gcc.target/i386/incoming-7.c: Likewise.
	* gcc.target/i386/incoming-8.c: Likewise.
	* gcc.target/i386/incoming-9.c: Likewise.
	* gcc.target/i386/incoming-10.c: Likewise.
	* gcc.target/i386/incoming-11.c: Likewise.
	* gcc.target/i386/incoming-12.c: Likewise.
	* gcc.target/i386/incoming-13.c: Likewise.
	* gcc.target/i386/incoming-14.c: Likewise.
	* gcc.target/i386/incoming-15.c: Likewise.
	* gcc.target/i386/pr37843-4.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/incoming-10.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-11.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-12.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-13.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-14.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-15.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-6.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-7.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-8.c
    trunk/gcc/testsuite/gcc.target/i386/incoming-9.c
    trunk/gcc/testsuite/gcc.target/i386/pr37843-4.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/testsuite/ChangeLog

Comment 64 hjl@gcc.gnu.org 2009-10-30 15:45:41 UTC
Subject: Bug 40838

Author: hjl
Date: Fri Oct 30 15:45:23 2009
New Revision: 153757

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=153757
Log:
gcc/

2009-10-30  H.J. Lu  <hongjiu.lu@intel.com>

	Backport from mainline:
	2009-10-30  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/40838
	* cfgexpand.c (expand_stack_alignment): Call update_stack_boundary
	first.  Move assert on stack_alignment_estimated just before
	setting stack_realign_needed.
	(gimple_expand_cfg): Initialize stack_alignment_estimated to 0.
	Don't call update_stack_boundary.

	* config/i386/i386.c (ix86_minimum_incoming_stack_boundary): New.
	(verride_options): Don't check ix86_force_align_arg_pointer here.
	(ix86_function_ok_for_sibcall): Use it.
	(ix86_update_stack_boundary): Likewise.

	* config/i386/i386.h (STACK_REALIGN_DEFAULT): Update comments.

gcc/testsuite/

2009-10-30  H.J. Lu  <hongjiu.lu@intel.com>

	Backport from mainline:
	2009-10-30  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/40838
	* gcc.target/i386/incoming-6.c: New.
	* gcc.target/i386/incoming-7.c: Likewise.
	* gcc.target/i386/incoming-8.c: Likewise.
	* gcc.target/i386/incoming-9.c: Likewise.
	* gcc.target/i386/incoming-10.c: Likewise.
	* gcc.target/i386/incoming-11.c: Likewise.
	* gcc.target/i386/incoming-12.c: Likewise.
	* gcc.target/i386/incoming-13.c: Likewise.
	* gcc.target/i386/incoming-14.c: Likewise.
	* gcc.target/i386/incoming-15.c: Likewise.
	* gcc.target/i386/pr37843-4.c: Likewise.

Added:
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-10.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-11.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-12.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-13.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-14.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-15.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-6.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-7.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-8.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/incoming-9.c
    branches/ix86/gcc-4_4-branch/gcc/testsuite/gcc.target/i386/pr37843-4.c
Modified:
    branches/ix86/gcc-4_4-branch/gcc/ChangeLog.ix86
    branches/ix86/gcc-4_4-branch/gcc/cfgexpand.c
    branches/ix86/gcc-4_4-branch/gcc/config/i386/i386.c
    branches/ix86/gcc-4_4-branch/gcc/config/i386/i386.h
    branches/ix86/gcc-4_4-branch/gcc/testsuite/ChangeLog.ix86

Comment 65 H.J. Lu 2009-10-31 16:47:30 UTC
Here are the differences of "-m32 -O3 -msse2 -mfpmath=sse -ffast-math
-funroll-loops" vs. "-m32 -O3 -msse2 -mfpmath=sse -ffast-math -funroll-loops
-mstackrealign" using ix86/gcc-4_4-branch on Intel Core i7:

164.gzip                         -0.604595%
175.vpr                          -0.0831255%
176.gcc                          -0.567698%
181.mcf                          0.142653%
186.crafty                       0.0783699%
197.parser                       -0.355714%
252.eon                          -2.32775%
253.perlbmk                      0.943693%
254.gap                          0.553825%
255.vortex                       0.226978%
256.bzip2                        0.291971%
300.twolf                        0.183936%
SPECint_base2000                 -0.126734%
168.wupwise                      -0.345871%
171.swim                         -1.42753%
172.mgrid                        0.496166%
173.applu                        0.467758%
177.mesa                         0.151791%
178.galgel                       -0.227652%
179.art                          0.338073%
183.equake                       -0.308569%
187.facerec                      0.189798%
188.ammp                         1.00536%
189.lucas                        -2.18097%
191.fma3d                        -1.0162%
200.sixtrack                     0.237906%
301.apsi                         1.00138%
SPECfp_base2000                 -0.121114%
400.perlbench                    0.4%
401.bzip2                        0%
403.gcc                          0%
429.mcf                          0%
445.gobmk                        -0.478469%
456.hmmer                        0%
458.sjeng                        0.452489%
462.libquantum                   -0.689655%
464.h264ref                      -0.331126%
471.omnetpp                      0.483092%
473.astar                        0.666667%
483.xalancbmk                    -0.358423%
SPECint(R)_base2006              0%
410.bwaves                       11.2108%
416.gamess                       -0.543478%
433.milc                         1.37615%
434.zeusmp                       -0.497512%
435.gromacs                      0%
436.cactusADM                    -0.763359%
437.leslie3d                     0.625%
444.namd                         0%
447.dealII                       -0.689655%
450.soplex                       -1.2987%
453.povray                       -0.904977%
454.calculix                     0.621118%
459.GemsFDTD                     -0.625%
465.tonto                        0%
470.lbm                          0.925926%
481.wrf                          -0.588235%
482.sphinx3                      -0.96463%
SPECfp(R)_base2006               0%
Comment 66 H.J. Lu 2009-12-26 22:19:24 UTC
*** Bug 42513 has been marked as a duplicate of this bug. ***
Comment 67 Artem S. Tashkinov 2010-04-17 14:28:51 UTC
Am I right assuming that GCC 4.5 is also affected by this bug? Is this bug going to be resolved?
Comment 68 mikulas 2010-04-20 07:48:04 UTC
gcc 4.5 is affected too. It would be nice if they fixed it.
Comment 69 Artem S. Tashkinov 2010-04-29 02:12:36 UTC
(In reply to comment #64)
> Subject: Bug 40838
> 

This patch is not sufficient, some applications still crash after I've applied it to GCC 4.4 branch (to be more precise gcc-4.4-20100427.tar.bz2).

I still wonder how GCC 4.4.x and 4.5.0 were ever released, to me it's a zero priority bug.
Comment 70 Uroš Bizjak 2010-04-29 06:29:02 UTC
(In reply to comment #69)
> (In reply to comment #64)
> > Subject: Bug 40838
> > 
> 
> This patch is not sufficient, some applications still crash after I've applied
> it to GCC 4.4 branch (to be more precise gcc-4.4-20100427.tar.bz2).

You did use -mstackrealign?

> I still wonder how GCC 4.4.x and 4.5.0 were ever released, to me it's a zero
> priority bug.

So, export CFLAGS="-mstackrealign"

Comment 71 Artem S. Tashkinov 2010-04-29 07:24:11 UTC
(In reply to comment #70)

No, I haven't used -mstackrealign as I presumed that the patch is sufficient - and since you make me sound like I'm wrong, then the patch is also wrong, since GCC must be producing a working code with no extra compilation flags.
Comment 72 Jakub Jelinek 2010-04-29 07:33:20 UTC
There is no agreement on this being actually a bug, -mpreferred-stack-boundary
is actually an ABI changing option and if you use it you are supposed to deal with the things it is causing (such as using -mstackrealign when calling code between the different ABIs).
Comment 73 Uroš Bizjak 2010-04-29 07:50:35 UTC
From the manual:

`-mstackrealign'
     Realign the stack at entry.  On the Intel x86, the `-mstackrealign'
     option will generate an alternate prologue and epilogue that
     realigns the runtime stack if necessary.  This supports mixing
     legacy codes that keep a 4-byte aligned stack with modern codes
     that keep a 16-byte stack for SSE compatibility.  See also the
     attribute `force_align_arg_pointer', applicable to individual
     functions.
Comment 74 Artem S. Tashkinov 2010-04-29 08:29:36 UTC
Guys, you are talking in riddles.

There's a fact: with -msse2 -O2 -m32 flags GCC generate bad code for some properly coded applications, so I wonder what users are supposed to do.

There are already six duplicates of this bug in Wine's bugzilla and several duplicates in Gentoo bugzilla.
Comment 75 Ryan Hill 2010-04-29 22:58:34 UTC
if some libraries, (zlib and fontconfig i've had personal experience with, i've also heard libgcrypt) are compiled with -ftree-vectorize (ie. -O3) on x86 systems supporting SSE2, it causes segfaults in certain packages, usually mozilla-based or wine, when SSE2 instructions requiring 16bit alignment are used on unaligned data.  nothing is being built with -mpreferred-stack-boundary in these cases. this is PR41156, https://bugzilla.redhat.com/489290, https://bugs.gentoo.org/270120.  i'm not convinced this is GCC's problem.  it usually gets traced back as far as something in the mozilla codebase misaligning the stack at which point everyone seems to give up.  i've yet to see an actual testcase, though I've encountered it several times in the wild.
Comment 76 Jesus Cea 2010-05-12 13:00:35 UTC
[Zlib-devel] HEADS UP: Apparent bad compilation under (just released) GCC 4.5.0

http://mail.madler.net/pipermail/zlib-devel_madler.net/2010-May/002267.html
Comment 77 Jesus Cea 2010-05-12 13:14:51 UTC
Discussión in progress:
http://mail.python.org/pipermail/python-dev/2010-May/100044.html
Comment 78 bugzilla-pierre_jasmin 2010-08-09 01:56:01 UTC
I am not exactly sure how to report a bug here - but it seems highly related to this thread (I am pierre@revisionfx.com, since I am not sure if I am auto-subscribed to this thread, and so will get email back about this)

It appears (whatever the version) that this construct in 32b linux (below) does not work after -shared (ld...) - does not work means is not aligned anymore to 16 bytes/128 bits boundary. If I compile a library with such code, and link it as in an executable, it seems to work but if I do it as shared object the alignment is not guaranteed anymore. Anyone has an idea about what to do, if this will be addressed... or GNU just assumes 32b will go away? Or the problem is different.  All this works fine with Mac OSX gcc -m32 compile by the way.

Also, if I compile with Intel CC does that fix the problem for an host that is Intel CC compiled? (As in is it possible that mixed compiler context - host versus dynamic library further cause complications).

#define MAX_CONV_KERNEL_SIZE			4096
class ConvKernel4f  
{
	public:
		
		ConvKernel4f ( );
		~ConvKernel4f( );
 		DATA_ALIGN(float4	mFilterPtr[MAX_CONV_KERNEL_SIZE]);
		
};
Comment 79 bugzilla-pierre_jasmin 2010-08-11 21:26:39 UTC
 > I am not exactly sure how to report a bug here  

Find the answer here:

https://bugs.launchpad.net/sbcl/+bug/539632

" Compile with -mpreferred-stack-boundary=2. This will force GCC to compile code that adheres to the ABI and therefore properly adjusts its own stack pointer."

that works
Comment 80 mikulas 2010-08-17 20:17:36 UTC
Comment #79:

-mpreferred-stack-boundary=2 adheres to the sysv ABI but it doesn't adhere to the Linux ABI (that requires 16-byte alignment), so if you compile anything with -mpreferred-stack-boundary=2, it may crash when entering other dynamic libraries.

-mstackrealign does the right thing, it realigns the stack when needed, but keeps it 16-byte aligned on function output. It should be used.

I would be nice if gcc developers made -mstackrealign default (with an option to turn it off for scientists who need maximum performance and don't care about ABI) so that this ABI madness will finally end when distributions get recompiled with it.
Comment 81 bugzilla-pierre_jasmin 2010-08-17 21:03:08 UTC
(In reply to comment #80)
> Comment #79:
> 
 
> -mstackrealign does the right thing, it realigns the stack when needed, but
> keeps it 16-byte aligned on function output. It should be used.
>  

I don't have that option (-mstackrealign)
It's available in gcc linux from what released version?

Comment 82 mikulas 2010-08-17 21:17:45 UTC
-mstackrealign is available from gcc 4.5.0. For gcc 4.4 you can use my patch GCC-4.4.1-ALIGN-PATCH from this bugzilla or H.J.Lu's last patch. It basically does the same as -mstackrealign (but it doesn't add a command line option, it sets this behavior as default).
Comment 83 bugzilla-pierre_jasmin 2010-08-24 22:09:47 UTC
(In reply to comment #82)
> -mstackrealign is available from gcc 4.5.0. 

So 1) you are right that somehow - mpreferred-stack-boundary=2 does not always work - I found a case where I crash 
2) so I compiled gcc 4.5.1 and recompiled and same (with or without -mstackrealign) 
I am noting there might be a new bug. When (I am a shared library) I call back the host(an exec I run in as a plugin) to check abort I now have a problem with throw.  My throw is in a try/catch but somehow seems going into host space by calling it back makes the throw pass through and the host terminates / abort as it gets a throw it does not know anything about. 
-- I am not equiped right now to test more 32b linux gcc as I don't have a proper host side way of testing this right now. Just raising a flag.
Comment 84 mikulas 2010-08-25 21:27:19 UTC
(In reply to comment #83)

If the bug is not related to stack alignment (i.e. it crashes not on unaligned SSE access), simplify it and file another bugzilla entry.
Comment 85 Artem S. Tashkinov 2011-01-18 21:02:43 UTC
Am I the only one who thinks this bug should be nominated as the first priority GCC 4.6.0 bug?

I don't really care if the fix would be backported to 4.5.x or 4.4.x releases, but at least it would be better to have it fixed for GCC 4.6.0.
Comment 86 H.J. Lu 2011-01-18 21:07:26 UTC
I am in the process of updating i386 psABI to specify 16byte stack
alignment.
Comment 87 Dzianis Kahanovich 2011-03-13 16:56:05 UTC
(In reply to comment #85)
> Am I the only one who thinks this bug should be nominated as the first priority
> GCC 4.6.0 bug?

Some lazy people ;) may use global mstackrealign [patch] instead and remind only if some new package breaks in asm code on mstackrealign. But it is too old known problem to doubts about it priority.

Really I found first package with mstackrealign problem, exclude gcc/libunwind - gst-ffmpeg-0.10.11/libavcodec (in Gentoo, may be other similar ebuilds just filtering sse/cpu options).
Comment 88 Jackie Rosen 2014-02-16 10:01:40 UTC Comment hidden (spam)
Comment 89 H.J. Lu 2017-08-01 17:16:15 UTC
*** Bug 81646 has been marked as a duplicate of this bug. ***
Comment 90 Harald van Dijk 2018-05-05 18:58:55 UTC
Given how many years have passed I expect it's too late for anything to change, but for completeness:

Assuming the stack is aligned breaks existing binaries compiled with old GCC versions, and with GCC 8 this becomes far more visible. Minimal test case: compile

  void exit(int);
  int main(void) { exit(0); }

with GCC 2.8, compile current glibc with GCC 8, and there will be a segfault in glibc's __run_exit_handlers because GCC 2.8 never kept the stack 16-byte-aligned, but GCC 8 does now generate code which assumes it.

Yes, I've actually been running GCC 2.8-compiled binaries that have only now become broken as a result of this. This is not hypothetical. For the moment, I've rebuilt glibc with -mincoming-stack-boundary=2 to handle the problem well enough for my current needs, but it's not a complete solution.
Comment 91 Peter Cordes 2019-10-31 06:00:22 UTC
This bug should be closed as "resolved fixed".  The "fix" was to change the ABI doc and break existing hand-written asm, and old binaries.  This was intentional and resulted in some pain, but at this point it's a done deal.

----

My attempt at a summary of the current state of affairs for 32-bit x86 calling conventions (on Linux and elsewhere):

Yes, the version of the i386 System V ABI used on Linux really did change between gcc2.8 and gcc8.  Those compilers are not ABI-compatible with each other.  This is a known fact.  Hand-written asm that makes function calls with misaligned stack pointers is violating the (updated) ABI, and was also knowingly broken by this change.


(Perhaps unintentionally at first, with stack alignment intended to just provide a performance benefit, not a correctness issue.  But the resolution ended up being to standardize on 16-byte alignment matching x86-64 System V.   Instead of reverting to the old ABI and breaking compat with new binaries that had started to rely on 16-byte incoming alignment, or to add significant overhead to every function that didn't know how both its caller and callee were compiled, i.e. most functions.  Using MOVUPS instead of MOVAPS everywhere wouldn't work well because it would mean no folding of memory operands into ALU instructions: without AVX's VEX encoding,  paddd xmm0, [mem] requires aligned mem.  And existing binaries that rely on incoming 16-byte alignment weren't doing that.)


An earlier comment also mentioned common arrays: the ABI also requires arrays larger than 16 bytes to have 16-byte alignment.

----

Perhaps unnecessary pain for little real benefit: i386 on Linux has been mostly obsolete for a long time, and the inefficient stack-args calling convention was never changed.  It's ironic that Linux broke ABI compat for i386 in the name of more efficient SSE-usage despite not caring to introduce anything like Windows fastcall or vectorcall (efficient register-args calling conventions).

(GCC does have ABI-changing -mregparm=3 and -msseregparm to pass integers in regs, and pass/return FP values in XMM registers (instead of passing on the stack / returning in x87 st0).  But no distros have switched over to using that calling convention for i386 binaries, AFAIK.  The Linux kernel does use regparm for 32-bit kernel builds.)

Even more ironic, probably a lot of 32-bit code is compiled without -msse2 (because one of the main reasons for using 32-bit code is CPUs too old for x86-64, which is about the same vintage as SSE2).  SSE usage can still happen with runtime dispatching in binaries that are compatible with old machines while still being able to take advantage of new ones.


But in most cases, if you want performance you use x86-64 kernel + user-space, or maybe x32 user-space (ILP32 in 64-bit mode) to get modern calling conventions and the benefit of twice as many registers.  x86-64 System V has mandated 16-byte stack alignment from the start.  (I don't know the history, but perhaps i386 code-gen started assuming / depending on it for correctness, not just performance, by accident because of devs being used to x86-64?)

The 32-bit ABI on some other OSes, including i386 *BSD and 32-bit Windows, has *not* changed; presumably gcc there doesn't rely on incoming stack alignment.  (It might try to propagate 16-byte alignment for performance benefits, though.)

My understanding is that i386 MacOS still uses a version of i386 System V that doesn't include the 16-byte stack alignment update, like other *BSDs.


(In reply to Harald van Dijk from comment #90)
> compile
> 
>   void exit(int);
>   int main(void) { exit(0); }
> 
> with GCC 2.8, compile current glibc with GCC 8, and there will be a segfault
> in glibc's __run_exit_handlers because GCC 2.8 never kept the stack
> 16-byte-aligned, but GCC 8 does now generate code which assumes it.
>
> For the moment, I've rebuilt glibc with -mincoming-stack-boundary=2 to handle the problem well enough for my current needs, but it's not a complete solution.

Yes, you need workarounds like this to change modern GCC's ABI back to legacy 4-byte.

Note that you might break atomicity of C11 _Atomic 8-byte objects even outside structs by doing this, if they split across a cache line (Intel) or possibly narrower (AMD) boundary.  But only if they were stack allocated.
Comment 92 Viktor Ostashevskyi 2020-01-13 11:22:44 UTC
I've tried to run some old binaries yesterday (StarOffice 5.1, get it from archive.org) and hit this bug.

What are possible workarounds?
Comment 93 Florian Weimer 2020-01-13 11:59:03 UTC
(In reply to Viktor Ostashevskyi from comment #92)
> I've tried to run some old binaries yesterday (StarOffice 5.1, get it from
> archive.org) and hit this bug.
> 
> What are possible workarounds?

You need to use an operating system which was build with -mstackrealign, such as Fedora.
Comment 94 Viktor Ostashevskyi 2020-01-14 07:36:37 UTC
(In reply to Florian Weimer from comment #93)
> (In reply to Viktor Ostashevskyi from comment #92)
> > I've tried to run some old binaries yesterday (StarOffice 5.1, get it from
> > archive.org) and hit this bug.
> > 
> > What are possible workarounds?
> 
> You need to use an operating system which was build with -mstackrealign,
> such as Fedora.

Indeed, I can confirm that rebuilding 32-bit libraries with '-mstackrealign' on Gentoo helps.

Bug probably can be closed as WONTFIX. Additionally, it would be nice to have this ABI breakage properly documented somewhere (GCC FAQ?).
Comment 95 Dzianis Kahanovich 2020-01-15 13:31:42 UTC
Just FYI. Novadays, on my Thinkpad tablet with Atom (32 bit userspace Gentoo), I globally replace patch/-mstackrealign to "-fvect-cost-model=cheap -fsimd-cost-model=cheap -malign-data=cacheline" and all works fine for -O3 +. (This is dirty example, as cacheline for some old SSE CPUs are different, etc).
Comment 96 Viktor Ostashevskyi 2020-01-16 10:41:30 UTC
Honestly, I don't see how your compiler flags could help. cost-model=cheap
is default, data-alignment doesn't change incoming stack alignment.

ср, 15 січ. 2020, 14:31 користувач mahatma at eu dot by <
gcc-bugzilla@gcc.gnu.org> пише:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838
>
> --- Comment #95 from Dzianis Kahanovich <mahatma at eu dot by> ---
> Just FYI. Novadays, on my Thinkpad tablet with Atom (32 bit userspace
> Gentoo),
> I globally replace patch/-mstackrealign to "-fvect-cost-model=cheap
> -fsimd-cost-model=cheap -malign-data=cacheline" and all works fine for -O3
> +.
> (This is dirty example, as cacheline for some old SSE CPUs are different,
> etc).
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 97 Dzianis Kahanovich 2020-01-16 15:41:20 UTC
No. Looking into gcc/opts.c - "-O3 optimizations" section - line:
{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC },

- so, for -O3 it's "dynamic". Then, RTFM, "cheap" more cares about aligning. But anymore, I not try to rebuild 32bit "world" without ANY workaround, so all still dirty ;)

PS For some options configuration behaviour still non-linear, so queryng "gcc -Q ..." still unsafe to check some defaults...

(In reply to Viktor Ostashevskyi from comment #96)
> Honestly, I don't see how your compiler flags could help. cost-model=cheap
> is default, data-alignment doesn't change incoming stack alignment.
> 
> ср, 15 січ. 2020, 14:31 користувач mahatma at eu dot by <
> gcc-bugzilla@gcc.gnu.org> пише:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838
> >
> > --- Comment #95 from Dzianis Kahanovich <mahatma at eu dot by> ---
> > Just FYI. Novadays, on my Thinkpad tablet with Atom (32 bit userspace
> > Gentoo),
> > I globally replace patch/-mstackrealign to "-fvect-cost-model=cheap
> > -fsimd-cost-model=cheap -malign-data=cacheline" and all works fine for -O3
> > +.
> > (This is dirty example, as cacheline for some old SSE CPUs are different,
> > etc).
> >
> > --
> > You are receiving this mail because:
> > You are on the CC list for the bug.
Comment 98 Dzianis Kahanovich 2020-01-16 15:55:06 UTC
fix: "I not try to rebuild 32bit "world" without ANY workaround" - on modern gcc (now all under 9.2). Previous experiments was times & versions ago, so many other new factors/fixes can solve most issues.
Comment 99 Dzianis Kahanovich 2020-01-16 16:17:29 UTC
PPS About some hidden thinks/things. In pure theory. "*cost-model=cheap" can reduce SSE usage, -mstackrealign - can increase function prolog/epilog overhead. In my case - x7-Z8700 CPU have 2 FPU cores for 4 CPU cores (silvermont-1 have even less FPUs), so solution looks sure better then "-mstackrealign". But on some other CPUs something may be else and need to be tested about performance.
Comment 100 Dzianis Kahanovich 2020-01-31 15:03:31 UTC
Created attachment 47753 [details]
additional aligning on demand <10.0

Finally (for me), if somebody think to patch by  H.J. Lu is not enough, there are my patch to force "-fvect-cost-model=cheap -fsimd-cost-model=cheap" or "-mstackrealign" on demand. Default - first, as no abi violation, but if defined ENABLE_STACKREALIGN_ABI_VIOLATION=1 - first choice will be "-mstackrealign". This is for gcc <10.0 and verifyed.
Comment 101 Dzianis Kahanovich 2020-01-31 15:07:21 UTC
Created attachment 47754 [details]
additional aligning on demand 10.0 (unsure)

This is same for gcc 10.0 and not fully verifyed.

It MUST work in gcc 10.0, but in current git options helps show nothing changed:
gcc -Q -O3 -m32 -march=core2 --help=target --help=optimizers |grep 'stackrealign\|cost-model'

Looks like deep options behaviour rework in progress.
Comment 102 Dzianis Kahanovich 2020-02-19 02:25:02 UTC
Created attachment 47874 [details]
additional aligning on demand <10.0 (fixed)

Sorry (choked), playing with 10.0 I lost mind... 9.2 patch was broken. There are  fixed.