Bug 54412 - minimal 32-byte stack alignment with -mavx on 64-bit Windows
Summary: minimal 32-byte stack alignment with -mavx on 64-bit Windows
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.1
: P3 normal
Target Milestone: ---
Assignee: Eric Botcazou
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-30 01:10 UTC by R Copley
Modified: 2019-09-08 09:44 UTC (History)
11 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-09-22 00:00:00


Attachments
Self-contained C source, with AVX alignment bug on Windows (170 bytes, text/x-csrc)
2012-08-30 01:10 UTC, R Copley
Details
As before, but with explicitly 32-byte aligned variables (177 bytes, text/plain)
2013-09-10 17:26 UTC, R Copley
Details
Assembly-language code compiled from attachment 1 (425 bytes, text/plain)
2013-09-10 17:38 UTC, R Copley
Details
Slightly modified testcase (137 bytes, text/plain)
2014-09-20 06:08 UTC, Roland Schulz
Details
Test source for unaligned pass-by-value crash (227 bytes, text/plain)
2019-04-10 14:49 UTC, Dmitry Kazakov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description R Copley 2012-08-30 01:10:07 UTC
Created attachment 28103 [details]
Self-contained C source, with AVX alignment bug on Windows

Code generated by GCC 4.7.1 for the Windows x86_64-w64-mingw32 target, with "-mavx", can segfault due to alignment errors when the 32-byte ymm registers are spilled onto the stack. May I please submit a feature request for 32-byte stack alignment on this target where necessary?

Compiled for Windows with "gcc -O0 -m64 -mavx bug.c" using GCC 4.7.1 with the MingGW W64 toolchain, the attached program segfaults. Specifically, it uses vmovapd to copy the value of %ymm0 to a location on the stack before calling f(), but doesn't align the location to 32 bytes as required by that instruction. In contrast, the generated code for Linux (using GCC 4.6.3 from Ubuntu) does explicitly align the stack to 32 bytes.

The lack of stack alignment on Windows has been noted before; see for
example http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49001 and
http://stackoverflow.com/questions/5983389.
Comment 1 Kai Tietz 2013-09-10 10:29:57 UTC
MS' abi doesn't allow this.  So I doubt we will be able to implement that for this target.  If we want to re-align stack on function-base we will run into troubles with SEH-information.
Doesn't it work to align explicit the variable itself?
Comment 2 R Copley 2013-09-10 17:26:43 UTC
Created attachment 30793 [details]
As before, but with explicitly 32-byte aligned variables
Comment 3 R Copley 2013-09-10 17:38:21 UTC
Created attachment 30794 [details]
Assembly-language code compiled from attachment 1

Compiled with GCC 4.7.2 from the MinGW-w64 toolchain.
Compile command: "gcc -O0 -m64 -mavx -S bug1.c -o bug1.s".
gcc -v output:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=c:/mingw64/bin/../libexec/gcc/x86_64-w64-mingw32/4.7.2/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: /home/ruben/mingw-w64/src/gcc/configure --host=x86_64-w64-mingw32 --build=x86_64-linux-gnu --target=x86_64-w64-mingw32 --with-sysroot=/home/ruben/mingw-w64/mingw64mingw64/mingw64 --prefix=/home/ruben/mingw-w64/mingw64mingw64/mingw64 --with-gmp=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-mpfr=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-mpc=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-ppl=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-cloog=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --disable-ppl-version-check --disable-cloog-version-check --enable-cloog-backend=isl --with-host-libstdcxx='-static -lstdc++ -lm' --enable-shared --enable-static --enable-threads=win32 --enable-plugins --disable-multilib --enable-languages=c,lto,c++,objc,obj-c++,fortran,java --enable-libgomp --enable-fully-dynamic-string --enable-libstdcxx-time --disable-nls --disable-werror --enable-checking=release --with-gnu-as --with-gnu-ld --disable-win32-registry --disable-rpath --disable-werror --with-libiconv-prefix=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-pkgversion=rubenvb-4.7.2-release --with-bugurl=mingw-w64-public@lists.sourceforge.net CC= CFLAGS='-O2 -march=nocona -mtune=core2 -fomit-frame-pointer -momit-leaf-frame-pointer' LDFLAGS=
Thread model: win32
gcc version 4.7.2 (rubenvb-4.7.2-release)
Comment 4 R Copley 2013-09-10 17:49:29 UTC
(In reply to Kai Tietz from comment #1)
> MS' abi doesn't allow this.  So I doubt we will be able to implement that
> for this target.  If we want to re-align stack on function-base we will run
> into troubles with SEH-information.

You might be right, I'm not sure. Are you aware that on 64-bit Windows, SEH is table-based, not frame-based (see, e.g., http://www.osronline.com/article.cfm?article=469)?

> Doesn't it work to align explicit the variable itself?

No (see attachments 2 and 3). If I understand correctly, the alignment specification is redundant anyway, because the variables are supposed to be naturally aligned, on their size.

Assembling attachment 3 [details] with "-g" and running it in gdb gives:

Program received signal SIGSEGV, Segmentation fault.
main () at bug1.s:46
46              vmovapd %ymm0, -96(%rbp)

Thanks.
Comment 5 Roland Schulz 2014-09-03 21:17:33 UTC
This seems to me to be a duplicate of 49001.
Comment 6 R Copley 2014-09-04 22:49:36 UTC
As I mentioned in the description, this request was indeed related to that bug.
The test-case no longer crashes with recent MinGW-W64 toolchains (GCC 4.9.1).
Comment 7 Roland Schulz 2014-09-05 01:42:03 UTC
For me the problem isn't fixed with gcc 4.9.1. I tried two build a) http://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/installer/mingw-w64-install.exe/download and b) http://nuwen.net/mingw.html. Did you use a special distribution or special flags if you compiled gcc yourself?
Comment 8 R Copley 2014-09-05 18:40:22 UTC
No, I use the mingw-builds distro too.

gcc --version
gcc (x86_64-win32-seh-rev0, Built by MinGW-W64 project) 4.9.1

Bizarrely, the attached program exits with a random error code unless I add a "return 0;" statement to the end of the main function.
But it doesn't segfault.
Comment 9 R Copley 2014-09-05 18:44:19 UTC
Heh, sorry. I don't really know C, I assumed it had an implicit "return 0;" like C++. Apparently C99 has this but earlier C standards do not. So, not bizarre at all.
Comment 10 Roland Schulz 2014-09-20 06:08:30 UTC
Created attachment 33520 [details]
Slightly modified testcase

This slightly modified testcase in which the return value isn't stored, still segfaults for me. With the 32bit mingw64 binary ((i686-win32-dwarf-rev1, Built by MinGW-W64 project) 4.9.1) it is OK, but with the 64bit binary ((x86_64-win32-seh-rev1, Built by MinGW-W64 project) 4.9.1) it segfaults.
Comment 11 R Copley 2014-09-21 01:08:15 UTC
On 20 September 2014 07:08, roland at rschulz dot eu
<gcc-bugzilla@gcc.gnu.org> wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412
>
> --- Comment #10 from Roland Schulz <roland at rschulz dot eu> ---
> Created attachment 33520 [details]
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33520&action=edit
> Slightly modified testcase
>
> This slightly modified testcase in which the return value isn't stored, still
> segfaults for me. With the 32bit mingw64 binary ((i686-win32-dwarf-rev1, Built
> by MinGW-W64 project) 4.9.1) it is OK, but with the 64bit binary
> ((x86_64-win32-seh-rev1, Built by MinGW-W64 project) 4.9.1) it segfaults.

Confirmed (with the same compiler, in the mingw-builds toolchain). I
compiled your testcase with command "gcc -O0 -g -ggdb -m64 -mavx
bug.c". It segfaults on the instruction marked "=>" below.

(gdb) disassemble /m
Dump of assembler code for function f:
6       {
   0x00000000004014f0 <+0>:     push   %rbp
   0x00000000004014f1 <+1>:     mov    %rsp,%rbp
   0x00000000004014f4 <+4>:     mov    %rcx,0x10(%rbp)
   0x00000000004014f8 <+8>:     sub    $0x40,%rsp
   0x00000000004014fc <+12>:    mov    %rsp,%rax
   0x00000000004014ff <+15>:    add    $0x1f,%rax
   0x0000000000401503 <+19>:    shr    $0x5,%rax
   0x0000000000401507 <+23>:    shl    $0x5,%rax

7         v4d x __attribute__ ((aligned (32))) = { 1.0, 2.0, 3.0, 4.0, };
   0x000000000040150b <+27>:    vmovapd 0x2aed(%rip),%ymm0        # 0x404000
   0x0000000000401513 <+35>:    vmovapd %ymm0,(%rax)

8         return x;
   0x0000000000401517 <+39>:    mov    0x10(%rbp),%rdx
   0x000000000040151b <+43>:    vmovapd (%rax),%ymm0
=> 0x000000000040151f <+47>:    vmovapd %ymm0,(%rdx)

9       }
   0x0000000000401523 <+51>:    mov    0x10(%rbp),%rax
   0x0000000000401527 <+55>:    mov    %rbp,%rsp
   0x000000000040152a <+58>:    pop    %rbp
   0x000000000040152b <+59>:    retq

End of assembler dump.
(gdb) print $rdx % 32
$1 = 16
Comment 12 Kai Tietz 2015-09-22 10:51:28 UTC
It is good to hear that issue is fixed for 32-bit.  But for 64-bit - as I already explained in comment above - this issue isn't fixable for stack-variables.

The problem is that for x64 ABI we are tighten bound to SEH-prologue information, and this can't express alignment-operations.  The x64 ABI guarantee 16 byte alignment on function entry, therefore sse 128-bit operations are possible to be placed fully aligned on stack, but higher alignment is simply not expressible.

Therefore I will need to set this bug to suspended.  If this information gets in future extended to allow such prologue-information we need for alignment, then we will be able to fix that.  So long it is suspended.
Comment 13 Roland Schulz 2015-09-23 20:15:02 UTC
But this problem is limited to GCC. ICC, Clang and MSVC don't have the problem with compiling 64bit AVX code. Thus they must have some kind of work-around for ABI and GCC should be able to use a work-around too (at least in theory).
Comment 14 chi 2016-01-13 22:11:06 UTC
A solution would be to use unaligned loads and stores to stack variables for 256-bit variables and spilled registers. Likely the other compilers are doing this to make it work. 

I would really appreciate such a solution.
Comment 15 Joseph Coffland 2016-06-21 23:00:06 UTC
After compiling and running the test case, I can confirm that this bug still exists in ``gcc (Debian 5.3.1-11) 5.3.1 20160307``.  It crashes both under Wine and 64-bit Windows 7.

I would love to see this fixed.  It's the only thing keeping me from building all of the Folding@home software for Windows under Linux.  We need AVX for our protein folding simulations.  The extra performance gained by using AVX is significant.  My other options are clang or building on Windows using MSVC or Intel compilers.

```
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 5.3.1-11' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --with-arch-32=i586 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.3.1 20160307 (Debian 5.3.1-11) 
```
Comment 16 Royi 2018-04-09 06:42:59 UTC
Hello,
More details about this bug can be seen at (With simple test case):

https://github.com/Alexpux/MSYS2-packages/issues/1209#issuecomment-379576367

Also on MinGW64 discussion:

https://sourceforge.net/p/mingw-w64/mailman/message/36287627/

Regarding Kai Tietz's comment:

https://stackoverflow.com/a/30929086/195787

Any chance this gets fixed?

Thank You.
Comment 17 Eric Botcazou 2018-04-09 09:01:03 UTC
Moving again to NEW.  Clang does indeed realign the stack a la GCC, which is OK in simple cases like this with SEH, i.e. when no DRAP is required (MSVC does things totally backwards, since it realigns the frame instead of the stack but we just cannot do that in GCC).
Comment 18 Eric Botcazou 2018-04-09 09:06:44 UTC
Investigating.  In particular, it would be good to know what Clang does when there is also a call to alloca in the problematic function.
Comment 19 Royi 2018-04-10 04:17:12 UTC
This comment could be important:

https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64?noredirect=1#comment86499640_30928265.

Hopefully you'll find a way to bring AVX to Windows 64 using GCC.

Thank You.
Comment 20 Eric Botcazou 2018-04-10 06:23:50 UTC
> This comment could be important:
> 
> https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-
> stack-alignment-required-for-avx-on-windows-
> x64?noredirect=1#comment86499640_30928265.

As already said, MSVC does something completely different (it realigns the frame instead of the stack) and we cannot do that; the model must be Clang instead.
Comment 21 Liu Hao 2018-04-18 13:00:24 UTC
This comment could be important:

<https://github.com/Alexpux/MSYS2-packages/issues/1209#issuecomment-379576367>

> mstorsjo commented 10 days ago
> However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC:

If the `__m256` in question in the original post was made to pass by reference, the crash would go away. From the assembly code following that reply we can also conclude that, it is not the impossibility of realigning the stack during run time that is the issue (because RSP was aligned in that snippet of code and I believe that code was correct). It is GCC does not realign the stack at all that is the issue.
Comment 22 Royi 2018-08-28 08:09:15 UTC
Hello,

Any progress on this on GCC 8.x?

We really want GCC + AVX on Windows.
Comment 23 Yichao Yu 2019-02-28 04:48:56 UTC
> It is GCC does not realign the stack at all that is the issue.

I hit another related issue that might confirm this as well.

I noticed this when I tried to manually align the stack with inline assembly.

C++ code reduced from my test case,

```
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>

__attribute__((target("avx")))
__attribute__((noinline)) __m256d f(__m256d x, uint32_t a, const double *p)
{
    __m256d res;
    asm volatile ("vxorpd %0, %0, %0" :
                  "=x"(res), "+x"(x), "+r"(a), "+r"(p) ::
                  "memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
                  "r11", "rbp");
    return res;
}

__attribute__((target("avx")))
__attribute__((noinline)) __m256d f2(__m256d x, uint32_t a, const double *p)
{
    __m256d res;
    asm volatile ("vxorpd %0, %0, %0" :
                  "=x"(res), "+x"(x), "+r"(a), "+r"(p) ::
                  "memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
                  "r11", "rbp");
    return res;
}

__attribute__((target("avx")))
__attribute__((noinline)) __m256d f(__m256d x, __m256d y, __m256d z,
                                    uint32_t a, const double *p)
{
    __m256d res;
    asm volatile ("vxorpd %0, %0, %0" :
                  "=x"(res), "+x"(x), "+x"(y), "+x"(z), "+r"(a), "+r"(p) ::
                  "memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
                  "r11", "rbp");
    return res;
}

const double points[] = {0, 0.1, 0.2, 0.6};

__attribute__((target("avx"))) void test_avx()
{
    f(__m256d{0, 0, 0, 0}, __m256d{0, 0, 0, 0},
                           __m256d{0, 0, 0, 0}, 4, points);
    f(__m256d{0, 0, 0, 0}, 4, points);
}

__attribute__((target("avx"))) void test_avx2()
{
    f2(__m256d{0, 0, 0, 0}, 4, points);
}

static void call_aligned_stack(void (*p)(void))
{
    asm volatile ("movq %%rsp, %%rbp\n"
                  "andq $-64, %%rsp\n"
                  "subq $64, %%rsp\n"
                  "callq *%0\n"
                  "movq %%rbp, %%rsp\n"
                  :: "r"(p)
                  : "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "rbp");
}

int main()
{
    call_aligned_stack(test_avx);
    fprintf(stderr, "aaaa\n");
    fflush(stderr);
    call_aligned_stack(test_avx2);
    return 0;
}
```

(The `fprintf` is there only to make it easier to see when the crash happens.) The stack alignment code makes sure that the stack is aligned to 64bytes before making the `call`, which is verified in the debugger, however, when compiled with GCC 8.2.1 on msys2 (using the mingw-w64-x86_64-gcc package) the `test_avx` function is happy while `test_avx2` function is not.

Looking at the generated code, for the crashing function:

```
00000000004015c0 <_Z9test_avx2v>:
  4015c0:	48 83 ec 68          	sub    $0x68,%rsp
  4015c4:	c5 f9 57 c0          	vxorpd %xmm0,%xmm0,%xmm0
  4015c8:	4c 8d 0d 51 7a 00 00 	lea    0x7a51(%rip),%r9        # 409020 <_ZL6points>
  4015cf:	41 b8 04 00 00 00    	mov    $0x4,%r8d
  4015d5:	48 8d 4c 24 40       	lea    0x40(%rsp),%rcx
  4015da:	48 8d 54 24 20       	lea    0x20(%rsp),%rdx
  4015df:	c5 fd 29 44 24 20    	vmovapd %ymm0,0x20(%rsp)
  4015e5:	c5 f8 77             	vzeroupper 
  4015e8:	e8 a3 ff ff ff       	callq  401590 <_Z2f2Dv4_djPKd>
  4015ed:	90                   	nop
  4015ee:	48 83 c4 68          	add    $0x68,%rsp
  4015f2:	c3                   	retq   
```

which tries to write with 32byte alignment with a stack offset from the initial call instruction: -8 - 0x68 + 0x20 = -80.

OTOH, for the "good" function,

```
0000000000401640 <_Z8test_avxv>:
  401640:	57                   	push   %rdi
  401641:	56                   	push   %rsi
  401642:	53                   	push   %rbx
  401643:	48 81 ec b0 00 00 00 	sub    $0xb0,%rsp
  40164a:	c5 d9 57 e4          	vxorpd %xmm4,%xmm4,%xmm4
  40164e:	48 8d 3d cb 79 00 00 	lea    0x79cb(%rip),%rdi        # 409020 <_ZL6points>
  401655:	48 8d 74 24 70       	lea    0x70(%rsp),%rsi
  40165a:	4c 8d 4c 24 30       	lea    0x30(%rsp),%r9
  40165f:	48 89 7c 24 28       	mov    %rdi,0x28(%rsp)
  401664:	48 8d 9c 24 90 00 00 	lea    0x90(%rsp),%rbx
  40166b:	00 
  40166c:	4c 8d 44 24 50       	lea    0x50(%rsp),%r8
  401671:	48 89 f2             	mov    %rsi,%rdx
  401674:	c5 fd 29 64 24 70    	vmovapd %ymm4,0x70(%rsp)
  40167a:	48 89 d9             	mov    %rbx,%rcx
  40167d:	c5 fd 29 64 24 50    	vmovapd %ymm4,0x50(%rsp)
  401683:	c5 fd 29 64 24 30    	vmovapd %ymm4,0x30(%rsp)
  401689:	c7 44 24 20 04 00 00 	movl   $0x4,0x20(%rsp)
  401690:	00 
  401691:	c5 f8 77             	vzeroupper 
  401694:	e8 67 ff ff ff       	callq  401600 <_Z1fDv4_dS_S_jPKd>
  401699:	c5 d9 57 e4          	vxorpd %xmm4,%xmm4,%xmm4
  40169d:	49 89 f9             	mov    %rdi,%r9
  4016a0:	48 89 f2             	mov    %rsi,%rdx
  4016a3:	41 b8 04 00 00 00    	mov    $0x4,%r8d
  4016a9:	48 89 d9             	mov    %rbx,%rcx
  4016ac:	c5 fd 29 64 24 70    	vmovapd %ymm4,0x70(%rsp)
  4016b2:	c5 f8 77             	vzeroupper 
  4016b5:	e8 a6 fe ff ff       	callq  401560 <_Z1fDv4_djPKd>
  4016ba:	90                   	nop
  4016bb:	48 81 c4 b0 00 00 00 	add    $0xb0,%rsp
  4016c2:	5b                   	pop    %rbx
  4016c3:	5e                   	pop    %rsi
  4016c4:	5f                   	pop    %rdi
  4016c5:	c3                   	retq   
```

The stack offset for the 32bytes aligned access is, -8 - 8 * 3 - 0xb0 + 0x70 = -96, which is different from the previous one by 16 so it seems that GCC isn't even consistent with itself on what stack alignment it expect.
Comment 24 Yichao Yu 2019-02-28 04:50:26 UTC
Oh, and the test case above was compiled with -O3 (and -g -Wall -Wextra).
Comment 25 Dmitry Kazakov 2019-04-10 14:47:42 UTC
Hi, all!

I would like to add one more test file, related to the problem. If GCC tries to call a function, that accepts a __m256 register as a parameter, it unloads this parameter into the stack using an **aligned** move (vmovaps), but the alignment guarantee on Windows is only 16-byte. It means that the application will crash because of unaligned memory access.

Affected versions: GCC 7.3.0 (MinGW64), GCC 8.1.0 (MinGW64)

Here is the testing source (see also in an attachment):

#include <intrin.h>

struct X { 
alignas(32) __m256 d;
};

void g1(X);
void g2(const X&);
void g3(const void *);

void f(float *ptr) {
    X x = {_mm256_load_ps(ptr)};
    g1(x);  // BUG: passes via unaligned (whatever rsp alignment is) stack
    g2(x);  // OK: passes via aligned stack location
    g3(&x); // OK: passes via aligned stack location
}


Compiled result (-O2 -march=skylake):

_Z1fPf:
.LFB5135:
	pushq	%rbx
	.seh_pushreg	%rbx
	addq	$-128, %rsp
	.seh_stackalloc	128
	.seh_endprologue
	vmovaps	(%rcx), %ymm0
	leaq	95(%rsp), %rbx
	leaq	32(%rsp), %rcx
	andq	$-32, %rbx
	vmovaps	%ymm0, (%rbx)    # %rbx is properly aligned 
	vmovaps	%ymm0, 32(%rsp)  # %rsp may be unaligned
	vzeroupper
	call	_Z2g11X
	movq	%rbx, %rcx
	call	_Z2g2RK1X
	movq	%rbx, %rcx
	call	_Z2g3PKv
	nop
	subq	$-128, %rsp
	popq	%rbx
        ret

Related bug in Vc library: https://github.com/VcDevel/Vc/issues/241
Related bug in Krita: https://bugs.kde.org/show_bug.cgi?id=406209
Comment 26 Dmitry Kazakov 2019-04-10 14:49:07 UTC
Created attachment 46133 [details]
Test source for unaligned pass-by-value crash

Test file for the comment above
Comment 27 Dmitry Kazakov 2019-04-10 14:52:23 UTC
As a workaround, one can either use __attribute__((always_inline)) for *all* the functions accepting __m256 or pass *all* arguments by const-ref. Const-ref arguments are passed correctly.
Comment 28 John Platts 2019-08-25 15:47:47 UTC
The correct way to align the stack to a 32-byte or 64-byte boundary on 64-bit Windows is to use a frame pointer in a function that requires stack realignment and then realign the stack to the required alignment once the frame pointer is set and all of the non-volatile registers used in the function are saved.

class Avx2VectorGenerator {
public:
    virtual __m256i NextVector() = 0;
};

__m256i Example_AVX2_Func(Avx2VectorGenerator* generator, size_t iterations);

Example_AVX2_Func:
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rbx
    .seh_pushreg %rbx
    pushq %rdi
    .seh_pushreg %rdi
    movq %rsp, %rbp
    .seh_setframe %rbp, 0
    .seh_endprologue
    
    /* Set rbx to generator and rdi to iterations */
    movq %rcx, %rbx
    movq %rdx, %rdi

    /* It is okay to allocate additional stack memory */
    /* and re-align the stack pointer outside of the */
    /* SEH prologue as there is a frame pointer in this */
    /* function */
    subq $64, %rsp
    andq $-32, %rsp
    
    /* Zero out the result vector */
    vpxor %ymm0, %ymm0, %ymm0

    test %rdi, %rdi
    jz .loop_complete
.loop_iteration_start:
    /* Save the result vector to 32(%rsp) */
    vmovdqa 32(%rsp), ymm0

    /* Move generator into rcx */
    movq %rbx, %rcx
    /* Move the pointer to the NextVector() virtual member func */
    /* into rax */
    movq (%rbx), %rax
    /* Call generator->NextVector() */
    call *(%rax)

    /* Add the result of generator->NextVector() to the result vector */
    vpaddb 32(%rsp), %ymm0, %ymm0

    /* Decrement iterations by 1 */
    sub $1, %rdi

    /* Jump back to the beginning of the loop if iterations is non-zero */
    jnz .loop_iteration_start
.loop_complete:
    lea (%rbp), %rsp
    pop %rdi
    pop %rbx
    pop %rbp
    ret
    .seh_endproc
Comment 29 Yichao Yu 2019-08-25 16:00:25 UTC
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412#c25

GCC is fully capable of aligning the stack. It just seems that different part of it disagrees on what the current stack alignment is and whether a realignment is needed.