Summary: | minimal 32-byte stack alignment with -mavx on 64-bit Windows | ||
---|---|---|---|
Product: | gcc | Reporter: | R Copley <rcopley> |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | arthur200126, avraham.adler, CoelacanthusHex, dimula73, ebotcazou, idhameed, jakub, ktietz, lists, luke-jr+gccbugs, mehdi.chinoune, mika.fischer, rcopley, roland, ssbssa, steve |
Priority: | P3 | Keywords: | wrong-code |
Version: | 4.7.1 | ||
Target Milestone: | --- | ||
See Also: | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49001 | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2015-09-22 00:00:00 | |
Attachments: |
Self-contained C source, with AVX alignment bug on Windows
As before, but with explicitly 32-byte aligned variables Assembly-language code compiled from attachment 1 Slightly modified testcase Test source for unaligned pass-by-value crash Use unaligned VMOV instructions (for Windows targets) |
Description
R Copley
2012-08-30 01:10:07 UTC
MS' abi doesn't allow this. So I doubt we will be able to implement that for this target. If we want to re-align stack on function-base we will run into troubles with SEH-information. Doesn't it work to align explicit the variable itself? Created attachment 30793 [details]
As before, but with explicitly 32-byte aligned variables
Created attachment 30794 [details] Assembly-language code compiled from attachment 1 Compiled with GCC 4.7.2 from the MinGW-w64 toolchain. Compile command: "gcc -O0 -m64 -mavx -S bug1.c -o bug1.s". gcc -v output: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=c:/mingw64/bin/../libexec/gcc/x86_64-w64-mingw32/4.7.2/lto-wrapper.exe Target: x86_64-w64-mingw32 Configured with: /home/ruben/mingw-w64/src/gcc/configure --host=x86_64-w64-mingw32 --build=x86_64-linux-gnu --target=x86_64-w64-mingw32 --with-sysroot=/home/ruben/mingw-w64/mingw64mingw64/mingw64 --prefix=/home/ruben/mingw-w64/mingw64mingw64/mingw64 --with-gmp=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-mpfr=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-mpc=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-ppl=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-cloog=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --disable-ppl-version-check --disable-cloog-version-check --enable-cloog-backend=isl --with-host-libstdcxx='-static -lstdc++ -lm' --enable-shared --enable-static --enable-threads=win32 --enable-plugins --disable-multilib --enable-languages=c,lto,c++,objc,obj-c++,fortran,java --enable-libgomp --enable-fully-dynamic-string --enable-libstdcxx-time --disable-nls --disable-werror --enable-checking=release --with-gnu-as --with-gnu-ld --disable-win32-registry --disable-rpath --disable-werror --with-libiconv-prefix=/home/ruben/mingw-w64/prereq/x86_64-w64-mingw32/install --with-pkgversion=rubenvb-4.7.2-release --with-bugurl=mingw-w64-public@lists.sourceforge.net CC= CFLAGS='-O2 -march=nocona -mtune=core2 -fomit-frame-pointer -momit-leaf-frame-pointer' LDFLAGS= Thread model: win32 gcc version 4.7.2 (rubenvb-4.7.2-release) (In reply to Kai Tietz from comment #1) > MS' abi doesn't allow this. So I doubt we will be able to implement that > for this target. If we want to re-align stack on function-base we will run > into troubles with SEH-information. You might be right, I'm not sure. Are you aware that on 64-bit Windows, SEH is table-based, not frame-based (see, e.g., http://www.osronline.com/article.cfm?article=469)? > Doesn't it work to align explicit the variable itself? No (see attachments 2 and 3). If I understand correctly, the alignment specification is redundant anyway, because the variables are supposed to be naturally aligned, on their size. Assembling attachment 3 [details] with "-g" and running it in gdb gives: Program received signal SIGSEGV, Segmentation fault. main () at bug1.s:46 46 vmovapd %ymm0, -96(%rbp) Thanks. This seems to me to be a duplicate of 49001. As I mentioned in the description, this request was indeed related to that bug. The test-case no longer crashes with recent MinGW-W64 toolchains (GCC 4.9.1). For me the problem isn't fixed with gcc 4.9.1. I tried two build a) http://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/installer/mingw-w64-install.exe/download and b) http://nuwen.net/mingw.html. Did you use a special distribution or special flags if you compiled gcc yourself? No, I use the mingw-builds distro too. gcc --version gcc (x86_64-win32-seh-rev0, Built by MinGW-W64 project) 4.9.1 Bizarrely, the attached program exits with a random error code unless I add a "return 0;" statement to the end of the main function. But it doesn't segfault. Heh, sorry. I don't really know C, I assumed it had an implicit "return 0;" like C++. Apparently C99 has this but earlier C standards do not. So, not bizarre at all. Created attachment 33520 [details]
Slightly modified testcase
This slightly modified testcase in which the return value isn't stored, still segfaults for me. With the 32bit mingw64 binary ((i686-win32-dwarf-rev1, Built by MinGW-W64 project) 4.9.1) it is OK, but with the 64bit binary ((x86_64-win32-seh-rev1, Built by MinGW-W64 project) 4.9.1) it segfaults.
On 20 September 2014 07:08, roland at rschulz dot eu <gcc-bugzilla@gcc.gnu.org> wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 > > --- Comment #10 from Roland Schulz <roland at rschulz dot eu> --- > Created attachment 33520 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33520&action=edit > Slightly modified testcase > > This slightly modified testcase in which the return value isn't stored, still > segfaults for me. With the 32bit mingw64 binary ((i686-win32-dwarf-rev1, Built > by MinGW-W64 project) 4.9.1) it is OK, but with the 64bit binary > ((x86_64-win32-seh-rev1, Built by MinGW-W64 project) 4.9.1) it segfaults. Confirmed (with the same compiler, in the mingw-builds toolchain). I compiled your testcase with command "gcc -O0 -g -ggdb -m64 -mavx bug.c". It segfaults on the instruction marked "=>" below. (gdb) disassemble /m Dump of assembler code for function f: 6 { 0x00000000004014f0 <+0>: push %rbp 0x00000000004014f1 <+1>: mov %rsp,%rbp 0x00000000004014f4 <+4>: mov %rcx,0x10(%rbp) 0x00000000004014f8 <+8>: sub $0x40,%rsp 0x00000000004014fc <+12>: mov %rsp,%rax 0x00000000004014ff <+15>: add $0x1f,%rax 0x0000000000401503 <+19>: shr $0x5,%rax 0x0000000000401507 <+23>: shl $0x5,%rax 7 v4d x __attribute__ ((aligned (32))) = { 1.0, 2.0, 3.0, 4.0, }; 0x000000000040150b <+27>: vmovapd 0x2aed(%rip),%ymm0 # 0x404000 0x0000000000401513 <+35>: vmovapd %ymm0,(%rax) 8 return x; 0x0000000000401517 <+39>: mov 0x10(%rbp),%rdx 0x000000000040151b <+43>: vmovapd (%rax),%ymm0 => 0x000000000040151f <+47>: vmovapd %ymm0,(%rdx) 9 } 0x0000000000401523 <+51>: mov 0x10(%rbp),%rax 0x0000000000401527 <+55>: mov %rbp,%rsp 0x000000000040152a <+58>: pop %rbp 0x000000000040152b <+59>: retq End of assembler dump. (gdb) print $rdx % 32 $1 = 16 It is good to hear that issue is fixed for 32-bit. But for 64-bit - as I already explained in comment above - this issue isn't fixable for stack-variables. The problem is that for x64 ABI we are tighten bound to SEH-prologue information, and this can't express alignment-operations. The x64 ABI guarantee 16 byte alignment on function entry, therefore sse 128-bit operations are possible to be placed fully aligned on stack, but higher alignment is simply not expressible. Therefore I will need to set this bug to suspended. If this information gets in future extended to allow such prologue-information we need for alignment, then we will be able to fix that. So long it is suspended. But this problem is limited to GCC. ICC, Clang and MSVC don't have the problem with compiling 64bit AVX code. Thus they must have some kind of work-around for ABI and GCC should be able to use a work-around too (at least in theory). A solution would be to use unaligned loads and stores to stack variables for 256-bit variables and spilled registers. Likely the other compilers are doing this to make it work. I would really appreciate such a solution. After compiling and running the test case, I can confirm that this bug still exists in ``gcc (Debian 5.3.1-11) 5.3.1 20160307``. It crashes both under Wine and 64-bit Windows 7. I would love to see this fixed. It's the only thing keeping me from building all of the Folding@home software for Windows under Linux. We need AVX for our protein folding simulations. The extra performance gained by using AVX is significant. My other options are clang or building on Windows using MSVC or Intel compilers. ``` $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 5.3.1-11' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --with-arch-32=i586 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 5.3.1 20160307 (Debian 5.3.1-11) ``` Hello, More details about this bug can be seen at (With simple test case): https://github.com/Alexpux/MSYS2-packages/issues/1209#issuecomment-379576367 Also on MinGW64 discussion: https://sourceforge.net/p/mingw-w64/mailman/message/36287627/ Regarding Kai Tietz's comment: https://stackoverflow.com/a/30929086/195787 Any chance this gets fixed? Thank You. Moving again to NEW. Clang does indeed realign the stack a la GCC, which is OK in simple cases like this with SEH, i.e. when no DRAP is required (MSVC does things totally backwards, since it realigns the frame instead of the stack but we just cannot do that in GCC). Investigating. In particular, it would be good to know what Clang does when there is also a call to alloca in the problematic function. This comment could be important: https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-stack-alignment-required-for-avx-on-windows-x64?noredirect=1#comment86499640_30928265. Hopefully you'll find a way to bring AVX to Windows 64 using GCC. Thank You. > This comment could be important:
>
> https://stackoverflow.com/questions/30928265/mingw64-is-incapable-of-32-byte-
> stack-alignment-required-for-avx-on-windows-
> x64?noredirect=1#comment86499640_30928265.
As already said, MSVC does something completely different (it realigns the frame instead of the stack) and we cannot do that; the model must be Clang instead.
This comment could be important: <https://github.com/Alexpux/MSYS2-packages/issues/1209#issuecomment-379576367> > mstorsjo commented 10 days ago > However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC: If the `__m256` in question in the original post was made to pass by reference, the crash would go away. From the assembly code following that reply we can also conclude that, it is not the impossibility of realigning the stack during run time that is the issue (because RSP was aligned in that snippet of code and I believe that code was correct). It is GCC does not realign the stack at all that is the issue. Hello, Any progress on this on GCC 8.x? We really want GCC + AVX on Windows. > It is GCC does not realign the stack at all that is the issue.
I hit another related issue that might confirm this as well.
I noticed this when I tried to manually align the stack with inline assembly.
C++ code reduced from my test case,
```
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
__attribute__((target("avx")))
__attribute__((noinline)) __m256d f(__m256d x, uint32_t a, const double *p)
{
__m256d res;
asm volatile ("vxorpd %0, %0, %0" :
"=x"(res), "+x"(x), "+r"(a), "+r"(p) ::
"memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
"r11", "rbp");
return res;
}
__attribute__((target("avx")))
__attribute__((noinline)) __m256d f2(__m256d x, uint32_t a, const double *p)
{
__m256d res;
asm volatile ("vxorpd %0, %0, %0" :
"=x"(res), "+x"(x), "+r"(a), "+r"(p) ::
"memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
"r11", "rbp");
return res;
}
__attribute__((target("avx")))
__attribute__((noinline)) __m256d f(__m256d x, __m256d y, __m256d z,
uint32_t a, const double *p)
{
__m256d res;
asm volatile ("vxorpd %0, %0, %0" :
"=x"(res), "+x"(x), "+x"(y), "+x"(z), "+r"(a), "+r"(p) ::
"memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
"r11", "rbp");
return res;
}
const double points[] = {0, 0.1, 0.2, 0.6};
__attribute__((target("avx"))) void test_avx()
{
f(__m256d{0, 0, 0, 0}, __m256d{0, 0, 0, 0},
__m256d{0, 0, 0, 0}, 4, points);
f(__m256d{0, 0, 0, 0}, 4, points);
}
__attribute__((target("avx"))) void test_avx2()
{
f2(__m256d{0, 0, 0, 0}, 4, points);
}
static void call_aligned_stack(void (*p)(void))
{
asm volatile ("movq %%rsp, %%rbp\n"
"andq $-64, %%rsp\n"
"subq $64, %%rsp\n"
"callq *%0\n"
"movq %%rbp, %%rsp\n"
:: "r"(p)
: "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "rbp");
}
int main()
{
call_aligned_stack(test_avx);
fprintf(stderr, "aaaa\n");
fflush(stderr);
call_aligned_stack(test_avx2);
return 0;
}
```
(The `fprintf` is there only to make it easier to see when the crash happens.) The stack alignment code makes sure that the stack is aligned to 64bytes before making the `call`, which is verified in the debugger, however, when compiled with GCC 8.2.1 on msys2 (using the mingw-w64-x86_64-gcc package) the `test_avx` function is happy while `test_avx2` function is not.
Looking at the generated code, for the crashing function:
```
00000000004015c0 <_Z9test_avx2v>:
4015c0: 48 83 ec 68 sub $0x68,%rsp
4015c4: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
4015c8: 4c 8d 0d 51 7a 00 00 lea 0x7a51(%rip),%r9 # 409020 <_ZL6points>
4015cf: 41 b8 04 00 00 00 mov $0x4,%r8d
4015d5: 48 8d 4c 24 40 lea 0x40(%rsp),%rcx
4015da: 48 8d 54 24 20 lea 0x20(%rsp),%rdx
4015df: c5 fd 29 44 24 20 vmovapd %ymm0,0x20(%rsp)
4015e5: c5 f8 77 vzeroupper
4015e8: e8 a3 ff ff ff callq 401590 <_Z2f2Dv4_djPKd>
4015ed: 90 nop
4015ee: 48 83 c4 68 add $0x68,%rsp
4015f2: c3 retq
```
which tries to write with 32byte alignment with a stack offset from the initial call instruction: -8 - 0x68 + 0x20 = -80.
OTOH, for the "good" function,
```
0000000000401640 <_Z8test_avxv>:
401640: 57 push %rdi
401641: 56 push %rsi
401642: 53 push %rbx
401643: 48 81 ec b0 00 00 00 sub $0xb0,%rsp
40164a: c5 d9 57 e4 vxorpd %xmm4,%xmm4,%xmm4
40164e: 48 8d 3d cb 79 00 00 lea 0x79cb(%rip),%rdi # 409020 <_ZL6points>
401655: 48 8d 74 24 70 lea 0x70(%rsp),%rsi
40165a: 4c 8d 4c 24 30 lea 0x30(%rsp),%r9
40165f: 48 89 7c 24 28 mov %rdi,0x28(%rsp)
401664: 48 8d 9c 24 90 00 00 lea 0x90(%rsp),%rbx
40166b: 00
40166c: 4c 8d 44 24 50 lea 0x50(%rsp),%r8
401671: 48 89 f2 mov %rsi,%rdx
401674: c5 fd 29 64 24 70 vmovapd %ymm4,0x70(%rsp)
40167a: 48 89 d9 mov %rbx,%rcx
40167d: c5 fd 29 64 24 50 vmovapd %ymm4,0x50(%rsp)
401683: c5 fd 29 64 24 30 vmovapd %ymm4,0x30(%rsp)
401689: c7 44 24 20 04 00 00 movl $0x4,0x20(%rsp)
401690: 00
401691: c5 f8 77 vzeroupper
401694: e8 67 ff ff ff callq 401600 <_Z1fDv4_dS_S_jPKd>
401699: c5 d9 57 e4 vxorpd %xmm4,%xmm4,%xmm4
40169d: 49 89 f9 mov %rdi,%r9
4016a0: 48 89 f2 mov %rsi,%rdx
4016a3: 41 b8 04 00 00 00 mov $0x4,%r8d
4016a9: 48 89 d9 mov %rbx,%rcx
4016ac: c5 fd 29 64 24 70 vmovapd %ymm4,0x70(%rsp)
4016b2: c5 f8 77 vzeroupper
4016b5: e8 a6 fe ff ff callq 401560 <_Z1fDv4_djPKd>
4016ba: 90 nop
4016bb: 48 81 c4 b0 00 00 00 add $0xb0,%rsp
4016c2: 5b pop %rbx
4016c3: 5e pop %rsi
4016c4: 5f pop %rdi
4016c5: c3 retq
```
The stack offset for the 32bytes aligned access is, -8 - 8 * 3 - 0xb0 + 0x70 = -96, which is different from the previous one by 16 so it seems that GCC isn't even consistent with itself on what stack alignment it expect.
Oh, and the test case above was compiled with -O3 (and -g -Wall -Wextra). Hi, all! I would like to add one more test file, related to the problem. If GCC tries to call a function, that accepts a __m256 register as a parameter, it unloads this parameter into the stack using an **aligned** move (vmovaps), but the alignment guarantee on Windows is only 16-byte. It means that the application will crash because of unaligned memory access. Affected versions: GCC 7.3.0 (MinGW64), GCC 8.1.0 (MinGW64) Here is the testing source (see also in an attachment): #include <intrin.h> struct X { alignas(32) __m256 d; }; void g1(X); void g2(const X&); void g3(const void *); void f(float *ptr) { X x = {_mm256_load_ps(ptr)}; g1(x); // BUG: passes via unaligned (whatever rsp alignment is) stack g2(x); // OK: passes via aligned stack location g3(&x); // OK: passes via aligned stack location } Compiled result (-O2 -march=skylake): _Z1fPf: .LFB5135: pushq %rbx .seh_pushreg %rbx addq $-128, %rsp .seh_stackalloc 128 .seh_endprologue vmovaps (%rcx), %ymm0 leaq 95(%rsp), %rbx leaq 32(%rsp), %rcx andq $-32, %rbx vmovaps %ymm0, (%rbx) # %rbx is properly aligned vmovaps %ymm0, 32(%rsp) # %rsp may be unaligned vzeroupper call _Z2g11X movq %rbx, %rcx call _Z2g2RK1X movq %rbx, %rcx call _Z2g3PKv nop subq $-128, %rsp popq %rbx ret Related bug in Vc library: https://github.com/VcDevel/Vc/issues/241 Related bug in Krita: https://bugs.kde.org/show_bug.cgi?id=406209 Created attachment 46133 [details]
Test source for unaligned pass-by-value crash
Test file for the comment above
As a workaround, one can either use __attribute__((always_inline)) for *all* the functions accepting __m256 or pass *all* arguments by const-ref. Const-ref arguments are passed correctly. The correct way to align the stack to a 32-byte or 64-byte boundary on 64-bit Windows is to use a frame pointer in a function that requires stack realignment and then realign the stack to the required alignment once the frame pointer is set and all of the non-volatile registers used in the function are saved. class Avx2VectorGenerator { public: virtual __m256i NextVector() = 0; }; __m256i Example_AVX2_Func(Avx2VectorGenerator* generator, size_t iterations); Example_AVX2_Func: pushq %rbp .seh_pushreg %rbp pushq %rbx .seh_pushreg %rbx pushq %rdi .seh_pushreg %rdi movq %rsp, %rbp .seh_setframe %rbp, 0 .seh_endprologue /* Set rbx to generator and rdi to iterations */ movq %rcx, %rbx movq %rdx, %rdi /* It is okay to allocate additional stack memory */ /* and re-align the stack pointer outside of the */ /* SEH prologue as there is a frame pointer in this */ /* function */ subq $64, %rsp andq $-32, %rsp /* Zero out the result vector */ vpxor %ymm0, %ymm0, %ymm0 test %rdi, %rdi jz .loop_complete .loop_iteration_start: /* Save the result vector to 32(%rsp) */ vmovdqa 32(%rsp), ymm0 /* Move generator into rcx */ movq %rbx, %rcx /* Move the pointer to the NextVector() virtual member func */ /* into rax */ movq (%rbx), %rax /* Call generator->NextVector() */ call *(%rax) /* Add the result of generator->NextVector() to the result vector */ vpaddb 32(%rsp), %ymm0, %ymm0 /* Decrement iterations by 1 */ sub $1, %rdi /* Jump back to the beginning of the loop if iterations is non-zero */ jnz .loop_iteration_start .loop_complete: lea (%rbp), %rsp pop %rdi pop %rbx pop %rbp ret .seh_endproc See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412#c25 GCC is fully capable of aligning the stack. It just seems that different part of it disagrees on what the current stack alignment is and whether a realignment is needed. One of the weird probably SEH-related things is that the lack-of-alignment behavior of comment 28 and attachment 1 is not reproduced on a "normal" Linux GCC with __attribute__((ms_abi)) sprinkled all over to get the right calling convention. The code takes the same shape, uses mostly the same registers, but the `and rsp, -32` is just either not there or placed wrong. Hi, all!
Just wanted to note that the bug is still present in GCC 10.3.0 on Windows (from MSYS-MinGW64 packages).
> gcc (Rev5, Built by MSYS2 project) 10.3.0
Still present in GCC 11.2.0 Adding another +1. Still present in 10.3.0. Bitcoin Core's sha2 code uses avx2 when possible. We ran into this bug when bumping our toolchain: https://github.com/bitcoin/bitcoin/pull/24736 and opted to take Debian's patch: https://salsa.debian.org/mingw-w64-team/gcc-mingw-w64/-/blob/master/debian/patches/vmov-alignment.patch It's unfortunate that the best and most common advice for using avx2 with gcc/mingw is to use a patched compiler. Might it be possible to accept Debian's patch upstream? > It's unfortunate that the best and most common advice for using avx2 with
> gcc/mingw is to use a patched compiler. Might it be possible to accept
> Debian's patch upstream?
Sure, but they need to submit it first, we cannot do it for them.
Created attachment 52737 [details] Use unaligned VMOV instructions (for Windows targets) The reason I didn't submit the Debian patch is that it unconditionally replaces V...{U,A} with V...U instructions. That's fine when we know the target needs something like that, which is the case when we're building a Windows cross-compiler; but I don't think it's suitable for general use as-is. It would need a build-time conditional at the very least. Anyway, I'll add it as an attachment here; I'll try to find time to make it generally applicable. I haven't filed copyright assignment paperwork for me personally; if the patch needs it, consider it submitted by skitt@redhat.com under the corporate copyright assignment agreement. That patch is certainly unacceptable, not only because it affects non-Windows too, but even on Windows it will unnecessarily pessimize e.g. accesses to data sections or heap that can be aligned. If the Windows ABI doesn't align stack or not as much as gcc assumes, then a fix would ensure only automatic vars on Windows are accessed always using unaligned vector instructions provided dynamic stack realignment is not an option. > If the Windows ABI doesn't align stack or not as much as gcc assumes, then a
> fix would ensure only automatic vars on Windows are accessed always using
> unaligned vector instructions provided dynamic stack realignment is not an
> option.
It's classical double-word alignment, i.e. 16 bytes, and AVX requires 32 bytes.
The implementation of dynamic stack realignment is too much of a kludge to be safely used on Windows IMO so, yes, the way out is probably unaligned vector instructions.
(A patch to emit unaligned instructions should probably resolve bug 49001 instead of this one, 54412.) Could dynamic alignment be achieved, not for automatic variables and function parameters, but for registers spilled to the stack (due to register exhaustion, or because they may be clobbered)? So that users can write code that stores over-aligned objects on the heap only. If SEH is the problem, can alignment be accounted for in cases where SEH is not in use (if there are any such cases)? I'm thinking of -fno-exceptions, and dwarf (on x86) or setjump/longjump exceptions. Sorry if those are stupid questions. > If SEH is the problem, can alignment be accounted for in cases where SEH is
> not in use (if there are any such cases)? I'm thinking of -fno-exceptions,
> and dwarf (on x86) or setjump/longjump exceptions.
The hitch is that Setjmp/Longjmp is implemented on top of SEH on 64-bit Windows, which means that SEH information must always be generated, even in plain C.
(In reply to Eric Botcazou from comment #37) > > If the Windows ABI doesn't align stack or not as much as gcc assumes, then a > > fix would ensure only automatic vars on Windows are accessed always using > > unaligned vector instructions provided dynamic stack realignment is not an > > option. > > It's classical double-word alignment, i.e. 16 bytes, and AVX requires 32 > bytes. > The implementation of dynamic stack realignment is too much of a kludge to > be safely used on Windows IMO so, yes, the way out is probably unaligned > vector instructions. Assembler in binutils 2.38 supports: -muse-unaligned-vector-move encode aligned vector move as unaligned vector move It has been a few years since the last comment. I recently got hit by this bug for the first time in about a decade and a half of compiling R for Windows 64 using GCC 13.2.0 as packaged in Rtools44 [1]. Does it remain true that the only option to get around this bug without killing all AVX2 is to pass "-Wa,-muse-unaligned-vector-move" when compiling using GCC on Windows 64? Thank you. [1] https://stat.ethz.ch/pipermail/r-sig-windows/2024q1/000113.html Hi, Avraham!
> Does it remain true that the only option to get around this bug without killing all AVX2 is to pass "-Wa,-muse-unaligned-vector-move" when compiling using GCC on Windows 64? Thank you
I'm not sure about your particular issue, but in our case we used to manage to workaround this issue by passing AVX2-related structures by reference (or const-reference, when possible).
Thank you, Dmitry, but that particular solution may not be possible for me. When I try compiling with -mstackrealign -mpreferred-stack-boundary=5 -mincoming-stack-boundary=5 instead of forcing unaligned moves I get "cc1.exe: error: '-mpreferred-stack-boundary=5' is not between 3 and 4". Is that this bug in a different form, something that should be filed separately, or known and intended behavior? > Thank you, Dmitry, but that particular solution may not be possible for me.
> When I try compiling with -mstackrealign -mpreferred-stack-boundary=5
> -mincoming-stack-boundary=5 instead of forcing unaligned moves I get
> "cc1.exe: error: '-mpreferred-stack-boundary=5' is not between 3 and 4". Is
> that this bug in a different form, something that should be filed
> separately, or known and intended behavior?
No, it's the same issue: 32-byte stack alignment is not supported with SEH.
|