Created attachment 55023 [details] C source code The attached C code seems to work fine with -O2: $ ../results.20230507.asan.ubsan/bin/gcc -w -O2 bug916.c $ ./a.out checksum = 44DCF65B $ But adding -march=znver1 seems to cause trouble: $ ../results.20230507.asan.ubsan/bin/gcc -w -O2 -march=znver1 bug916.c $ ./a.out Segmentation fault (core dumped) $ Adding the -fno-strict-aliasing flag doesn't seem to help: $ ../results.20230507.asan.ubsan/bin/gcc -w -O2 -march=znver1 -fno-strict-aliasing bug916.c $ ./a.out Segmentation fault (core dumped) $ The bug seems to have existed for a while: $ ../results.20230417/bin/gcc -w -O2 -march=znver1 -fno-strict-aliasing bug916.c $ ./a.out Segmentation fault (core dumped) $
=> 0x0000000000402378 <func_36+136>: vmovdqa %xmm3,(%r8) 0x000000000040237d <func_36+141>: vmovdqa %xmm2,0x10(%r8) 0x0000000000402383 <func_36+147>: neg %r9 0x0000000000402386 <func_36+150>: vmovdqa %xmm2,0x20(%r8) End of assembler dump. (gdb) p $r8 $1 = 140737488346408 (gdb) p/x $r8 $2 = 0x7fffffffdd28
I should mention there is no difference in the gimple dump between with and without -march=znver1 .
Works with -fstack-reuse=none. Somehow GCC is confused about stack layout in func_36. It leaves rsp misaligned by 8 mod 16, but computes the base of l_814 for __builtin_memset (&l_814, 0, 80) as if rsp is aligned mod 16.
Seems good with g:18547874ee205d83 dated 20220515 and bad with g:73f7109ffb159302, dated yesterday.
Seems good at date 20221106, so the date range is [20221106..20230417]. Trying 20230205.
Broken at 20230205, so range is now [20221106.. 20230205]. Trying snapshot 20221218
(In reply to David Binderman from comment #6) > Broken at 20230205, so range is now [20221106.. 20230205]. > > Trying snapshot 20221218 That was good, so range is 20221218..20230108. Trying snapshot 20230101.
As far as the snapshots go, 20221218..20221225 seems to be the range. In git, this is g:fd69977febf399d1992bbf8d66ae9170e0a4dc9f .. g:febb58d28bfa4b544ec7ffec2d61f46d25205ff0, which is 123 commits. Trying g:89ba8366fe12fd2d04535c99ba67f33d7e305132.
Started with zen tuning revision r13-4839-geef81eefcdc2a5.
(In reply to Martin Liška from comment #9) > Started with zen tuning revision r13-4839-geef81eefcdc2a5. The issue is also reproducible with -march=haswell or -march=skylake, so you can use those for further bisection.
(In reply to Alexander Monakov from comment #10) > (In reply to Martin Liška from comment #9) > > Started with zen tuning revision r13-4839-geef81eefcdc2a5. > > The issue is also reproducible with -march=haswell or -march=skylake, so you > can use those for further bisection. With -march=skylake it began with r13-4124-g156f523f9582f1 and with -march=haswell it started with the very same revision.
Eh, that commit sneakily changed avx2 tuning without explaining that in the Changelog. Anyway, it should possible to "workaround" that by compiling with -O2 -mavx2 -mtune=skylake-avx512 instead, in which case the bisect will likely point to commit r12-2666-g29f0e955c97 ("x86: Update piecewise move and store") (before that, expansion of memcpy via store-by-pieces wouldn't use avx2). Also, PR 109093 looks very related.
PR 109087 might also be solved by this. Marking as 12/13/14 regression since this is reproducible with -O2 plus basic -m flags, while earlier PRs also needed -ftrivial-auto-var-init and lost the regression markers in the meanwhile.
Any clue about how to fix this?
attachment 54666 [details] from PR109093 seems able to fix this. Could we make it into trunk and the release branches?
Created attachment 55409 [details] A patch I am stilling trying to find a small testcase.
(In reply to H.J. Lu from comment #16) > Created attachment 55409 [details] > A patch > > I am stilling trying to find a small testcase. The patch triggers an ICE building Spidermonkey 115b9 (it segfaults with GCC trunk because of some unaligned vmovdqa): 0x93297b ix86_finalize_stack_frame_flags ../../gcc/gcc/config/i386/i386.cc:8224 0x162064c ix86_expand_epilogue(int) ../../gcc/gcc/config/i386/i386.cc:9405 0x1b2e27f gen_epilogue() ../../gcc/gcc/config/i386/i386.md:17517 0x160a815 target_gen_epilogue ../../gcc/gcc/config/i386/i386.md:17013 0xf15e86 make_epilogue_seq ../../gcc/gcc/function.cc:5964 0xf15f8b thread_prologue_and_epilogue_insns() ../../gcc/gcc/function.cc:6046 0xf166c2 rest_of_handle_thread_prologue_and_epilogue ../../gcc/gcc/function.cc:6544 0xf166c2 execute ../../gcc/gcc/function.cc:6625 The code at i386.cc:8224 reads: if (crtl->stack_realign_finalized) { /* After stack_realign_needed is finalized, we can't no longer change it. */ gcc_assert (crtl->stack_realign_needed == stack_realign); return; } I'm not sure if the assert should be dropped or it's more difficult. Or can we just force to use unaligned vector moves for block operations until we can find a better solution? It's at least better than leaving the vectorized block moving broken and forcing people trying to disable the feature.
(In reply to Xi Ruoyao from comment #17) > (In reply to H.J. Lu from comment #16) > > Created attachment 55409 [details] > > A patch > > > > I am stilling trying to find a small testcase. > > The patch triggers an ICE building Spidermonkey 115b9 (it segfaults with GCC > trunk because of some unaligned vmovdqa): I mean, "GCC trunk and -O3 -march=tigerlake -mtune=tigerlake".
(In reply to Xi Ruoyao from comment #17) > (In reply to H.J. Lu from comment #16) > > Created attachment 55409 [details] > > A patch > > > > I am stilling trying to find a small testcase. > > The patch triggers an ICE building Spidermonkey 115b9 (it segfaults with GCC > trunk because of some unaligned vmovdqa): > > 0x93297b ix86_finalize_stack_frame_flags > ../../gcc/gcc/config/i386/i386.cc:8224 > 0x162064c ix86_expand_epilogue(int) > ../../gcc/gcc/config/i386/i386.cc:9405 > 0x1b2e27f gen_epilogue() > ../../gcc/gcc/config/i386/i386.md:17517 > 0x160a815 target_gen_epilogue > ../../gcc/gcc/config/i386/i386.md:17013 > 0xf15e86 make_epilogue_seq > ../../gcc/gcc/function.cc:5964 > 0xf15f8b thread_prologue_and_epilogue_insns() > ../../gcc/gcc/function.cc:6046 > 0xf166c2 rest_of_handle_thread_prologue_and_epilogue > ../../gcc/gcc/function.cc:6544 > 0xf166c2 execute > ../../gcc/gcc/function.cc:6625 > > The code at i386.cc:8224 reads: > > if (crtl->stack_realign_finalized) > { > /* After stack_realign_needed is finalized, we can't no longer > change it. */ > gcc_assert (crtl->stack_realign_needed == stack_realign); > return; > } > > I'm not sure if the assert should be dropped or it's more difficult. > > Or can we just force to use unaligned vector moves for block operations > until we can find a better solution? It's at least better than leaving the > vectorized block moving broken and forcing people trying to disable the > feature. Do you have a testcase?
(In reply to H.J. Lu from comment #19) > (In reply to Xi Ruoyao from comment #17) > > (In reply to H.J. Lu from comment #16) > > > Created attachment 55409 [details] > > > A patch > > > > > > I am stilling trying to find a small testcase. > > > > The patch triggers an ICE building Spidermonkey 115b9 (it segfaults with GCC > > trunk because of some unaligned vmovdqa): > > > > 0x93297b ix86_finalize_stack_frame_flags > > ../../gcc/gcc/config/i386/i386.cc:8224 > > 0x162064c ix86_expand_epilogue(int) > > ../../gcc/gcc/config/i386/i386.cc:9405 > > 0x1b2e27f gen_epilogue() > > ../../gcc/gcc/config/i386/i386.md:17517 > > 0x160a815 target_gen_epilogue > > ../../gcc/gcc/config/i386/i386.md:17013 > > 0xf15e86 make_epilogue_seq > > ../../gcc/gcc/function.cc:5964 > > 0xf15f8b thread_prologue_and_epilogue_insns() > > ../../gcc/gcc/function.cc:6046 > > 0xf166c2 rest_of_handle_thread_prologue_and_epilogue > > ../../gcc/gcc/function.cc:6544 > > 0xf166c2 execute > > ../../gcc/gcc/function.cc:6625 > > > > The code at i386.cc:8224 reads: > > > > if (crtl->stack_realign_finalized) > > { > > /* After stack_realign_needed is finalized, we can't no longer > > change it. */ > > gcc_assert (crtl->stack_realign_needed == stack_realign); > > return; > > } > > > > I'm not sure if the assert should be dropped or it's more difficult. > > > > Or can we just force to use unaligned vector moves for block operations > > until we can find a better solution? It's at least better than leaving the > > vectorized block moving broken and forcing people trying to disable the > > feature. > > Do you have a testcase? It's too large and I'm running cvise on it.
Created attachment 55421 [details] test case broken by draft patch (at -O2 -mavx2 -mtune=haswell)
(In reply to H.J. Lu from comment #19) > Do you have a testcase? Attached.
Created attachment 55424 [details] An updated patch
(In reply to H.J. Lu from comment #23) > Created attachment 55424 [details] > An updated patch Unfortunately Spidermonkey 115 still crashes even with the patch (and -O3 -march=tigerlike -mtune=tigerlake -fno-exceptions). The problem seems an unaligned stack slot is assigned for an object of a 512-bit aligned class. Then the pointer to the object is passed to another function which stores into the object with 256-bit vmovdqu but the stack slot is only aligned to 128-bit in fact. I'll try to reduce.
(In reply to Xi Ruoyao from comment #24) > (In reply to H.J. Lu from comment #23) > > Created attachment 55424 [details] > > An updated patch > > Unfortunately Spidermonkey 115 still crashes even with the patch (and -O3 > -march=tigerlike -mtune=tigerlake -fno-exceptions). The problem seems an > unaligned stack slot is assigned for an object of a 512-bit aligned class. > Then the pointer to the object is passed to another function which stores > into the object with 256-bit vmovdqu but the stack slot is only aligned to > 128-bit in fact. I'll try to reduce. Nope, it seems a bug in mozilla code.
Seems like there is nothing to bisect any more, please re-add the keyword is I am wrong.
GCC 12.4 is being released, retargeting bugs to GCC 12.5.
trunk works for me, but 14 doesn't. (Needed -fno-stack-protector -fno-stack-clash-protection to override some defaults.) I don't think https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109780#c23 was committed though.
``` char a; static int b, c, f; char *d = &a; static char *e = &a; void g(int h, int i) { int j = 1; for (; c != -3; c = c - 1) { int k[10]; f = 0; for (; f < 10; f++) k[f] = 0; *d = k[1]; if (i < *d) { *e = h; for (; j < 9; j++) { b = 1; for (; b < 7; b++) ; } } } } int main() { g(1, 1); } ``` ``` $ gcc-14 p.c -o p -O2 -march=znver1 -fno-stack-protector -fno-stack-clash-protection && ./p Segmentation fault (core dumped) ```
Needs bisection to see why trunk works. Note that the original at least worked on trunk back in November so it may just be latent rather than fixed by some of the recent (kind of similar looking) changes for other bugs.
#c29 works since r15-571-g1e0ae1f52741f7e0133661659ed2d210f939a398
(In reply to Jakub Jelinek from comment #31) > #c29 works since r15-571-g1e0ae1f52741f7e0133661659ed2d210f939a398 That means it most likely went latent.
FTR, ix86_find_max_used_stack_alignment increases alignment only when stack pointer or frame pointer are explicitly mentioned in : /* Find the maximum stack alignment. */ subrtx_iterator::array_type array; FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL) if (MEM_P (*iter) && (reg_mentioned_p (stack_pointer_rtx, *iter) || reg_mentioned_p (frame_pointer_rtx, *iter))) { unsigned int alignment = MEM_ALIGN (*iter); if (alignment > stack_alignment) stack_alignment = alignment; This RTX does *not* trigger stack alignment increase in the above code: (insn 94 20 95 5 (set (mem/c:V16QI (reg/f:DI 1 dx [110]) [0 MEM <char[1:40]> [(void *)&k]+0 S16 A128]) (reg:V16QI 21 xmm1 [orig:122 MEM <char[1:40]> [(void *)&k] ] [122])) "pr109780.c":11:12 2015 {movv16qi_internal} (nil))
The problematic code is expanded from: ;; Generating RTL for gimple basic block 5 ;; __builtin_memset (&k, 0, 40); (insn 21 20 22 (parallel [ (set (reg:DI 107) (plus:DI (reg/f:DI 93 virtual-stack-vars) (const_int -48 [0xffffffffffffffd0]))) (clobber (reg:CC 17 flags)) ]) "pr109780.c":11:12 -1 (nil)) (insn 22 21 23 (set (reg:V32QI 108) (const_vector:V32QI [ (const_int 0 [0]) repeated x32 ])) "pr109780.c":11:12 -1 (nil)) (insn 23 22 24 (set (mem/c:V16QI (reg:DI 107) [0 MEM <char[1:40]> [(void *)&k]+0 S16 A128]) (vec_select:V16QI (reg:V32QI 108) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 14 [0xe]) (const_int 15 [0xf]) ]))) "pr109780.c":11:12 -1 (nil)) (insn 24 23 25 (set (mem/c:V16QI (plus:DI (reg:DI 107) (const_int 16 [0x10])) [0 MEM <char[1:40]> [(void *)&k]+16 S16 A128]) (vec_select:V16QI (reg:V32QI 108) (parallel [ (const_int 16 [0x10]) (const_int 17 [0x11]) (const_int 18 [0x12]) (const_int 19 [0x13]) (const_int 20 [0x14]) (const_int 21 [0x15]) (const_int 22 [0x16]) (const_int 23 [0x17]) (const_int 24 [0x18]) (const_int 25 [0x19]) (const_int 26 [0x1a]) (const_int 27 [0x1b]) (const_int 28 [0x1c]) (const_int 29 [0x1d]) (const_int 30 [0x1e]) (const_int 31 [0x1f]) ]))) "pr109780.c":11:12 -1 (nil)) (insn 25 24 0 (set (mem/c:DI (plus:DI (reg:DI 107) (const_int 32 [0x20])) [0 MEM <char[1:40]> [(void *)&k]+32 S8 A128]) (subreg:DI (reg:V32QI 108) 0)) "pr109780.c":11:12 -1 (nil))
(In reply to Uroš Bizjak from comment #34) > The problematic code is expanded from: > > ;; Generating RTL for gimple basic block 5 > > ;; __builtin_memset (&k, 0, 40); > > (insn 21 20 22 (parallel [ > (set (reg:DI 107) > (plus:DI (reg/f:DI 93 virtual-stack-vars) > (const_int -48 [0xffffffffffffffd0]))) > (clobber (reg:CC 17 flags)) > ]) "pr109780.c":11:12 -1 > (nil)) It is not a good idea to CSE address that refers to virtual stack vars to a temporary. This defeats stack/frame pointer detection, mentioned in Comment #33, and consequently fails to increase stack alignment.
(In reply to Uroš Bizjak from comment #35) > It is not a good idea to CSE address that refers to virtual stack vars to a > temporary. This defeats stack/frame pointer detection, mentioned in Comment > #33, and consequently fails to increase stack alignment. This is expanded via builtins.cc/expand_builtin_memset_args.
(In reply to H.J. Lu from comment #23) > Created attachment 55424 [details] > An updated patch Is this patch similar to the one in PR109093#c17 ? As argued in PR109093#c35, it looks that the current detection of instructions that need increase in stack alignment is insufficient.
(In reply to Sam James from comment #29) > $ gcc-14 p.c -o p -O2 -march=znver1 -fno-stack-protector > -fno-stack-clash-protection && ./p > Segmentation fault (core dumped) Adding -mpreferred-stack-boundary=3 (as expected) "fixes" the testcase. The difference of "g" is in the realignment AND insn: vmovdqa %xmm1, %xmm0 .cfi_offset 3, -24 movl $1, %ebx - leaq -48(%rsp), %rdx + andq $-16, %rsp movl $9, %r11d + leaq -48(%rsp), %rdx movq d(%rip), %rcx jmp .L7
(In reply to Uroš Bizjak from comment #37) > (In reply to H.J. Lu from comment #23) > > Created attachment 55424 [details] > > An updated patch > > Is this patch similar to the one in PR109093#c17 ? As argued in > PR109093#c35, it looks that the current detection of instructions that need > increase in stack alignment is insufficient. This one is a newer patch. I will take a look.
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>: https://gcc.gnu.org/g:11902be7a57c0ccf03786aa0255fffaf0f54dbf9 commit r15-7573-g11902be7a57c0ccf03786aa0255fffaf0f54dbf9 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 14 11:41:51 2023 -0700 x86: Properly find the maximum stack slot alignment Don't assume that stack slots can only be accessed by stack or frame registers. We first find all registers defined by stack or frame registers. Then check memory accesses by such registers, including stack and frame registers. gcc/ PR target/109780 PR target/109093 * config/i386/i386.cc (ix86_update_stack_alignment): New. (ix86_find_all_reg_use_1): Likewise. (ix86_find_all_reg_use): Likewise. (ix86_find_max_used_stack_alignment): Also check memory accesses from registers defined by stack or frame registers. gcc/testsuite/ PR target/109780 PR target/109093 * g++.target/i386/pr109780-1.C: New test. * gcc.target/i386/pr109093-1.c: Likewise. * gcc.target/i386/pr109780-1.c: Likewise. * gcc.target/i386/pr109780-2.c: Likewise. * gcc.target/i386/pr109780-3.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
So is this fixed on the trunk now? If so, let's remove the "15" from the regression marker set.
Updated.
Patch was reverted in r15-7634-g0312d11be3f666 and r15-7635-g6921c93d205203 to try again in GCC 15.
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>: https://gcc.gnu.org/g:b9ea3b2ef98048f93b02fcd6ff51777bce1676c2 commit r16-194-gb9ea3b2ef98048f93b02fcd6ff51777bce1676c2 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 14 11:41:51 2023 -0700 x86: Properly find the maximum stack slot alignment Don't assume that stack slots can only be accessed by stack or frame registers. We first find all registers defined by stack or frame registers. Then check memory accesses by such registers, including stack and frame registers. gcc/ PR target/109780 PR target/109093 * config/i386/i386.cc (stack_access_data): New. (ix86_update_stack_alignment): Likewise. (ix86_find_all_reg_use_1): Likewise. (ix86_find_all_reg_use): Likewise. (ix86_find_max_used_stack_alignment): Also check memory accesses from registers defined by stack or frame registers. gcc/testsuite/ PR target/109780 PR target/109093 * g++.target/i386/pr109780-1.C: New test. * gcc.target/i386/pr109093-1.c: Likewise. * gcc.target/i386/pr109780-1.c: Likewise. * gcc.target/i386/pr109780-2.c: Likewise. * gcc.target/i386/pr109780-3.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Co-Authored-By: Uros Bizjak <ubizjak@gmail.com>
GCC 12 branch is being closed.
On the 15 branch gcc.target/i386/pr109093-1.c still fails, at ix86_compute_frame_layout time preferred_alignment is 8 and stack_alignment_needed is 16, the frame during LRA is computed to {nsseregs = 0, nregs = 1, va_arg_size = 0, red_zone_size = 80, outgoing_arguments_size = 0, frame_pointer_offset = 32, hard_frame_pointer_offset = 16, stack_pointer_offset = 32, hfp_save_offset = 16, reg_save_offset = 24, stack_realign_allocate = 0, stack_realign_offset = 32, sse_reg_save_offset = 32, save_regs_using_mov = false, expensive_p = false, expensive_count = 0} but at ix86_compute_frame_layout time, under same constraints, {nsseregs = 0, nregs = 1, va_arg_size = 0, red_zone_size = 88, outgoing_arguments_size = 0, frame_pointer_offset = 32, hard_frame_pointer_offset = 16, stack_pointer_offset = 24, hfp_save_offset = 16, reg_save_offset = 24, stack_realign_allocate = 0, stack_realign_offset = 24, sse_reg_save_offset = 24, save_regs_using_mov = false, expensive_p = false, expensive_count = 0} because somehow cfun->machine->max_used_stack_alignment is now 8. It seems the 15 branch has ix86_find_max_used_stack_alignment as well. As I thought this was introduced with r8-2912-gf174328efedff7 with the attempt to avoid unnecessary stack-realignment given at RTL expansion time we conservatively estimate that requirement. The rev in question just extended some code trying to do the same (revise earlier computed stack alignment req