I see 526.blender_r from SPEC2017 ICEing when building with -Ofast -mcpu=neoverse-v2 -msve-vector-bits=128 -flto=auto. Unfortunately the -flto is necessary to reproduce it. I've reduced it to 3 small files that should be enough to reproduce it: $ cat main.c void a(); void main() { a(); } $ cat foo.c void a(); typedef struct { char b, c; } d; typedef struct { d bezt; } e; typedef struct { int f; } g; e *h, *j; void i(g *k, e **l) { *l = &h[k->f]; } void BKE_mask_calc_handle_point_auto(g *k, e *l) { float m, n = m / 2.0f; char o = l->bezt.c, p = l->bezt.b; i(k, &j); d *q = &j->bezt; if (q) a(); l->bezt.b = p; l->bezt.c = o; a(n); } $ cat bar.c void BKE_mask_calc_handle_point_auto(); int a() { int b; BKE_mask_calc_handle_point_auto(b); } Compile for aarch64 with: gcc -Ofast -msve-vector-bits=128 -mcpu=neoverse-v2 main.c foo.c bar.c -flto and the crash is: during RTL pass: reload bar.c: In function 'a.isra': bar.c:5:1: internal compiler error: maximum number of generated reload insns per insn achieved (90) 5 | } | ^ 0x1fe090b internal_error(char const*, ...) $SRC/gcc/diagnostic-global-context.cc:491 0xbff0c3 lra_constraints(bool) $SRC/gcc/lra-constraints.cc:5402 0xbe52ff lra(_IO_FILE*, int) $SRC/gcc/lra.cc:2442 0xb9919b do_reload $SRC/gcc/ira.cc:5973 0xb9919b execute $SRC/gcc/ira.cc:6161 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. lto-wrapper: fatal error: /home/ktkachov/builds/gcc-trunk/bin/gcc returned 1 exit status compilation terminated. /home/ktkachov/builds/binutils-trunk/bin/ld: error: lto-wrapper failed collect2: error: ld returned 1 exit status I've bisected it to the commit g:3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b but I don't know if that is the cause or uncovered a latent problem
Created attachment 58863 [details] Just this file with `-Ofast -msve-vector-bits=128 -march=armv9-a` [apinski@xeond2 t]$ ../xgcc -B.. t2.c -Ofast -msve-vector-bits=128 -mcpu=neoverse-v2 during RTL pass: reload t2.c: In function ‘BKE_mask_calc_handle_point_auto’: t2.c:24:1: internal compiler error: maximum number of generated reload insns per insn achieved (90) 24 | } | ^ 0x31e8dd8 internal_error(char const*, ...) ../../gcc/diagnostic-global-context.cc:491 0x14c6f8f lra_constraints(bool) ../../gcc/lra-constraints.cc:5402 0x14af895 lra(_IO_FILE*, int) ../../gcc/lra.cc:2442 0x145779f do_reload ../../gcc/ira.cc:5976 0x1457c3c execute ../../gcc/ira.cc:6164 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions.
Created attachment 58864 [details] Reduced testcase `-Ofast -msve-vector-bits=128 -march=armv9-a -fno-vect-cost-model `
Confirmed. >I've bisected it to the commit g:3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b but I don't know if that is the cause or uncovered a latent problem I think it was a latent bug. We are trying to spill `(reg:VNx2QI 102 [ vect_p_6.5 ])` to the stack but there is no instruction matching to do it so it goes into an infinite.
CCing Richard about the partial mode reloads as he added many of those patterns
Yeah, seems to be a latent bug in aarch64_hard_regno_caller_save_mode. A brute-force reproducer is: void foo(); typedef unsigned char v2qi __attribute__((vector_size(2))); void f(v2qi *ptr) { v2qi x = *ptr; asm volatile ("" :: "w" (x)); asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15"); foo(); asm volatile ("" :: "w" (x)); *ptr = x; }
(In reply to Richard Sandiford from comment #5) > Yeah, seems to be a latent bug in aarch64_hard_regno_caller_save_mode. A > brute-force reproducer is: > > void foo(); > typedef unsigned char v2qi __attribute__((vector_size(2))); > void f(v2qi *ptr) > { > v2qi x = *ptr; > asm volatile ("" :: "w" (x)); > asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15"); > foo(); > asm volatile ("" :: "w" (x)); > *ptr = x; > } Interesting. With this reproducer we get the ICE from GCC 10 onwards when compiled with -O3 -march=armv8.2-a+sve -msve-vector-bits=128
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:ec9d6d45191f639482344362d048294e74587ca3 commit r15-3073-gec9d6d45191f639482344362d048294e74587ca3 Author: Richard Sandiford <richard.sandiford@arm.com> Date: Wed Aug 21 17:35:47 2024 +0100 aarch64: Fix caller saves of VNx2QI [PR116238] The testcase contains a VNx2QImode pseudo that is live across a call and that cannot be allocated a call-preserved register. LRA quite reasonably tried to save it before the call and restore it afterwards. Unfortunately, the target told it to do that in SImode, even though punning between SImode and VNx2QImode is disallowed by both TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P. The natural class to use for SImode is GENERAL_REGS, so this led to an unsalvageable situation in which we had: (set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B)) where A needed GENERAL_REGS and B needed FP_REGS. We therefore ended up in a reload loop. The hooks above should ensure that this situation can never occur for incoming subregs. It only happened here because the target explicitly forced it. The decision to use SImode for modes smaller than 4 bytes dates back to the beginning of the port, before 16-bit floating-point modes existed. I'm not sure whether promoting to SImode really makes sense for any FPR, but that's a separate performance/QoI discussion. For now, this patch just disallows using SImode when it is wrong for correctness reasons, since that should be safer to backport. gcc/ PR testsuite/116238 * config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode): Only return SImode if we can convert to and from it. gcc/testsuite/ PR testsuite/116238 * gcc.target/aarch64/sve/pr116238.c: New test.
Fixed on trunk. Will backport after a while if there is no fallout.