116238 – [12/13/14 Regression] ICE building 526.blender_r on aarch64 SVE after r15-1619-g3b9b8d6cfdf593

Bug 116238 - [12/13/14 Regression] ICE building 526.blender_r on aarch64 SVE after r15-1619-g3b9b8d6cfdf593

Summary: [12/13/14 Regression] ICE building 526.blender_r on aarch64 SVE after r15-161...

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	15.0

Importance:	P3 normal
Target Milestone:	15.0
Assignee:	Richard Sandiford

URL:
Keywords:	aarch64-sve, ice-on-valid-code, ra

Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2024-08-05 13:14 UTC by ktkachov
Modified:	2024-08-21 16:37 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	aarch64
Build:
Known to work:	14.1.0, 15.0, 9.5.0
Known to fail:	10.5.0, 14.2.1
Last reconfirmed:	2024-08-07 00:00:00

Attachments
Just this file with `-Ofast -msve-vector-bits=128 -march=armv9-a` (239 bytes, text/plain) 2024-08-07 21:36 UTC, Andrew Pinski	Details
Reduced testcase (124 bytes, text/plain) 2024-08-07 21:46 UTC, Andrew Pinski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description ktkachov 2024-08-05 13:14:28 UTC

I see 526.blender_r from SPEC2017 ICEing when building with -Ofast -mcpu=neoverse-v2 -msve-vector-bits=128 -flto=auto.
Unfortunately the -flto is necessary to reproduce it.
I've reduced it to 3 small files that should be enough to reproduce it:
$ cat main.c
void a();
void main() { a(); }

$ cat foo.c
void a();
typedef struct {
  char b, c;
} d;
typedef struct {
  d bezt;
} e;
typedef struct {
  int f;
} g;
e *h, *j;
void i(g *k, e **l) { *l = &h[k->f]; }
void BKE_mask_calc_handle_point_auto(g *k, e *l) {
  float m, n = m / 2.0f;
  char o = l->bezt.c, p = l->bezt.b;
  i(k, &j);
  d *q = &j->bezt;
  if (q)
    a();
  l->bezt.b = p;
  l->bezt.c = o;
  a(n);
}

$ cat bar.c
void BKE_mask_calc_handle_point_auto();
int a() {
  int b;
  BKE_mask_calc_handle_point_auto(b);
}

Compile for aarch64 with:
gcc -Ofast -msve-vector-bits=128 -mcpu=neoverse-v2 main.c foo.c bar.c -flto
and the crash is:
during RTL pass: reload
bar.c: In function 'a.isra':
bar.c:5:1: internal compiler error: maximum number of generated reload insns per insn achieved (90)
    5 | }
      | ^
0x1fe090b internal_error(char const*, ...)
        $SRC/gcc/diagnostic-global-context.cc:491
0xbff0c3 lra_constraints(bool)
        $SRC/gcc/lra-constraints.cc:5402
0xbe52ff lra(_IO_FILE*, int)
        $SRC/gcc/lra.cc:2442
0xb9919b do_reload
        $SRC/gcc/ira.cc:5973
0xb9919b execute
        $SRC/gcc/ira.cc:6161
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
lto-wrapper: fatal error: /home/ktkachov/builds/gcc-trunk/bin/gcc returned 1 exit status
compilation terminated.
/home/ktkachov/builds/binutils-trunk/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

I've bisected it to the commit g:3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b but I don't know if that is the cause or uncovered a latent problem

Comment 1 Andrew Pinski 2024-08-07 21:36:26 UTC

Created attachment 58863 [details]
Just this file with `-Ofast -msve-vector-bits=128 -march=armv9-a`

[apinski@xeond2 t]$ ../xgcc -B.. t2.c -Ofast -msve-vector-bits=128 -mcpu=neoverse-v2
during RTL pass: reload
t2.c: In function ‘BKE_mask_calc_handle_point_auto’:
t2.c:24:1: internal compiler error: maximum number of generated reload insns per insn achieved (90)
   24 | }
      | ^
0x31e8dd8 internal_error(char const*, ...)
        ../../gcc/diagnostic-global-context.cc:491
0x14c6f8f lra_constraints(bool)
        ../../gcc/lra-constraints.cc:5402
0x14af895 lra(_IO_FILE*, int)
        ../../gcc/lra.cc:2442
0x145779f do_reload
        ../../gcc/ira.cc:5976
0x1457c3c execute
        ../../gcc/ira.cc:6164
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Comment 2 Andrew Pinski 2024-08-07 21:46:41 UTC

Created attachment 58864 [details]
Reduced testcase

`-Ofast -msve-vector-bits=128 -march=armv9-a -fno-vect-cost-model `

Comment 3 Andrew Pinski 2024-08-07 21:50:17 UTC

Confirmed.

>I've bisected it to the commit g:3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b but I don't know if that is the cause or uncovered a latent problem


I think it was a latent bug. 

We are trying to spill `(reg:VNx2QI 102 [ vect_p_6.5 ])` to the stack but there is no instruction matching to do it so it goes into an infinite.

Comment 4 ktkachov 2024-08-20 08:45:51 UTC

CCing Richard about the partial mode reloads as he added many of those patterns

Comment 5 Richard Sandiford 2024-08-20 15:19:55 UTC

Yeah, seems to be a latent bug in aarch64_hard_regno_caller_save_mode.  A brute-force reproducer is:

void foo();
typedef unsigned char v2qi __attribute__((vector_size(2)));
void f(v2qi *ptr)
{
  v2qi x = *ptr;
  asm volatile ("" :: "w" (x));
  asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15");
  foo();
  asm volatile ("" :: "w" (x));
  *ptr = x;
}

Comment 6 ktkachov 2024-08-20 15:50:21 UTC

(In reply to Richard Sandiford from comment #5)
> Yeah, seems to be a latent bug in aarch64_hard_regno_caller_save_mode.  A
> brute-force reproducer is:
> 
> void foo();
> typedef unsigned char v2qi __attribute__((vector_size(2)));
> void f(v2qi *ptr)
> {
>   v2qi x = *ptr;
>   asm volatile ("" :: "w" (x));
>   asm volatile ("" ::: "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15");
>   foo();
>   asm volatile ("" :: "w" (x));
>   *ptr = x;
> }

Interesting. With this reproducer we get the ICE from GCC 10 onwards when compiled with -O3 -march=armv8.2-a+sve -msve-vector-bits=128

Comment 7 GCC Commits 2024-08-21 16:36:00 UTC

The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:ec9d6d45191f639482344362d048294e74587ca3

commit r15-3073-gec9d6d45191f639482344362d048294e74587ca3
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Wed Aug 21 17:35:47 2024 +0100

    aarch64: Fix caller saves of VNx2QI [PR116238]
    
    The testcase contains a VNx2QImode pseudo that is live across a call
    and that cannot be allocated a call-preserved register.  LRA quite
    reasonably tried to save it before the call and restore it afterwards.
    Unfortunately, the target told it to do that in SImode, even though
    punning between SImode and VNx2QImode is disallowed by both
    TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P.
    
    The natural class to use for SImode is GENERAL_REGS, so this led
    to an unsalvageable situation in which we had:
    
      (set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B))
    
    where A needed GENERAL_REGS and B needed FP_REGS.  We therefore ended
    up in a reload loop.
    
    The hooks above should ensure that this situation can never occur
    for incoming subregs.  It only happened here because the target
    explicitly forced it.
    
    The decision to use SImode for modes smaller than 4 bytes dates
    back to the beginning of the port, before 16-bit floating-point
    modes existed.  I'm not sure whether promoting to SImode really
    makes sense for any FPR, but that's a separate performance/QoI
    discussion.  For now, this patch just disallows using SImode
    when it is wrong for correctness reasons, since that should be
    safer to backport.
    
    gcc/
            PR testsuite/116238
            * config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode):
            Only return SImode if we can convert to and from it.
    
    gcc/testsuite/
            PR testsuite/116238
            * gcc.target/aarch64/sve/pr116238.c: New test.

Comment 8 Richard Sandiford 2024-08-21 16:37:45 UTC

Fixed on trunk.  Will backport after a while if there is no fallout.