Start from r14-8450 "tree-optimization/113602 - datarefs of non-addressables", various LoongArch target tests fail. Reduced test case: typedef double __attribute__ ((vector_size (32))) vec; register vec a asm("f25"), b asm("f26"), c asm("f27"); void test (void) { for (int i = 0; i < 4; i++) c[i] = a[i] < b[i] ? 0.1 : 0.2; } $ ./gcc/cc1 t.c -O2 -msimd=lasx -fno-ident -o xvfcmp-f.s -nostdinc t.c: In function ‘test’: t.c:8:10: internal compiler error: in expand_assignment, at expr.cc:6073 8 | c[i] = a[i] < b[i] ? 0.1 : 0.2; | ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ 0x102029b expand_assignment(tree_node*, tree_node*, bool) ../../gcc/gcc/expr.cc:6073 0xe7a76b expand_gimple_stmt_1 ../../gcc/gcc/cfgexpand.cc:3992 0xe7aaf9 expand_gimple_stmt ../../gcc/gcc/cfgexpand.cc:4071 0xe8355b expand_gimple_basic_block ../../gcc/gcc/cfgexpand.cc:6127 0xe85b0d execute ../../gcc/gcc/cfgexpand.cc:6866
On x86_64 it's reproducible likewise: typedef double __attribute__ ((vector_size (32))) vec; register vec a asm("ymm2"), b asm("ymm0"), c asm("ymm1"); void test (void) { for (int i = 0; i < 4; i++) c[i] = a[i] < b[i] ? 0.1 : 0.2; } $ gcc/cc1 t.c -mavx2 t.c:8:10: internal compiler error: in expand_assignment, at expr.cc:6073 8 | c[i] = a[i] < b[i] ? 0.1 : 0.2; | ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ 0x10d9759 expand_assignment(tree_node*, tree_node*, bool) ../../gcc/gcc/expr.cc:6073 0xf2e9c0 expand_gimple_stmt_1 ../../gcc/gcc/cfgexpand.cc:3992 0xf2ed4e expand_gimple_stmt ../../gcc/gcc/cfgexpand.cc:4071 0xf377be expand_gimple_basic_block ../../gcc/gcc/cfgexpand.cc:6127 0xf39f72 execute ../../gcc/gcc/cfgexpand.cc:6866 Self-confirming.
x86_64 Testcase (which invokes undefined behavior) which has been failing since at least 4.9.1 even: ``` typedef double __attribute__ ((vector_size (16))) vec; register vec a asm("ymm2"), b asm("ymm0"), c asm("ymm1"); void test (void) { for (int i = 0; i < 4; i++) c[i] = a[i] < b[i] ? 0.1 : 0.2; } ```
(In reply to Andrew Pinski from comment #2) > x86_64 Testcase (which invokes undefined behavior) which has been failing > since at least 4.9.1 even: > ``` > typedef double __attribute__ ((vector_size (16))) vec; > register vec a asm("ymm2"), b asm("ymm0"), c asm("ymm1"); > > void > test (void) > { > for (int i = 0; i < 4; i++) > c[i] = a[i] < b[i] ? 0.1 : 0.2; > } > ``` Phew. Then maybe I should refactor the LoongArch test cases to avoid using named registers...
typedef double __attribute__ ((vector_size (16))) vec; register vec a asm("xmm12"), b asm("xmm13"), c asm("xmm14"); void test (void) { for (int i = 0; i < 4; i++) c[i] = a[i] < b[i] ? 0.1 : 0.2; } ICEs with -O2 -mavx -ffixed-xmm{12,13,14} -std=gnu99 starting with r0-104000-g30cd1c5d04c18770e8688d7199c20c2d528df1cd when the vector indexing support has been added.
Here is a "valid" x86_64 testcase: ``` typedef float __attribute__ ((vector_size (64))) vec; register vec a asm("zmm2"), b asm("zmm0"), c asm("zmm1"); void test (void) { for (int i = 0; i < 8; i++) c[i] = a[i] < b[i] ? 0.1 : 0.2; } ``` Which fails with `-O2 -g0 -mavx512f` which started in GCC 4.9 when -mavx512f support was added.
The assert by the way: ``` if (!MEM_P (to_rtx)) { /* We can get constant negative offsets into arrays with broken user code. Translate this to a trap instead of ICEing. */ gcc_assert (TREE_CODE (offset) == INTEGER_CST); expand_builtin_trap (); to_rtx = gen_rtx_MEM (BLKmode, const0_rtx); } ``` Yes this is obvious is wrong when used with global registers ...
*** Bug 113626 has been marked as a duplicate of this bug. ***
Guess for an rvalue (if even that crashes) we want to expand it to some permutation or whole vector shift which moves the indexed elements first and then extract it, for lvalue we need to insert it similarly.
I will have a look.
(In reply to Jakub Jelinek from comment #8) > Guess for an rvalue (if even that crashes) we want to expand it to some > permutation or whole vector shift which moves the indexed elements first and > then extract it, for lvalue we need to insert it similarly. If we can we should match this up with .VEC_SET / .VEC_EXTRACT, otherwise we should go "simple" and spill. diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc index 7e2392ecd38..e94f292dd38 100644 --- a/gcc/gimple-isel.cc +++ b/gcc/gimple-isel.cc @@ -104,7 +104,8 @@ gimple_expand_vec_set_extract_expr (struct function *fun, machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0)); machine_mode extract_mode = TYPE_MODE (TREE_TYPE (ref)); - if (auto_var_in_fn_p (view_op0, fun->decl) + if ((auto_var_in_fn_p (view_op0, fun->decl) + || DECL_HARD_REGISTER (view_op0)) && !TREE_ADDRESSABLE (view_op0) && ((!is_extract && can_vec_set_var_idx_p (outermode)) || (is_extract ensures the former and fixes the ICE on x86_64 on trunk. The comment#5 testcase then results in the following loop: .L3: movslq %eax, %rdx vmovaps %zmm2, -56(%rsp) vmovaps %zmm0, -120(%rsp) vmovss -120(%rsp,%rdx,4), %xmm4 vmovss -56(%rsp,%rdx,4), %xmm3 vcmpltss %xmm4, %xmm3, %xmm3 vpbroadcastd %eax, %zmm4 addl $1, %eax vpcmpd $0, %zmm7, %zmm4, %k1 vblendvps %xmm3, %xmm5, %xmm6, %xmm3 vbroadcastss %xmm3, %zmm1{%k1} cmpl $8, %eax jne .L3 this isn't optimal of course, for optimality we need vectorization. But we still need to avoid the ICEs since vectorization can be disabled. That said, I'm quite sure in code using hard registers people are not doing such stupid things so I wonder how important it is to avoid "regressing" the vectorization here.
I think it is most important we don't ICE and generate correct code. I doubt this is used too much in real-world code, otherwise it would have been reported years ago, so how efficient it will be is less important.
(In reply to Jakub Jelinek from comment #11) > I think it is most important we don't ICE and generate correct code. I > doubt this is used too much in real-world code, otherwise it would have been > reported years ago, so how efficient it will be is less important. Hmm, but for another test case (LoongArch): typedef double __attribute__ ((vector_size (32))) vec; register vec a asm("f25"), b asm("f26"), c asm("f27"); void test (void) { for (int i = 0; i < 4; i++) c[i] = __builtin_isless (a[i], b[i]) ? 0.1 : 0.2; } I'll have to write a loop (because __builtin_isless does not work on vectors). Or is there a vector built-in I'm missing?
(In reply to Xi Ruoyao from comment #12) > (In reply to Jakub Jelinek from comment #11) > > I think it is most important we don't ICE and generate correct code. I > > doubt this is used too much in real-world code, otherwise it would have been > > reported years ago, so how efficient it will be is less important. > > Hmm, but for another test case (LoongArch): > > typedef double __attribute__ ((vector_size (32))) vec; > register vec a asm("f25"), b asm("f26"), c asm("f27"); > > void > test (void) > { > for (int i = 0; i < 4; i++) > c[i] = __builtin_isless (a[i], b[i]) ? 0.1 : 0.2; > } > > I'll have to write a loop (because __builtin_isless does not work on > vectors). Or is there a vector built-in I'm missing? Why are you doing that? Normally tests would do vec test (vec a, vec b) { vec c = {}; for (int i = 0; i < 4; i++) c[i] = __builtin_isless (a[i], b[i]) ? 0.1 : 0.2; return c; } or something similar.
(In reply to Jakub Jelinek from comment #13) > (In reply to Xi Ruoyao from comment #12) > > (In reply to Jakub Jelinek from comment #11) > > > I think it is most important we don't ICE and generate correct code. I > > > doubt this is used too much in real-world code, otherwise it would have been > > > reported years ago, so how efficient it will be is less important. > > > > Hmm, but for another test case (LoongArch): > > > > typedef double __attribute__ ((vector_size (32))) vec; > > register vec a asm("f25"), b asm("f26"), c asm("f27"); > > > > void > > test (void) > > { > > for (int i = 0; i < 4; i++) > > c[i] = __builtin_isless (a[i], b[i]) ? 0.1 : 0.2; > > } > > > > I'll have to write a loop (because __builtin_isless does not work on > > vectors). Or is there a vector built-in I'm missing? > > Why are you doing that? > Normally tests would do > vec > test (vec a, vec b) > { > vec c = {}; > for (int i = 0; i < 4; i++) > c[i] = __builtin_isless (a[i], b[i]) ? 0.1 : 0.2; > return c; > } > or something similar. Because we are lacking a calling convention passing vectors in vector registers (it will be added in the future but not before GCC 14 release), thus I cannot test if the register operands are showing up in a correct order in the generated asm.
(In reply to Jakub Jelinek from comment #11) > I think it is most important we don't ICE and generate correct code. I > doubt this is used too much in real-world code, otherwise it would have been > reported years ago, so how efficient it will be is less important. We do spill on the read side already. On the write side the ICE is because of r0-71337-g1e188d1e130034. Note we're spilling parts of bitpos to offset: /* Otherwise, split it up. */ if (offset) { /* Avoid returning a negative bitpos as this may wreak havoc later. */ if (!bit_offset.to_shwi (pbitpos) || maybe_lt (*pbitpos, 0)) { *pbitpos = num_trailing_bits (bit_offset.force_shwi ()); poly_offset_int bytes = bits_to_bytes_round_down (bit_offset); offset = size_binop (PLUS_EXPR, offset, build_int_cst (sizetype, bytes.force_shwi ())); } *poffset = offset; but it can also be large positive when the bit amount doesn't fit a HWI. The flow of 'to' expansion is a bit awkward, but the following properly spills in case of variable offset and non-MEM_P: diff --git a/gcc/expr.cc b/gcc/expr.cc index ee822c11dce..f54d0b1474e 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -6061,6 +6061,7 @@ expand_assignment (tree to, tree from, bool nontemporal) to_rtx = adjust_address (to_rtx, BLKmode, 0); } + rtx stemp = NULL_RTX, old_to_rtx = NULL_RTX; if (offset != 0) { machine_mode address_mode; @@ -6070,9 +6071,24 @@ expand_assignment (tree to, tree from, bool nontemporal) { /* We can get constant negative offsets into arrays with broken user code. Translate this to a trap instead of ICEing. */ - gcc_assert (TREE_CODE (offset) == INTEGER_CST); - expand_builtin_trap (); - to_rtx = gen_rtx_MEM (BLKmode, const0_rtx); + if (TREE_CODE (offset) == INTEGER_CST) + { + expand_builtin_trap (); + to_rtx = gen_rtx_MEM (BLKmode, const0_rtx); + } + /* Else spill for variable offset to the destination. */ + else + { + gcc_assert (!TREE_CODE (from) == CALL_EXPR + && COMPLETE_TYPE_P (TREE_TYPE (from)) + && (TREE_CODE (TYPE_SIZE (TREE_TYPE (from))) + != INTEGER_CST)); + stemp = assign_stack_temp (GET_MODE (to_rtx), + GET_MODE_SIZE (GET_MODE (to_rtx))); + emit_move_insn (stemp, to_rtx); + old_to_rtx = to_rtx; + to_rtx = stemp; + } } offset_rtx = expand_expr (offset, NULL_RTX, VOIDmode, EXPAND_SUM); @@ -6305,6 +6321,9 @@ expand_assignment (tree to, tree from, bool nontemporal) bitregion_start, bitregion_end, mode1, from, get_alias_set (to), nontemporal, reversep); + /* Move the temporary storage back to the non-MEM_P. */ + if (stemp) + emit_move_insn (old_to_rtx, stemp); } if (result)
typedef double __attribute__ ((vector_size (16))) vec; void test (void) { register vec a asm("xmm1"), b asm("xmm2"), c asm("xmm3"); for (int i = 0; i < 2; i++) c[i] = a[i] < b[i] ? 0.1 : 0.2; } also ICEs with -O0 -msse.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:96bc048d78f804bac0fa7b2ca3b6dd3a04c68217 commit r14-8497-g96bc048d78f804bac0fa7b2ca3b6dd3a04c68217 Author: Richard Biener <rguenther@suse.de> Date: Mon Jan 29 09:47:31 2024 +0100 middle-end/113622 - allow .VEC_SET and .VEC_EXTRACT for global hard regs The following expands .VEC_SET and .VEC_EXTRACT instruction selection to global hard registers, not only automatic variables (possibly) promoted to registers. This can avoid some ICEs later and create better code. PR middle-end/113622 * gimple-isel.cc (gimple_expand_vec_set_extract_expr): Also allow DECL_HARD_REGISTER variables. * gcc.target/i386/pr113622-1.c: New testcase.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:0f7945417f913c85bd556904c0c4e7bf77793488 commit r14-8498-g0f7945417f913c85bd556904c0c4e7bf77793488 Author: Richard Biener <rguenther@suse.de> Date: Mon Jan 29 10:24:39 2024 +0100 middle-end/113622 - handle store with variable index to register The following implements storing to a non-MEM_P with a variable offset. We usually avoid this by forcing expansion to memory but this doesn't work for hard register variables. The solution is to spill and operate on the stack. PR middle-end/113622 * expr.cc (expand_assignment): Spill hard registers if we index them with a variable offset. * gcc.target/i386/pr113622-2.c: New testcase. * gcc.target/i386/pr113622-3.c: Likewise.
Should be fixed on trunk, not sure to what extent backporting is suitable.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:47b81161c98cf2ff5495d4aa6386cc3c87f9d27b commit r14-8515-g47b81161c98cf2ff5495d4aa6386cc3c87f9d27b Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Jan 30 09:31:22 2024 +0100 testsuite: Fix up pr113622-{2,3}.c for i686-linux [PR113622] The 2 new tests FAIL for me on i686-linux: .../gcc/testsuite/gcc.target/i386/pr113622-2.c:5:14: error: data type of 'a' isn't suitable for a register .../gcc/testsuite/gcc.target/i386/pr113622-2.c:5:29: error: data type of 'b' isn't suitable for a register .../gcc/testsuite/gcc.target/i386/pr113622-2.c:5:44: error: data type of 'c' isn't suitable for a register The problem is that the tests use vectors of double, something added only in SSE2, while the testcases ask for just -msse which only provides vectors of floats. So, either it should be using floats instead of doubles, or we need to add -msse2 to dg-options. I've done the latter. 2024-01-30 Jakub Jelinek <jakub@redhat.com> PR middle-end/113622 * gcc.target/i386/pr113622-2.c: Use -msse2 instead of -msse in dg-options. * gcc.target/i386/pr113622-3.c: Likewise.
This still blows up on LoongArch even after r14-8498: typedef float __attribute__ ((vector_size (16))) vec; typedef int __attribute__ ((vector_size (16))) ivec; register vec a asm("f25"), b asm("f26"); register ivec c asm("f27"); void test (void) { for (int i = 0; i < 4; i++) c[i] = a[i] < b[i] ? -1 : 1; } $ gcc/cc1 -msimd=lsx t.c -O2 -fno-vect-cost-model -nostdinc t.c: In function ‘test’: t.c:7:1: internal compiler error: in expand_expr_addr_expr_1, at expr.cc:9139 7 | test (void) | ^~~~ 0x102d7cb expand_expr_addr_expr_1 ../../gcc/gcc/expr.cc:9139 0x102e13e expand_expr_addr_expr ../../gcc/gcc/expr.cc:9252 0x103df07 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:12585 0x102e824 expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:9440 0xe45096 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) ../../gcc/gcc/expr.h:316 0x102fc56 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) ../../gcc/gcc/expr.cc:9762 0x10361c9 expand_expr_real_gassign(gassign*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:11096 0xe798de expand_gimple_stmt_1 ../../gcc/gcc/cfgexpand.cc:4010 0xe79b81 expand_gimple_stmt ../../gcc/gcc/cfgexpand.cc:4071 0xe825e3 expand_gimple_basic_block ../../gcc/gcc/cfgexpand.cc:6127 0xe84b95 execute ../../gcc/gcc/cfgexpand.cc:6866 Interestingly -fno-vect-cost-model is needed to trigger the ICE.
On x86_64: $ cat t.c typedef float __attribute__ ((vector_size (16))) vec; typedef int __attribute__ ((vector_size (16))) ivec; register vec a asm("xmm0"), b asm("xmm1"); register ivec c asm("xmm2"); void test (void) { for (int i = 0; i < 4; i++) c[i] = a[i] < b[i] ? -1 : 1; } $ gcc/cc1 -msse2 t.c -O2 -fno-vect-cost-model -nostdinc -ffixed-xmm{0,1,2} t.c: In function 'test': t.c:7:1: internal compiler error: in expand_expr_addr_expr_1, at expr.cc:9139 7 | test (void) | ^~~~ 0x10e6d6e expand_expr_addr_expr_1 ../../gcc/gcc/expr.cc:9139 0x10e76e2 expand_expr_addr_expr ../../gcc/gcc/expr.cc:9252 0x10f73a7 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:12585 0x10e7dc8 expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:9440 0xef7346 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) ../../gcc/gcc/expr.h:316 0x10e91fa expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) ../../gcc/gcc/expr.cc:9762 0x10ef77d expand_expr_real_gassign(gassign*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:11096 0xf2db31 expand_gimple_stmt_1 ../../gcc/gcc/cfgexpand.cc:4010 0xf2ddd4 expand_gimple_stmt ../../gcc/gcc/cfgexpand.cc:4071 0xf36844 expand_gimple_basic_block ../../gcc/gcc/cfgexpand.cc:6127 0xf38ff8 execute ../../gcc/gcc/cfgexpand.cc:6866 Should I open a new ticket or add back 14 Regression to the subject?
On Tue, 30 Jan 2024, xry111 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113622 > > --- Comment #22 from Xi Ruoyao <xry111 at gcc dot gnu.org> --- > On x86_64: > > $ cat t.c > typedef float __attribute__ ((vector_size (16))) vec; > typedef int __attribute__ ((vector_size (16))) ivec; > register vec a asm("xmm0"), b asm("xmm1"); > register ivec c asm("xmm2"); > > void > test (void) > { > for (int i = 0; i < 4; i++) > c[i] = a[i] < b[i] ? -1 : 1; > } > $ gcc/cc1 -msse2 t.c -O2 -fno-vect-cost-model -nostdinc -ffixed-xmm{0,1,2} > t.c: In function 'test': > t.c:7:1: internal compiler error: in expand_expr_addr_expr_1, at expr.cc:9139 > 7 | test (void) > | ^~~~ > 0x10e6d6e expand_expr_addr_expr_1 > ../../gcc/gcc/expr.cc:9139 > 0x10e76e2 expand_expr_addr_expr > ../../gcc/gcc/expr.cc:9252 > 0x10f73a7 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > ../../gcc/gcc/expr.cc:12585 > 0x10e7dc8 expand_expr_real(tree_node*, rtx_def*, machine_mode, expand_modifier, > rtx_def**, bool) > ../../gcc/gcc/expr.cc:9440 > 0xef7346 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) > ../../gcc/gcc/expr.h:316 > 0x10e91fa expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, > expand_modifier) > ../../gcc/gcc/expr.cc:9762 > 0x10ef77d expand_expr_real_gassign(gassign*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > ../../gcc/gcc/expr.cc:11096 > 0xf2db31 expand_gimple_stmt_1 > ../../gcc/gcc/cfgexpand.cc:4010 > 0xf2ddd4 expand_gimple_stmt > ../../gcc/gcc/cfgexpand.cc:4071 > 0xf36844 expand_gimple_basic_block > ../../gcc/gcc/cfgexpand.cc:6127 > 0xf38ff8 execute > ../../gcc/gcc/cfgexpand.cc:6866 > > Should I open a new ticket or add back 14 Regression to the subject? Please open a new ticked - this seems to be another vectorizer issue. We end up with the invalid _28 = (sizetype) &a;
It looks I can rewrite the LoongArch test case (still broken though ICE is stopped) using check-function-bodies. Will try later...
The releases/gcc-13 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:d4c0800aab864bb95260e12342d18695c6ebbec8 commit r13-8483-gd4c0800aab864bb95260e12342d18695c6ebbec8 Author: Richard Biener <rguenther@suse.de> Date: Mon Jan 29 09:47:31 2024 +0100 middle-end/113622 - allow .VEC_SET and .VEC_EXTRACT for global hard regs The following expands .VEC_SET and .VEC_EXTRACT instruction selection to global hard registers, not only automatic variables (possibly) promoted to registers. This can avoid some ICEs later and create better code. PR middle-end/113622 * gimple-isel.cc (gimple_expand_vec_set_extract_expr): Also allow DECL_HARD_REGISTER variables. * gcc.target/i386/pr113622-1.c: New testcase. (cherry picked from commit 96bc048d78f804bac0fa7b2ca3b6dd3a04c68217)
I've enabled vec_set for hard-regs on the branch and plugged the vectorizer hole for GCC 13. I'm not sure to what extent we need the expansion change.