Since a new register allocator (LRA) has been added in 4.8 it might be worth trying it out for SH.
Author: olegendo Date: Sun Aug 3 09:47:15 2014 New Revision: 213524 URL: https://gcc.gnu.org/viewcvs?rev=213524&root=gcc&view=rev Log: PR target/55212 Create branch. Added: branches/sh-lra/ - copied from r213515, trunk/
I've committed a small change as r215140. In my case building libgcc failed in the function __divsc3 and the change in sh_secondary_reload made it go away. I've briefly compared the resulting code against non-LRA and it looked OK. However, building __divsc3 still fails for -m2 -mb, now with the following: beh 0 0 0 (insn 1159 1008 1120 27 (set (reg:QI 625) (mem/c:QI (plus:SI (reg/f:SI 153 sfp) (const_int 19 [0x13])) [2 %sfp+-1 S1 A8])) sh_tmp.cpp:45 -1 (nil)) sh_tmp.cpp: In function '__divsc3': sh_tmp.cpp:57:1: internal compiler error: in lra_set_insn_recog_data, at lra.c:947 } ^ 0x8524c29 lra_set_insn_recog_data(rtx_def*) ../../gcc-sh-lra/gcc/lra.c:947 0x85258cb lra_get_insn_recog_data ../../gcc-sh-lra/gcc/lra-int.h:468 0x85258cb lra_update_insn_regno_info ../../gcc-sh-lra/gcc/lra.c:1607 0x85258cb lra_update_insn_regno_info ../../gcc-sh-lra/gcc/lra.c:1598 0x8525b1b lra_push_insn_1 ../../gcc-sh-lra/gcc/lra.c:1660 0x8525b1b lra_push_insn(rtx_def*) ../../gcc-sh-lra/gcc/lra.c:1668 0x8525d12 push_insns ../../gcc-sh-lra/gcc/lra.c:1711 0x852618f lra_process_new_insns(rtx_def*, rtx_def*, rtx_def*, char const*) ../../gcc-sh-lra/gcc/lra.c:1756 0x8533758 simplify_operand_subreg ../../gcc-sh-lra/gcc/lra-constraints.c:1523 0x8533758 curr_insn_transform ../../gcc-sh-lra/gcc/lra-constraints.c:3258 0x8535e2d lra_constraints(bool) ../../gcc-sh-lra/gcc/lra-constraints.c:4212 0x8526af8 lra(_IO_FILE*) ../../gcc-sh-lra/gcc/lra.c:2198 0x84e6823 do_reload ../../gcc-sh-lra/gcc/ira.c:5306 0x84e6823 execute ../../gcc-sh-lra/gcc/ira.c:5465 The generated insn 1159 isn't recognized, because the displacement value is out of range. It seems that LRA doesn't try to transform the out of range displacement QImode move into something else so that the <disp04> constraint is satisfied. Maybe this is due to the way I did the QI/HImode displacement patterns. AFAIR, reload uses the macro LEGITIMIZE_RELOAD_ADDRESS (sh_legitimize_reload_address) to handle such cases. LRA can decompose addresses. However, it looks like it's not done in some cases?
Created attachment 33465 [details] problematic libgcc divsc3 function
(In reply to Oleg Endo from comment #2) > However, building __divsc3 still fails for -m2 -mb, now with the > following: > > beh 0 0 0 > (insn 1159 1008 1120 27 (set (reg:QI 625) > (mem/c:QI (plus:SI (reg/f:SI 153 sfp) > (const_int 19 [0x13])) [2 %sfp+-1 S1 A8])) sh_tmp.cpp:45 -1 > (nil)) > > sh_tmp.cpp: In function '__divsc3': > sh_tmp.cpp:57:1: internal compiler error: in lra_set_insn_recog_data, at > lra.c:947 > } > ^ > 0x8524c29 lra_set_insn_recog_data(rtx_def*) > ../../gcc-sh-lra/gcc/lra.c:947 > 0x85258cb lra_get_insn_recog_data > ../../gcc-sh-lra/gcc/lra-int.h:468 > 0x85258cb lra_update_insn_regno_info > ../../gcc-sh-lra/gcc/lra.c:1607 > 0x85258cb lra_update_insn_regno_info > ../../gcc-sh-lra/gcc/lra.c:1598 > 0x8525b1b lra_push_insn_1 > ../../gcc-sh-lra/gcc/lra.c:1660 > 0x8525b1b lra_push_insn(rtx_def*) > ../../gcc-sh-lra/gcc/lra.c:1668 > 0x8525d12 push_insns > ../../gcc-sh-lra/gcc/lra.c:1711 > 0x852618f lra_process_new_insns(rtx_def*, rtx_def*, rtx_def*, char const*) > ../../gcc-sh-lra/gcc/lra.c:1756 > 0x8533758 simplify_operand_subreg > ../../gcc-sh-lra/gcc/lra-constraints.c:1523 > 0x8533758 curr_insn_transform > ../../gcc-sh-lra/gcc/lra-constraints.c:3258 > 0x8535e2d lra_constraints(bool) > ../../gcc-sh-lra/gcc/lra-constraints.c:4212 > 0x8526af8 lra(_IO_FILE*) > ../../gcc-sh-lra/gcc/lra.c:2198 > 0x84e6823 do_reload > ../../gcc-sh-lra/gcc/ira.c:5306 > 0x84e6823 execute > ../../gcc-sh-lra/gcc/ira.c:5465 > > The generated insn 1159 isn't recognized, because the displacement value is > out of range. When compiling for -m2a-nofpu -mb the problem goes away because SH2A can handle the displacement. However, there are no QImode accesses in the resulting code after LRA.
Compiling the gcc.dg/atomic/c11-atomic-exec-4.c test case with '-O2 -m4 -ml -matomic-model=soft-gusa' results in the following: beh (insn 207 206 208 6 (set (reg/f:SI 14 r14 [275]) (plus:SI (reg/f:SI 15 r15) (const_int 36 [0x24]))) 76 {*addsi3_compact} (expr_list:REG_EQUIV (plus:SI (reg/f:SI 153 sfp) (const_int -4 [0xfffffffffffffffc])) (nil))) internal compiler error: in check_rtl, at lra.c:1936 TEST_FUNCS (complex_float_add, _Complex float, , += 1, 0, 20000) ^ 0x85416bf check_rtl ../../gcc-sh-lra/gcc/lra.c:1936 0x8544a9a lra(_IO_FILE*) ../../gcc-sh-lra/gcc/lra.c:2327 0x8500e43 do_reload ../../gcc-sh-lra/gcc/ira.c:5312 0x8500e43 execute ../../gcc-sh-lra/gcc/ira.c:5471 the addsi3_compact insn should have been split into two insns: 1x reg-reg-move + 1x add-insn.
Author: olegendo Date: Sat Sep 13 16:23:55 2014 New Revision: 215240 URL: https://gcc.gnu.org/viewcvs?rev=215240&root=gcc&view=rev Log: PR target/55212 * lra.c: fix comments. print insn before lra_assert. Modified: branches/sh-lra/gcc/lra.c
An issue similar to that in c#5 has been reported here: https://gcc.gnu.org/ml/gcc/2014-04/msg00273.html I've tried applying that patch, but it doesn't fix the issue in c#2 nor c#5.
Author: olegendo Date: Sat Sep 13 18:53:54 2014 New Revision: 215241 URL: https://gcc.gnu.org/viewcvs?rev=215241&root=gcc&view=rev Log: PR target/55212 * predicates.md (general_movsrc_operand, general_movdst_operand): Don't legitimate QI/HImode mem displacement while LRA is running. Modified: branches/sh-lra/gcc/config/sh/predicates.md
(In reply to Oleg Endo from comment #8) > Author: olegendo > Date: Sat Sep 13 18:53:54 2014 > New Revision: 215241 > > URL: https://gcc.gnu.org/viewcvs?rev=215241&root=gcc&view=rev > Log: > PR target/55212 > * predicates.md (general_movsrc_operand, general_movdst_operand): > Don't legitimate QI/HImode mem displacement while LRA is running. > > Modified: > branches/sh-lra/gcc/config/sh/predicates.md libgcc now seems to compile for all default multilibs defined for sh-elf. The next failure is: gcc-sh-lra-build-sh-elf/sh-elf/libstdc++-v3/include/bits/locale_facets_nonio.h:1564:66: internal compiler error: in set_address_disp, at rtlanal.c:5790 { return this->do_put(__s, __intl, __io, __fill, __units); } ^ 0x8830497 set_address_disp ../../gcc-sh-lra/gcc/rtlanal.c:5790 0x8830497 set_address_disp ../../gcc-sh-lra/gcc/rtlanal.c:6080 0x8830497 decompose_normal_address ../../gcc-sh-lra/gcc/rtlanal.c:5912 0x8830497 decompose_address ../../gcc-sh-lra/gcc/rtlanal.c:5997 0x88305b1 decompose_mem_address(address_info*, rtx_def*) ../../gcc-sh-lra/gcc/rtlanal.c:6017 0x8745d7b process_address_1 ../../gcc-sh-lra/gcc/lra-constraints.c:2783 0x874aed9 process_address ../../gcc-sh-lra/gcc/lra-constraints.c:2990 0x874aed9 curr_insn_transform ../../gcc-sh-lra/gcc/lra-constraints.c:3274 0x874d905 lra_constraints(bool) ../../gcc-sh-lra/gcc/lra-constraints.c:4215 0x873ce6c lra(_IO_FILE*) ../../gcc-sh-lra/gcc/lra.c:2206 0x86f99a3 do_reload ../../gcc-sh-lra/gcc/ira.c:5312 0x86f99a3 execute ../../gcc-sh-lra/gcc/ira.c:5471
(In reply to Oleg Endo from comment #9) > > The next failure is: > > gcc-sh-lra-build-sh-elf/sh-elf/libstdc++-v3/include/bits/locale_facets_nonio. > h:1564:66: internal compiler error: in set_address_disp, at rtlanal.c:5790 > { return this->do_put(__s, __intl, __io, __fill, __units); } > ^ > 0x8830497 set_address_disp > ../../gcc-sh-lra/gcc/rtlanal.c:5790 > 0x8830497 set_address_disp > ../../gcc-sh-lra/gcc/rtlanal.c:6080 > 0x8830497 decompose_normal_address > ../../gcc-sh-lra/gcc/rtlanal.c:5912 > 0x8830497 decompose_address > ../../gcc-sh-lra/gcc/rtlanal.c:5997 > 0x88305b1 decompose_mem_address(address_info*, rtx_def*) > ../../gcc-sh-lra/gcc/rtlanal.c:6017 > 0x8745d7b process_address_1 > ../../gcc-sh-lra/gcc/lra-constraints.c:2783 > 0x874aed9 process_address > ../../gcc-sh-lra/gcc/lra-constraints.c:2990 > 0x874aed9 curr_insn_transform > ../../gcc-sh-lra/gcc/lra-constraints.c:3274 > 0x874d905 lra_constraints(bool) > ../../gcc-sh-lra/gcc/lra-constraints.c:4215 > 0x873ce6c lra(_IO_FILE*) > ../../gcc-sh-lra/gcc/lra.c:2206 > 0x86f99a3 do_reload > ../../gcc-sh-lra/gcc/ira.c:5312 > 0x86f99a3 execute > ../../gcc-sh-lra/gcc/ira.c:5471 This is due to the following invalid mem that ends up in the "*extend<mode>si2_compact_snd" pattern: (insn 8 5 10 2 (set (reg/v:SI 171 [ __fill ]) (sign_extend:SI (mem/c:QI (plus:SI (plus:SI (reg/f:SI 145 ap) (const_int 12 [0xc])) (const_int 4 [0x4])) [0 __fill+0 S1 A32]))) locale_facets_nonio.h:1562 241 {*extendqisi2_compact_snd} (expr_list:REG_EQUIV (mem:SI (plus:SI (reg/f:SI 15 r15) (const_int 4 [0x4])) [0 S4 A32]) (nil))) in function "decompose_address": inner = (plus:SI (plus:SI (reg/f:SI 145 ap) (const_int 12 [0xc])) (const_int 4 [0x4]))
Author: olegendo Date: Sat Sep 13 20:21:00 2014 New Revision: 215243 URL: https://gcc.gnu.org/viewcvs?rev=215243&root=gcc&view=rev Log: PR target/55212 * predicates.md (general_movsrc_operand, general_movdst_operand): Allow only valid plus address expressions. Modified: branches/sh-lra/gcc/config/sh/predicates.md
(In reply to Oleg Endo from comment #11) > Author: olegendo > Date: Sat Sep 13 20:21:00 2014 > New Revision: 215243 > > URL: https://gcc.gnu.org/viewcvs?rev=215243&root=gcc&view=rev > Log: > PR target/55212 > * predicates.md (general_movsrc_operand, general_movdst_operand): > Allow only valid plus address expressions. > > Modified: > branches/sh-lra/gcc/config/sh/predicates.md After that, we're back to: beh (insn 189 188 1583 22 (set (reg/f:SI 14 r14 [561]) (plus:SI (reg/f:SI 15 r15) (const_int 28 [0x1c]))) locale_facets_nonio.h:1195 76 {*addsi3_compact} (expr_list:REG_EQUAL (plus:SI (reg/f:SI 153 sfp) (const_int -36 [0xffffffffffffffdc])) (nil))) locale_facets_nonio.tcc:125:5: internal compiler error: in check_rtl, at lra.c:1935 0x873a21f check_rtl ../../gcc-sh-lra/gcc/lra.c:1935 0x873d5fa lra(_IO_FILE*) ../../gcc-sh-lra/gcc/lra.c:2326 0x86f99a3 do_reload ../../gcc-sh-lra/gcc/ira.c:5312 0x86f99a3 execute ../../gcc-sh-lra/gcc/ira.c:5471 ... which seems to be the same issue as in c#5
Author: olegendo Date: Sat Sep 13 20:40:21 2014 New Revision: 215244 URL: https://gcc.gnu.org/viewcvs?rev=215244&root=gcc&view=rev Log: PR target/55212 * Remove LEGITIMIZE_RELOAD_ADDRESS and sh_legitimize_reload_address. They are not used by LRA. Modified: branches/sh-lra/gcc/config/sh/sh-protos.h branches/sh-lra/gcc/config/sh/sh.c branches/sh-lra/gcc/config/sh/sh.h
Author: olegendo Date: Sat Sep 13 21:32:27 2014 New Revision: 215246 URL: https://gcc.gnu.org/viewcvs?rev=215246&root=gcc&view=rev Log: PR target/55212 * Apply patch from https://gcc.gnu.org/ml/gcc/2014-04/msg00273.html Modified: branches/sh-lra/gcc/lra-constraints.c
(In reply to Oleg Endo from comment #14) > Author: olegendo > Date: Sat Sep 13 21:32:27 2014 > New Revision: 215246 > > URL: https://gcc.gnu.org/viewcvs?rev=215246&root=gcc&view=rev > Log: > PR target/55212 > * Apply patch from https://gcc.gnu.org/ml/gcc/2014-04/msg00273.html > > Modified: > branches/sh-lra/gcc/lra-constraints.c Now the build 'make all' for sh-elf c,c++ completes without errors. New problem: Compiling newlib 1.20.0 with '-O2 -m2e': ./../../../../../newlib-1.20.0/newlib/libm/math/er_lgamma.c: In function '__ieee754_lgamma_r': ../../../../../../newlib-1.20.0/newlib/libm/math/er_lgamma.c:309:1: internal compiler error: Max. number of generated reload insns per insn is achieved (90) } ^ 0x85551c5 lra_constraints(bool) ../../gcc-sh-lra/gcc/lra-constraints.c:4145 0x85442bc lra(_IO_FILE*) ../../gcc-sh-lra/gcc/lra.c:2206 0x8500df3 do_reload ../../gcc-sh-lra/gcc/ira.c:5312 0x8500df3 execute ../../gcc-sh-lra/gcc/ira.c:5471 Please submit a full bug report,
(In reply to Oleg Endo from comment #5) > Compiling the gcc.dg/atomic/c11-atomic-exec-4.c test case with '-O2 -m4 -ml > -matomic-model=soft-gusa' results in the following: > This problem seems to be gone now, too.
Author: olegendo Date: Sat Sep 13 22:13:56 2014 New Revision: 215247 URL: https://gcc.gnu.org/viewcvs?rev=215247&root=gcc&view=rev Log: PR target/55212 * lra-constraints.c: Print insn when hitting MAX_RELOAD_INSNS_NUMBER. Modified: branches/sh-lra/gcc/lra-constraints.c
Created attachment 33489 [details] Reduced case for MAX_RELOAD_INSNS_NUMBER This fails when compiling with -O2 -m2e -m{l|b} with: curr_insn = (insn 713 115 712 16 (parallel [ (set (reg/v:SF 804 [orig:333 x ] [333]) (reg/v:SF 714 [orig:333 x ] [333])) (use (reg/v:PSI 151 )) (clobber (scratch:SI)) ]) er_lgamma.c:85 296 {movsf_ie} (nil)) In file included from sh_tmp.cpp:4:0: er_lgamma.c: In function '__ieee754_lgamma_r': er_lgamma.c:151:1: internal compiler error: Max. number of generated reload insns per insn is achieved (90) } ^ 0x8555215 lra_constraints(bool) ../../gcc-sh-lra/gcc/lra-constraints.c:4149 0x85442bc lra(_IO_FILE*) ../../gcc-sh-lra/gcc/lra.c:2206 0x8500df3 do_reload ../../gcc-sh-lra/gcc/ira.c:5312 0x8500df3 execute ../../gcc-sh-lra/gcc/ira.c:5471
(In reply to Oleg Endo from comment #18) > Created attachment 33489 [details] > Reduced case for MAX_RELOAD_INSNS_NUMBER > > This fails when compiling with -O2 -m2e -m{l|b} with: > > curr_insn = (insn 713 115 712 16 (parallel [ > (set (reg/v:SF 804 [orig:333 x ] [333]) > (reg/v:SF 714 [orig:333 x ] [333])) > (use (reg/v:PSI 151 )) > (clobber (scratch:SI)) > ]) er_lgamma.c:85 296 {movsf_ie} > (nil)) It's a reload loop and is maybe similar to PR 57032.
Vlad, do you have any idea/suggestion what could be going wrong in this case? Also, could you please have a look at the patch applied to the sh-lra branch as r215246 and see whether it makes sense to commit it to trunk?
I've run check-gcc for sh-lra branch on sh4-unknown-linux-gnu and got === gcc Summary === # of expected passes 83035 # of unexpected failures 169 # of unexpected successes 1 # of expected failures 119 # of unresolved testcases 45 # of unsupported tests 1514 It looks that 3 new kind of ICEs except the ICE in #c18 which causes ~100 failures. ~50 failures are internal compiler error: in assign_by_spills, at lra-assigns.c:1335 which may be similar with mips's PR61610. The typical example is c-c++-common/torture/complex-sign-add.c -O2. There are some unrecognizable insn errors for the rtl like (insn 11 10 12 2 (set (reg/f:SI 166) (mem/f/j:SI (plus:SI (subreg:SI (reg/v:DI 162 [ s1 ]) 0) (const_int 4 [0x4])) [0 _5->m+0 S4 A32])) It seems that the patch in #c11 should take account of subregs. The example of another type of new failures is gcc.c-torture/execute/20040703-1.c:127:1: error: could not split insn (insn 118 112 119 (parallel [ (set (reg:SI 9 r9 [orig:253 D.1616 ] [253]) (xor:SI (zero_extend:SI (reg:QI 147 t)) (const_int 1 [0x1]))) (clobber (reg:SI 1 r1 [287])) (clobber (reg:SI 147 t)) ]) gcc.c-torture/execute/20040703-1.c:54 401 {*movrt_negc} It looks that LRA substitutes (zero_extend:SI (subreg:QI (reg:SI 147 t) 0)) with (zero_extend:SI (reg:QI 147 t)) in this case.
Created attachment 33505 [details] A possible patch These last two errors could be fixed with the attached patch: * config/sh/predicates.md (general_movsrc_operand): Take subregs into account for plus address expression. (general_movdst_operand): Likewise. (t_reg_operand): Allow (zero_extend (reg t)).
(In reply to Kazumoto Kojima from comment #22) > Created attachment 33505 [details] > A possible patch > > These last two errors could be fixed with the attached patch: > > * config/sh/predicates.md (general_movsrc_operand): Take > subregs into account for plus address expression. > (general_movdst_operand): Likewise. > (t_reg_operand): Allow (zero_extend (reg t)). Makes sense.
Author: kkojima Date: Wed Sep 17 23:24:40 2014 New Revision: 215341 URL: https://gcc.gnu.org/viewcvs?rev=215341&root=gcc&view=rev Log: PR target/55212 * config/sh/predicates.md (general_movsrc_operand): Take subregs into account for plus address expression. (general_movdst_operand): Likewise. (t_reg_operand): Allow (zero_extend (reg t)). Modified: branches/sh-lra/gcc/config/sh/predicates.md
Created attachment 33524 [details] Reduced test case for ICE in assign_by_spill I've looked into what is going on for the ICE in assign_by_spill. The above test case ICEs on sh-elf with "-m4 -ml -O1 -ftree-parallelize-loops=4 -ftree-vectorize". .ira and .reload dumps show that the load and store insns 42: r208:SI=[r202:SI+r185:SI] 43: [r207:SI+r200:SI]=r218:HI // r218 = subreg:HI r208 are reloaded like as 73: r217:SI=r200:SI 79: r223:SI=r185:SI 74: r218:HI=[r202:SI+r223:SI] 43: [r207:SI+r217:SI]=r218:HI and R0_REGS is assigned to both r217 and r223 because the only register R0 can be used as the index register on SH. .reload says that the original insn 42 is deleted and the source operand r218 of insn 43 is substituted with its equiv [r202:SI+r185:SI]. It seems that it works also for SH if reload insn 73 was inserted at just before insn 43. I've attached .reload dump for this too.
Created attachment 33525 [details] .reload dump file
Created attachment 33526 [details] A trial patch It disables equiv substitution when the equiv includes some reg assigned to a small register class. I hope that it makes the issue a bit clearer, even if it doesn't the right thing.
Created attachment 33527 [details] Another reduced test case (with "-m4 -ml -O2 -std=gnu99") Here is a test case for the ICE in assign_by_spill which looks caused by another reason. .reload dump says that the problematic insn is something like 107: {r166:SF=[r942:SI+r943:SI];use :PSI;clobber r784:SI;} and Creating newreg=942 from oldreg=374, assigning class R0_REGS to address r942 Creating newreg=943, assigning class GENERAL_REGS to base + disp r943 Change to class R0_REGS for r943 which means that R0_REGS is assigned to both r942 and r943.
Created attachment 33528 [details] A trial patch The patch is to refrain from changing to class R0_REGS for r943. With the above 2 patches, the remained test case failing with the ICE in assign_by_spill is gcc.dg/pr38338.c which uses extra-ordinary __builtin_apply_args and __builtin_apply builtins. I can't get what is going on that case yet.
I thought I've CC'ed Vlad, but somehow I haven't. Trying again.
Created attachment 33556 [details] A bit reduced test case of pr38338.c (-O0 -m4 -ml) That case has only one basic block which looks like (prologue) (Save r4-r7,fr4-fr11,...) ;; Save argument mov.b r0,@(8,r1) ! 4 *movqi_store_mem_disp04/1 ;; insns don't use r0 ... ;; untyped_call mov.l .L3,r1 jsr @r1 nop ! 97 calli mov.l r0,@(56,r9) ! 98 movsi_ie/9 flds fr0,fpul ! 197 movsi_ie/22 sts fpul,r1 ! 198 movsi_ie/20 mov.l r1,@(60,r8) ! 99 movsi_ie/9 ;; ... if RA was done. It seems that LRA thought that r0 in insn 4 conflicts r0 in insn 98 and fails at reloading *movqi_store_mem_disp04 which permits only r0 as the source operand when reg+offset addressing mode is used. LRA looks right, because RA doesn't know that r0 is the function return value register set by function call. Target could notify it with clobbering function value registers before call.
Created attachment 33557 [details] Patch for SH untyped_call * config/sh/sh.md (untyped_call): Clobber function value registers before call.
(In reply to Kazumoto Kojima from comment #32) > Patch for SH untyped_call > config/sh/sh.md (untyped_call): Clobber function value registers before > call. (Sorry for butting in; I'm CC'ed with the purpose of observing issues likely to pop up for my own ports.) Will clobbering function-value registers before the call not cause problems on SH5, where function-value registers and parameters registers overlap (IIUC)? For reference, i386.md emits a blockage (i.e. *using* and clobbering all registers).
(In reply to Hans-Peter Nilsson from comment #33) > Will clobbering function-value registers before the call not cause problems > on SH5, where function-value registers and parameters registers overlap > (IIUC)? Yes. I've forgot about SH5. Thanks for pointing it out. The patch should be conditionalized for SH5 on which this issue won't happen. > For reference, i386.md emits a blockage (i.e. *using* and clobbering all > registers). It looks all ports including SH emit a blockage after storing the function return values to a memory block, not before the call.
(In reply to Kazumoto Kojima from comment #32) > Created attachment 33557 [details] > Patch for SH untyped_call > > * config/sh/sh.md (untyped_call): Clobber function value > registers before call. I'm just wondering ... how did/does that work without LRA (i.e. with IRA)?
(In reply to Oleg Endo from comment #35) > I'm just wondering ... how did/does that work without LRA (i.e. with IRA)? I'm not sure about the old reload. LRA makes only 3 uses of r0 and it's relatively easy to see what is going on there with .reload comments. The old reload makes many uses of r0 and it looks not easy to see how does that work.
(In reply to Kazumoto Kojima from comment #36) > I'm not sure about the old reload. LRA makes only 3 uses of r0 and it's > relatively easy to see what is going on there with .reload comments. > The old reload makes many uses of r0 and it looks not easy to see how > does that work. Hm, there's still this open PR 23868 ... maybe it's somehow related or even fixed with LRA + attachment 33557 [details] ...
(In reply to Kazumoto Kojima from comment #34) > For reference, > i386.md emits a blockage (i.e. *using* and clobbering all > registers). > It > looks all ports including SH emit a blockage after storing the function > return values to a memory block, not before the call. Hm, you're right, but to me that indicates the patch covering-up a bug elsewhere than in the sh port.
(In reply to Hans-Peter Nilsson from comment #38) > Hm, you're right, but to me that indicates the patch covering-up a bug > elsewhere than in the sh port. Would it be better to file an independent PR for this? It looks a latent problem of untyped_call implementations on some targets, though I guess that it's hard to hit because untyped_call is fairly exotic and the RA issue before that call can be happened only on arches with non-orthogonal instruction set like SH which heavily depends on R0.
I've tried to see what is going on with reload loop on movsf_ie for gcc.c-torture/compile/20050113-1.c -O1 -m4 -ml. It looks that the problem starts at reloading (insn 15 10 18 2 (parallel [ (set (subreg:SF (reg:SI 173 [ a ]) 0) (minus:SF (reg:SF 160 [ D.1385 ]) (reg:SF 160 [ D.1385 ]))) (use (reg/v:PSI 151 )) ]) yyy.i:10 443 {subsf3_i} (expr_list:REG_DEAD (reg:SF 160 [ D.1385 ]) (expr_list:REG_DEAD (reg/v:PSI 151 ) (nil)))) RA inserts 43: {r179:SF=r160:SF;use :PSI;clobber scratch;} before that insn and 44: {r173:SI#0=r179:SF;use :PSI;clobber scratch;} after that where #0 means subreg:SF 0. RA assigns register class FP_REGS to r179 here and tries to reload insn 44 which matches movsf_ie. SH has no direct move instruction from floating point reg to general reg and RA chooses the alt 1 in movesf_ie of which operands are (0) r (1) r (2) c (3) X. Then RA creates new reg r180 with assigning class GENERAL_REGS to r180 and inserts 45: {r180:SF=r179:SF;use :PSI;clobber scratch;} before insn 44 which is now 44: {r173:SI#0=r180:SF;use :PSI;clobber scratch;} RA tries to reload insn 45 which is in the same situation with the original 44, i.e. move from FP_REGS to GENERAL_REGS. Thus things are looping. It looks that the problematic cases are corresponding to the cases which were handled by the secondary reloads in the old reload. In the above example, if RA choosed the alt 13 in movsf_ie of which operands are (0) fr (1) rf (2) c (3) y, it didn't loop. It looks RA doesn't choose it because it requires more numbers of registers than the alt 1. The current movsf_ie uses (clobber (match_scratch:SI 3 "=X,X,Bsc,Bsc,&z,X,X,X,X,X,X,X,X,y,X,X,X,X,X"))] and this pattern might be not good for lra.
Created attachment 33601 [details] A trial patch for reload-loop problem My first trial is to define a special movsf_ie insn_and_split used when lra makes new reload insns. Here is the check-gcc result with the patch + c#29 + patch c#32: === gcc Summary === # of expected passes 83153 # of unexpected failures 21 # of unexpected successes 1 # of expected failures 119 # of unresolved testcases 3 # of unsupported tests 1514 /home/ldroot/dodes/xsh-gcc-lra/gcc/xgcc version 5.0.0 20140913 (experimental) (GCC) No "Max. number of generated reload insns per insn is achieved" errors. I'm backing out the patch in c#27 for this test because there are many execution errors with it.
(In reply to Kazumoto Kojima from comment #41) > Created attachment 33601 [details] > A trial patch for reload-loop problem > > My first trial is to define a special movsf_ie insn_and_split used > when lra makes new reload insns. That also fixes the case in comment #18. PR 54699 comes into my mind when seeing the movsf_ie patterns ... having another movsf_ie pattern is discomforting. But if it makes it work, so be it. Maybe we can try to combine the two movsf_ie insns later.
(In reply to Oleg Endo from comment #42) > PR 54699 comes into my mind when seeing the movsf_ie patterns ... having > another movsf_ie pattern is discomforting. But if it makes it work, so be > it. Maybe we can try to combine the two movsf_ie insns later. Sure. It's an experiment to clear the issue further. I'll apply it and the revised untyped_call patch on sh-lra branch.
Author: kkojima Date: Mon Sep 29 01:24:33 2014 New Revision: 215676 URL: https://gcc.gnu.org/viewcvs?rev=215676&root=gcc&view=rev Log: PR target/55212 * config/sh/sh-protos.h (sh_movsf_ie_ra_split_p): New prototype. * config/sh/sh.c (sh_movsf_ie_ra_split_p): New function. * config/sh/sh.md (movsi_ie): Use constraint "mr" instead of "m". (movsf_ie_ra): New insn_and_split. (movsf): Use movsf_ie_ra when lra_in_progress is set. Modified: branches/sh-lra/gcc/config/sh/sh-protos.h branches/sh-lra/gcc/config/sh/sh.c branches/sh-lra/gcc/config/sh/sh.md
Author: kkojima Date: Mon Sep 29 01:27:03 2014 New Revision: 215677 URL: https://gcc.gnu.org/viewcvs?rev=215677&root=gcc&view=rev Log: PR target/55212 * config/sh/sh.md (untyped_call): Clobber the function value registers before the call. Modified: branches/sh-lra/gcc/config/sh/sh.md
newlib 1.2.0 now builds without errors here.
Created attachment 33615 [details] reduced CSiBE /libpng-1.2.5 test I've tried compiling CSiBE (-m4 -ml). This is a stripped down pngrutil.c which crashes in lra-spills.c (remove_pseudos). It's a bit strange, because if the function 'test' (top of the file) is compiled before the actual problematic function 'png_handle_cHRM', there's a segfault. If 'test' is left out, it ends in: internal compiler error: in copy_rtx, at rtl.c:356 0x862e4d2 copy_rtx(rtx_def*) ../../gcc-sh-lra/gcc/rtl.c:356 In the segfault case, the insn is: insn = (call_insn 617 775 618 42 (parallel [ (call (mem:SI (reg/f:SI 701) [0 png_set_cHRM S4 A32]) (const_int 0 [0])) (use (reg:PSI 151 )) (clobber (reg:SI 146 pr)) ]) sh_tmp.cpp:395 316 {calli} (expr_list:REG_DEAD (reg:SI 69 fr5) (expr_list:REG_DEAD (reg:SI 75 fr11) (expr_list:REG_DEAD (reg:SI 71 fr7) (expr_list:REG_DEAD (reg:SI 73 fr9) (expr_list:REG_DEAD (reg/f:SI 701) (expr_list:REG_DEAD (reg:SI 5 r5) (expr_list:REG_DEAD (reg:SI 4 r4) (expr_list:REG_DEAD (reg:DF 74 fr10) (expr_list:REG_DEAD (reg:DF 72 fr8) (expr_list:REG_DEAD (reg:DF 70 fr6) (expr_list:REG_DEAD (reg:DF 68 fr4) (expr_list:REG_CALL_DECL (symbol_ref:SI ("png_set_cHRM") [flags 0x41] <function_decl 0xb7407c80 png_set_cHRM>) (nil))))))))))))) (expr_list:SI (use (reg:SI 4 r4)) (expr_list:SI (use (reg:SI 5 r5)) (expr_list:DF (use (reg:DF 68 fr4)) (expr_list:DF (use (reg:DF 70 fr6)) (expr_list:DF (use (reg:DF 72 fr8)) (expr_list:DF (use (reg:DF 74 fr10)) (expr_list:DF (use (mem:DF (reg/f:SI 15 r15) [0 S8 A32])) (expr_list:DF (use (mem:DF (reg/f:SI 699) [0 S8 A32])) (expr_list:DF (use (mem:DF (reg/f:SI 696) [0 S8 A32])) (expr_list:DF (use (mem:DF (reg/f:SI 693) [0 S8 A32])) (nil)))))))))))) and the crash is caused by loc = (reg/f:SI 699) In the non-segfault case the input arg 'orig' in copy_rtx prints "(UnKnown Unknown)"
(In reply to Oleg Endo from comment #47) > Created attachment 33615 [details] > reduced CSiBE /libpng-1.2.5 test > > I've tried compiling CSiBE (-m4 -ml). This is a stripped down pngrutil.c > which crashes in lra-spills.c (remove_pseudos). > It's a bit strange, because if the function 'test' (top of the file) is > compiled before the actual problematic function 'png_handle_cHRM', there's a > segfault. The segfault happens because of this lookup (remove_pseudos): if ((hard_reg = spill_hard_reg[i]) != NULL_RTX) The array at i = 699 doesn't seem to contain anything valid. Function 'assign_spill_hard_regs' sets those values: spill_hard_reg[regno] = gen_raw_REG (PSEUDO_REGNO_MODE (regno), hard_regno); However, in this case it never gets to it because of this: if (! lra_reg_spill_p) return n;
(In reply to Oleg Endo from comment #48) > The array at i = 699 doesn't seem to contain anything valid. It looks that (expr_list:DF (use (mem:DF (reg/f:SI 699) [0 S8 A32])) which shows an argument of call_insn 617 via memory, was missed by LRA as a usage of pseudo 699. Here is my trial: --- gcc/lra.c.orig 2014-09-14 09:09:57.223724308 +0900 +++ gcc/lra.c 2014-09-30 17:15:21.709508021 +0900 @@ -1615,6 +1615,20 @@ lra_update_insn_regno_info (rtx_insn *in if ((code = GET_CODE (PATTERN (insn))) == CLOBBER || code == USE) add_regs_to_insn_regno_info (data, XEXP (PATTERN (insn), 0), uid, code == USE ? OP_IN : OP_OUT, false); + if (CALL_P (insn)) + { + rtx link; + + /* Take account of arguments via memory which could be implicit + usage of pseudo regs. */ + for (link = CALL_INSN_FUNCTION_USAGE (insn); + link != NULL_RTX; + link = XEXP (link, 1)) + if (GET_CODE (XEXP (link, 0)) == USE + && MEM_P (XEXP (XEXP (link, 0), 0))) + add_regs_to_insn_regno_info (data, XEXP (XEXP (link, 0), 0), uid, + OP_IN, false); + } if (NONDEBUG_INSN_P (insn)) setup_insn_reg_info (data, freq); }
Author: olegendo Date: Tue Sep 30 22:12:42 2014 New Revision: 215744 URL: https://gcc.gnu.org/viewcvs?rev=215744&root=gcc&view=rev Log: PR target/55212 * lra.c (lra_update_insn_regno_info): Handle mem args in calls. Modified: branches/sh-lra/gcc/lra.c
(In reply to Oleg Endo from comment #50) > Author: olegendo > Date: Tue Sep 30 22:12:42 2014 > New Revision: 215744 > > URL: https://gcc.gnu.org/viewcvs?rev=215744&root=gcc&view=rev > Log: > PR target/55212 > * lra.c (lra_update_insn_regno_info): Handle mem args in calls. > > Modified: > branches/sh-lra/gcc/lra.c This fixes the issue with attachment 33615 [details].
Created attachment 33632 [details] Reduced case of error: in assign_by_spills, at lra-assigns.c:1335 with -m4 -ml -O2
(In reply to Oleg Endo from comment #52) > Created attachment 33632 [details] > Reduced case of error: in assign_by_spills, at lra-assigns.c:1335 with -m4 > -ml -O2 .ira dump of the test case has the lines (insn 145 144 148 8 (set (reg/v:SI 245 [ qval ]) (mem:SI (plus:SI (reg/v/f:SI 345 [orig:213 divisors ] [213]) (reg:SI 349 [orig:259 ivtmp.27 ] [259])) [7 MEM[base: divisors_17, index: ivtmp.27_119, offset: 0B]+0 S4 A32])) jcdctmgr.i:87 258 {movsi_ie} (nil)) (insn 148 145 149 8 (set (reg/v:SI 246 [ temp ]) (mem:SI (plus:SI (reg:SI 349 [orig:259 ivtmp.27 ] [259]) (reg:SI 334)) [7 MEM[symbol: workspace, index: ivtmp.27_119, offset: 0B]+0 S4 A32])) jcdctmgr.i:88 258 {movsi_ie} (nil)) which correspond to the lines qval = divisors[i]; temp = workspace[i]; of the test case. r349 is the index variable i and r334 is the pointer variable workspace. LRA assigns R0_REGS to r349 at insn 145 because rtlanal.c:decompose_mem_address guesses r345 would be base and r349 index. Unfortunately, that function also guesses r349 would be base and r334 index at insn 148. LRA tries to assign R0_REGS to r334 according to this "guess" result. .reload dump says: Changing address in insn 148 r349:SI+r334:SI on equiv r349:SI+sfp:SI Creating newreg=384 from oldreg=153, assigning class R0_REGS to address r384 Creating newreg=385 from oldreg=349, assigning class R0_REGS to address r385 Thus regclass of r384 and r385 corrides at insn 148: r246:SI=[r385:SI+r384:SI]
Created attachment 33657 [details] A possible workaround The patch is trying to fix the result of decompose_mem_address so as not to assign INDEX_REG_CLASS to both base and index regs when INDEX_REG_CLASS has one register only.
Created attachment 33662 [details] revised patch The last patch should check if base rtx and index rtx of the address are registers.
There are some LRA changes coming ... https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00948.html
Created attachment 33686 [details] revised patch for the problem in c#25
(In reply to Kazumoto Kojima from comment #57) > revised patch for the problem in c#25 I've looked into the remained ICEs for usual sh4-unknown-linux-gnu test and got that they are similar to the ICE in c#25. The original insn is a move insn from/to a memory with base+index addressing which requires r0 for index reg on SH. The other operand of that insn has an equivalence with a memory of which load/store uses r0 for some reason. The tipical example is a memory of which addressing is base+index and base+a_small_display_constant which can fit to mov.[bwl] r0,@(disp,rm)/@(disp,rm),r0 instructions. The patch introduces a new tergetm fuction and hides equivalences for such cases. The default tergetm fuction simply do nothing and shouldn't impact the other targets. The patch might be a bit overkill for SH. We need to see the code quality. With c#29 + c#55 + c#57 patches, all ICEs went away on sh4-linux. compare_tests against trunk rev.215912 says -- New tests that FAIL: g++.dg/cpp1y/feat-cxx14.C -std=gnu++1y (test for excess errors) gcc.target/sh/pr50749-sf-postinc-1.c scan-assembler-times fmov.s\\t@r[0-9]+\\\\+,fr[0-9]+ 1 gcc.target/sh/pr50749-sf-postinc-3.c scan-assembler-times fmov.s\\t@r[0-9]+\\\\+,fr[0-9]+ 1 Old tests that failed, that have disappeared: (Eeek!) gcc.dg/atomic/c11-atomic-exec-4.c -Os (internal compiler error) gcc.dg/atomic/c11-atomic-exec-4.c -Os (test for excess errors) libgomp.fortran/udr14.f90 -O3 -g (internal compiler error) libgomp.fortran/udr14.f90 -O3 -g (test for excess errors) -- The first new FAIL looks a test glitch and it turned out that pr50749-sf-postinc-1.c FAILs can be easily fixed. It looks that the LRA change suggested in c#56 is related about the optimization based with equivalences. I'd like to try it on sh-lra.
Created attachment 33688 [details] reduced CSiBE teem-1.6.0 test (-O2 sh4-linux) Yet another lra-assigns.c:1335 ICE when compiling nrrd/kernel.c of CSiBE/ teem-1.6.0 test. It looks that RA fails for FPUL_REGS class.
(In reply to Kazumoto Kojima from comment #59) > Yet another lra-assigns.c:1335 ICE when compiling nrrd/kernel.c of CSiBE/ > teem-1.6.0 test. It looks that RA fails for FPUL_REGS class. It seems that the scenario starts at insn 58 in .reload. 58: {r199:DF=float_extend(r190:SF);use :PSI;} LRA inserts a reload insn 138 before it. 138: {r261:SF=r190:SF;use :PSI;clobber scratch;0;} 58: {r199:DF=float_extend(r261:SF);use :PSI;} During inheritance/split process, LRA inserts a move insn 151 after 138 to save r190. Add save<-reg after: 151: r280:SI=r190:SF#0 where LRA assigns the hard reg fr8 to r190 and #0 means subreg:SI here. We need fpul to save the SFmode value on fr8 to SImode object and insn 58 is actually fcnvds instruction of which destination is fpul. This causes the ICE. LRA chooses SImode at insn 151 because fr8 is a caller save register and HARD_REGNO_CALLER_SAVE_MODE returns SImode for fr8 even with SFmode. SH doesn't define its specific HARD_REGNO_CALLER_SAVE_MODE and uses the default choose_hard_reg_mode (regno, nregs, false) implementation. choose_hard_reg_mode chooses mode based on HARD_REGNO_MODE_OK and returns ?Imode for float registers when sh_hard_regno_mode_ok permits integer modes on them. I guess that HARD_REGNO_CALLER_SAVE_MODE should return SFmode in this case.
Created attachment 33692 [details] a possible patch I'm testing a patch to define a SH specific HARD_REGNO_CALLER_SAVE_MODE macro.
Just a note... I've briefly checked whether PR 54429 gets any better with LRA. It doesn't seem to be the case.
Created attachment 33705 [details] CSiBE result-size.cvs sh-lra+c#29+c#55+c#57+c#61
Created attachment 33706 [details] CSiBE result-size.cvs trunk rev.215912
We can see ~2% code size regression on average with CSiBE. There are some cases which register elimination doesn't work well. An example is int foo (int x) { int y[100]; bar (&y[4], x); return y[4]; } which is compiled to a lengthy code: foo: sts.l pr,@-r15 mov.w .L6,r1 mov r4,r5 mov.w .L3,r2 sub r1,r15 mov.l .L4,r0 mov r15,r3 mov.w .L5,r1 add #4,r3 add r3,r2 add r2,r1 mov.l r1,@r15 mov r1,r4 jsr @r0 add #16,r4 mov.l @r15,r1 mov.l @(16,r1),r0 mov.w .L6,r7 add r7,r15 lds.l @r15+,pr rts nop .align 1 .L6: .short 404 .L3: .short 400 .L5: .short -400 with -O2. It seems that SH's very limited addsi3 insn doesn't match with the LRA's register elimination. The canonical insn like (set reg_a (plus reg_b const_int)) are scaned at register elimination phase. On SH, it's expanded to the 2 insns (set reg_a const_int) and (set reg_a (plus reg_a reg_b)) when const_int doesn't fit in 8-bit. This makes the elimination process unhappy.
Created attachment 33707 [details] a possible patch With it, foo is compiled to foo: sts.l pr,@-r15 mov.w .L4,r1 mov r4,r5 mov.l .L3,r0 mov #16,r4 sub r1,r15 jsr @r0 add r15,r4 mov.l @(16,r15),r0 mov.w .L4,r7 add r7, r15 lds.l @r15+,pr rts nop .align 1 .L4: .short 400 and the code size regression for CSiBE is reduced to 0.74%.
Created attachment 33708 [details] CSiBE result-runtime.cvs sh-lra+c#29+c#55+c#57+c#61+c#66
Created attachment 33709 [details] CSiBE result-runtime.cvs trunk rev.215912 0.54% runtime regression.
The patch in c#57 disables memory equiv substitution for the memory with base+index and base+display addressing. static bool sh_cannot_substitute_equiv_p (rtx subst) { if (TARGET_SHMEDIA) return false; if (GET_CODE (subst) == SUBREG) subst = SUBREG_REG (subst); if (MEM_P (subst) && GET_CODE (XEXP (subst, 0)) == PLUS) return true; return false; } A bit surprisingly, disabling all memory equiv substitution wins CSiBE tests. With static bool sh_cannot_substitute_equiv_p (rtx subst) { if (TARGET_SHMEDIA) return false; return true; } the code size regression for CSiBE from non LRA is reduced to 0.59%. Looking at the improved cases, the extra save/restore insns to/from stack disappear. I guess that SH has not enough numbers of the hard registers to make the equiv substitution win in the size and the speed on average working sets. It looks the pseudos produced to hold the equiv values can't get hard registers for bad cases and end up memory save/restore insns which make the code worse.
I'd like to apply the revised patches below to sh-lra branch for looking at the problems easily. Oleg, is it OK for you? c#55: Fixup the result of decompose_mem_address when INDEX_REG_CLASS is a single register class. c#59: Introduce target hook which disables the substitution of pseudos with equivalent memory values. c#61: Define SH specific HARD_REGNO_CALLER_SAVE_MODE target macro. c#66: Make *addsi3_compact insn_and_split and add an alternative which will split to (set a c) and (set (plus a b)) after reload. I omit c#29 patch which touches lra-constraints.c and may impact on other targets. That patch would make it hard to compare with other targets. c#55 and c#59 also touch lra-constraints.c but those changes unlikely impact on other targets. Fortunately the test case c#28 doesn't ICE anymore without c#29.
(In reply to Kazumoto Kojima from comment #69) > the code size regression for CSiBE from non LRA is reduced to 0.59%. > Looking at the improved cases, the extra save/restore insns to/from > stack disappear. I guess that SH has not enough numbers of the hard > registers to make the equiv substitution win in the size and the speed > on average working sets. It looks the pseudos produced to hold > the equiv values can't get hard registers for bad cases and end up > memory save/restore insns which make the code worse. I don't know the details and maybe I'm totally off here ... LRA is being used for ARM and there are almost the same amount of GP registers available on ARM than on SH. So either on ARM nobody has checked for such regressions, or there's something else going wrong or missing on SH? Or is it maybe really that SH R0-ness thing that makes everything work (or not work) totally different?
(In reply to Kazumoto Kojima from comment #70) > I'd like to apply the revised patches below to sh-lra branch for > looking at the problems easily. Oleg, is it OK for you? Sure. However, maybe it's better to do a merge from trunk first. I know there might be some conflicts w.r.t. my recent changes on trunk, in particular for PR 53513. On the other hand, I've seen some LRA changes/fixes on trunk recently.
(In reply to Oleg Endo from comment #71) > I don't know the details and maybe I'm totally off here ... LRA is being > used for ARM and there are almost the same amount of GP registers available > on ARM than on SH. So either on ARM nobody has checked for such > regressions, or there's something else going wrong or missing on SH? Or is > it maybe really that SH R0-ness thing that makes everything work (or not > work) totally different? I'm not sure about ARM. The problematic cases I've looked at are high R0 pressure cases and IRA decided to allocate equiv value registers to memory as most profitable ones. It looks that the remained code size inflation came from R0-ness, very limited base+display addressing, only one index register and so on. I'll attach the test cases for them to this PR after merge from trunk and commit current patches.
(In reply to Kazumoto Kojima from comment #73) > I'm not sure about ARM. The problematic cases I've looked at are > high R0 pressure cases and IRA decided to allocate equiv value > registers to memory as most profitable ones. > It looks that the remained code size inflation came from R0-ness, > very limited base+display addressing, only one index register and > so on. OK, makes sense. > I'll attach the test cases for them to this PR after merge > from trunk and commit current patches. Thanks!
FYI, merge from trunk revision 216447 as r216529. I've fixed c#55, c#59, c#61 and c#66 so to match this merge and committed them on sh-lra as r216532, r216533, r216533 and r216535, respectively.
Created attachment 33787 [details] a reduced test case of SCiBE compiler/vam test compiler/vam is a test which is an example of code size regression from non LRA compiler. With -O2, non LRA compiler generates code like mov.l r0,@(56,r15) mov #64,r1 add #2,r0 mov r13,r2 add r15,r1 mov.l r0,@(60,r15) add #18,r2 mov r13,r14 add #2,r0 mov.l r2,@(0,r1) mov.l r0,@(4,r1) add #2,r2 add #2,r0 mov.l r2,@(8,r1) mov.l r0,@(12,r1) and the corresponding code generated by sh-lra compiler looks like mov.l r1,@(56,r15) mov #17,r1 add r2,r1 mov.l r1,@(60,r15) mov #2,r14 mov #16,r10 mov #18,r1 add r2,r1 add r2,r14 add r2,r10 mov #64,r2 add r15,r2 mov.l r1,@r2 mov #19,r1 mov #68,r2 add r13,r1 add r15,r2 mov.l r1,@r2 mov #20,r1 mov #72,r2 add r13,r1 add r15,r2 mov.l r1,@r2 mov #21,r1 mov #76,r2 add r13,r1 add r15,r2 mov.l r1,@r2 which shows that base+disp addressing mode wasn't used for the insn (set reg (mem (plus sp N))) when N >= 64. It looks that non LRA compiler uses targetm.legitimize_address to handle that case and splits it to (set new_reg (plus sp M)) and (set reg (mem (plus new_reg N-M))) so as to N-M is a legitimate display constant for base+disp addressing.
Created attachment 33788 [details] another reduced test case of compiler/vam This is an another test case got from compiler/vam test. The generated code for the second foo call with non LRA compiler at -O2 looks like mov.b @(7,r8),r0 extu.b r0,r7 mov.b @(6,r8),r0 extu.b r0,r6 mov.b @(5,r8),r0 extu.b r0,r5 mov.b @(4,r8),r0 jsr @r9 extu.b r0,r4 and the corresponding code with sh-lra is here mov.b @(7,r8),r0 mov.b r0,@(1,r15) mov.b @(6,r8),r0 mov.b r0,@(2,r15) mov.b @(5,r8),r0 mov.b r0,@(3,r15) mov.b @(4,r8),r0 mov.b r0,@r15 mov.b @(1,r15),r0 extu.b r0,r7 mov.b @(2,r15),r0 extu.b r0,r6 mov.b @(3,r15),r0 extu.b r0,r5 mov.b @r15,r0 jsr @r9 extu.b r0,r4
Created attachment 33813 [details] a trial patch for the issue c#76 With it, the generated code for c#76 test case looks similar with the one with non LRA compiler. The total code size regression for CSiBE is reduced to 0.43%. The patch introduces a new targetm function legitimize_address_display. I've tested it on arm thumb with adding a similar new targetm function to arm.c. There is only one CSiBE test of which code size is changed from 400 to 396 at -Os and there are no size changes at -O2 on thumb. Although arm thumb is better than SH about those base+display addressing because of 8-bit display for stack pointer register and no R0-ness, it's a bit surprising to me.
(In reply to Kazumoto Kojima from comment #78) > Created attachment 33813 [details] > a trial patch for the issue c#76 > > With it, the generated code for c#76 test case looks similar with > the one with non LRA compiler. The total code size regression for > CSiBE is reduced to 0.43%. > The patch introduces a new targetm function legitimize_address_display. Hm, maybe it's better to name this "legitimize_address_displacement" or "legitimize_address_offset"?
(In reply to Oleg Endo from comment #79) > Hm, maybe it's better to name this "legitimize_address_displacement" or > "legitimize_address_offset"? Sure. I'm not sure whether that targetm function is a good idea or not in the first place, though.
(In reply to Kazumoto Kojima from comment #80) > (In reply to Oleg Endo from comment #79) > > Hm, maybe it's better to name this "legitimize_address_displacement" or > > "legitimize_address_offset"? > > Sure. I'm not sure whether that targetm function is a good idea > or not in the first place, though. I haven't studied the LRA code, but I guess there must be some logic that replaces the target hook function sh_legitimize_reload_address. Either it's using sh_legitimize_address and interpreting/expecting something different from what it gets, or it's doing it's own "guessing" of how to legitimize the address. My opinion is that it's better to have an explicit transformation in the back end (i.e. something like legitimize_address_offset) rather than guess-and-check-if-valid. But that's just my personal opinion.
(In reply to Kazumoto Kojima from comment #77) > Created attachment 33788 [details] > another reduced test case of compiler/vam It seems that unsigned char memory accesses make this bad code with LRA. We have no (zero_extend:SI (reg) (mem)) instruction for QI/HImode. When displacement addressing is used, (set rA (mem (plus rX disp))) and a zero extension between registers are generated for it. The combine phase may combine that extension and another move insn and deletes the original extension: (set rA (mem (plus rX disp))) (set rA (mem (plus rX disp))) (zero_extend:SI rB rA) deleted insn ... => (set rC rB) (zero_extend:SI rC rA) RA will assign r0 to rA for QI/HImode and make a long live range for R0 unfortunately. This may cause anomalous register spills in some case with LRA. I guess that the old reload has something to handle such exceptional case. I think that the long live R0 is problematic in the first place, though. The above combine doesn't happen if rA is a hard reg of which reg class is likely spilled to avoid such problem. We can split (set rA (mem (plus rX disp))) to two move insns via r0 which is a likely spilled register. It will make some codes worse but it wins on average. CSiBE shows 0.036% total code size improvement when the above split is done in prepare_move_operands for load only and 0.049% when done for both load and store. I'll attach the patch for it. With it, the generated code for the 2nd call of the test case looks like mov.b @(7,r8),r0 mov r0,r7 mov.b @(6,r8),r0 extu.b r7,r7 mov r0,r6 mov.b @(5,r8),r0 extu.b r6,r6 mov r0,r5 mov.b @(4,r8),r0 extu.b r5,r5 jsr @r9 extu.b r0,r4 Spills to memory went away, though it's still worse than non LRA case. Before that patch, I've tried several splitters for zero_extend of which operands are register and memory with displacement addressing. They gave similar codes against the test case but can't win CSiBE on average. Even with some peepholes to fix regressions, the total code size of CSiBE increases 0.05-0.1%.
Created attachment 33992 [details] a patch for the issue c#77 Interestingly, this reduces the total text size of CSiBE test ~0.04% at -O2 even for the trunk i.e. with the old reload.
FYI, merge from trunk revision 217978 as sh-lra revision 217980 to apply the lra remat changes on trunk.
Created attachment 34135 [details] patch to add -mlra option I'd like to apply the patch to add a transitional option -mlra like ARM and the revised patches in c#78 and c#83 on sh-lra branch after merge from trunk. It helps me a lot when building, comparing and testing sh-lra compilers.
(In reply to Kazumoto Kojima from comment #85) > Created attachment 34135 [details] > patch to add -mlra option > > I'd like to apply the patch to add a transitional option -mlra > like ARM and the revised patches in c#78 and c#83 on sh-lra branch > after merge from trunk. It helps me a lot when building, comparing > and testing sh-lra compilers. Sure, makes sense to me. It's probably better to revert my commit r215244 to reinstate reload on the sh-lra branch.
(In reply to Oleg Endo from comment #86) > (In reply to Kazumoto Kojima from comment #85) > > Created attachment 34135 [details] > > patch to add -mlra option > > > > I'd like to apply the patch to add a transitional option -mlra > > like ARM and the revised patches in c#78 and c#83 on sh-lra branch > > after merge from trunk. It helps me a lot when building, comparing > > and testing sh-lra compilers. > > Sure, makes sense to me. It's probably better to revert my commit r215244 > to reinstate reload on the sh-lra branch. Sorry, your patch in attachment 34135 [details] does that already of course. We could also try to merge the sh-lra branch into trunk for the gcc 5 release. If all the LRA changes are conditional and the code paths are not changed if -mlra is not specified, it should be OK to do it. Some of the changes could probably also be applied unconditionally, such as r215241 and r215243.
For the record, here is the sh-lra revisions. 218191: Merge from trunk revision 218173. 218192: Add legitimize_address_displacement target macto. 218193: Split QI/HImode displacement addressing load/store via R0. Use std::swap in the sh-lra change. 218194: Add -mlra option. BTW, there are new ICEs with internal compiler error: in split_reg, at lra-constraints.c:4909 It looks that DCmode move causes them on sh-lra.
Created attachment 34159 [details] a reduced c++ test case (-O2 -std=gnu++11) Here is related lines of lra-constraints.c: 4906 save = emit_spill_move (true, new_reg, original_reg); 4907 if (NEXT_INSN (save) != NULL_RTX) 4908 { 4909 lra_assert (! call_save_p); where (gdb) p call_save_p $2 = true (gdb) call debug_rtx(new_reg) (reg:DC 270) (gdb) call debug_rtx(original_reg) (reg:DC 220 [+8 ]) and emit_spill_move emits 3 instructions (insn 106 0 107 (clobber (reg:DC 270)) -1 (nil)) (insn 107 106 108 (parallel [ (set (subreg:DF (reg:DC 270) 0) (subreg:DF (reg:DC 220 [+8 ]) 0)) (use (reg:SI 154 fpscr0)) (clobber (scratch:SI)) ]) -1 (nil)) (insn 108 107 0 (parallel [ (set (subreg:DF (reg:DC 270) 8) (subreg:DF (reg:DC 220 [+8 ]) 8)) (use (reg:SI 154 fpscr0)) (clobber (scratch:SI)) ]) -1 (nil)) with emit_move_complex function in expr.c.
For the record, the patches to enable LRA on SH have been committed. The last revision of the committed patch series is r218892.
Re: [RFC PATCH 9/9] [SH] Split QI/HImode load/store via r0 (In reply to Kazumoto Kojima from comment #83) > Created attachment 33992 [details] > a patch for the issue c#77 > > Interestingly, this reduces the total text size of CSiBE test ~0.04% > at -O2 even for the trunk i.e. with the old reload. I've checked this change to prepare_move_operands without LRA with trunk r218988, to see whether it should be enabled for non-LRA. I can confirm the -1140 bytes / -0.04% on the CSiBE set. However, as mentioned in comment #82, it results in unnecessary zero extensions before other logic/arithmetic because combine doesn't (want to) see through the R0 hardreg. Unnecessary sign/zero extensions are actually a separate topic (e.g. PR 53987). If there was a good sign/zero extension elimination in place, this wouldn't be an issue here. I've tried disabling the prepare_move_operands change and instead adding the following splitters, which are done after combine and before RA: (define_split [(set (match_operand:SI 0 "arith_reg_dest") (sign_extend:SI (match_operand:QIHI 1 "displacement_mem_operand")))] "TARGET_SH1 && can_create_pseudo_p () && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)" [(set (match_dup 2) (reg:SI R0_REG)) (set (reg:SI R0_REG) (sign_extend:SI (match_dup 1))) (set (match_dup 0) (reg:SI R0_REG)) (set (reg:SI R0_REG) (match_dup 2))] { operands[2] = gen_reg_rtx (SImode); }) (define_split [(set (match_operand:QIHI 0 "arith_reg_dest") (match_operand:QIHI 1 "displacement_mem_operand"))] "TARGET_SH1 && can_create_pseudo_p () && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)" [(set (match_dup 2) (reg:SI R0_REG)) (set (reg:QIHI R0_REG) (match_dup 1)) (set (match_dup 0) (reg:QIHI R0_REG)) (set (reg:SI R0_REG) (match_dup 2))] { operands[2] = gen_reg_rtx (SImode); }) With these two splitters for mem loads I get exactly the same -1140 bytes / -0.04% on the CSiBE set. The simple test case int test_tst (unsigned char* x, int y, int z) { return x[4] ? y : z; } does not contain the extu.b insn anymore, but instead we get this: mov.b @(4,r4),r0 mov r0,r1 tst r1,r1 << should be: tst r0,r0 bf .L4 mov r6,r5 .L4: rts mov r5,r0 Other cases of new unnessecary zero-extension insns are in e.g. in jpeg-6b/jdcoefct.s. In linux-2.4.23-pre3-testplatform/arch/testplatform/kernel/signal.s some mov reg,reg insns end up as: extu.b r0,r0 mov.b r0,@(1,r8) mov r9,r0 shlr16 r0 extu.b r0,r0 mov.b r0,@(2,r8) mov r9,r0 shlr16 r0 shlr8 r0 mov.b r0,@(3,r8) extu.b r1,r0 mov.b r0,@(4,r8) mov r1,r0 .... those can be wallpapered with peepholes though. I've also tried using the splitters instead of the prepare_move_operands hunk with LRA. But then I get spill errors on QI/HImode stores with displacement addressing. I've also tried removing the prepare_move_operands hunk with LRA. Compared with trunk no-lra I get: sum: 3368431 -> 3378515 +10084 / +0.299368 % And LRA + prepare_move_operands hunk vs. trunk no-lra is: sum: 3368431 -> 3376507 +8076 / +0.239756 % Doing this kind of pre-RA R0 pre-allocation seems to result in better RA choices w.r.t. commutative insns such as addition. After all, maybe it's worth trying out an SH specific R0 pre-alloc pass that offloads some of the RA choices. Of course it will not be able to solve issues that arise when spilling code is generated that uses QI/HImode mem accesses during the RA/spilling process. R0 is the most difficult candidate, but I've also seen reports about FPU code ICEing due FR0 spill failures when there are lots of (interdependent?) FMAC insns at -O3 (e.g. FP matrix multiplication). Another register (class), of which there is only one on SH, would be the MACH/MACL pair. Currently MACH/MACL are fixed hardregs. Early experiments to allow MACH/MACL to be used by RA and adding the MAC insn showed some problems (see PR 53949 #c3).
Author: olegendo Date: Sun Dec 21 23:37:42 2014 New Revision: 218999 URL: https://gcc.gnu.org/viewcvs?rev=218999&root=gcc&view=rev Log: gcc/ PR target/55212 * config/sh/sh.md (*addsi3_compact): Add parentheses around && condition. Add comments. Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/sh.md
Author: olegendo Date: Thu Jan 8 11:07:45 2015 New Revision: 219341 URL: https://gcc.gnu.org/viewcvs?rev=219341&root=gcc&view=rev Log: gcc/ PR target/55212 * config/sh/sh.md (*addsi3_compact): Emit reg-reg copy instead of constant load if constant operand fits into I08. Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/sh.md
Kaz, do you think we can enable LRA by default for GCC 6? Some early results from the AMS optimization show new R0 spill failures. For example, AMS uses @(R0,Rn) address modes more often. Although I still would like to try this R0 prealloc pass, I don't know when I will have time for that. Enabling LRA might be easier.
(In reply to Oleg Endo from comment #94) > Kaz, do you think we can enable LRA by default for GCC 6? I think that it's OK to make LRA default on trunk, even if it's better with your R0 pre-allocating pass. The last time I tested -mlra, there are some regressions, though. I'd like to test it again. Could you see any serious problems on sh-elf with -mlra?
(In reply to Kazumoto Kojima from comment #95) > > I think that it's OK to make LRA default on trunk, even if it's > better with your R0 pre-allocating pass. The R0 pre-alloc pass could be a further improvement, even with LRA. > The last time I tested -mlra, there are some regressions, though. > I'd like to test it again. Could you see any serious problems on > sh-elf with -mlra? I haven't tried yet. I will do a full toolchain rebuild and test with -mlra enabled by default.
(In reply to Oleg Endo from comment #96) > > I haven't tried yet. I will do a full toolchain rebuild and test with -mlra > enabled by default. I've compared the results of r227512 without LRA and r227682 with LRA. Below are the new failures. Running target sh-sim/-m2/-mb FAIL: gcc.c-torture/execute/20040709-2.c -Os execution test FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36 Running target sh-sim/-m2/-ml FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36 Running target sh-sim/-m2a/-mb FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36 Running target sh-sim/-m4/-mb FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2 AIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36 Running target sh-sim/-m4/-ml FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36 Running target sh-sim/-m4a/-mb FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36 Running target sh-sim/-m4a/-ml FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28 FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36
(In reply to Oleg Endo from comment #97) > I've compared the results of r227512 without LRA and r227682 with LRA. > Below are the new failures. A typical example of pr64661-x.c regressions is: [LRA] test_37_1: stc gbr,r3 ! 22 store_gbr [length = 2] mova 1f,r0 ! 6 atomic_add_fetchqi_soft_tcb [length = 20] mov #(0f-1f),r1 .align 2 mov.l r0,@(128,gbr) 0: mov.b @(4,r3),r0 ... [no-LRA] test_37_1: mova 1f,r0 ! 6 atomic_add_fetchqi_soft_tcb [length = 20] mov #(0f-1f),r1 .align 2 mov.l r0,@(128,gbr) 0: mov.b @(4,gbr),r0 ... where (define_insn_and_split "atomic_fetch_<fetchop_name><mode>_soft_tcb" [(set (match_operand:QIHISI 0 "arith_reg_dest" "=&r") (match_operand:QIHISI 1 "atomic_mem_operand_1" "=SraSdd")) ... (clobber (reg:SI R0_REG)) (clobber (reg:SI R1_REG))] "TARGET_ATOMIC_SOFT_TCB" { return "\r mova 1f,r0" "\n" " .align 2" "\n" " mov #(0f-1f),r1" "\n" " mov.l r0,@(%O3,gbr)" "\n" "0: mov.<bwl> %1,r0" "\n" ... The .ira dump shows (insn 6 5 7 2 (parallel [ (set (reg:QI 167) (plus:QI (mem/v:QI (plus:SI (reg:SI 144 gbr) (const_int 4 [0x4])) [-1 S1 A8]) (const_int 1 [0x1]))) (set (mem/v:QI (plus:SI (reg:SI 144 gbr) (const_int 4 [0x4])) [-1 S1 A8]) (unspec:QI [ (plus:QI (mem/v:QI (plus:SI (reg:SI 144 gbr) (const_int 4 [0x4])) [-1 S1 A8]) (const_int 1 [0x1])) ] UNSPEC_ATOMIC)) ... which looks the code generated with the old RA. It seems that old RA passes gbr+4 addressing as is but LRA spills gbr with r3 according to the constraint SraSdd. Perhaps LRA is more strict in this regard. I guess that these pr64661-x.c regressions are not so problem, though. I'm not sure whether hiconst.c regression is serious or not. The serious one would be Running target sh-sim/-m2/-mb FAIL: gcc.c-torture/execute/20040709-2.c -Os execution test My sh4-unknown-linux-gnu test also shows another execution errors for libstdc++-v3. New tests that FAIL: 22_locale/money_get/get/wchar_t/14.cc execution test 22_locale/money_get/get/wchar_t/19.cc execution test 22_locale/money_get/get/wchar_t/22131.cc execution test 22_locale/money_get/get/wchar_t/38399.cc execution test 22_locale/money_get/get/wchar_t/39168.cc execution test 22_locale/money_get/get/wchar_t/6.cc execution test 22_locale/money_get/get/wchar_t/8.cc execution test 22_locale/money_get/get/wchar_t/9.cc execution test 22_locale/money_put/put/wchar_t/39168.cc execution test 22_locale/money_put/put/wchar_t/5.cc execution test 22_locale/money_put/put/wchar_t/6.cc execution test These tests compiled with -mlra don't fail with libstdc++-v3 library compiled with -mno-lra, i.e. libstdc++-v3 is miscompiled with -mlra. These wrong code problems should be resolved before witch to LRA.
(In reply to Kazumoto Kojima from comment #98) > I guess that these pr64661-x.c regressions are not so problem, though. I agree. > I'm not sure whether hiconst.c regression is serious or not. I think we can ignore this for now. It's about loading/sharing constants. There are more constant related inefficiencies which can be addressed later. > The serious one would be > > Running target sh-sim/-m2/-mb > FAIL: gcc.c-torture/execute/20040709-2.c -Os execution test I will have a look at it. > My sh4-unknown-linux-gnu test also shows another execution > errors for libstdc++-v3. > > New tests that FAIL: > > 22_locale/money_get/get/wchar_t/14.cc execution test > 22_locale/money_get/get/wchar_t/19.cc execution test > 22_locale/money_get/get/wchar_t/22131.cc execution test > 22_locale/money_get/get/wchar_t/38399.cc execution test > 22_locale/money_get/get/wchar_t/39168.cc execution test > 22_locale/money_get/get/wchar_t/6.cc execution test > 22_locale/money_get/get/wchar_t/8.cc execution test > 22_locale/money_get/get/wchar_t/9.cc execution test > 22_locale/money_put/put/wchar_t/39168.cc execution test > 22_locale/money_put/put/wchar_t/5.cc execution test > 22_locale/money_put/put/wchar_t/6.cc execution test > > These tests compiled with -mlra don't fail with libstdc++-v3 > library compiled with -mno-lra, i.e. libstdc++-v3 is miscompiled > with -mlra. These wrong code problems should be resolved before > witch to LRA. Hm .. those don't fail on sh-elf ... maybe something related to the atomics? Atomics are off by default for sh-elf, but on for sh-linux.
(In reply to Oleg Endo from comment #99) > Hm .. those don't fail on sh-elf ... maybe something related to the atomics? > Atomics are off by default for sh-elf, but on for sh-linux. Maybe. It could be something atomic related. For 14.cc case, replacing locale.o and cxx11-shim_facets.o with ones from non-LRA build fixes the failure. For example, std::locale::locale(std::locale const&) is compiled very differently in LRA/non-LRA builds. Strangely, the problem can't be reproduced with recompiling these objects with -mlra in non-LRA builds. The above constructor is compiled to [LRA build] 0: c6 2f mov.l r12,@-r15 2: 0b c7 mova 30 <_ZNSt6localeC1ERKS_+0x30>,r0 4: 0a dc mov.l 30 <_ZNSt6localeC1ERKS_+0x30>,r12 ! 0 <_ZNSt6localeC1ERKS_> 6: 0c 3c add r0,r12 8: 52 62 mov.l @r5,r2 a: 0a d0 mov.l 34 <_ZNSt6localeC1ERKS_+0x34>,r0 ! 0 <_ZNSt6localeC1ERKS_> c: ce 01 mov.l @(r0,r12),r1 e: 18 21 tst r1,r1 10: 09 8d bt.s 26 <_ZNSt6localeC1ERKS_+0x26> 12: 22 24 mov.l r2,@r4 14: 02 c7 mova 20 <_ZNSt6localeC1ERKS_+0x20>,r0 16: f3 61 mov r15,r1 18: fa ef mov #-6,r15 1a: 22 63 mov.l @r2,r3 1c: 01 73 add #1,r3 1e: 32 22 mov.l r3,@r2 20: 13 6f mov r1,r15 22: 0b 00 rts 24: f6 6c mov.l @r15+,r12 26: 22 61 mov.l @r2,r1 28: 01 71 add #1,r1 2a: 12 22 mov.l r1,@r2 2c: 0b 00 rts 2e: f6 6c mov.l @r15+,r12 ... 30: R_SH_GOTPC _GLOBAL_OFFSET_TABLE_ 34: R_SH_GOT32 __pthread_key_create [non-LRA build] 0: c6 2f mov.l r12,@-r15 2: 0b c7 mova 30 <_ZNSt6localeC1ERKS_+0x30>,r0 4: 0a dc mov.l 30 <_ZNSt6localeC1ERKS_+0x30>,r12 ! 0 <_ZNSt6localeC1ERKS_> 6: 22 4f sts.l pr,@-r15 8: 0c 3c add r0,r12 a: 52 61 mov.l @r5,r1 c: 09 d0 mov.l 34 <_ZNSt6localeC1ERKS_+0x34>,r0 ! 0 <_ZNSt6localeC1ERKS_> e: ce 02 mov.l @(r0,r12),r2 10: 28 22 tst r2,r2 12: 07 8d bt.s 24 <_ZNSt6localeC1ERKS_+0x24> 14: 12 24 mov.l r1,@r4 16: 13 64 mov r1,r4 18: 07 d1 mov.l 38 <_ZNSt6localeC1ERKS_+0x38>,r1 ! 1a 1a: 03 01 bsrf r1 1c: 01 e5 mov #1,r5 1e: 26 4f lds.l @r15+,pr 20: 0b 00 rts 22: f6 6c mov.l @r15+,r12 24: 12 62 mov.l @r1,r2 26: 01 72 add #1,r2 28: 22 21 mov.l r2,@r1 2a: 26 4f lds.l @r15+,pr 2c: 0b 00 rts 2e: f6 6c mov.l @r15+,r12 ... 30: R_SH_GOTPC _GLOBAL_OFFSET_TABLE_ 34: R_SH_GOT32 __pthread_key_create 38: 1a 00 sts macl,r0 38: R_SH_PLT32 _ZN9__gnu_cxx12__atomic_addEPVii ... Perhaps the both codes are ok, though something odd happens on atomic things anyway.
> Perhaps the both codes are ok, though something odd happens > on atomic things anyway. I've forgotten that some atomic builtins fail with spill_failure in class 'R0_REGS' for old RA. So libstdc++-v3 was configured so to undefine _GLIBCXX_ATOMIC_BUILTINS for old RA and we see the above difference. I've tried to disable it on LRA build and rebuild libstdc++-v3 libs. 14.exe still fails and it looks very sensitive with code/option changes. Weird.
(In reply to Kazumoto Kojima from comment #101) > > I've forgotten that some atomic builtins fail with spill_failure > in class 'R0_REGS' for old RA. So libstdc++-v3 was configured > so to undefine _GLIBCXX_ATOMIC_BUILTINS for old RA and we see > the above difference. Ah! I always wondered why those failures disappeared from your nightly tests. Did you do this locally in your setup, or has this been committed at some time?
(In reply to Oleg Endo from comment #102) > Ah! I always wondered why those failures disappeared from your nightly > tests. Did you do this locally in your setup, or has this been committed at > some time? I did nothing. Looks simply fragile like as other reload ICEs. libstdc++-v3 conftest catches it, (un)fortunately.
(In reply to Kazumoto Kojima from comment #103) > I did nothing. Looks simply fragile like as other reload ICEs. > libstdc++-v3 conftest catches it, (un)fortunately. Ah, I understand. Thanks.
Created attachment 36328 [details] 20040709-2.s without LRA (PASS)
Created attachment 36329 [details] 20040709-2.s with LRA (FAIL)
> > The serious one would be > > > > Running target sh-sim/-m2/-mb > > FAIL: gcc.c-torture/execute/20040709-2.c -Os execution test > I have tried compiling that test case outside of the testsuite with -m2 -mb -Os with and without LRA. There are lots of differences in the asm output, but the LRA version executable runs fine and doesn't abort. Running the test inside the testsuite fails reliably. I could reduce it to the testZ (); subtest. It passes with with make -k check RUNTESTFLAGS="execute.exp=20040709-2.c --target_board=sh-sim\{-m2/-mb/-save-temps/-mno-lra}" It fails with (-mlra enabled by default) make -k check RUNTESTFLAGS="execute.exp=20040709-2.c --target_board=sh-sim\{-m2/-mb/-save-temps}"
(In reply to Oleg Endo from comment #107) The problem is the define_split at sh.md line 893 (the part at 945). Or actually, it's sh_find_set_of_reg. Adding this diff --git a/gcc/config/sh/sh-protos.h b/gcc/config/sh/sh-protos.h index 3e4211b..b4866c1 100644 --- a/gcc/config/sh/sh-protos.h +++ b/gcc/config/sh/sh-protos.h @@ -199,8 +199,13 @@ sh_find_set_of_reg (rtx reg, rtx_insn* insn, F stepfunc, { if (BARRIER_P (result.insn)) break; + + if (CALL_P (result.insn)) + break; + if (!NONJUMP_INSN_P (result.insn)) continue; + if (reg_set_p (reg, result.insn)) { result.set_rtx = set_of (reg, result.insn); Fixes the test case for me. It seems I should really fix PR 67061, which looks like the same issue (and the other issue with sh_find_set_of_reg which was fixed with a modified_between_p).
(In reply to Oleg Endo from comment #108) > > It seems I should really fix PR 67061, which > looks like the same issue (and the other issue with sh_find_set_of_reg which > was fixed with a modified_between_p). I have posted a patch for PR 67061 attachment 36331 [details] With that patch make -k check RUNTESTFLAGS="execute.exp=20040709-2.c --target_board=sh-sim\{-m2/-mb/-mlra}" passes the tests. Maybe it also fixes the std::locale issues. Kaz, could you please have a look?
(In reply to Oleg Endo from comment #109) > Maybe it also fixes the std::locale issues. Kaz, could > you please have a look? It doesn't fix those failures. I'll try to bisect that issue.
(In reply to Kazumoto Kojima from comment #110) It seems that those failures are the latent wrong code problem triggered with -mlra accidentally. I've filed it as PR67573.
I had a quick look at the gcc.target/sh/hiconst.c test case: int rab (char *pt, int *pti) { pt[2] = 0; pti[3] = 0; return 0; } without LRA: mov #0,r0 mov.b r0,@(2,r4) rts mov.l r0,@(12,r5) with LRA: mov #0,r0 mov.b r0,@(2,r4) mov #0,r1 mov.l r1,@(12,r5) rts mov #0,r0 Without LRA it seems some of the constant loads are CSE'ed, but the return value constant load remains and reload-something figures it out. With LRA it seems that for the SImode store it wants to use R1, but for the QImode store it needs to use R0. It then rematerializes the constant load. Increasing the costs of constants: @@ -3574,7 +3574,7 @@ return true; } if (CONST_OK_FOR_I08 (INTVAL (x))) - *total = 0; + *total = 4; else if ((outer_code == AND || outer_code == IOR || outer_code == XOR) && CONST_OK_FOR_K08 (INTVAL (x))) *total = 1; results in: mov #0,r1 mov r1,r0 mov.b r0,@(2,r4) mov.l r1,@(12,r5) rts mov r1,r0 That would make the hiconst test pass again of course, but is not much better. Probably a combination of constant optimization (PR 65069, PR 63390, PR 51708, PR 54682) and an R0 pre-alloc would give optimal results.
Created attachment 36363 [details] CSiBE I08 const cost = 0 vs. cost = 1, no LRA
Created attachment 36364 [details] CSiBE I08 const cost = 0 vs. cost = 1, LRA
It seems that it's already enough to set the cost for I08 from 0 to 1: Index: gcc/config/sh/sh.c =================================================================== --- gcc/config/sh/sh.c (revision 227958) +++ gcc/config/sh/sh.c (working copy) @@ -3574,7 +3574,7 @@ return true; } if (CONST_OK_FOR_I08 (INTVAL (x))) - *total = 0; + *total = 1; else if ((outer_code == AND || outer_code == IOR || outer_code == XOR) && CONST_OK_FOR_K08 (INTVAL (x))) *total = 1; This will reduce the rematerializations and thus increase reg live ranges in other places. attachment 36363 [details] and attachment 36364 [details] are CSiBE comparistions with and without LRA. Without LRA: sum: 3347407 -> 3345667 -1740 / -0.051981 % With LRA: sum: 3348933 -> 3347433 -1500 / -0.044790 % Although there are quite some increases, overall it seems beneficial...
(In reply to Oleg Endo from comment #91) > > I can confirm the -1140 bytes / -0.04% on the CSiBE set. Since r228176 this is no longer true. Now the gap of no-LRA and LRA is a bit wider: 3345527 -> 3334351 -11176 / -0.334058 % See also PR 67732. I think now the difference is too big to enable LRA by default for GCC 6. We should wait until the number gets better again.
(In reply to Oleg Endo from comment #94) > Kaz, do you think we can enable LRA by default for GCC 6? > > Some early results from the AMS optimization show new R0 spill failures. > For example, AMS uses @(R0,Rn) address modes more often. Although I still > would like to try this R0 prealloc pass, I don't know when I will have time > for that. Enabling LRA might be easier. So I've tried compiling the CSiBE set with AMS and LRA. The AMS branch still has some problems, but without LRA CSiBE at least compiles without errors. With LRA I get a lot of: ll_rw_blk.i: In function 'generic_plug_device': ll_rw_blk.i:8464:1: error: unable to find a register to spill } ^ ll_rw_blk.i:8464:1: error: this is the insn: (insn 32 74 61 5 (set (mem/j:QI (plus:SI (reg/v/f:SI 170 [ q ]) (reg:SI 283 [272])) [0 q_2(D)->plugged+0 S1 A32]) (reg:QI 0 r0)) ll_rw_blk.i:8462 276 {*movqi} (expr_list:REG_DEAD (reg:SI 283 [272]) (expr_list:REG_DEAD (reg:QI 0 r0) (nil)))) ll_rw_blk.i:8464:1: internal compiler error: in assign_by_spills, at lra-assigns.c:1431 It seems it has trouble rearranging this: mov.b r0,@(Rm,Rn) into mov.b Rx,@(R0,Rn)
Is there anything that currently speaks against switching to LRA by default now? I think, it would be a good idea for gcc-11 or even gcc-10 if possible. I will report all issues I'm running into since I am constantly monitoring package builds on Debian/sh4.
(In reply to John Paul Adrian Glaubitz from comment #118) > Is there anything that currently speaks against switching to LRA by default > now? There were a couple of code quality issues, which I would need to dig up and re-check if it's still an issue. > > I think, it would be a good idea for gcc-11 or even gcc-10 if possible. I > will report all issues I'm running into since I am constantly monitoring > package builds on Debian/sh4. Last time this issue came up, I asked you to re-build all debian with -mlra enabled https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00977.html I'm still waiting for the results. I've been telling you to turn on LRA for specific cases, but that doesn't mean it's not gonna have regressions. So please try using -mlra for *all* packages, including kernel and let the system boostrap. Please report your findings in this PR.
(In reply to Oleg Endo from comment #119) > Last time this issue came up, I asked you to re-build all debian with -mlra > enabled > > https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00977.html > > I'm still waiting for the results. > > I've been telling you to turn on LRA for specific cases, but that doesn't > mean it's not gonna have regressions. So please try using -mlra for *all* > packages, including kernel and let the system boostrap. That's a huge task which is why I prefer fixing issues on the fly. We are going to switch over anyway with GCC-12 or GCC-13, so I'm not sure what we are gaining if we continue to wait.
(In reply to John Paul Adrian Glaubitz from comment #120) > > That's a huge task which is why I prefer fixing issues on the fly. I thought this is almost fully automated? You can apply this patch to GCC to enable LRA by default for SH --- sh.opt.orig 2019-03-04 10:09:09.244521000 +0900 +++ sh.opt 2020-02-26 10:19:55.414340269 +0900 @@ -299,5 +299,5 @@ Enable the use of the fsrra instruction. mlra -Target Report Var(sh_lra_flag) Init(0) Save +Target Report Var(sh_lra_flag) Init(1) Save Use LRA instead of reload (transitional). Then let it rebuild everything. > > We are going to switch over anyway with GCC-12 or GCC-13, so I'm not sure > what we are gaining if we continue to wait. I don't get it. You want to enable LRA by default for your distro, but you don't want to rebuild all the packages with that modification? It's like .. shipping a distro with a new compiler, which potentially can't compile the distro packages (correctly), so instead we secretly use an older compiler to build the packages .... ? Is that normal practice? Sorry, sounds like a mess to me.
(In reply to Oleg Endo from comment #121) > (In reply to John Paul Adrian Glaubitz from comment #120) > > > > That's a huge task which is why I prefer fixing issues on the fly. > > I thought this is almost fully automated? The build process is. Fixing broken packages isn't. > You can apply this patch to GCC to enable LRA by default for SH > > --- sh.opt.orig 2019-03-04 10:09:09.244521000 +0900 > +++ sh.opt 2020-02-26 10:19:55.414340269 +0900 > @@ -299,5 +299,5 @@ > Enable the use of the fsrra instruction. > > mlra > -Target Report Var(sh_lra_flag) Init(0) Save > +Target Report Var(sh_lra_flag) Init(1) Save > Use LRA instead of reload (transitional). > > Then let it rebuild everything. Everything is around 13.000 source packages. > > We are going to switch over anyway with GCC-12 or GCC-13, so I'm not sure > > what we are gaining if we continue to wait. > > I don't get it. You want to enable LRA by default for your distro, but you > don't want to rebuild all the packages with that modification? It's like .. > shipping a distro with a new compiler, which potentially can't compile the > distro packages (correctly), so instead we secretly use an older compiler to > build the packages .... ? Is that normal practice? Sorry, sounds like a > mess to me. Debian/sh4 is not a release target, so it's not being shipped and there is no warranty anyway. I also don't have the possibility to rebuild the whole archive in a separate project. There is only unstable. If am doing a complete rebuild now, I will be busy for the next weeks or months (there are new packages coming in every day) looking at issues because there might be a lot of failing packages, also for other reasons. It's simply not practical from my perspective. I'm doing this as a hobby, not as a full time job, so I can't take care of all issues all-day long. And, finally, the buildd capacity is limited on sh4. If this was POWER or SPARC, we would have plenty of resources for a complete rebuild.
(In reply to Oleg Endo from comment #121) > (In reply to John Paul Adrian Glaubitz from comment #120) > > > > That's a huge task which is why I prefer fixing issues on the fly. > > I thought this is almost fully automated? > > You can apply this patch to GCC to enable LRA by default for SH > > --- sh.opt.orig 2019-03-04 10:09:09.244521000 +0900 > +++ sh.opt 2020-02-26 10:19:55.414340269 +0900 > @@ -299,5 +299,5 @@ > Enable the use of the fsrra instruction. > > mlra > -Target Report Var(sh_lra_flag) Init(0) Save > +Target Report Var(sh_lra_flag) Init(1) Save > Use LRA instead of reload (transitional). I will just do that for the default kernel now. But I won't trigger a full archive rebuild. Just report issues once I see them.
(In reply to John Paul Adrian Glaubitz from comment #123) > I will just do that for the default kernel now. But I won't trigger a full > archive rebuild. Just report issues once I see them. Compiler, of course. Not kernel.
(In reply to John Paul Adrian Glaubitz from comment #122) > > The build process is. Fixing broken packages isn't. > > Everything is around 13.000 source packages. > > And, finally, the buildd capacity is limited on sh4. If this was POWER or > SPARC, we would have plenty of resources for a complete rebuild. Can you at least please try to check if the kernel builds & runs OK? As for the other packages, is there a way to let it just try to rebuild everything and get a list of the packages that failed to build. We don't need to sit down and fix all issues one by one for now. At least we can compare the list against the current list of non-building packages. If the difference is not too big, we can consider turning on LRA by default from GCC 10 on.