55212 – [SH] Switch to LRA

Bug 55212 - [SH] Switch to LRA

Summary: [SH] Switch to LRA

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	unknown

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:	67573
Blocks:	34777 54699 113932 50751 59400 61142
	Show dependency tree / graph

Reported:	2012-11-05 08:49 UTC by Oleg Endo
Modified:	2024-02-22 14:15 UTC (History)
CC List:	7 users (show)

See Also:	83464 112115 112116 100782 81466 67732 69765 66591
Host:
Target:	sh--*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2023-07-21 00:00:00

Attachments
problematic libgcc divsc3 function (570 bytes, text/x-csrc) 2014-09-10 22:43 UTC, Oleg Endo	Details
Reduced case for MAX_RELOAD_INSNS_NUMBER (2.07 KB, text/x-csrc) 2014-09-13 22:15 UTC, Oleg Endo	Details
A possible patch (480 bytes, patch) 2014-09-17 08:03 UTC, Kazumoto Kojima	Details \| Diff
Reduced test case for ICE in assign_by_spill (150 bytes, text/x-csrc) 2014-09-21 22:52 UTC, Kazumoto Kojima	Details
.reload dump file (2.58 KB, text/plain) 2014-09-21 22:54 UTC, Kazumoto Kojima	Details
A trial patch (774 bytes, patch) 2014-09-21 23:01 UTC, Kazumoto Kojima	Details \| Diff
Another reduced test case (with "-m4 -ml -O2 -std=gnu99") (361 bytes, text/x-csrc) 2014-09-21 23:05 UTC, Kazumoto Kojima	Details
A trial patch (284 bytes, patch) 2014-09-21 23:13 UTC, Kazumoto Kojima	Details \| Diff
A bit reduced test case of pr38338.c (-O0 -m4 -ml) (112 bytes, text/x-csrc) 2014-09-25 02:20 UTC, Kazumoto Kojima	Details
Patch for SH untyped_call (410 bytes, patch) 2014-09-25 02:22 UTC, Kazumoto Kojima	Details \| Diff
A trial patch for reload-loop problem (1.78 KB, patch) 2014-09-28 07:10 UTC, Kazumoto Kojima	Details \| Diff
reduced CSiBE /libpng-1.2.5 test (25.80 KB, text/plain) 2014-09-29 23:55 UTC, Oleg Endo	Details
Reduced case of error: in assign_by_spills, at lra-assigns.c:1335 with -m4 -ml -O2 (775 bytes, text/plain) 2014-09-30 22:17 UTC, Oleg Endo	Details
A possible workaround (498 bytes, patch) 2014-10-07 02:24 UTC, Kazumoto Kojima	Details \| Diff
revised patch (502 bytes, patch) 2014-10-08 02:01 UTC, Kazumoto Kojima	Details \| Diff
revised patch for the problem in c#25 (1.28 KB, patch) 2014-10-11 03:33 UTC, Kazumoto Kojima	Details \| Diff
reduced CSiBE teem-1.6.0 test (-O2 sh4-linux) (231 bytes, text/x-csrc) 2014-10-11 09:16 UTC, Kazumoto Kojima	Details
a possible patch (1018 bytes, patch) 2014-10-12 07:08 UTC, Kazumoto Kojima	Details \| Diff
CSiBE result-size.cvs sh-lra+c#29+c#55+c#57+c#61 (7.53 KB, text/plain) 2014-10-14 01:12 UTC, Kazumoto Kojima	Details
CSiBE result-size.cvs trunk rev.215912 (7.50 KB, text/plain) 2014-10-14 01:13 UTC, Kazumoto Kojima	Details
a possible patch (936 bytes, patch) 2014-10-14 01:16 UTC, Kazumoto Kojima	Details \| Diff
CSiBE result-runtime.cvs sh-lra+c#29+c#55+c#57+c#61+c#66 (418 bytes, text/plain) 2014-10-14 01:18 UTC, Kazumoto Kojima	Details
CSiBE result-runtime.cvs trunk rev.215912 (424 bytes, text/plain) 2014-10-14 01:20 UTC, Kazumoto Kojima	Details
a reduced test case of SCiBE compiler/vam test (434 bytes, text/plain) 2014-10-23 00:31 UTC, Kazumoto Kojima	Details
another reduced test case of compiler/vam (103 bytes, text/plain) 2014-10-23 00:34 UTC, Kazumoto Kojima	Details
a trial patch for the issue c#76 (1.94 KB, patch) 2014-10-27 00:40 UTC, Kazumoto Kojima	Details \| Diff
a patch for the issue c#77 (782 bytes, patch) 2014-11-16 13:19 UTC, Kazumoto Kojima	Details \| Diff
patch to add -mlra option (2.83 KB, patch) 2014-11-29 02:23 UTC, Kazumoto Kojima	Details \| Diff
a reduced c++ test case (-O2 -std=gnu++11) (524 bytes, text/plain) 2014-12-02 01:16 UTC, Kazumoto Kojima	Details
20040709-2.s without LRA (PASS) (1.66 KB, text/plain) 2015-09-13 09:11 UTC, Oleg Endo	Details
20040709-2.s with LRA (FAIL) (1.51 KB, text/plain) 2015-09-13 09:12 UTC, Oleg Endo	Details
CSiBE I08 const cost = 0 vs. cost = 1, no LRA (15.08 KB, text/plain) 2015-09-21 12:11 UTC, Oleg Endo	Details
CSiBE I08 const cost = 0 vs. cost = 1, LRA (15.18 KB, text/plain) 2015-09-21 12:11 UTC, Oleg Endo	Details
Build log including preprocessed source for building gcc-14 (20240221) natively with LRA (411.99 KB, application/gzip) 2024-02-22 14:15 UTC, John Paul Adrian Glaubitz	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Oleg Endo 2012-11-05 08:49:59 UTC

Since a new register allocator (LRA) has been added in 4.8 it might be worth trying it out for SH.

Comment 1 Oleg Endo 2014-08-03 09:47:47 UTC

Author: olegendo
Date: Sun Aug  3 09:47:15 2014
New Revision: 213524

URL: https://gcc.gnu.org/viewcvs?rev=213524&root=gcc&view=rev
Log:
PR target/55212
Create branch.

Added:
    branches/sh-lra/
      - copied from r213515, trunk/

Comment 2 Oleg Endo 2014-09-10 22:41:01 UTC

I've committed a small change as r215140.  In my case building libgcc failed in the function __divsc3 and the change in sh_secondary_reload made it go away.  I've briefly compared the resulting code against non-LRA and it looked OK.  However, building __divsc3 still fails for -m2 -mb, now with the following:

beh 0 0 0
(insn 1159 1008 1120 27 (set (reg:QI 625)
        (mem/c:QI (plus:SI (reg/f:SI 153 sfp)
                (const_int 19 [0x13])) [2 %sfp+-1 S1 A8])) sh_tmp.cpp:45 -1
     (nil))

sh_tmp.cpp: In function '__divsc3':
sh_tmp.cpp:57:1: internal compiler error: in lra_set_insn_recog_data, at lra.c:947
 }
 ^
0x8524c29 lra_set_insn_recog_data(rtx_def*)
	../../gcc-sh-lra/gcc/lra.c:947
0x85258cb lra_get_insn_recog_data
	../../gcc-sh-lra/gcc/lra-int.h:468
0x85258cb lra_update_insn_regno_info
	../../gcc-sh-lra/gcc/lra.c:1607
0x85258cb lra_update_insn_regno_info
	../../gcc-sh-lra/gcc/lra.c:1598
0x8525b1b lra_push_insn_1
	../../gcc-sh-lra/gcc/lra.c:1660
0x8525b1b lra_push_insn(rtx_def*)
	../../gcc-sh-lra/gcc/lra.c:1668
0x8525d12 push_insns
	../../gcc-sh-lra/gcc/lra.c:1711
0x852618f lra_process_new_insns(rtx_def*, rtx_def*, rtx_def*, char const*)
	../../gcc-sh-lra/gcc/lra.c:1756
0x8533758 simplify_operand_subreg
	../../gcc-sh-lra/gcc/lra-constraints.c:1523
0x8533758 curr_insn_transform
	../../gcc-sh-lra/gcc/lra-constraints.c:3258
0x8535e2d lra_constraints(bool)
	../../gcc-sh-lra/gcc/lra-constraints.c:4212
0x8526af8 lra(_IO_FILE*)
	../../gcc-sh-lra/gcc/lra.c:2198
0x84e6823 do_reload
	../../gcc-sh-lra/gcc/ira.c:5306
0x84e6823 execute
	../../gcc-sh-lra/gcc/ira.c:5465

The generated insn 1159 isn't recognized, because the displacement value is out of range.

It seems that LRA doesn't try to transform the out of range displacement QImode move into something else so that the <disp04> constraint is satisfied.

Maybe this is due to the way I did the QI/HImode displacement patterns.  AFAIR, reload uses the macro LEGITIMIZE_RELOAD_ADDRESS (sh_legitimize_reload_address) to handle such cases.  LRA can decompose addresses.  However, it looks like it's not done in some cases?

Comment 3 Oleg Endo 2014-09-10 22:43:52 UTC

Created attachment 33465 [details]
problematic libgcc divsc3 function

Comment 4 Oleg Endo 2014-09-13 14:50:35 UTC

(In reply to Oleg Endo from comment #2)
> However, building __divsc3 still fails for -m2 -mb, now with the
> following:
> 
> beh 0 0 0
> (insn 1159 1008 1120 27 (set (reg:QI 625)
>         (mem/c:QI (plus:SI (reg/f:SI 153 sfp)
>                 (const_int 19 [0x13])) [2 %sfp+-1 S1 A8])) sh_tmp.cpp:45 -1
>      (nil))
> 
> sh_tmp.cpp: In function '__divsc3':
> sh_tmp.cpp:57:1: internal compiler error: in lra_set_insn_recog_data, at
> lra.c:947
>  }
>  ^
> 0x8524c29 lra_set_insn_recog_data(rtx_def*)
> 	../../gcc-sh-lra/gcc/lra.c:947
> 0x85258cb lra_get_insn_recog_data
> 	../../gcc-sh-lra/gcc/lra-int.h:468
> 0x85258cb lra_update_insn_regno_info
> 	../../gcc-sh-lra/gcc/lra.c:1607
> 0x85258cb lra_update_insn_regno_info
> 	../../gcc-sh-lra/gcc/lra.c:1598
> 0x8525b1b lra_push_insn_1
> 	../../gcc-sh-lra/gcc/lra.c:1660
> 0x8525b1b lra_push_insn(rtx_def*)
> 	../../gcc-sh-lra/gcc/lra.c:1668
> 0x8525d12 push_insns
> 	../../gcc-sh-lra/gcc/lra.c:1711
> 0x852618f lra_process_new_insns(rtx_def*, rtx_def*, rtx_def*, char const*)
> 	../../gcc-sh-lra/gcc/lra.c:1756
> 0x8533758 simplify_operand_subreg
> 	../../gcc-sh-lra/gcc/lra-constraints.c:1523
> 0x8533758 curr_insn_transform
> 	../../gcc-sh-lra/gcc/lra-constraints.c:3258
> 0x8535e2d lra_constraints(bool)
> 	../../gcc-sh-lra/gcc/lra-constraints.c:4212
> 0x8526af8 lra(_IO_FILE*)
> 	../../gcc-sh-lra/gcc/lra.c:2198
> 0x84e6823 do_reload
> 	../../gcc-sh-lra/gcc/ira.c:5306
> 0x84e6823 execute
> 	../../gcc-sh-lra/gcc/ira.c:5465
> 
> The generated insn 1159 isn't recognized, because the displacement value is
> out of range.

When compiling for -m2a-nofpu -mb the problem goes away because SH2A can handle the displacement.  However, there are no QImode accesses in the resulting code after LRA.

Comment 5 Oleg Endo 2014-09-13 15:04:57 UTC

Compiling the gcc.dg/atomic/c11-atomic-exec-4.c test case with '-O2 -m4 -ml -matomic-model=soft-gusa' results in the following:

beh
(insn 207 206 208 6 (set (reg/f:SI 14 r14 [275])
        (plus:SI (reg/f:SI 15 r15)
            (const_int 36 [0x24]))) 76 {*addsi3_compact}
     (expr_list:REG_EQUIV (plus:SI (reg/f:SI 153 sfp)
            (const_int -4 [0xfffffffffffffffc]))
        (nil)))

internal compiler error: in check_rtl, at lra.c:1936
 TEST_FUNCS (complex_float_add, _Complex float, , += 1, 0, 20000)
 ^
0x85416bf check_rtl
	../../gcc-sh-lra/gcc/lra.c:1936
0x8544a9a lra(_IO_FILE*)
	../../gcc-sh-lra/gcc/lra.c:2327
0x8500e43 do_reload
	../../gcc-sh-lra/gcc/ira.c:5312
0x8500e43 execute
	../../gcc-sh-lra/gcc/ira.c:5471


the addsi3_compact insn should have been split into two insns: 1x reg-reg-move + 1x add-insn.

Comment 6 Oleg Endo 2014-09-13 16:24:27 UTC

Author: olegendo
Date: Sat Sep 13 16:23:55 2014
New Revision: 215240

URL: https://gcc.gnu.org/viewcvs?rev=215240&root=gcc&view=rev
Log:
PR target/55212
* lra.c: fix comments.  print insn before lra_assert.

Modified:
    branches/sh-lra/gcc/lra.c

Comment 7 Oleg Endo 2014-09-13 16:40:19 UTC

An issue similar to that in c#5 has been reported here:  https://gcc.gnu.org/ml/gcc/2014-04/msg00273.html

I've tried applying that patch, but it doesn't fix the issue in c#2 nor c#5.

Comment 8 Oleg Endo 2014-09-13 18:54:27 UTC

Author: olegendo
Date: Sat Sep 13 18:53:54 2014
New Revision: 215241

URL: https://gcc.gnu.org/viewcvs?rev=215241&root=gcc&view=rev
Log:
PR target/55212
* predicates.md (general_movsrc_operand,  general_movdst_operand):
  Don't legitimate QI/HImode  mem displacement while LRA is running.

Modified:
    branches/sh-lra/gcc/config/sh/predicates.md

Comment 9 Oleg Endo 2014-09-13 19:07:48 UTC

(In reply to Oleg Endo from comment #8)
> Author: olegendo
> Date: Sat Sep 13 18:53:54 2014
> New Revision: 215241
> 
> URL: https://gcc.gnu.org/viewcvs?rev=215241&root=gcc&view=rev
> Log:
> PR target/55212
> * predicates.md (general_movsrc_operand,  general_movdst_operand):
>   Don't legitimate QI/HImode  mem displacement while LRA is running.
> 
> Modified:
>     branches/sh-lra/gcc/config/sh/predicates.md

libgcc now seems to compile for all default multilibs defined for sh-elf.

The next failure is:

gcc-sh-lra-build-sh-elf/sh-elf/libstdc++-v3/include/bits/locale_facets_nonio.h:1564:66: internal compiler error: in set_address_disp, at rtlanal.c:5790
       { return this->do_put(__s, __intl, __io, __fill, __units); }
                                                                  ^
0x8830497 set_address_disp
	../../gcc-sh-lra/gcc/rtlanal.c:5790
0x8830497 set_address_disp
	../../gcc-sh-lra/gcc/rtlanal.c:6080
0x8830497 decompose_normal_address
	../../gcc-sh-lra/gcc/rtlanal.c:5912
0x8830497 decompose_address
	../../gcc-sh-lra/gcc/rtlanal.c:5997
0x88305b1 decompose_mem_address(address_info*, rtx_def*)
	../../gcc-sh-lra/gcc/rtlanal.c:6017
0x8745d7b process_address_1
	../../gcc-sh-lra/gcc/lra-constraints.c:2783
0x874aed9 process_address
	../../gcc-sh-lra/gcc/lra-constraints.c:2990
0x874aed9 curr_insn_transform
	../../gcc-sh-lra/gcc/lra-constraints.c:3274
0x874d905 lra_constraints(bool)
	../../gcc-sh-lra/gcc/lra-constraints.c:4215
0x873ce6c lra(_IO_FILE*)
	../../gcc-sh-lra/gcc/lra.c:2206
0x86f99a3 do_reload
	../../gcc-sh-lra/gcc/ira.c:5312
0x86f99a3 execute
	../../gcc-sh-lra/gcc/ira.c:5471

Comment 10 Oleg Endo 2014-09-13 19:48:57 UTC

(In reply to Oleg Endo from comment #9)
>  
> The next failure is:
> 
> gcc-sh-lra-build-sh-elf/sh-elf/libstdc++-v3/include/bits/locale_facets_nonio.
> h:1564:66: internal compiler error: in set_address_disp, at rtlanal.c:5790
>        { return this->do_put(__s, __intl, __io, __fill, __units); }
>                                                                   ^
> 0x8830497 set_address_disp
> 	../../gcc-sh-lra/gcc/rtlanal.c:5790
> 0x8830497 set_address_disp
> 	../../gcc-sh-lra/gcc/rtlanal.c:6080
> 0x8830497 decompose_normal_address
> 	../../gcc-sh-lra/gcc/rtlanal.c:5912
> 0x8830497 decompose_address
> 	../../gcc-sh-lra/gcc/rtlanal.c:5997
> 0x88305b1 decompose_mem_address(address_info*, rtx_def*)
> 	../../gcc-sh-lra/gcc/rtlanal.c:6017
> 0x8745d7b process_address_1
> 	../../gcc-sh-lra/gcc/lra-constraints.c:2783
> 0x874aed9 process_address
> 	../../gcc-sh-lra/gcc/lra-constraints.c:2990
> 0x874aed9 curr_insn_transform
> 	../../gcc-sh-lra/gcc/lra-constraints.c:3274
> 0x874d905 lra_constraints(bool)
> 	../../gcc-sh-lra/gcc/lra-constraints.c:4215
> 0x873ce6c lra(_IO_FILE*)
> 	../../gcc-sh-lra/gcc/lra.c:2206
> 0x86f99a3 do_reload
> 	../../gcc-sh-lra/gcc/ira.c:5312
> 0x86f99a3 execute
> 	../../gcc-sh-lra/gcc/ira.c:5471


This is due to the following invalid mem that ends up in the "*extend<mode>si2_compact_snd" pattern:

(insn 8 5 10 2 (set (reg/v:SI 171 [ __fill ])
        (sign_extend:SI (mem/c:QI (plus:SI (plus:SI (reg/f:SI 145 ap)
                        (const_int 12 [0xc]))
                    (const_int 4 [0x4])) [0 __fill+0 S1 A32]))) locale_facets_nonio.h:1562 241 {*extendqisi2_compact_snd}
     (expr_list:REG_EQUIV (mem:SI (plus:SI (reg/f:SI 15 r15)
                (const_int 4 [0x4])) [0  S4 A32])
        (nil)))


in function "decompose_address":
inner = (plus:SI (plus:SI (reg/f:SI 145 ap)
        (const_int 12 [0xc]))
    (const_int 4 [0x4]))

Comment 11 Oleg Endo 2014-09-13 20:21:32 UTC

Author: olegendo
Date: Sat Sep 13 20:21:00 2014
New Revision: 215243

URL: https://gcc.gnu.org/viewcvs?rev=215243&root=gcc&view=rev
Log:
PR target/55212
* predicates.md (general_movsrc_operand,  general_movdst_operand):
  Allow only valid plus address expressions.

Modified:
    branches/sh-lra/gcc/config/sh/predicates.md

Comment 12 Oleg Endo 2014-09-13 20:25:52 UTC

(In reply to Oleg Endo from comment #11)
> Author: olegendo
> Date: Sat Sep 13 20:21:00 2014
> New Revision: 215243
> 
> URL: https://gcc.gnu.org/viewcvs?rev=215243&root=gcc&view=rev
> Log:
> PR target/55212
> * predicates.md (general_movsrc_operand,  general_movdst_operand):
>   Allow only valid plus address expressions.
> 
> Modified:
>     branches/sh-lra/gcc/config/sh/predicates.md

After that, we're back to:

beh
(insn 189 188 1583 22 (set (reg/f:SI 14 r14 [561])
        (plus:SI (reg/f:SI 15 r15)
            (const_int 28 [0x1c]))) locale_facets_nonio.h:1195 76 {*addsi3_compact}
     (expr_list:REG_EQUAL (plus:SI (reg/f:SI 153 sfp)
            (const_int -36 [0xffffffffffffffdc]))
        (nil)))

locale_facets_nonio.tcc:125:5: internal compiler error: in check_rtl, at lra.c:1935

0x873a21f check_rtl
	../../gcc-sh-lra/gcc/lra.c:1935
0x873d5fa lra(_IO_FILE*)
	../../gcc-sh-lra/gcc/lra.c:2326
0x86f99a3 do_reload
	../../gcc-sh-lra/gcc/ira.c:5312
0x86f99a3 execute
	../../gcc-sh-lra/gcc/ira.c:5471

... which seems to be the same issue as in c#5

Comment 13 Oleg Endo 2014-09-13 20:40:53 UTC

Author: olegendo
Date: Sat Sep 13 20:40:21 2014
New Revision: 215244

URL: https://gcc.gnu.org/viewcvs?rev=215244&root=gcc&view=rev
Log:
PR target/55212
* Remove LEGITIMIZE_RELOAD_ADDRESS and sh_legitimize_reload_address.
  They are not used by LRA.

Modified:
    branches/sh-lra/gcc/config/sh/sh-protos.h
    branches/sh-lra/gcc/config/sh/sh.c
    branches/sh-lra/gcc/config/sh/sh.h

Comment 14 Oleg Endo 2014-09-13 21:32:58 UTC

Author: olegendo
Date: Sat Sep 13 21:32:27 2014
New Revision: 215246

URL: https://gcc.gnu.org/viewcvs?rev=215246&root=gcc&view=rev
Log:
PR target/55212
* Apply patch from https://gcc.gnu.org/ml/gcc/2014-04/msg00273.html

Modified:
    branches/sh-lra/gcc/lra-constraints.c

Comment 15 Oleg Endo 2014-09-13 21:35:34 UTC

(In reply to Oleg Endo from comment #14)
> Author: olegendo
> Date: Sat Sep 13 21:32:27 2014
> New Revision: 215246
> 
> URL: https://gcc.gnu.org/viewcvs?rev=215246&root=gcc&view=rev
> Log:
> PR target/55212
> * Apply patch from https://gcc.gnu.org/ml/gcc/2014-04/msg00273.html
> 
> Modified:
>     branches/sh-lra/gcc/lra-constraints.c

Now the build 'make all' for sh-elf c,c++ completes without errors.

New problem:
Compiling newlib 1.20.0 with '-O2 -m2e':

./../../../../../newlib-1.20.0/newlib/libm/math/er_lgamma.c: In function '__ieee754_lgamma_r':
../../../../../../newlib-1.20.0/newlib/libm/math/er_lgamma.c:309:1: internal compiler error: Max. number of generated reload insns per insn is achieved (90)

 }
 ^
0x85551c5 lra_constraints(bool)
	../../gcc-sh-lra/gcc/lra-constraints.c:4145
0x85442bc lra(_IO_FILE*)
	../../gcc-sh-lra/gcc/lra.c:2206
0x8500df3 do_reload
	../../gcc-sh-lra/gcc/ira.c:5312
0x8500df3 execute
	../../gcc-sh-lra/gcc/ira.c:5471
Please submit a full bug report,

Comment 16 Oleg Endo 2014-09-13 21:45:42 UTC

(In reply to Oleg Endo from comment #5)
> Compiling the gcc.dg/atomic/c11-atomic-exec-4.c test case with '-O2 -m4 -ml
> -matomic-model=soft-gusa' results in the following:
> 

This problem seems to be gone now, too.

Comment 17 Oleg Endo 2014-09-13 22:14:28 UTC

Author: olegendo
Date: Sat Sep 13 22:13:56 2014
New Revision: 215247

URL: https://gcc.gnu.org/viewcvs?rev=215247&root=gcc&view=rev
Log:
PR target/55212
* lra-constraints.c: Print insn when hitting MAX_RELOAD_INSNS_NUMBER.

Modified:
    branches/sh-lra/gcc/lra-constraints.c

Comment 18 Oleg Endo 2014-09-13 22:15:52 UTC

Created attachment 33489 [details]
Reduced case for MAX_RELOAD_INSNS_NUMBER

This fails when compiling with -O2 -m2e -m{l|b} with:

curr_insn = (insn 713 115 712 16 (parallel [
            (set (reg/v:SF 804 [orig:333 x ] [333])
                (reg/v:SF 714 [orig:333 x ] [333]))
            (use (reg/v:PSI 151 ))
            (clobber (scratch:SI))
        ]) er_lgamma.c:85 296 {movsf_ie}
     (nil))

In file included from sh_tmp.cpp:4:0:
er_lgamma.c: In function '__ieee754_lgamma_r':
er_lgamma.c:151:1: internal compiler error: Max. number of generated reload insns per insn is achieved (90)

 }
 ^
0x8555215 lra_constraints(bool)
	../../gcc-sh-lra/gcc/lra-constraints.c:4149
0x85442bc lra(_IO_FILE*)
	../../gcc-sh-lra/gcc/lra.c:2206
0x8500df3 do_reload
	../../gcc-sh-lra/gcc/ira.c:5312
0x8500df3 execute
	../../gcc-sh-lra/gcc/ira.c:5471

Comment 19 Oleg Endo 2014-09-13 23:19:09 UTC

(In reply to Oleg Endo from comment #18)
> Created attachment 33489 [details]
> Reduced case for MAX_RELOAD_INSNS_NUMBER
> 
> This fails when compiling with -O2 -m2e -m{l|b} with:
> 
> curr_insn = (insn 713 115 712 16 (parallel [
>             (set (reg/v:SF 804 [orig:333 x ] [333])
>                 (reg/v:SF 714 [orig:333 x ] [333]))
>             (use (reg/v:PSI 151 ))
>             (clobber (scratch:SI))
>         ]) er_lgamma.c:85 296 {movsf_ie}
>      (nil))

It's a reload loop and is maybe similar to PR 57032.

Comment 20 Oleg Endo 2014-09-16 07:54:33 UTC

Vlad, do you have any idea/suggestion what could be going wrong in this case?
Also, could you please have a look at the patch applied to the sh-lra branch as r215246  and see whether it makes sense to commit it to trunk?

Comment 21 Kazumoto Kojima 2014-09-17 08:00:50 UTC

I've run check-gcc for sh-lra branch on sh4-unknown-linux-gnu
and got
                === gcc Summary ===

# of expected passes            83035
# of unexpected failures        169
# of unexpected successes       1
# of expected failures          119
# of unresolved testcases       45
# of unsupported tests          1514

It looks that 3 new kind of ICEs except the ICE in #c18 which
causes ~100 failures.  ~50 failures are

internal compiler error: in assign_by_spills, at lra-assigns.c:1335

which may be similar with mips's PR61610.  The typical example is
c-c++-common/torture/complex-sign-add.c -O2.
There are some unrecognizable insn errors for the rtl like

(insn 11 10 12 2 (set (reg/f:SI 166)
        (mem/f/j:SI (plus:SI (subreg:SI (reg/v:DI 162 [ s1 ]) 0)
                (const_int 4 [0x4])) [0 _5->m+0 S4 A32]))

It seems that the patch in #c11 should take account of subregs.
The example of another type of new failures is 

gcc.c-torture/execute/20040703-1.c:127:1: error: could not split insn
(insn 118 112 119 (parallel [
            (set (reg:SI 9 r9 [orig:253 D.1616 ] [253])
                (xor:SI (zero_extend:SI (reg:QI 147 t))
                    (const_int 1 [0x1])))
            (clobber (reg:SI 1 r1 [287]))
            (clobber (reg:SI 147 t))
        ]) gcc.c-torture/execute/20040703-1.c:54 401 {*movrt_negc}

It looks that LRA substitutes
  (zero_extend:SI (subreg:QI (reg:SI 147 t) 0))
with
  (zero_extend:SI (reg:QI 147 t))
in this case.

Comment 22 Kazumoto Kojima 2014-09-17 08:03:27 UTC

Created attachment 33505 [details]
A possible patch

These last two errors could be fixed with the attached patch:

	* config/sh/predicates.md (general_movsrc_operand): Take
	subregs into account for plus address expression.
	(general_movdst_operand): Likewise.
	(t_reg_operand): Allow (zero_extend (reg t)).

Comment 23 Oleg Endo 2014-09-17 10:01:13 UTC

(In reply to Kazumoto Kojima from comment #22)
> Created attachment 33505 [details]
> A possible patch
> 
> These last two errors could be fixed with the attached patch:
> 
> 	* config/sh/predicates.md (general_movsrc_operand): Take
> 	subregs into account for plus address expression.
> 	(general_movdst_operand): Likewise.
> 	(t_reg_operand): Allow (zero_extend (reg t)).

Makes sense.

Comment 24 Kazumoto Kojima 2014-09-17 23:25:13 UTC

Author: kkojima
Date: Wed Sep 17 23:24:40 2014
New Revision: 215341

URL: https://gcc.gnu.org/viewcvs?rev=215341&root=gcc&view=rev
Log:
PR target/55212
* config/sh/predicates.md (general_movsrc_operand): Take
  subregs into account for plus address expression.
  (general_movdst_operand): Likewise.
  (t_reg_operand): Allow (zero_extend (reg t)).


Modified:
    branches/sh-lra/gcc/config/sh/predicates.md

Comment 25 Kazumoto Kojima 2014-09-21 22:52:43 UTC

Created attachment 33524 [details]
Reduced test case for ICE in assign_by_spill

I've looked into what is going on for the ICE in assign_by_spill.
The above test case ICEs on sh-elf with
"-m4 -ml -O1 -ftree-parallelize-loops=4 -ftree-vectorize".
.ira and .reload dumps show that the load and store insns

   42: r208:SI=[r202:SI+r185:SI]
   43: [r207:SI+r200:SI]=r218:HI // r218 = subreg:HI r208

are reloaded like as

   73: r217:SI=r200:SI
   79: r223:SI=r185:SI
   74: r218:HI=[r202:SI+r223:SI]
   43: [r207:SI+r217:SI]=r218:HI

and R0_REGS is assigned to both r217 and r223 because the only
register R0 can be used as the index register on SH.  .reload says
that the original insn 42 is deleted and the source operand r218
of insn 43 is substituted with its equiv [r202:SI+r185:SI].
It seems that it works also for SH if reload insn 73 was inserted
at just before insn 43.  I've attached .reload dump for this too.

Comment 26 Kazumoto Kojima 2014-09-21 22:54:18 UTC

Created attachment 33525 [details]
.reload dump file

Comment 27 Kazumoto Kojima 2014-09-21 23:01:02 UTC

Created attachment 33526 [details]
A trial patch

It disables equiv substitution when the equiv includes some reg
assigned to a small register class.  I hope that it makes the issue
a bit clearer, even if it doesn't the right thing.

Comment 28 Kazumoto Kojima 2014-09-21 23:05:06 UTC

Created attachment 33527 [details]
Another reduced test case (with "-m4 -ml -O2 -std=gnu99")

Here is a test case for the ICE in assign_by_spill which looks
caused by another reason.
.reload dump says that the problematic insn is something like

  107: {r166:SF=[r942:SI+r943:SI];use :PSI;clobber r784:SI;}

and

      Creating newreg=942 from oldreg=374, assigning class R0_REGS to address r942
      Creating newreg=943, assigning class GENERAL_REGS to base + disp r943
	   Change to class R0_REGS for r943

which means that R0_REGS is assigned to both r942 and r943.

Comment 29 Kazumoto Kojima 2014-09-21 23:13:19 UTC

Created attachment 33528 [details]
A trial patch

The patch is to refrain from changing to class R0_REGS for r943.

With the above 2 patches, the remained test case failing with the ICE
in assign_by_spill is gcc.dg/pr38338.c which uses extra-ordinary
__builtin_apply_args and __builtin_apply builtins.  I can't get
what is going on that case yet.

Comment 30 Oleg Endo 2014-09-23 20:35:33 UTC

I thought I've CC'ed Vlad, but somehow I haven't.  Trying again.

Comment 31 Kazumoto Kojima 2014-09-25 02:20:37 UTC

Created attachment 33556 [details]
A bit reduced test case of pr38338.c (-O0 -m4 -ml)

That case has only one basic block which looks like

	(prologue)
	(Save r4-r7,fr4-fr11,...)
  ;; Save argument
	mov.b	r0,@(8,r1)	! 4	*movqi_store_mem_disp04/1
  ;; insns don't use r0
	...
  ;; untyped_call
	mov.l	.L3,r1
	jsr	@r1
	nop	! 97	calli
	mov.l	r0,@(56,r9)	! 98	movsi_ie/9
	flds	fr0,fpul	! 197	movsi_ie/22
	sts	fpul,r1	! 198	movsi_ie/20
	mov.l	r1,@(60,r8)	! 99	movsi_ie/9
  ;;
	...

if RA was done.
It seems that LRA thought that r0 in insn 4 conflicts r0 in insn 98 and
fails at reloading *movqi_store_mem_disp04 which permits only r0 as
the source operand when reg+offset addressing mode is used.  LRA looks
right, because RA doesn't know that r0 is the function return value
register set by function call.
Target could notify it with clobbering function value registers before
call.

Comment 32 Kazumoto Kojima 2014-09-25 02:22:53 UTC

Created attachment 33557 [details]
Patch for SH untyped_call

	* config/sh/sh.md (untyped_call): Clobber function value
	registers before call.

Comment 33 Hans-Peter Nilsson 2014-09-25 06:26:46 UTC

(In reply to Kazumoto Kojima from comment #32)
> Patch for SH untyped_call
> config/sh/sh.md (untyped_call): Clobber function value registers before
> call.

(Sorry for butting in; I'm CC'ed with the purpose of observing issues likely to pop up for my own ports.)

Will clobbering function-value registers before the call not cause problems on SH5, where function-value registers and parameters registers overlap (IIUC)?

For reference, i386.md emits a blockage (i.e. *using* and clobbering all registers).

Comment 34 Kazumoto Kojima 2014-09-25 06:59:09 UTC

(In reply to Hans-Peter Nilsson from comment #33)
> Will clobbering function-value registers before the call not cause problems
> on SH5, where function-value registers and parameters registers overlap
> (IIUC)?

Yes.  I've forgot about SH5.  Thanks for pointing it out.  The patch
should be conditionalized for SH5 on which this issue won't happen.

> For reference, i386.md emits a blockage (i.e. *using* and clobbering all
> registers).

It looks all ports including SH emit a blockage after storing
the function return values to a memory block, not before the call.

Comment 35 Oleg Endo 2014-09-25 09:12:25 UTC

(In reply to Kazumoto Kojima from comment #32)
> Created attachment 33557 [details]
> Patch for SH untyped_call
> 
> 	* config/sh/sh.md (untyped_call): Clobber function value
> 	registers before call.

I'm just wondering ... how did/does that work without LRA (i.e. with IRA)?

Comment 36 Kazumoto Kojima 2014-09-25 10:32:35 UTC

(In reply to Oleg Endo from comment #35)
> I'm just wondering ... how did/does that work without LRA (i.e. with IRA)?

I'm not sure about the old reload.  LRA makes only 3 uses of r0 and it's
relatively easy to see what is going on there with .reload comments.
The old reload makes many uses of r0 and it looks not easy to see how
does that work.

Comment 37 Oleg Endo 2014-09-25 16:17:55 UTC

(In reply to Kazumoto Kojima from comment #36)
> I'm not sure about the old reload.  LRA makes only 3 uses of r0 and it's
> relatively easy to see what is going on there with .reload comments.
> The old reload makes many uses of r0 and it looks not easy to see how
> does that work.

Hm, there's still this open PR 23868 ... maybe it's somehow related or even fixed with LRA + attachment 33557 [details] ...

Comment 38 Hans-Peter Nilsson 2014-09-25 16:56:03 UTC

(In reply to Kazumoto Kojima from comment #34)
> For reference,
> i386.md emits a blockage (i.e. *using* and clobbering all
> registers).

> It
> looks all ports including SH emit a blockage after storing
the function
> return values to a memory block, not before the call.

Hm, you're right, but to me that indicates the patch covering-up a bug elsewhere than in the sh port.

Comment 39 Kazumoto Kojima 2014-09-25 23:34:19 UTC

(In reply to Hans-Peter Nilsson from comment #38)
> Hm, you're right, but to me that indicates the patch covering-up a bug
> elsewhere than in the sh port.

Would it be better to file an independent PR for this?
It looks a latent problem of untyped_call implementations on some
targets, though I guess that it's hard to hit because untyped_call
is fairly exotic and the RA issue before that call can be happened
only on arches with non-orthogonal instruction set like SH which
heavily depends on R0.

Comment 40 Kazumoto Kojima 2014-09-28 07:05:37 UTC

I've tried to see what is going on with reload loop on movsf_ie
for gcc.c-torture/compile/20050113-1.c -O1 -m4 -ml.
It looks that the problem starts at reloading

(insn 15 10 18 2 (parallel [
            (set (subreg:SF (reg:SI 173 [ a ]) 0)
                (minus:SF (reg:SF 160 [ D.1385 ])
                    (reg:SF 160 [ D.1385 ])))
            (use (reg/v:PSI 151 ))
        ]) yyy.i:10 443 {subsf3_i}
     (expr_list:REG_DEAD (reg:SF 160 [ D.1385 ])
        (expr_list:REG_DEAD (reg/v:PSI 151 )
            (nil))))

RA inserts

   43: {r179:SF=r160:SF;use :PSI;clobber scratch;}

before that insn and

   44: {r173:SI#0=r179:SF;use :PSI;clobber scratch;}

after that where #0 means subreg:SF 0.
RA assigns register class FP_REGS to r179 here and tries to
reload insn 44 which matches movsf_ie.
SH has no direct move instruction from floating point reg to
general reg and RA chooses the alt 1 in movesf_ie of which operands
are (0) r  (1) r  (2) c  (3) X.  Then RA creates new reg r180
with assigning class GENERAL_REGS to r180 and inserts

   45: {r180:SF=r179:SF;use :PSI;clobber scratch;}

before insn 44 which is now

   44: {r173:SI#0=r180:SF;use :PSI;clobber scratch;}

RA tries to reload insn 45 which is in the same situation with
the original 44, i.e. move from FP_REGS to GENERAL_REGS.
Thus things are looping.

It looks that the problematic cases are corresponding to the cases
which were handled by the secondary reloads in the old reload.
In the above example, if RA choosed the alt 13 in movsf_ie of
which operands are (0) fr  (1) rf  (2) c  (3) y, it didn't loop.
It looks RA doesn't choose it because it requires more numbers
of registers than the alt 1.  The current movsf_ie uses

   (clobber (match_scratch:SI 3 "=X,X,Bsc,Bsc,&z,X,X,X,X,X,X,X,X,y,X,X,X,X,X"))]

and this pattern might be not good for lra.

Comment 41 Kazumoto Kojima 2014-09-28 07:10:24 UTC

Created attachment 33601 [details]
A trial patch for reload-loop problem

My first trial is to define a special movsf_ie insn_and_split used
when lra makes new reload insns.
Here is the check-gcc result with the patch + c#29 + patch c#32:

                === gcc Summary ===

# of expected passes            83153
# of unexpected failures        21
# of unexpected successes       1
# of expected failures          119
# of unresolved testcases       3
# of unsupported tests          1514
/home/ldroot/dodes/xsh-gcc-lra/gcc/xgcc  version 5.0.0 20140913 (experimental) (GCC) 

No "Max. number of generated reload insns per insn is achieved" errors.
I'm backing out the patch in c#27 for this test because there are
many execution errors with it.

Comment 42 Oleg Endo 2014-09-28 21:43:56 UTC

(In reply to Kazumoto Kojima from comment #41)
> Created attachment 33601 [details]
> A trial patch for reload-loop problem
> 
> My first trial is to define a special movsf_ie insn_and_split used
> when lra makes new reload insns.

That also fixes the case in comment #18.

PR 54699 comes into my mind when seeing the movsf_ie patterns ... having another movsf_ie pattern is discomforting.  But if it makes it work, so be it.  Maybe we can try to combine the two movsf_ie insns later.

Comment 43 Kazumoto Kojima 2014-09-29 01:19:43 UTC

(In reply to Oleg Endo from comment #42)
> PR 54699 comes into my mind when seeing the movsf_ie patterns ... having
> another movsf_ie pattern is discomforting.  But if it makes it work, so be
> it.  Maybe we can try to combine the two movsf_ie insns later.

Sure.  It's an experiment to clear the issue further.  I'll
apply it and the revised untyped_call patch on sh-lra branch.

Comment 44 Kazumoto Kojima 2014-09-29 01:25:05 UTC

Author: kkojima
Date: Mon Sep 29 01:24:33 2014
New Revision: 215676

URL: https://gcc.gnu.org/viewcvs?rev=215676&root=gcc&view=rev
Log:
PR target/55212
* config/sh/sh-protos.h (sh_movsf_ie_ra_split_p): New prototype.
* config/sh/sh.c (sh_movsf_ie_ra_split_p): New function.
* config/sh/sh.md (movsi_ie): Use constraint "mr" instead of "m".
  (movsf_ie_ra): New insn_and_split.
  (movsf): Use movsf_ie_ra when lra_in_progress is set.


Modified:
    branches/sh-lra/gcc/config/sh/sh-protos.h
    branches/sh-lra/gcc/config/sh/sh.c
    branches/sh-lra/gcc/config/sh/sh.md

Comment 45 Kazumoto Kojima 2014-09-29 01:27:35 UTC

Author: kkojima
Date: Mon Sep 29 01:27:03 2014
New Revision: 215677

URL: https://gcc.gnu.org/viewcvs?rev=215677&root=gcc&view=rev
Log:
PR target/55212
* config/sh/sh.md (untyped_call): Clobber the function value registers
  before the call.


Modified:
    branches/sh-lra/gcc/config/sh/sh.md

Comment 46 Oleg Endo 2014-09-29 22:42:55 UTC

newlib 1.2.0 now builds without errors here.

Comment 47 Oleg Endo 2014-09-29 23:55:32 UTC

Created attachment 33615 [details]
reduced CSiBE /libpng-1.2.5 test

I've tried compiling CSiBE (-m4 -ml).  This is a stripped down pngrutil.c which crashes in lra-spills.c (remove_pseudos).
It's a bit strange, because if the function 'test' (top of the file) is compiled before the actual problematic function 'png_handle_cHRM', there's a segfault.  If 'test' is left out, it ends in:

internal compiler error: in copy_rtx, at rtl.c:356

0x862e4d2 copy_rtx(rtx_def*)
	../../gcc-sh-lra/gcc/rtl.c:356

In the segfault case, the insn is:

insn = 
(call_insn 617 775 618 42 (parallel [
            (call (mem:SI (reg/f:SI 701) [0 png_set_cHRM S4 A32])
                (const_int 0 [0]))
            (use (reg:PSI 151 ))
            (clobber (reg:SI 146 pr))
        ]) sh_tmp.cpp:395 316 {calli}
     (expr_list:REG_DEAD (reg:SI 69 fr5)
        (expr_list:REG_DEAD (reg:SI 75 fr11)
            (expr_list:REG_DEAD (reg:SI 71 fr7)
                (expr_list:REG_DEAD (reg:SI 73 fr9)
                    (expr_list:REG_DEAD (reg/f:SI 701)
                        (expr_list:REG_DEAD (reg:SI 5 r5)
                            (expr_list:REG_DEAD (reg:SI 4 r4)
                                (expr_list:REG_DEAD (reg:DF 74 fr10)
                                    (expr_list:REG_DEAD (reg:DF 72 fr8)
                                        (expr_list:REG_DEAD (reg:DF 70 fr6)
                                            (expr_list:REG_DEAD (reg:DF 68 fr4)
                                                (expr_list:REG_CALL_DECL (symbol_ref:SI ("png_set_cHRM") [flags 0x41]  <function_decl 0xb7407c80 png_set_cHRM>)
                                                    (nil)))))))))))))
    (expr_list:SI (use (reg:SI 4 r4))
        (expr_list:SI (use (reg:SI 5 r5))
            (expr_list:DF (use (reg:DF 68 fr4))
                (expr_list:DF (use (reg:DF 70 fr6))
                    (expr_list:DF (use (reg:DF 72 fr8))
                        (expr_list:DF (use (reg:DF 74 fr10))
                            (expr_list:DF (use (mem:DF (reg/f:SI 15 r15) [0  S8 A32]))
                                (expr_list:DF (use (mem:DF (reg/f:SI 699) [0  S8 A32]))
                                    (expr_list:DF (use (mem:DF (reg/f:SI 696) [0  S8 A32]))
                                        (expr_list:DF (use (mem:DF (reg/f:SI 693) [0  S8 A32]))
                                            (nil))))))))))))

and the crash is caused by
loc = (reg/f:SI 699)

In the non-segfault case the input arg 'orig' in copy_rtx prints "(UnKnown Unknown)"

Comment 48 Oleg Endo 2014-09-30 00:16:17 UTC

(In reply to Oleg Endo from comment #47)
> Created attachment 33615 [details]
> reduced CSiBE /libpng-1.2.5 test
> 
> I've tried compiling CSiBE (-m4 -ml).  This is a stripped down pngrutil.c
> which crashes in lra-spills.c (remove_pseudos).
> It's a bit strange, because if the function 'test' (top of the file) is
> compiled before the actual problematic function 'png_handle_cHRM', there's a
> segfault. 

The segfault happens because of this lookup (remove_pseudos):

if ((hard_reg = spill_hard_reg[i]) != NULL_RTX)

The array at i = 699 doesn't seem to contain anything valid.

Function 'assign_spill_hard_regs' sets those values:

      spill_hard_reg[regno]
	= gen_raw_REG (PSEUDO_REGNO_MODE (regno), hard_regno);

However, in this case it never gets to it because of this:

  if (! lra_reg_spill_p)
    return n;

Comment 49 Kazumoto Kojima 2014-09-30 10:08:08 UTC

(In reply to Oleg Endo from comment #48)
> The array at i = 699 doesn't seem to contain anything valid.

It looks that
  (expr_list:DF (use (mem:DF (reg/f:SI 699) [0  S8 A32]))
which shows an argument of call_insn 617 via memory, was missed
by LRA as a usage of pseudo 699.  Here is my trial:

--- gcc/lra.c.orig	2014-09-14 09:09:57.223724308 +0900
+++ gcc/lra.c	2014-09-30 17:15:21.709508021 +0900
@@ -1615,6 +1615,20 @@ lra_update_insn_regno_info (rtx_insn *in
   if ((code = GET_CODE (PATTERN (insn))) == CLOBBER || code == USE)
     add_regs_to_insn_regno_info (data, XEXP (PATTERN (insn), 0), uid,
 				 code == USE ? OP_IN : OP_OUT, false);
+  if (CALL_P (insn))
+    {
+      rtx link;
+
+      /* Take account of arguments via memory which could be implicit
+	 usage of pseudo regs.  */
+      for (link = CALL_INSN_FUNCTION_USAGE (insn);
+	   link != NULL_RTX;
+	   link = XEXP (link, 1))
+	if (GET_CODE (XEXP (link, 0)) == USE
+	    && MEM_P (XEXP (XEXP (link, 0), 0)))
+	  add_regs_to_insn_regno_info (data, XEXP (XEXP (link, 0), 0), uid,
+				       OP_IN, false);
+    }
   if (NONDEBUG_INSN_P (insn))
     setup_insn_reg_info (data, freq);
 }

Comment 50 Oleg Endo 2014-09-30 22:13:14 UTC

Author: olegendo
Date: Tue Sep 30 22:12:42 2014
New Revision: 215744

URL: https://gcc.gnu.org/viewcvs?rev=215744&root=gcc&view=rev
Log:
PR target/55212
* lra.c (lra_update_insn_regno_info): Handle mem args in calls.

Modified:
    branches/sh-lra/gcc/lra.c

Comment 51 Oleg Endo 2014-09-30 22:15:14 UTC

(In reply to Oleg Endo from comment #50)
> Author: olegendo
> Date: Tue Sep 30 22:12:42 2014
> New Revision: 215744
> 
> URL: https://gcc.gnu.org/viewcvs?rev=215744&root=gcc&view=rev
> Log:
> PR target/55212
> * lra.c (lra_update_insn_regno_info): Handle mem args in calls.
> 
> Modified:
>     branches/sh-lra/gcc/lra.c

This fixes the issue with attachment 33615 [details].

Comment 52 Oleg Endo 2014-09-30 22:17:07 UTC

Created attachment 33632 [details]
Reduced case of error: in assign_by_spills, at lra-assigns.c:1335 with -m4 -ml -O2

Comment 53 Kazumoto Kojima 2014-10-07 02:21:56 UTC

(In reply to Oleg Endo from comment #52)
> Created attachment 33632 [details]
> Reduced case of error: in assign_by_spills, at lra-assigns.c:1335 with -m4
> -ml -O2

.ira dump of the test case has the lines

(insn 145 144 148 8 (set (reg/v:SI 245 [ qval ])
        (mem:SI (plus:SI (reg/v/f:SI 345 [orig:213 divisors ] [213])
                (reg:SI 349 [orig:259 ivtmp.27 ] [259])) [7 MEM[base: divisors_17, index: ivtmp.27_119, offset: 0B]+0 S4 A32])) jcdctmgr.i:87 258 {movsi_ie}
     (nil))
(insn 148 145 149 8 (set (reg/v:SI 246 [ temp ])
        (mem:SI (plus:SI (reg:SI 349 [orig:259 ivtmp.27 ] [259])
                (reg:SI 334)) [7 MEM[symbol: workspace, index: ivtmp.27_119, offset: 0B]+0 S4 A32])) jcdctmgr.i:88 258 {movsi_ie}
     (nil))

which correspond to the lines

 qval = divisors[i];
 temp = workspace[i];

of the test case.  r349 is the index variable i and r334 is the pointer
variable workspace.  LRA assigns R0_REGS to r349 at insn 145 because
rtlanal.c:decompose_mem_address guesses r345 would be base and r349 index.
Unfortunately, that function also guesses r349 would be base and r334 index
at insn 148.  LRA tries to assign R0_REGS to r334 according to this "guess"
result.  .reload dump says:

Changing address in insn 148 r349:SI+r334:SI on equiv r349:SI+sfp:SI
      Creating newreg=384 from oldreg=153, assigning class R0_REGS to address r384
      Creating newreg=385 from oldreg=349, assigning class R0_REGS to address r385

Thus regclass of r384 and r385 corrides at insn

  148: r246:SI=[r385:SI+r384:SI]

Comment 54 Kazumoto Kojima 2014-10-07 02:24:39 UTC

Created attachment 33657 [details]
A possible workaround

The patch is trying to fix the result of decompose_mem_address
so as not to assign INDEX_REG_CLASS to both base and index regs
when INDEX_REG_CLASS has one register only.

Comment 55 Kazumoto Kojima 2014-10-08 02:01:33 UTC

Created attachment 33662 [details]
revised patch

The last patch should check if base rtx and index rtx of the address are registers.

Comment 56 Oleg Endo 2014-10-10 18:07:27 UTC

There are some LRA changes coming ...
https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00948.html

Comment 57 Kazumoto Kojima 2014-10-11 03:33:21 UTC

Created attachment 33686 [details]
revised patch for the problem in c#25

Comment 58 Kazumoto Kojima 2014-10-11 03:37:40 UTC

(In reply to Kazumoto Kojima from comment #57)
> revised patch for the problem in c#25

I've looked into the remained ICEs for usual sh4-unknown-linux-gnu
test and got that they are similar to the ICE in c#25.
The original insn is a move insn from/to a memory with base+index
addressing which requires r0 for index reg on SH.  The other operand
of that insn has an equivalence with a memory of which load/store
uses r0 for some reason.  The tipical example is a memory of which
addressing is base+index and base+a_small_display_constant which
can fit to mov.[bwl] r0,@(disp,rm)/@(disp,rm),r0 instructions.
The patch introduces a new tergetm fuction and hides equivalences
for such cases.  The default tergetm fuction simply do nothing and
shouldn't impact the other targets.
The patch might be a bit overkill for SH.  We need to see the code
quality.

With c#29 + c#55 + c#57 patches, all ICEs went away on sh4-linux.
compare_tests against trunk rev.215912 says

--
New tests that FAIL:

g++.dg/cpp1y/feat-cxx14.C  -std=gnu++1y (test for excess errors)
gcc.target/sh/pr50749-sf-postinc-1.c scan-assembler-times fmov.s\\t@r[0-9]+\\\\+,fr[0-9]+ 1
gcc.target/sh/pr50749-sf-postinc-3.c scan-assembler-times fmov.s\\t@r[0-9]+\\\\+,fr[0-9]+ 1

Old tests that failed, that have disappeared: (Eeek!)

gcc.dg/atomic/c11-atomic-exec-4.c   -Os  (internal compiler error)
gcc.dg/atomic/c11-atomic-exec-4.c   -Os  (test for excess errors)
libgomp.fortran/udr14.f90   -O3 -g  (internal compiler error)
libgomp.fortran/udr14.f90   -O3 -g  (test for excess errors)
--

The first new FAIL looks a test glitch and it turned out that
pr50749-sf-postinc-1.c FAILs can be easily fixed.

It looks that the LRA change suggested in c#56 is related about
the optimization based with equivalences.  I'd like to try it
on sh-lra.

Comment 59 Kazumoto Kojima 2014-10-11 09:16:44 UTC

Created attachment 33688 [details]
reduced CSiBE teem-1.6.0 test (-O2 sh4-linux)

Yet another lra-assigns.c:1335 ICE when compiling nrrd/kernel.c of CSiBE/
teem-1.6.0 test.  It looks that RA fails for FPUL_REGS class.

Comment 60 Kazumoto Kojima 2014-10-12 07:05:01 UTC

(In reply to Kazumoto Kojima from comment #59)
> Yet another lra-assigns.c:1335 ICE when compiling nrrd/kernel.c of CSiBE/
> teem-1.6.0 test.  It looks that RA fails for FPUL_REGS class.

It seems that the scenario starts at insn 58 in .reload.

   58: {r199:DF=float_extend(r190:SF);use :PSI;}

LRA inserts a reload insn 138 before it.

  138: {r261:SF=r190:SF;use :PSI;clobber scratch;0;}
   58: {r199:DF=float_extend(r261:SF);use :PSI;}

During inheritance/split process, LRA inserts a move insn 151
after 138 to save r190.

    Add save<-reg after:
  151: r280:SI=r190:SF#0

where LRA assigns the hard reg fr8 to r190 and #0 means subreg:SI
here.  We need fpul to save the SFmode value on fr8 to SImode
object and insn 58 is actually fcnvds instruction of which destination
is fpul.  This causes the ICE.
LRA chooses SImode at insn 151 because fr8 is a caller save register
and HARD_REGNO_CALLER_SAVE_MODE returns SImode for fr8 even with
SFmode.  SH doesn't define its specific HARD_REGNO_CALLER_SAVE_MODE
and uses the default choose_hard_reg_mode (regno, nregs, false)
implementation.  choose_hard_reg_mode chooses mode based on
HARD_REGNO_MODE_OK and returns ?Imode for float registers when
sh_hard_regno_mode_ok permits integer modes on them. 
I guess that HARD_REGNO_CALLER_SAVE_MODE should return SFmode in
this case.

Comment 61 Kazumoto Kojima 2014-10-12 07:08:42 UTC

Created attachment 33692 [details]
a possible patch

I'm testing a patch to define a SH specific HARD_REGNO_CALLER_SAVE_MODE macro.

Comment 62 Oleg Endo 2014-10-12 17:28:45 UTC

Just a note... I've briefly checked whether PR 54429 gets any better with LRA.  It doesn't seem to be the case.

Comment 63 Kazumoto Kojima 2014-10-14 01:12:45 UTC

Created attachment 33705 [details]
CSiBE result-size.cvs sh-lra+c#29+c#55+c#57+c#61

Comment 64 Kazumoto Kojima 2014-10-14 01:13:32 UTC

Created attachment 33706 [details]
CSiBE result-size.cvs trunk rev.215912

Comment 65 Kazumoto Kojima 2014-10-14 01:14:49 UTC

We can see ~2% code size regression on average with CSiBE.
There are some cases which register elimination doesn't
work well.  An example is

int foo (int x)
{
  int y[100];

  bar (&y[4], x);
  return y[4];
}

which is compiled to a lengthy code:

foo:
        sts.l   pr,@-r15
        mov.w   .L6,r1
        mov     r4,r5
        mov.w   .L3,r2
        sub     r1,r15
        mov.l   .L4,r0
        mov     r15,r3
        mov.w   .L5,r1
        add     #4,r3
        add     r3,r2
        add     r2,r1
        mov.l   r1,@r15
        mov     r1,r4
        jsr     @r0
        add     #16,r4
        mov.l   @r15,r1
        mov.l   @(16,r1),r0
        mov.w   .L6,r7
        add     r7,r15
        lds.l   @r15+,pr
        rts
        nop
        .align 1
.L6:
        .short  404
.L3:
        .short  400
.L5:
        .short  -400

with -O2.  It seems that SH's very limited addsi3 insn doesn't
match with the LRA's register elimination.  The canonical insn
like (set reg_a (plus reg_b const_int)) are scaned at register
elimination phase.
On SH, it's expanded to the 2 insns (set reg_a const_int) and
(set reg_a (plus reg_a reg_b)) when const_int doesn't fit in 8-bit.
This makes the elimination process unhappy.

Comment 66 Kazumoto Kojima 2014-10-14 01:16:54 UTC

Created attachment 33707 [details]
a possible patch

With it, foo is compiled to

foo:
        sts.l   pr,@-r15
        mov.w   .L4,r1
        mov     r4,r5
        mov.l   .L3,r0
        mov     #16,r4
        sub     r1,r15
        jsr     @r0
        add     r15,r4
        mov.l   @(16,r15),r0
        mov.w   .L4,r7
        add     r7, r15
        lds.l   @r15+,pr
        rts
        nop
        .align 1
.L4:
        .short  400

and the code size regression for CSiBE is reduced to 0.74%.

Comment 67 Kazumoto Kojima 2014-10-14 01:18:53 UTC

Created attachment 33708 [details]
CSiBE result-runtime.cvs sh-lra+c#29+c#55+c#57+c#61+c#66

Comment 68 Kazumoto Kojima 2014-10-14 01:20:24 UTC

Created attachment 33709 [details]
CSiBE result-runtime.cvs trunk rev.215912

0.54% runtime regression.

Comment 69 Kazumoto Kojima 2014-10-19 02:33:58 UTC

The patch in c#57 disables memory equiv substitution for the memory
with base+index and base+display addressing.

static bool
sh_cannot_substitute_equiv_p (rtx subst)
{
  if (TARGET_SHMEDIA)
    return false;

  if (GET_CODE (subst) == SUBREG)
    subst = SUBREG_REG (subst);
  if (MEM_P (subst) && GET_CODE (XEXP (subst, 0)) == PLUS)
    return true;

  return false;
}

A bit surprisingly, disabling all memory equiv substitution wins
CSiBE tests.  With

static bool
sh_cannot_substitute_equiv_p (rtx subst)
{
  if (TARGET_SHMEDIA)
    return false;

  return true;
}

the code size regression for CSiBE from non LRA is reduced to 0.59%.
Looking at the improved cases, the extra save/restore insns to/from
stack disappear.  I guess that SH has not enough numbers of the hard
registers to make the equiv substitution win in the size and the speed
on average working sets.  It looks the pseudos produced to hold
the equiv values can't get hard registers for bad cases and end up
memory save/restore insns which make the code worse.

Comment 70 Kazumoto Kojima 2014-10-20 02:28:25 UTC

I'd like to apply the revised patches below to sh-lra branch for
looking at the problems easily.  Oleg, is it OK for you?

c#55: Fixup the result of decompose_mem_address when INDEX_REG_CLASS
  is a single register class.
c#59: Introduce target hook which disables the substitution of pseudos
  with equivalent memory values.
c#61: Define SH specific HARD_REGNO_CALLER_SAVE_MODE target macro.
c#66: Make *addsi3_compact insn_and_split and add an alternative
  which will split to (set a c) and (set (plus a b)) after reload.

I omit c#29 patch which touches lra-constraints.c and may impact on
other targets.  That patch would make it hard to compare with other
targets.  c#55 and c#59 also touch lra-constraints.c but those changes
unlikely impact on other targets.  Fortunately the test case c#28
doesn't ICE anymore without c#29.

Comment 71 Oleg Endo 2014-10-21 09:02:38 UTC

(In reply to Kazumoto Kojima from comment #69)
> the code size regression for CSiBE from non LRA is reduced to 0.59%.
> Looking at the improved cases, the extra save/restore insns to/from
> stack disappear.  I guess that SH has not enough numbers of the hard
> registers to make the equiv substitution win in the size and the speed
> on average working sets.  It looks the pseudos produced to hold
> the equiv values can't get hard registers for bad cases and end up
> memory save/restore insns which make the code worse.

I don't know the details and maybe I'm totally off here ... LRA is being used for ARM and there are almost the same amount of GP registers available on ARM than on SH.  So either on ARM nobody has checked for such regressions, or there's something else going wrong or missing on SH?  Or is it maybe really that SH R0-ness thing that makes everything work (or not work) totally different?

Comment 72 Oleg Endo 2014-10-21 09:04:52 UTC

(In reply to Kazumoto Kojima from comment #70)
> I'd like to apply the revised patches below to sh-lra branch for
> looking at the problems easily.  Oleg, is it OK for you?

Sure.  However, maybe it's better to do a merge from trunk first.  I know there might be some conflicts w.r.t. my recent changes on trunk, in particular for PR 53513.  On the other hand, I've seen some LRA changes/fixes on trunk recently.

Comment 73 Kazumoto Kojima 2014-10-21 11:19:52 UTC

(In reply to Oleg Endo from comment #71)
> I don't know the details and maybe I'm totally off here ... LRA is being
> used for ARM and there are almost the same amount of GP registers available
> on ARM than on SH.  So either on ARM nobody has checked for such
> regressions, or there's something else going wrong or missing on SH?  Or is
> it maybe really that SH R0-ness thing that makes everything work (or not
> work) totally different?

I'm not sure about ARM.  The problematic cases I've looked at are
high R0 pressure cases and IRA decided to allocate equiv value
registers to memory as most profitable ones.
It looks that the remained code size inflation came from R0-ness,
very limited base+display addressing, only one index register and
so on.  I'll attach the test cases for them to this PR after merge
from trunk and commit current patches.

Comment 74 Oleg Endo 2014-10-21 11:23:21 UTC

(In reply to Kazumoto Kojima from comment #73)
> I'm not sure about ARM.  The problematic cases I've looked at are
> high R0 pressure cases and IRA decided to allocate equiv value
> registers to memory as most profitable ones.
> It looks that the remained code size inflation came from R0-ness,
> very limited base+display addressing, only one index register and
> so on.

OK, makes sense.

> I'll attach the test cases for them to this PR after merge
> from trunk and commit current patches.

Thanks!

Comment 75 Kazumoto Kojima 2014-10-21 23:55:33 UTC

FYI, merge from trunk revision 216447 as r216529.
I've fixed c#55, c#59, c#61 and c#66 so to match this merge and committed
them on sh-lra as r216532, r216533, r216533 and r216535, respectively.

Comment 76 Kazumoto Kojima 2014-10-23 00:31:43 UTC

Created attachment 33787 [details]
a reduced test case of SCiBE compiler/vam test

compiler/vam is a test which is an example of code size regression
from non LRA compiler.  With -O2, non LRA compiler generates code like

	mov.l	r0,@(56,r15)
	mov	#64,r1
	add	#2,r0
	mov	r13,r2
	add	r15,r1
	mov.l	r0,@(60,r15)
	add	#18,r2
	mov	r13,r14
	add	#2,r0
	mov.l	r2,@(0,r1)
	mov.l	r0,@(4,r1)
	add	#2,r2
	add	#2,r0
	mov.l	r2,@(8,r1)
	mov.l	r0,@(12,r1)

and the corresponding code generated by sh-lra compiler looks like

	mov.l	r1,@(56,r15)
	mov	#17,r1
	add	r2,r1
	mov.l	r1,@(60,r15)
	mov	#2,r14
	mov	#16,r10
	mov	#18,r1
	add	r2,r1
	add	r2,r14
	add	r2,r10
	mov	#64,r2
	add	r15,r2
	mov.l	r1,@r2
	mov	#19,r1
	mov	#68,r2
	add	r13,r1
	add	r15,r2
	mov.l	r1,@r2
	mov	#20,r1
	mov	#72,r2
	add	r13,r1
	add	r15,r2
	mov.l	r1,@r2
	mov	#21,r1
	mov	#76,r2
	add	r13,r1
	add	r15,r2
	mov.l	r1,@r2

which shows that base+disp addressing mode wasn't used for the insn
(set reg (mem (plus sp N))) when N >= 64.
It looks that non LRA compiler uses targetm.legitimize_address to
handle that case and splits it to (set new_reg (plus sp M)) and
(set reg (mem (plus new_reg N-M))) so as to N-M is a legitimate
display constant for base+disp addressing.

Comment 77 Kazumoto Kojima 2014-10-23 00:34:00 UTC

Created attachment 33788 [details]
another reduced test case of compiler/vam

This is an another test case got from compiler/vam test.
The generated code for the second foo call with non LRA compiler at
-O2 looks like

	mov.b	@(7,r8),r0
	extu.b	r0,r7
	mov.b	@(6,r8),r0
	extu.b	r0,r6
	mov.b	@(5,r8),r0
	extu.b	r0,r5
	mov.b	@(4,r8),r0
	jsr	@r9
	extu.b	r0,r4

and the corresponding code with sh-lra is here

	mov.b	@(7,r8),r0
	mov.b	r0,@(1,r15)
	mov.b	@(6,r8),r0
	mov.b	r0,@(2,r15)
	mov.b	@(5,r8),r0
	mov.b	r0,@(3,r15)
	mov.b	@(4,r8),r0
	mov.b	r0,@r15
	mov.b	@(1,r15),r0
	extu.b	r0,r7
	mov.b	@(2,r15),r0
	extu.b	r0,r6
	mov.b	@(3,r15),r0
	extu.b	r0,r5
	mov.b	@r15,r0
	jsr	@r9
	extu.b	r0,r4

Comment 78 Kazumoto Kojima 2014-10-27 00:40:05 UTC

Created attachment 33813 [details]
a trial patch for the issue c#76

With it, the generated code for c#76 test case looks similar with
the one with non LRA compiler. The total code size regression for
CSiBE is reduced to 0.43%.
The patch introduces a new targetm function legitimize_address_display.
I've tested it on arm thumb with adding a similar new targetm function
to arm.c.  There is only one CSiBE test of which code size is changed
from 400 to 396 at -Os and there are no size changes at -O2 on thumb.
Although arm thumb is better than SH about those base+display addressing
because of 8-bit display for stack pointer register and no R0-ness,
it's a bit surprising to me.

Comment 79 Oleg Endo 2014-10-27 00:47:33 UTC

(In reply to Kazumoto Kojima from comment #78)
> Created attachment 33813 [details]
> a trial patch for the issue c#76
> 
> With it, the generated code for c#76 test case looks similar with
> the one with non LRA compiler. The total code size regression for
> CSiBE is reduced to 0.43%.
> The patch introduces a new targetm function legitimize_address_display.

Hm, maybe it's better to name this "legitimize_address_displacement" or "legitimize_address_offset"?

Comment 80 Kazumoto Kojima 2014-10-27 01:01:13 UTC

(In reply to Oleg Endo from comment #79)
> Hm, maybe it's better to name this "legitimize_address_displacement" or
> "legitimize_address_offset"?

Sure.  I'm not sure whether that targetm function is a good idea
or not in the first place, though.

Comment 81 Oleg Endo 2014-10-27 01:09:56 UTC

(In reply to Kazumoto Kojima from comment #80)
> (In reply to Oleg Endo from comment #79)
> > Hm, maybe it's better to name this "legitimize_address_displacement" or
> > "legitimize_address_offset"?
> 
> Sure.  I'm not sure whether that targetm function is a good idea
> or not in the first place, though.

I haven't studied the LRA code, but I guess there must be some logic that replaces the target hook function sh_legitimize_reload_address.  Either it's using sh_legitimize_address and interpreting/expecting something different from what it gets, or it's doing it's own "guessing" of how to legitimize the address.  My opinion is that it's better to have an explicit transformation in the back end (i.e. something like legitimize_address_offset) rather than guess-and-check-if-valid.  But that's just my personal opinion.

Comment 82 Kazumoto Kojima 2014-11-16 13:15:53 UTC

(In reply to Kazumoto Kojima from comment #77)
> Created attachment 33788 [details]
> another reduced test case of compiler/vam

It seems that unsigned char memory accesses make this bad code
with LRA.
We have no (zero_extend:SI (reg) (mem)) instruction for QI/HImode.
When displacement addressing is used, (set rA (mem (plus rX disp)))
and a zero extension between registers are generated for it.
The combine phase may combine that extension and another move insn
and deletes the original extension:

	  (set rA (mem (plus rX disp)))		(set rA (mem (plus rX disp)))
	  (zero_extend:SI rB rA)		deleted insn
	  ...			 	=>  
	  (set rC rB)				(zero_extend:SI rC rA)

RA will assign r0 to rA for QI/HImode and make a long live range
for R0 unfortunately.  This may cause anomalous register spills in
some case with LRA.  I guess that the old reload has something
to handle such exceptional case.  I think that the long live R0 is
problematic in the first place, though.
The above combine doesn't happen if rA is a hard reg of which reg
class is likely spilled to avoid such problem.  We can split
(set rA (mem (plus rX disp))) to two move insns via r0 which is
a likely spilled register.  It will make some codes worse but it
wins on average.
CSiBE shows 0.036% total code size improvement when the above split
is done in prepare_move_operands for load only and 0.049% when
done for both load and store.  I'll attach the patch for it.  With
it, the generated code for the 2nd call of the test case looks like

	mov.b	@(7,r8),r0
	mov	r0,r7
	mov.b	@(6,r8),r0
	extu.b	r7,r7
	mov	r0,r6
	mov.b	@(5,r8),r0
	extu.b	r6,r6
	mov	r0,r5
	mov.b	@(4,r8),r0
	extu.b	r5,r5
	jsr	@r9
	extu.b	r0,r4

Spills to memory went away, though it's still worse than non LRA case.
Before that patch, I've tried several splitters for zero_extend of
which operands are register and memory with displacement addressing.
They gave similar codes against the test case but can't win CSiBE on
average.  Even with some peepholes to fix regressions, the total code
size of CSiBE increases 0.05-0.1%.

Comment 83 Kazumoto Kojima 2014-11-16 13:19:59 UTC

Created attachment 33992 [details]
a patch for the issue c#77

Interestingly, this reduces the total text size of CSiBE test ~0.04%
at -O2 even for the trunk i.e. with the old reload.

Comment 84 Kazumoto Kojima 2014-11-23 07:02:59 UTC

FYI, merge from trunk revision 217978 as sh-lra revision 217980 to
apply the lra remat changes on trunk.

Comment 85 Kazumoto Kojima 2014-11-29 02:23:54 UTC

Created attachment 34135 [details]
patch to add -mlra option

I'd like to apply the patch to add a transitional option -mlra
like ARM and the revised patches in c#78 and c#83 on sh-lra branch
after merge from trunk.  It helps me a lot when building, comparing
and testing sh-lra compilers.

Comment 86 Oleg Endo 2014-11-30 16:13:09 UTC

(In reply to Kazumoto Kojima from comment #85)
> Created attachment 34135 [details]
> patch to add -mlra option
> 
> I'd like to apply the patch to add a transitional option -mlra
> like ARM and the revised patches in c#78 and c#83 on sh-lra branch
> after merge from trunk.  It helps me a lot when building, comparing
> and testing sh-lra compilers.

Sure, makes sense to me.  It's probably better to revert my commit r215244 to reinstate reload on the sh-lra branch.

Comment 87 Oleg Endo 2014-11-30 16:31:28 UTC

(In reply to Oleg Endo from comment #86)
> (In reply to Kazumoto Kojima from comment #85)
> > Created attachment 34135 [details]
> > patch to add -mlra option
> > 
> > I'd like to apply the patch to add a transitional option -mlra
> > like ARM and the revised patches in c#78 and c#83 on sh-lra branch
> > after merge from trunk.  It helps me a lot when building, comparing
> > and testing sh-lra compilers.
> 
> Sure, makes sense to me.  It's probably better to revert my commit r215244
> to reinstate reload on the sh-lra branch.

Sorry, your patch in attachment 34135 [details] does that already of course.

We could also  try to merge the sh-lra branch into trunk for the gcc 5 release.  If all the LRA changes are conditional and the code paths are not changed if -mlra is not specified, it should be OK to do it.

Some of the changes could probably also be applied unconditionally, such as r215241 and r215243.

Comment 88 Kazumoto Kojima 2014-12-01 00:14:11 UTC

For the record, here is the sh-lra revisions.
218191: Merge from trunk revision 218173.
218192: Add legitimize_address_displacement target macto.
218193: Split QI/HImode displacement addressing load/store via R0.
        Use std::swap in the sh-lra change.
218194: Add -mlra option.

BTW, there are new ICEs with
  internal compiler error: in split_reg, at lra-constraints.c:4909
It looks that DCmode move causes them on sh-lra.

Comment 89 Kazumoto Kojima 2014-12-02 01:16:42 UTC

Created attachment 34159 [details]
a reduced c++ test case (-O2 -std=gnu++11)

Here is related lines of lra-constraints.c:

4906      save = emit_spill_move (true, new_reg, original_reg);
4907      if (NEXT_INSN (save) != NULL_RTX)
4908        {
4909          lra_assert (! call_save_p);

where

(gdb) p call_save_p
$2 = true
(gdb) call debug_rtx(new_reg)    
(reg:DC 270)
(gdb) call debug_rtx(original_reg)
(reg:DC 220 [+8 ])

and emit_spill_move emits 3 instructions

(insn 106 0 107 (clobber (reg:DC 270)) -1
     (nil))

(insn 107 106 108 (parallel [
            (set (subreg:DF (reg:DC 270) 0)
                (subreg:DF (reg:DC 220 [+8 ]) 0))
            (use (reg:SI 154 fpscr0))
            (clobber (scratch:SI))
        ]) -1
     (nil))

(insn 108 107 0 (parallel [
            (set (subreg:DF (reg:DC 270) 8)
                (subreg:DF (reg:DC 220 [+8 ]) 8))
            (use (reg:SI 154 fpscr0))
            (clobber (scratch:SI))
        ]) -1
     (nil))

with emit_move_complex function in expr.c.

Comment 90 Oleg Endo 2014-12-20 14:57:24 UTC

For the record, the patches to enable LRA on SH have been committed.  The last revision of the committed patch series is r218892.

Comment 91 Oleg Endo 2014-12-21 12:35:55 UTC

Re: [RFC PATCH 9/9] [SH] Split QI/HImode load/store via r0

(In reply to Kazumoto Kojima from comment #83)
> Created attachment 33992 [details]
> a patch for the issue c#77
> 
> Interestingly, this reduces the total text size of CSiBE test ~0.04%
> at -O2 even for the trunk i.e. with the old reload.

I've checked this change to prepare_move_operands without LRA with trunk r218988, to see whether it should be enabled for non-LRA.

I can confirm the -1140 bytes / -0.04% on the CSiBE set.  However, as mentioned in comment #82, it results in unnecessary zero extensions before other logic/arithmetic because combine doesn't (want to) see through the R0 hardreg.  

Unnecessary sign/zero extensions are actually a separate topic (e.g. PR 53987).  If there was a good sign/zero extension elimination in place, this wouldn't be an issue here.

I've tried disabling the prepare_move_operands change and instead adding the following splitters, which are done after combine and before RA:

(define_split
  [(set (match_operand:SI 0 "arith_reg_dest")
	(sign_extend:SI (match_operand:QIHI 1 "displacement_mem_operand")))]
  "TARGET_SH1 && can_create_pseudo_p ()
   && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)"
  [(set (match_dup 2) (reg:SI R0_REG))
   (set (reg:SI R0_REG) (sign_extend:SI (match_dup 1)))
   (set (match_dup 0) (reg:SI R0_REG))
   (set (reg:SI R0_REG) (match_dup 2))]
{
  operands[2] = gen_reg_rtx (SImode);
})

(define_split
  [(set (match_operand:QIHI 0 "arith_reg_dest")
	(match_operand:QIHI 1 "displacement_mem_operand"))]
  "TARGET_SH1 && can_create_pseudo_p ()
   && !refers_to_regno_p (R0_REG, R0_REG + 1, operands[1], NULL)"
  [(set (match_dup 2) (reg:SI R0_REG))
   (set (reg:QIHI R0_REG) (match_dup 1))
   (set (match_dup 0) (reg:QIHI R0_REG))
   (set (reg:SI R0_REG) (match_dup 2))]
{
  operands[2] = gen_reg_rtx (SImode);
})

With these two splitters for mem loads I get exactly the same -1140 bytes / -0.04% on the CSiBE set.

The simple test case

int test_tst (unsigned char* x, int y, int z)
{
  return x[4] ? y : z;
}

does not contain the extu.b insn anymore, but instead we get this:
        mov.b	@(4,r4),r0
        mov	r0,r1
        tst	r1,r1        << should be: tst r0,r0
        bf	.L4
	mov	r6,r5
.L4:
	rts
	mov	r5,r0


Other cases of new unnessecary zero-extension insns are in e.g. in jpeg-6b/jdcoefct.s.

In linux-2.4.23-pre3-testplatform/arch/testplatform/kernel/signal.s some mov reg,reg insns end up as:
	extu.b	r0,r0
	mov.b	r0,@(1,r8)
	mov	r9,r0
	shlr16	r0
	extu.b	r0,r0
	mov.b	r0,@(2,r8)
	mov	r9,r0
	shlr16	r0
	shlr8	r0
	mov.b	r0,@(3,r8)
	extu.b	r1,r0
	mov.b	r0,@(4,r8)
	mov	r1,r0
        ....
those can be wallpapered with peepholes though.

I've also tried using the splitters instead of the prepare_move_operands hunk with LRA.  But then I get spill errors on QI/HImode stores with displacement addressing.

I've also tried removing the prepare_move_operands hunk with LRA.  Compared with trunk no-lra I get:
sum:  3368431 -> 3378515    +10084 / +0.299368 %

And LRA + prepare_move_operands hunk vs. trunk no-lra is:
sum:  3368431 -> 3376507    +8076 / +0.239756 %

Doing this kind of pre-RA R0 pre-allocation seems to result in better RA choices w.r.t. commutative insns such as addition.

After all, maybe it's worth trying out an SH specific R0 pre-alloc pass that offloads some of the RA choices.  Of course it will not be able to solve issues that arise when spilling code is generated that uses QI/HImode mem accesses during the RA/spilling process.

R0 is the most difficult candidate, but I've also seen reports about FPU code ICEing due FR0 spill failures when there are lots of (interdependent?) FMAC insns at -O3 (e.g. FP matrix multiplication).  Another register (class), of which there is only one on SH, would be the MACH/MACL pair.  Currently MACH/MACL are fixed hardregs.  Early experiments to allow MACH/MACL to be used by RA and adding the MAC insn showed some problems (see PR 53949 #c3).

Comment 92 Oleg Endo 2014-12-21 23:38:14 UTC

Author: olegendo
Date: Sun Dec 21 23:37:42 2014
New Revision: 218999

URL: https://gcc.gnu.org/viewcvs?rev=218999&root=gcc&view=rev
Log:
gcc/
	PR target/55212
	* config/sh/sh.md (*addsi3_compact): Add parentheses around &&
	condition.  Add comments.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh.md

Comment 93 Oleg Endo 2015-01-08 11:08:17 UTC

Author: olegendo
Date: Thu Jan  8 11:07:45 2015
New Revision: 219341

URL: https://gcc.gnu.org/viewcvs?rev=219341&root=gcc&view=rev
Log:
gcc/
	PR target/55212
	* config/sh/sh.md (*addsi3_compact): Emit reg-reg copy instead of
	constant load if constant operand fits into I08.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh.md

Comment 94 Oleg Endo 2015-09-11 11:31:19 UTC

Kaz, do you think we can enable LRA by default for GCC 6?

Some early results from the AMS optimization show new R0 spill failures.  For example, AMS uses @(R0,Rn) address modes more often.  Although I still would like to try this R0 prealloc pass, I don't know when I will have time for that.  Enabling LRA might be easier.

Comment 95 Kazumoto Kojima 2015-09-11 11:58:31 UTC

(In reply to Oleg Endo from comment #94)
> Kaz, do you think we can enable LRA by default for GCC 6?

I think that it's OK to make LRA default on trunk, even if it's
better with your R0 pre-allocating pass.
The last time I tested -mlra, there are some regressions, though.
I'd like to test it again.  Could you see any serious problems on
sh-elf with -mlra?

Comment 96 Oleg Endo 2015-09-11 12:02:20 UTC

(In reply to Kazumoto Kojima from comment #95)
> 
> I think that it's OK to make LRA default on trunk, even if it's
> better with your R0 pre-allocating pass.

The R0 pre-alloc pass could be a further improvement, even with LRA.

> The last time I tested -mlra, there are some regressions, though.
> I'd like to test it again.  Could you see any serious problems on
> sh-elf with -mlra?

I haven't tried yet.  I will do a full toolchain rebuild and test with -mlra enabled by default.

Comment 97 Oleg Endo 2015-09-11 15:57:03 UTC

(In reply to Oleg Endo from comment #96)
> 
> I haven't tried yet.  I will do a full toolchain rebuild and test with -mlra
> enabled by default.

I've compared the results of r227512 without LRA and r227682 with LRA.  Below are the new failures.


Running target sh-sim/-m2/-mb
FAIL: gcc.c-torture/execute/20040709-2.c   -Os  execution test
FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Running target sh-sim/-m2/-ml
FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Running target sh-sim/-m2a/-mb
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Running target sh-sim/-m4/-mb
FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2
AIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Running target sh-sim/-m4/-ml
FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Running target sh-sim/-m4a/-mb
FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Running target sh-sim/-m4a/-ml
FAIL: gcc.target/sh/hiconst.c scan-assembler-times mov\t#0 2
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-2.c scan-assembler-times @\\(8,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(16,r[0-9]\\) 44
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(4,r[0-9]\\) 36
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,gbr\\) 28
FAIL: gcc.target/sh/pr64661-3.c scan-assembler-times @\\(8,r[0-9]\\) 36

Comment 98 Kazumoto Kojima 2015-09-11 23:31:00 UTC

(In reply to Oleg Endo from comment #97)
> I've compared the results of r227512 without LRA and r227682 with LRA. 
> Below are the new failures.

A typical example of pr64661-x.c regressions is:
[LRA]
test_37_1:
	stc	gbr,r3	! 22	store_gbr	[length = 2]
	mova	1f,r0	! 6	atomic_add_fetchqi_soft_tcb	[length = 20]
	mov	#(0f-1f),r1
	.align 2
	mov.l	r0,@(128,gbr)
0:	mov.b	@(4,r3),r0
	...

[no-LRA]
test_37_1:
	mova	1f,r0	! 6	atomic_add_fetchqi_soft_tcb	[length = 20]
	mov	#(0f-1f),r1
	.align 2
	mov.l	r0,@(128,gbr)
0:	mov.b	@(4,gbr),r0
	...

where

(define_insn_and_split "atomic_fetch_<fetchop_name><mode>_soft_tcb"
  [(set (match_operand:QIHISI 0 "arith_reg_dest" "=&r")
	(match_operand:QIHISI 1 "atomic_mem_operand_1" "=SraSdd"))
   ...
   (clobber (reg:SI R0_REG))
   (clobber (reg:SI R1_REG))]
  "TARGET_ATOMIC_SOFT_TCB"
{
  return "\r	mova	1f,r0"			"\n"
	 "	.align 2"			"\n"
	 "	mov	#(0f-1f),r1"		"\n"
	 "	mov.l	r0,@(%O3,gbr)"		"\n"
	 "0:	mov.<bwl>	%1,r0"		"\n"
	 ...

The .ira dump shows

(insn 6 5 7 2 (parallel [
            (set (reg:QI 167)
                (plus:QI (mem/v:QI (plus:SI (reg:SI 144 gbr)
                            (const_int 4 [0x4])) [-1  S1 A8])
                    (const_int 1 [0x1])))
            (set (mem/v:QI (plus:SI (reg:SI 144 gbr)
                        (const_int 4 [0x4])) [-1  S1 A8])
                (unspec:QI [
                        (plus:QI (mem/v:QI (plus:SI (reg:SI 144 gbr)
                                    (const_int 4 [0x4])) [-1  S1 A8])
                            (const_int 1 [0x1]))
                    ] UNSPEC_ATOMIC))
...

which looks the code generated with the old RA.  It seems that
old RA passes gbr+4 addressing as is but LRA spills gbr with r3
according to the constraint SraSdd.  Perhaps LRA is more strict
in this regard.
I guess that these pr64661-x.c regressions are not so problem, though.
I'm not sure whether hiconst.c regression is serious or not.
The serious one would be

Running target sh-sim/-m2/-mb
FAIL: gcc.c-torture/execute/20040709-2.c   -Os  execution test

My sh4-unknown-linux-gnu test also shows another execution
errors for libstdc++-v3.

New tests that FAIL:

22_locale/money_get/get/wchar_t/14.cc execution test
22_locale/money_get/get/wchar_t/19.cc execution test
22_locale/money_get/get/wchar_t/22131.cc execution test
22_locale/money_get/get/wchar_t/38399.cc execution test
22_locale/money_get/get/wchar_t/39168.cc execution test
22_locale/money_get/get/wchar_t/6.cc execution test
22_locale/money_get/get/wchar_t/8.cc execution test
22_locale/money_get/get/wchar_t/9.cc execution test
22_locale/money_put/put/wchar_t/39168.cc execution test
22_locale/money_put/put/wchar_t/5.cc execution test
22_locale/money_put/put/wchar_t/6.cc execution test

These tests compiled with -mlra don't fail with libstdc++-v3
library compiled with -mno-lra, i.e. libstdc++-v3 is miscompiled
with -mlra.  These wrong code problems should be resolved before
witch to LRA.

Comment 99 Oleg Endo 2015-09-12 01:18:40 UTC

(In reply to Kazumoto Kojima from comment #98)
> I guess that these pr64661-x.c regressions are not so problem, though.

I agree.

> I'm not sure whether hiconst.c regression is serious or not.

I think we can ignore this for now.  It's about loading/sharing constants.  There are more constant related inefficiencies which can be addressed later.

> The serious one would be
> 
> Running target sh-sim/-m2/-mb
> FAIL: gcc.c-torture/execute/20040709-2.c   -Os  execution test

I will have a look at it.


> My sh4-unknown-linux-gnu test also shows another execution
> errors for libstdc++-v3.
> 
> New tests that FAIL:
> 
> 22_locale/money_get/get/wchar_t/14.cc execution test
> 22_locale/money_get/get/wchar_t/19.cc execution test
> 22_locale/money_get/get/wchar_t/22131.cc execution test
> 22_locale/money_get/get/wchar_t/38399.cc execution test
> 22_locale/money_get/get/wchar_t/39168.cc execution test
> 22_locale/money_get/get/wchar_t/6.cc execution test
> 22_locale/money_get/get/wchar_t/8.cc execution test
> 22_locale/money_get/get/wchar_t/9.cc execution test
> 22_locale/money_put/put/wchar_t/39168.cc execution test
> 22_locale/money_put/put/wchar_t/5.cc execution test
> 22_locale/money_put/put/wchar_t/6.cc execution test
> 
> These tests compiled with -mlra don't fail with libstdc++-v3
> library compiled with -mno-lra, i.e. libstdc++-v3 is miscompiled
> with -mlra.  These wrong code problems should be resolved before
> witch to LRA.

Hm .. those don't fail on sh-elf ... maybe something related to the atomics?  Atomics are off by default for sh-elf, but on for sh-linux.

Comment 100 Kazumoto Kojima 2015-09-12 03:23:25 UTC

(In reply to Oleg Endo from comment #99)
> Hm .. those don't fail on sh-elf ... maybe something related to the atomics?
> Atomics are off by default for sh-elf, but on for sh-linux.

Maybe.  It could be something atomic related.
For 14.cc case, replacing locale.o and cxx11-shim_facets.o with
ones from non-LRA build fixes the failure.
For example, std::locale::locale(std::locale const&) is compiled
very differently in LRA/non-LRA builds.  Strangely, the problem
can't be reproduced with recompiling these objects with -mlra in
non-LRA builds.  The above constructor is compiled to

[LRA build]
   0:	c6 2f       	mov.l	r12,@-r15
   2:	0b c7       	mova	30 <_ZNSt6localeC1ERKS_+0x30>,r0
   4:	0a dc       	mov.l	30 <_ZNSt6localeC1ERKS_+0x30>,r12	! 0 <_ZNSt6localeC1ERKS_>
   6:	0c 3c       	add	r0,r12
   8:	52 62       	mov.l	@r5,r2
   a:	0a d0       	mov.l	34 <_ZNSt6localeC1ERKS_+0x34>,r0	! 0 <_ZNSt6localeC1ERKS_>
   c:	ce 01       	mov.l	@(r0,r12),r1
   e:	18 21       	tst	r1,r1
  10:	09 8d       	bt.s	26 <_ZNSt6localeC1ERKS_+0x26>
  12:	22 24       	mov.l	r2,@r4
  14:	02 c7       	mova	20 <_ZNSt6localeC1ERKS_+0x20>,r0
  16:	f3 61       	mov	r15,r1
  18:	fa ef       	mov	#-6,r15
  1a:	22 63       	mov.l	@r2,r3
  1c:	01 73       	add	#1,r3
  1e:	32 22       	mov.l	r3,@r2
  20:	13 6f       	mov	r1,r15
  22:	0b 00       	rts	
  24:	f6 6c       	mov.l	@r15+,r12
  26:	22 61       	mov.l	@r2,r1
  28:	01 71       	add	#1,r1
  2a:	12 22       	mov.l	r1,@r2
  2c:	0b 00       	rts	
  2e:	f6 6c       	mov.l	@r15+,r12
	...
			30: R_SH_GOTPC	_GLOBAL_OFFSET_TABLE_
			34: R_SH_GOT32	__pthread_key_create
[non-LRA build]
   0:	c6 2f       	mov.l	r12,@-r15
   2:	0b c7       	mova	30 <_ZNSt6localeC1ERKS_+0x30>,r0
   4:	0a dc       	mov.l	30 <_ZNSt6localeC1ERKS_+0x30>,r12	! 0 <_ZNSt6localeC1ERKS_>
   6:	22 4f       	sts.l	pr,@-r15
   8:	0c 3c       	add	r0,r12
   a:	52 61       	mov.l	@r5,r1
   c:	09 d0       	mov.l	34 <_ZNSt6localeC1ERKS_+0x34>,r0	! 0 <_ZNSt6localeC1ERKS_>
   e:	ce 02       	mov.l	@(r0,r12),r2
  10:	28 22       	tst	r2,r2
  12:	07 8d       	bt.s	24 <_ZNSt6localeC1ERKS_+0x24>
  14:	12 24       	mov.l	r1,@r4
  16:	13 64       	mov	r1,r4
  18:	07 d1       	mov.l	38 <_ZNSt6localeC1ERKS_+0x38>,r1	! 1a
  1a:	03 01       	bsrf	r1
  1c:	01 e5       	mov	#1,r5
  1e:	26 4f       	lds.l	@r15+,pr
  20:	0b 00       	rts	
  22:	f6 6c       	mov.l	@r15+,r12
  24:	12 62       	mov.l	@r1,r2
  26:	01 72       	add	#1,r2
  28:	22 21       	mov.l	r2,@r1
  2a:	26 4f       	lds.l	@r15+,pr
  2c:	0b 00       	rts	
  2e:	f6 6c       	mov.l	@r15+,r12
	...
			30: R_SH_GOTPC	_GLOBAL_OFFSET_TABLE_
			34: R_SH_GOT32	__pthread_key_create
  38:	1a 00       	sts	macl,r0
			38: R_SH_PLT32	_ZN9__gnu_cxx12__atomic_addEPVii
	...

Perhaps the both codes are ok, though something odd happens
on atomic things anyway.

Comment 101 Kazumoto Kojima 2015-09-12 10:47:33 UTC

> Perhaps the both codes are ok, though something odd happens
> on atomic things anyway.

I've forgotten that some atomic builtins fail with spill_failure
in class 'R0_REGS' for old RA.  So libstdc++-v3 was configured
so to undefine _GLIBCXX_ATOMIC_BUILTINS for old RA and we see
the above difference.  I've tried to disable it on LRA build
and rebuild libstdc++-v3 libs.  14.exe still fails and it looks
very sensitive with code/option changes.  Weird.

Comment 102 Oleg Endo 2015-09-13 05:32:52 UTC

(In reply to Kazumoto Kojima from comment #101)
> 
> I've forgotten that some atomic builtins fail with spill_failure
> in class 'R0_REGS' for old RA.  So libstdc++-v3 was configured
> so to undefine _GLIBCXX_ATOMIC_BUILTINS for old RA and we see
> the above difference.

Ah!  I always wondered why those failures disappeared from your nightly tests.  Did you do this locally in your setup, or has this been committed at some time?

Comment 103 Kazumoto Kojima 2015-09-13 06:10:41 UTC

(In reply to Oleg Endo from comment #102)
> Ah!  I always wondered why those failures disappeared from your nightly
> tests.  Did you do this locally in your setup, or has this been committed at
> some time?

I did nothing.  Looks simply fragile like as other reload ICEs.
libstdc++-v3 conftest catches it, (un)fortunately.

Comment 104 Oleg Endo 2015-09-13 06:24:22 UTC

(In reply to Kazumoto Kojima from comment #103)
> I did nothing.  Looks simply fragile like as other reload ICEs.
> libstdc++-v3 conftest catches it, (un)fortunately.

Ah, I understand.  Thanks.

Comment 105 Oleg Endo 2015-09-13 09:11:33 UTC

Created attachment 36328 [details]
20040709-2.s without LRA (PASS)

Comment 106 Oleg Endo 2015-09-13 09:12:11 UTC

Created attachment 36329 [details]
20040709-2.s with LRA (FAIL)

Comment 107 Oleg Endo 2015-09-13 09:28:40 UTC

> > The serious one would be
> > 
> > Running target sh-sim/-m2/-mb
> > FAIL: gcc.c-torture/execute/20040709-2.c   -Os  execution test
> 

I have tried compiling that test case outside of the testsuite with -m2 -mb -Os with and without LRA.  There are lots of differences in the asm output, but the LRA version executable runs fine and doesn't abort.

Running the test inside the testsuite fails reliably.  I could reduce it to the 

testZ ();

subtest.

It passes with with
make -k check RUNTESTFLAGS="execute.exp=20040709-2.c --target_board=sh-sim\{-m2/-mb/-save-temps/-mno-lra}"

It fails with (-mlra enabled by default)
make -k check RUNTESTFLAGS="execute.exp=20040709-2.c --target_board=sh-sim\{-m2/-mb/-save-temps}"

Comment 108 Oleg Endo 2015-09-13 09:55:13 UTC

(In reply to Oleg Endo from comment #107)

The problem is the define_split at sh.md line 893 (the part at 945).  Or actually, it's sh_find_set_of_reg.  Adding this

diff --git a/gcc/config/sh/sh-protos.h b/gcc/config/sh/sh-protos.h
index 3e4211b..b4866c1 100644
--- a/gcc/config/sh/sh-protos.h
+++ b/gcc/config/sh/sh-protos.h
@@ -199,8 +199,13 @@ sh_find_set_of_reg (rtx reg, rtx_insn* insn, F stepfunc,
     {
       if (BARRIER_P (result.insn))
        break;
+
+      if (CALL_P (result.insn))
+       break;
+
       if (!NONJUMP_INSN_P (result.insn))
        continue;
+
       if (reg_set_p (reg, result.insn))
        {
          result.set_rtx = set_of (reg, result.insn);

Fixes the test case for me.  It seems I should really fix PR 67061, which looks like the same issue (and the other issue with sh_find_set_of_reg which was fixed with a modified_between_p).

Comment 109 Oleg Endo 2015-09-13 13:52:13 UTC

(In reply to Oleg Endo from comment #108)
> 
> It seems I should really fix PR 67061, which
> looks like the same issue (and the other issue with sh_find_set_of_reg which
> was fixed with a modified_between_p).

I have posted a patch for PR 67061 attachment 36331 [details]
With that patch

make -k check RUNTESTFLAGS="execute.exp=20040709-2.c --target_board=sh-sim\{-m2/-mb/-mlra}"

passes the tests.  Maybe it also fixes the std::locale issues.  Kaz, could you please have a look?

Comment 110 Kazumoto Kojima 2015-09-13 22:44:10 UTC

(In reply to Oleg Endo from comment #109)
> Maybe it also fixes the std::locale issues.  Kaz, could
> you please have a look?

It doesn't fix those failures.  I'll try to bisect that issue.

Comment 111 Kazumoto Kojima 2015-09-14 13:05:49 UTC

(In reply to Kazumoto Kojima from comment #110)

It seems that those failures are the latent wrong code problem
triggered with -mlra accidentally.  I've filed it as PR67573.

Comment 112 Oleg Endo 2015-09-21 10:02:52 UTC

I had a quick look at the gcc.target/sh/hiconst.c test case:

int rab (char *pt, int *pti)
{
  pt[2] = 0;
  pti[3] = 0;

  return 0;
}

without LRA:
	mov	#0,r0
	mov.b	r0,@(2,r4)
	rts	
	mov.l	r0,@(12,r5)


with LRA:
	mov	#0,r0
	mov.b	r0,@(2,r4)
	mov	#0,r1
	mov.l	r1,@(12,r5)
	rts	
	mov	#0,r0

Without LRA it seems some of the constant loads are CSE'ed, but the return value constant load remains and reload-something figures it out.  With LRA it seems that for the SImode store it wants to use R1, but for the QImode store it needs to use R0.  It then rematerializes the constant load.  Increasing the costs of constants:

@@ -3574,7 +3574,7 @@
 	  return true;
 	}
       if (CONST_OK_FOR_I08 (INTVAL (x)))
-        *total = 0;
+        *total = 4;
       else if ((outer_code == AND || outer_code == IOR || outer_code == XOR)
 	       && CONST_OK_FOR_K08 (INTVAL (x)))
         *total = 1;

results in:
	mov	#0,r1
	mov	r1,r0
	mov.b	r0,@(2,r4)
	mov.l	r1,@(12,r5)
	rts	
	mov	r1,r0

That would make the hiconst test pass again of course, but is not much better.  Probably a combination of constant optimization (PR 65069, PR 63390, PR 51708, PR 54682) and an R0 pre-alloc would give optimal results.

Comment 113 Oleg Endo 2015-09-21 12:11:16 UTC

Created attachment 36363 [details]
CSiBE I08 const cost = 0 vs. cost = 1, no LRA

Comment 114 Oleg Endo 2015-09-21 12:11:54 UTC

Created attachment 36364 [details]
CSiBE I08 const cost = 0 vs. cost = 1, LRA

Comment 115 Oleg Endo 2015-09-21 12:23:39 UTC

It seems that it's already enough to set the cost for I08 from 0 to 1:

Index: gcc/config/sh/sh.c
===================================================================
--- gcc/config/sh/sh.c	(revision 227958)
+++ gcc/config/sh/sh.c	(working copy)
@@ -3574,7 +3574,7 @@
 	  return true;
 	}
       if (CONST_OK_FOR_I08 (INTVAL (x)))
-        *total = 0;
+        *total = 1;
       else if ((outer_code == AND || outer_code == IOR || outer_code == XOR)
 	       && CONST_OK_FOR_K08 (INTVAL (x)))
         *total = 1;

This will reduce the rematerializations and thus increase reg live ranges in other places.

attachment 36363 [details] and attachment 36364 [details] are CSiBE comparistions with and without LRA.

Without LRA: sum:  3347407 -> 3345667    -1740 / -0.051981 %
With LRA:    sum:  3348933 -> 3347433    -1500 / -0.044790 %

Although there are quite some increases, overall it seems beneficial...

Comment 116 Oleg Endo 2015-09-27 12:24:10 UTC

(In reply to Oleg Endo from comment #91)
> 
> I can confirm the -1140 bytes / -0.04% on the CSiBE set.

Since r228176 this is no longer true.  Now the gap of no-LRA and LRA is a bit wider:

3345527 -> 3334351    -11176 / -0.334058 %

See also PR 67732.

I think now the difference is too big to enable LRA by default for GCC 6.  We should wait until the number gets better again.

Comment 117 Oleg Endo 2015-09-29 14:21:43 UTC

(In reply to Oleg Endo from comment #94)
> Kaz, do you think we can enable LRA by default for GCC 6?
> 
> Some early results from the AMS optimization show new R0 spill failures. 
> For example, AMS uses @(R0,Rn) address modes more often.  Although I still
> would like to try this R0 prealloc pass, I don't know when I will have time
> for that.  Enabling LRA might be easier.

So I've tried compiling the CSiBE set with AMS and LRA.  The AMS branch still has some problems, but without LRA CSiBE at least compiles without errors.  With LRA I get a lot of:

ll_rw_blk.i: In function 'generic_plug_device':
ll_rw_blk.i:8464:1: error: unable to find a register to spill
 }
 ^
ll_rw_blk.i:8464:1: error: this is the insn:
(insn 32 74 61 5 (set (mem/j:QI (plus:SI (reg/v/f:SI 170 [ q ])
                (reg:SI 283 [272])) [0 q_2(D)->plugged+0 S1 A32])
        (reg:QI 0 r0)) ll_rw_blk.i:8462 276 {*movqi}
     (expr_list:REG_DEAD (reg:SI 283 [272])
        (expr_list:REG_DEAD (reg:QI 0 r0)
            (nil))))
ll_rw_blk.i:8464:1: internal compiler error: in assign_by_spills, at lra-assigns.c:1431

It seems it has trouble rearranging this:

       mov.b  r0,@(Rm,Rn)

into
       mov.b  Rx,@(R0,Rn)

Comment 118 John Paul Adrian Glaubitz 2020-02-25 13:12:14 UTC

Is there anything that currently speaks against switching to LRA by default now?

I think, it would be a good idea for gcc-11 or even gcc-10 if possible. I will report all issues I'm running into since I am constantly monitoring package builds on Debian/sh4.

Comment 119 Oleg Endo 2020-02-25 13:57:15 UTC

(In reply to John Paul Adrian Glaubitz from comment #118)
> Is there anything that currently speaks against switching to LRA by default
> now?

There were a couple of code quality issues, which I would need to dig up and re-check if it's still an issue.

> 
> I think, it would be a good idea for gcc-11 or even gcc-10 if possible. I
> will report all issues I'm running into since I am constantly monitoring
> package builds on Debian/sh4.

Last time this issue came up, I asked you to re-build all debian with -mlra enabled

https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00977.html

I'm still waiting for the results.

I've been telling you to turn on LRA for specific cases, but that doesn't mean it's not gonna have regressions.  So please try using -mlra for *all* packages, including kernel and let the system boostrap.

Please report your findings in this PR.

Comment 120 John Paul Adrian Glaubitz 2020-02-25 14:04:18 UTC

(In reply to Oleg Endo from comment #119)
> Last time this issue came up, I asked you to re-build all debian with -mlra
> enabled
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00977.html
> 
> I'm still waiting for the results.
> 
> I've been telling you to turn on LRA for specific cases, but that doesn't
> mean it's not gonna have regressions.  So please try using -mlra for *all*
> packages, including kernel and let the system boostrap.

That's a huge task which is why I prefer fixing issues on the fly.

We are going to switch over anyway with GCC-12 or GCC-13, so I'm not sure what we are gaining if we continue to wait.

Comment 121 Oleg Endo 2020-02-26 01:26:02 UTC

(In reply to John Paul Adrian Glaubitz from comment #120)
> 
> That's a huge task which is why I prefer fixing issues on the fly.

I thought this is almost fully automated?

You can apply this patch to GCC to enable LRA by default for SH

--- sh.opt.orig	2019-03-04 10:09:09.244521000 +0900
+++ sh.opt	2020-02-26 10:19:55.414340269 +0900
@@ -299,5 +299,5 @@
 Enable the use of the fsrra instruction.
 
 mlra
-Target Report Var(sh_lra_flag) Init(0) Save
+Target Report Var(sh_lra_flag) Init(1) Save
 Use LRA instead of reload (transitional).

Then let it rebuild everything.


> 
> We are going to switch over anyway with GCC-12 or GCC-13, so I'm not sure
> what we are gaining if we continue to wait.

I don't get it.  You want to enable LRA by default for your distro, but you don't want to rebuild all the packages with that modification?  It's like .. shipping a distro with a new compiler, which potentially can't compile the distro packages (correctly), so instead we secretly use an older compiler to build the packages .... ?  Is that normal practice?  Sorry, sounds like a mess to me.

Comment 122 John Paul Adrian Glaubitz 2020-02-26 06:46:32 UTC

(In reply to Oleg Endo from comment #121)
> (In reply to John Paul Adrian Glaubitz from comment #120)
> > 
> > That's a huge task which is why I prefer fixing issues on the fly.
> 
> I thought this is almost fully automated?

The build process is. Fixing broken packages isn't.

> You can apply this patch to GCC to enable LRA by default for SH
> 
> --- sh.opt.orig	2019-03-04 10:09:09.244521000 +0900
> +++ sh.opt	2020-02-26 10:19:55.414340269 +0900
> @@ -299,5 +299,5 @@
>  Enable the use of the fsrra instruction.
>  
>  mlra
> -Target Report Var(sh_lra_flag) Init(0) Save
> +Target Report Var(sh_lra_flag) Init(1) Save
>  Use LRA instead of reload (transitional).
> 
> Then let it rebuild everything.

Everything is around 13.000 source packages.
 
> > We are going to switch over anyway with GCC-12 or GCC-13, so I'm not sure
> > what we are gaining if we continue to wait.
> 
> I don't get it.  You want to enable LRA by default for your distro, but you
> don't want to rebuild all the packages with that modification?  It's like ..
> shipping a distro with a new compiler, which potentially can't compile the
> distro packages (correctly), so instead we secretly use an older compiler to
> build the packages .... ?  Is that normal practice?  Sorry, sounds like a
> mess to me.

Debian/sh4 is not a release target, so it's not being shipped and there is no warranty anyway.

I also don't have the possibility to rebuild the whole archive in a separate project. There is only unstable. If am doing a complete rebuild now, I will be busy for the next weeks or months (there are new packages coming in every day) looking at issues because there might be a lot of failing packages, also for other reasons.

It's simply not practical from my perspective. I'm doing this as a hobby, not as a full time job, so I can't take care of all issues all-day long.

And, finally, the buildd capacity is limited on sh4. If this was POWER or SPARC, we would have plenty of resources for a complete rebuild.

Comment 123 John Paul Adrian Glaubitz 2020-02-26 10:31:03 UTC

(In reply to Oleg Endo from comment #121)
> (In reply to John Paul Adrian Glaubitz from comment #120)
> > 
> > That's a huge task which is why I prefer fixing issues on the fly.
> 
> I thought this is almost fully automated?
> 
> You can apply this patch to GCC to enable LRA by default for SH
> 
> --- sh.opt.orig	2019-03-04 10:09:09.244521000 +0900
> +++ sh.opt	2020-02-26 10:19:55.414340269 +0900
> @@ -299,5 +299,5 @@
>  Enable the use of the fsrra instruction.
>  
>  mlra
> -Target Report Var(sh_lra_flag) Init(0) Save
> +Target Report Var(sh_lra_flag) Init(1) Save
>  Use LRA instead of reload (transitional).

I will just do that for the default kernel now. But I won't trigger a full archive rebuild. Just report issues once I see them.

Comment 124 John Paul Adrian Glaubitz 2020-02-26 10:31:40 UTC

(In reply to John Paul Adrian Glaubitz from comment #123)
> I will just do that for the default kernel now. But I won't trigger a full
> archive rebuild. Just report issues once I see them.

Compiler, of course. Not kernel.

Comment 125 Oleg Endo 2020-02-26 14:36:01 UTC

(In reply to John Paul Adrian Glaubitz from comment #122)
> 
> The build process is. Fixing broken packages isn't.
> 
> Everything is around 13.000 source packages.
>

> And, finally, the buildd capacity is limited on sh4. If this was POWER or
> SPARC, we would have plenty of resources for a complete rebuild.

Can you at least please try to check if the kernel builds & runs OK?

As for the other packages, is there a way to let it just try to rebuild everything and get a list of the packages that failed to build.  We don't need to sit down and fix all issues one by one for now.  At least we can compare the list against the current list of non-building packages.  If the difference is not too big, we can consider turning on LRA by default from GCC 10 on.

Comment 126 John Paul Adrian Glaubitz 2024-02-22 12:57:10 UTC

I'm retesting a native build with LRA enabled by default now and report back.

If that works, I will try to build and boot a kernel.

Comment 127 John Paul Adrian Glaubitz 2024-02-22 14:12:45 UTC

(In reply to John Paul Adrian Glaubitz from comment #126)
> I'm retesting a native build with LRA enabled by default now and report back.
> 
> If that works, I will try to build and boot a kernel.

OK, that fails with:

../../../src/libgcc/libgcc2.c: In function '__addvsi3':
../../../src/libgcc/libgcc2.c:84:1: error: unable to generate reloads for:
   84 | }
      | ^
(call_insn 29 28 30 4 (parallel [
            (call (mem:SI (symbol_ref:SI ("abort") [flags 0x41] <function_decl 0x3f800580 abort>) [0 __builtin_abort S4 A32])
                (const_int 0 [0]))
            (use (reg:SI 154 fpscr0))
            (use (reg:SI 12 r12))
            (clobber (reg:SI 146 pr))
            (clobber (reg:SI 176))
        ]) "../../../src/libgcc/libgcc2.c":81:5 228 {call_pcrel}
     (expr_list:REG_DEAD (reg:SI 154 fpscr0)
        (expr_list:REG_CALL_DECL (symbol_ref:SI ("abort") [flags 0x41] <function_decl 0x3f800580 abort>)
            (expr_list:REG_NORETURN (const_int 0 [0])
                (expr_list:REG_EH_REGION (const_int 0 [0])
                    (nil)))))
    (nil))
during RTL pass: reload

Attaching the full build log which also contains the intermediate source.

Comment 128 John Paul Adrian Glaubitz 2024-02-22 14:15:09 UTC

Created attachment 57498 [details]
Build log including preprocessed source for building gcc-14 (20240221) natively with LRA