Currently FPU mode switches are done by always loading values from the global __fpscr_values array into FPSCR. On SH4A the fpchg insn should be used to do SFmode / DFmode switches. On non-SH4A it would probably be better to toggle the PR bit by other means in order to preserve the other FPSCR bits. One example use case where the FPSCR.FR bit needs to be preserved is matrix multiplication, where user code could swap the register bank using the frchg insn. Even if the user code updated the global __fpscr_values to reflect the FPSCR.FR change, it would break when there are multiple threads that want to use different FPSCR.FR/PR/SZ settings, as there is only one global __fpscr_values array. To provide some backwards compatibility with existing binaries the global __fpscr_values should be kept around. Old code would then still be able to reference them and new code would not touch them except when __set_fpscr is used. Since the FPSCR.PR/SZ settings at function entry and function exit are defined by the selected ABI there should be no interoperability problems.
Some related notes: According to the public documentation, the 'fschg' insn is only valid when FPSCR.PR = 0 on all FPU enabled cores (SH2A, SH4, SH4A). On SH4 and SH4A the 'frchg' insn is only valid when FPSCR.PR = 0. The setting FPSCR.PR = 1, FPSCR.SZ = 1 is only valid for SH4A and SH2A, but not SH4.
See also PR 29349.
*** Bug 60138 has been marked as a duplicate of this bug. ***
As mentioned in PR 60138, this issue also prevents a working implementation of fenv.h & friends on SH. The idea would be to get rid of the __fpscr_values first and set the FPSCR.PR bit with insn sequences such like.. set pr = 1: sts fpscr,r2 mov.l #(1 << 19),r1 or r1,r2 lds r2,fpscr set pr = 0: sts fpscr,r2 mov.l #~(1 << 19),r1 and r1,r2 lds r2,fpscr This would obviously result in a performance regression but would work with all SH FPUs. On SH4A this can then be improved by adding support for fpchg. Although this would require changes/extensions to the mode switching machinery, as mentioned in PR 29349. The problem is that the mode switching pass emits only mode changes to a particular mode, not from mode 'x' to mode 'y'. In PR 29349 an extension of the pre_edge_lcm function is suggested which would make the necessary information available. Here are a few more 'requirements' for SH specific mode change issues: 1) The following function: double test (const float* a, const float* b, const double* c, float x) { float aa = a[0] * b[0]; double cc = c[0] + c[1]; aa += b[1] * b[2]; cc += c[2] + c[3]; aa += b[3] * b[4]; cc += c[4] + c[5]; aa += b[5] * b[6]; return aa / cc; } compiled with -m4 -O2 (default PR mode = double) results in 4 mode switches. Rewriting it as: double test (const float* a, const float* b, const double* c, float x) { float aa = a[0] * b[0] + b[1] * b[2] + b[3] * b[4] + b[5] * b[6]; double cc = c[0] + c[1]; cc += c[2] + c[3]; cc += c[4] + c[5]; return aa / cc; } results only in 2 mode switches (as expected). FP operations which are independent should be reordered in order to minimize mode switches. This could go as far as ... 2) ... doing loop distribution, for cases such as: double test (const float* x, const double* y, unsigned int c) { float r0 = 0; double r1 = 0; while (c--) { float xx = x[0] * x[1] + x[2] + 123.0f; x += 3; double yy = y[0] + y[1]; y += 2; r0 += xx; r1 += yy; } return r0 + r1; } which currently produces a loop with 4 mode switches in it. Reordering the FP operations would bring this down to 2 mode switches in the loop. Since r0 and r1 are calculated independently, 2 loops can be used, having the mode switches outside the loops. 3) FPSCR.SZ mode changes might interfere with FPSCR.PR mode changes. For example, using fschg to flip FPSCR.SZ might require changing FPSCR.PR first (and potentially changing it back). If fpchg is not available (SH4A only), it's better to set both bits directly. In order to minimize mode switches it might be necessary to reorder instructions and doing loop distribution while looking at PR and SZ bits simultaneously. 4) rounding mode settings also mean FPSCR mode changes. http://gcc.gnu.org/ml/gcc-patches/2014-02/msg01378.html 5) in some cases preserving FPSCR bits across mode changes is not required (if I'm not mistaken): double func (float a, float b, double c, double d) { #pragma STDC FENV_ACCESS ON // function entry, PR = double // mode switch PR = single float ab = a + b; // mode switch PR = double double x = ab + c + d; // read back FP status bits and do something with it return x; // function exit, PR = double } In this case the mode switch double -> float -> double can be done more efficiently by pushing the PR = double FPSCR state onto the stack, switch to PR = single and then switch back to PR = double by popping FPSCR from the stack. However, this must not happen if other FPSCR settings are changed after the first switch to PR = single, such as invoking a fenv modifying standard function or changing the FPSCR.FR bit on SH4.
(In reply to Oleg Endo from comment #4) > 5) > in some cases preserving FPSCR bits across mode changes is not required (if > I'm not mistaken): > > double func (float a, float b, double c, double d) > { > #pragma STDC FENV_ACCESS ON > > // function entry, PR = double > > // mode switch PR = single > float ab = a + b; > > // mode switch PR = double > double x = ab + c + d; > > // read back FP status bits and do something with it > return x; > > // function exit, PR = double > } > > In this case the mode switch double -> float -> double can be done more > efficiently by pushing the PR = double FPSCR state onto the stack, switch to > PR = single and then switch back to PR = double by popping FPSCR from the > stack. > > However, this must not happen if other FPSCR settings are changed after the > first switch to PR = single, such as invoking a fenv modifying standard > function or changing the FPSCR.FR bit on SH4. If it's OK for a temporary mode switch to clobber other FPSCR bits (such as in the PR = single mode above), it should also be OK to load the FPSCR value from a thread local variable inside the thread-control-block: mov.l @(<disp>, gbr),r0 // r0 = address to fpscr value for a // particular mode setting lds.l @r0+,fpscr // mode switch This would require that any FPSCR setting change is also propagated to the TLS variables. E.g. setting the rounding mode would have to update FPSCR mode values in all such TLS variables. I guess that it would be useful to be able to select an FPSCR value for at least all combinations of FR and SZ bit settings, in other words having a TLS __fpscr_values array with 4 entries. However, it would make things such as toggling FPSCR.FR via frchg inefficient due to the required updates of the TLS variables. Other setting changes such as denorm or rounding mode are probably not so critical.
On Sun, Mar 16, 2014 at 11:32:21PM +0000, olegendo at gcc dot gnu.org wrote: > If it's OK for a temporary mode switch to clobber other FPSCR bits (such as in > the PR = single mode above), it should also be OK to load the FPSCR value from > a thread local variable inside the thread-control-block: > > mov.l @(<disp>, gbr),r0 // r0 = address to fpscr value for a > // particular mode setting > lds.l @r0+,fpscr // mode switch IMO this is an ugly hack that shouldn't be taken. It has lots of complex interactions with other things: signal handlers, the ucontext.h functions, fenv, pthread, etc. that could probably be achieved correctly if somebody wanted to spend the effort on it, but it would be ugly and SH-specific and honestly we already have a shortage of people willing to spend time fixing SH problems without introducing even more work. > This would require that any FPSCR setting change is also propagated to the TLS > variables. E.g. setting the rounding mode would have to update FPSCR mode > values in all such TLS variables. > I guess that it would be useful to be able to select an FPSCR value for at > least all combinations of FR and SZ bit settings, in other words having a TLS > __fpscr_values array with 4 entries. However, it would make things such as > toggling FPSCR.FR via frchg inefficient due to the required updates of the TLS > variables. Other setting changes such as denorm or rounding mode are probably > not so critical. If I'm not mistaken, the toggle approach will be efficient without this TLS hack once it's implemented, right? I don't think it makes sense to introduce a hack for just a short-term mitigation of the performance regression. If you think the short-term fix for this issue is too costly, the proper solution is probably to add a -m option to turn it back off (using the old __fpscr_values approach).
(In reply to Rich Felker from comment #6) > On Sun, Mar 16, 2014 at 11:32:21PM +0000, olegendo at gcc dot gnu.org wrote: > > If it's OK for a temporary mode switch to clobber other FPSCR bits (such as in > > the PR = single mode above), it should also be OK to load the FPSCR value from > > a thread local variable inside the thread-control-block: > > > > mov.l @(<disp>, gbr),r0 // r0 = address to fpscr value for a > > // particular mode setting > > lds.l @r0+,fpscr // mode switch > > IMO this is an ugly hack that shouldn't be taken. It has lots of > complex interactions with other things: signal handlers, the > ucontext.h functions, fenv, pthread, etc. Could you please elaborate on the problems you see there? > that could probably be > achieved correctly if somebody wanted to spend the effort on it, but > it would be ugly and SH-specific and honestly we already have a > shortage of people willing to spend time fixing SH problems without > introducing even more work. I failed to mention, that the idea with TLS variables was meant to be a separate option. If implemented in the compiler it would need to be turned on explicitly, similar to the option '-matomic-model=soft-tcb,gbr-offset=<your offset here>'. Thus, if a libc/kernel/whatever doesn't want to make use of it, it won't have to. > > > This would require that any FPSCR setting change is also propagated to > > the TLS variables. E.g. setting the rounding mode would have to update > > FPSCR mode values in all such TLS variables. > > I guess that it would be useful to be able to select an FPSCR value for at > > least all combinations of FR and SZ bit settings, in other words having > > a TLS __fpscr_values array with 4 entries. However, it would make things > > such as toggling FPSCR.FR via frchg inefficient due to the required updates > > of the TLS variables. Other setting changes such as denorm or rounding > > mode are probably not so critical. > > If I'm not mistaken, the toggle approach will be efficient without > this TLS hack once it's implemented, right? No. The toggle approach is only efficient on SH4A. Other SH FPUs don't implement the fpchg instruction and require sts-lds fpscr sequences, regardless of toggling or explicitly setting the PR mode. > I don't think it makes > sense to introduce a hack for just a short-term mitigation of the > performance regression. If you think the short-term fix for this issue > is too costly, the proper solution is probably to add a -m option to > turn it back off (using the old __fpscr_values approach). This was not meant to be a short-term mitigation. In fact figuring out whether FPSCR bits can be clobbered during a PR mode switch or not is already not so simple. If at all, it would be an additional optimization opportunity after everything else is in place. Keeping the 'global __fpscr_values' approach as an option could be useful for thread model = single, or for bare-metal applications where the rounding mode etc is never changed from the default and FP status bits are never read back by application code.
Hi Oleg, I posted a patch in 2006 (http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01562.html) to support mode flipping fpchg instruction for the sh4a. it is in used in our current (4.8) production compiler without issue since then. I didn't insist a lot pushing it since too sh4a specific, but I'd be happy to port it to the 4.10 branch and repost if there is enough interest. Cheers a code like float foo(float a) { return a+2; } compiles into: mova .L2,r0 fmov.s @r0+,fr0 fpchg fadd fr5,fr0 rts fpchg instead of: mov.l .L2,r1 mova .L3,r0 fmov.s @r0+,fr0 lds.l @r1+,fpscr add #-4,r1 add #4,r1 fadd fr5,fr0 rts lds.l @r1+,fpscr .L4: .align 2 .L2: .long ___fpscr_values .L3: .long 1073741824
Although it seems that (1)-(5) in #3 are interesting points, they are almost optimizations. I guess that programs with frequent FP mode switchings are simply rare in real world and would be a bit special or even pathological in the first place. I like the idea of mode switching without __fpscr_values even if it requires more instructions on SH4. Now SH4 is a fairy old core and is not for heavy FP computations anyway. I think that it won't impact the real working set. It looks to me that Christian's patch is the way to go.
(In reply to Kazumoto Kojima from comment #9) > Although it seems that (1)-(5) in #3 are interesting points, they > are almost optimizations. Yep. Those are not necessary to get the functionality (of not using __fpscr_values). > I guess that programs with frequent FP > mode switchings are simply rare in real world and would be a bit > special or even pathological in the first place. > I like the idea of mode switching without __fpscr_values even if it > requires more instructions on SH4. Now SH4 is a fairy old core and > is not for heavy FP computations anyway. I think that it won't impact > the real working set. > It looks to me that Christian's patch is the way to go. Yep. However, when the patch was proposed there were some objections regarding the modifications in lcm.c (if I'm not mistaken). We could try again. The reason why I brought up (1)-(5) in #3 was that if one of them is eventually implemented (e.g. reorder calculation insns) the changes to lcm.c might not be required and could be avoided. Depending on the implementation of such optimization, the mode switch information might have to be obtained/maintained in a different way. However, I at the moment I have no concrete ideas/plans. If Christian's patch is accepted, that's cool, too.
(In reply to Oleg Endo from comment #10) > (In reply to Kazumoto Kojima from comment #9) > > Although it seems that (1)-(5) in #3 are interesting points, they > > are almost optimizations. > > Yep. Those are not necessary to get the functionality (of not using > __fpscr_values). > > > I guess that programs with frequent FP > > mode switchings are simply rare in real world and would be a bit > > special or even pathological in the first place. > > I like the idea of mode switching without __fpscr_values even if it > > requires more instructions on SH4. Now SH4 is a fairy old core and > > is not for heavy FP computations anyway. I think that it won't impact > > the real working set. > > It looks to me that Christian's patch is the way to go. > > Yep. However, when the patch was proposed there were some objections > regarding the modifications in lcm.c (if I'm not mistaken). We could try > again. there were neither followup nor objections to the last version. I'll post again, the time to cross-check with epiphany or x96. > > The reason why I brought up (1)-(5) in #3 was that if one of them is > eventually implemented (e.g. reorder calculation insns) the changes to lcm.c > might not be required and could be avoided. Depending on the implementation > of such optimization, the mode switch information might have to be > obtained/maintained in a different way. However, I at the moment I have no > concrete ideas/plans. If Christian's patch is accepted, that's cool, too. Implementing in LCM can also be a base to expose the architectural mode flipping extension with other ports (MMX was suggested I think)
(In reply to chrbr from comment #11) > > there were neither followup nor objections to the last version. I'll post > again, the time to cross-check with epiphany or x96. > Epiphany has some additional mode switching passes. I haven't checked the details, but maybe it could be of some use on SH, too.
for the record, last patch here: http://gcc.gnu.org/ml/gcc-patches/2014-05/msg00738.html
Created attachment 33690 [details] a possible patch This is a simple patch that does sts-lds fpscr mode switching (not fully tested). With the patch applied, the following example float test (float x, float y) { return x + y; } compiled with -O2 -m4 (default fpu mode = double): sts fpscr,r1 ! 20 fpu_switch/8 mov.l .L2,r2 ! 22 movsi_ie/1 xor r2,r1 ! 23 *xorsi3_compact/2 lds r1,fpscr ! 25 fpu_switch/5 xor r2,r1 ! 29 *xorsi3_compact/2 fmov fr5,fr0 ! 34 movsf_ie/1 fadd fr4,fr0 ! 8 addsf3_i rts ! 37 *return_i lds r1,fpscr ! 31 fpu_switch/5 .L3: .align 2 .L2: .long 524288 The switch is done by 3 (+2 artificial) individual instructions (load - modify - store). In this case the RA / optimizers figure out that there's no need to store fpscr twice and reorder the operations. This is because all the fp insn patterns in the machine description only "use" the fpscr, but actually they also modify it. This means that the fenv is reset after the 'fadd', i.e. it potentially clears exception flags etc. I think this is wrong. It also seems impossible to get the fpscr value immediately after the fp insn, as it always gets reordered in some way. As far as I understand, all the fp insns that update bits in fpscr should actually do so (clobber it or set it in someway) and a builtin "get_fpscr" is required so that optimizers see the dependencies on fpscr.
Created attachment 33691 [details] a possible patch The previous patch was buggy, it always generated a PR toggle sequence, even if a mode should be set directly.
I've tried a modified example from PR 5360, using floats instead of doubles: void loop_p (int np, int non0, float coeff[][2048], float tmp1) { int j, k; for (j = non0; j < np; j++) for (k = 0; k < j; k++) coeff[j][j] -= tmp1 * coeff[j][k]; } with -O2 -m4a (double mode default) and the patch from comment #15 applied: (loop setup code omitted) ... .L6: cmp/pl r5 ! outer loop, set to single bf/s .L7 sts fpscr,r7 mov.l .L16,r4 mov r0,r2 fmov.s @r3,fr1 mov r5,r1 and r4,r7 lds r7,fpscr .align 2 .L5: fmov.s @r2+,fr0 ! inner loop, no switch dt r1 fneg fr0 fmac fr0,fr5,fr1 bf/s .L5 fmov.s fr1,@r3 .L7: dt r6 add #1,r5 add r9,r0 bf/s .L6 add r8,r3 sts fpscr,r1 ! function return, set to double mov.l .L17,r2 mov.l @r15+,r9 or r2,r1 mov.l @r15+,r8 rts lds r1,fpscr Obviously, if the inner loop count is small the mode set in the outer loop will dominate. Something seems to be missing in the mode-switch optimization. The mode switch should be just hoisted above all loops, which then can use the fpchg insn on SH4A.
(In reply to Oleg Endo from comment #14) > > The switch is done by 3 (+2 artificial) individual instructions (load - > modify - store). In this case the RA / optimizers figure out that there's > no need to store fpscr twice and reorder the operations. This is because > all the fp insn patterns in the machine description only "use" the fpscr, > but actually they also modify it. This means that the fenv is reset after > the 'fadd', i.e. it potentially clears exception flags etc. > > I think this is wrong. It also seems impossible to get the fpscr value > immediately after the fp insn, as it always gets reordered in some way. As > far as I understand, all the fp insns that update bits in fpscr should > actually do so (clobber it or set it in someway) and a builtin "get_fpscr" > is required so that optimizers see the dependencies on fpscr. In the 'addsf3_i' pattern, I've tried replacing the (use (match_operand:PSI 3 "fpscr_operand" "c")) with (set (match_operand:PSI 3 "fpscr_operand" "=&c") (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))] and after that the asm output looks OK: sts fpscr,r1 mov.l .L2,r2 xor r2,r1 lds r1,fpscr fmov fr5,fr0 fadd fr4,fr0 sts fpscr,r1 xor r2,r1 rts lds r1,fpscr I haven't checked all the other side effects it could have, but at least the FMA combine patterns still seem work after that change.
(In reply to Oleg Endo from comment #15) > Created attachment 33691 [details] > a possible patch > > The previous patch was buggy, it always generated a PR toggle sequence, even > if a mode should be set directly. I've tested the patch, there are some new failures: -m4 -mb: FAIL: g++.dg/torture/type-generic-1.C -O0 execution test FAIL: gcc.c-torture/execute/builtins/complex-1.c execution, -O0 FAIL: gcc.c-torture/execute/builtins/complex-1.c execution, -Og -g FAIL: gcc.c-torture/execute/complex-6.c -O0 execution test FAIL: gcc.c-torture/execute/gofast.c -O0 execution test FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution, -O0 FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution, -O1 FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution, -Og -g FAIL: gcc.c-torture/execute/ieee/20030331-1.c execution, -O0 FAIL: gcc.c-torture/execute/ieee/copysign1.c execution, -O0 FAIL: gcc.c-torture/execute/ieee/mzero3.c execution, -O0 FAIL: gcc.dg/torture/type-generic-1.c -O0 execution test -m4 -ml: FAIL: g++.dg/torture/type-generic-1.C -O0 execution test FAIL: g++.dg/torture/type-generic-1.C -O1 execution test FAIL: g++.dg/torture/type-generic-1.C -O2 execution test FAIL: g++.dg/torture/type-generic-1.C -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test FAIL: g++.dg/torture/type-generic-1.C -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects execution test FAIL: g++.dg/torture/type-generic-1.C -O3 -fomit-frame-pointer execution test FAIL: g++.dg/torture/type-generic-1.C -O3 -g execution test FAIL: g++.dg/torture/type-generic-1.C -Os execution test FAIL: gcc.c-torture/execute/builtins/complex-1.c execution, -O0 FAIL: gcc.c-torture/execute/builtins/complex-1.c execution, -Og -g FAIL: gcc.c-torture/execute/complex-6.c -O0 execution test FAIL: gcc.c-torture/execute/conversion.c -O0 execution test FAIL: gcc.c-torture/execute/gofast.c -O0 execution test FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution, -O0 FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution, -O1 FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution, -Og -g FAIL: gcc.c-torture/execute/ieee/20030331-1.c execution, -O0 FAIL: gcc.c-torture/execute/ieee/copysign1.c execution, -O0 FAIL: gcc.c-torture/execute/ieee/mzero3.c execution, -O0 FAIL: gcc.dg/pr28796-2.c execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O0 execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O1 execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O2 execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O3 -fomit-frame-pointer execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O3 -g execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -Os execution test FAIL: gcc.dg/torture/type-generic-1.c -O0 execution test FAIL: gcc.dg/torture/type-generic-1.c -O1 execution test FAIL: gcc.dg/torture/type-generic-1.c -O2 execution test FAIL: gcc.dg/torture/type-generic-1.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test FAIL: gcc.dg/torture/type-generic-1.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects execution test FAIL: gcc.dg/torture/type-generic-1.c -O3 -fomit-frame-pointer execution test FAIL: gcc.dg/torture/type-generic-1.c -O3 -g execution test FAIL: gcc.dg/torture/type-generic-1.c -Os execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O0 execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O1 execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O2 execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O3 -fomit-frame-pointer execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: gcc.dg/torture/vec-cvt-1.c -O3 -g execution test FAIL: gcc.dg/torture/vec-cvt-1.c -Os execution test
(In reply to Oleg Endo from comment #18) > (In reply to Oleg Endo from comment #15) > > Created attachment 33691 [details] > > a possible patch > > > > The previous patch was buggy, it always generated a PR toggle sequence, even > > if a mode should be set directly. > > I've tested the patch, there are some new failures: > ... No I didn't. That was a patch for PR 63260. Sorry for the noise.
(In reply to Oleg Endo from comment #19) > > No I didn't. That was a patch for PR 63260. Sorry for the noise. Now I have. For both '-m4 -ml' and '-m4 -mb' there are a few new failures: FAIL: gcc.dg/attr-isr.c scan-assembler-times [^f]r[0-9][ \t]*, 8 (SH specific test 'gcc.dg/attr-isr.c' should be moved to 'gcc.target/sh') FAIL: gcc.target/sh/attr-isr-nosave_low_regs.c scan-assembler-not [^f]r1[,0-3] FAIL: gcc.target/sh/attr-isr-nosave_low_regs.c scan-assembler-not [^f]r[0-9][ \t]*, FAIL: gcc.target/sh/attr-isr-trapa.c scan-assembler-not r1[,0-3] FAIL: gcc.target/sh/pr54680.c scan-assembler-times lds\t 6 FAIL: gcc.target/sh/pragma-isr-nosave_low_regs.c scan-assembler-not [^f]r1[,0-3] FAIL: gcc.target/sh/pragma-isr-nosave_low_regs.c scan-assembler-not [^f]r[0-9][ \t]*, FAIL: gcc.target/sh/pragma-isr-trapa.c scan-assembler-not r1[,0-3] FAIL: gcc.target/sh/pragma-isr-trapa2.c scan-assembler-not r1[,0-3] FAIL: gcc.target/sh/pragma-isr-trapa2.c scan-assembler-times [^_]fpscr 3 FAIL: gcc.target/sh/pragma-isr-trapa2.c scan-assembler-times r[0-7]\n 3 The failures seem to be related to ISR-like function handling or the tests need adjustments because they assume that fpscr is accessed in a particular way, which is now about to change. I'd like to ignore ISR function problems for now, and fix it as part of PR 57339.
(In reply to Oleg Endo from comment #17) > > In the 'addsf3_i' pattern, I've tried replacing the > > (use (match_operand:PSI 3 "fpscr_operand" "c")) > > with > > (set (match_operand:PSI 3 "fpscr_operand" "=&c") > (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))] > > > I haven't checked all the other side effects it could have, but at least the > FMA combine patterns still seem work after that change. With those changes above applied to all the other FP insns, the FMA tests in gcc.target/sh fail. One problem is that at -O1, FMA patterns won't be expanded initially, but are rather formed during combine. With the more complex FPSCA set/use combine won't match the patterns at -O1 but only at -O2. This is regression is probably acceptable. The other issue is that the fix for PR 56547 doesn't work anymore, because for some reason combine can't figure out what to do with the 1.0 constant. Probably because now it'd need to deal with multiple-set insns, which it doesn't do well. Adding a combine bridge pattern doesn't make it go further, as it won't try to combine 'fpreg = fpreg + 1.0' and 'fpreg = fpreg * fpreg'. In this case it's probably better to do a manual fma combine in the split1 pass. This is more effort, but would also fix the first problem.
(In reply to Oleg Endo from comment #21) > (In reply to Oleg Endo from comment #17) > > > > In the 'addsf3_i' pattern, I've tried replacing the > > > > (use (match_operand:PSI 3 "fpscr_operand" "c")) > > > > with > > > > (set (match_operand:PSI 3 "fpscr_operand" "=&c") > > (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))] > > > > > > I haven't checked all the other side effects it could have, but at least the > > FMA combine patterns still seem work after that change. > > With those changes above applied to all the other FP insns, the FMA tests in > gcc.target/sh fail. As well as some of the FSCA tests. E.g. 'return sinf (x) + cosf (x)' would expand into two sincos patterns and normally combine would fold them into one. However, it doesn't understand the unspec:PSI stuff and tries to nest it such as: Failed to match this instruction: (parallel [ (set (reg:SI 174) (fix:SI (reg:SF 167))) (set (reg/v:PSI 151 ) (unspec:PSI [ (unspec:PSI [ (reg/v:PSI 151 ) ] UNSPEC_FPSCR) ] UNSPEC_FPSCR)) ]) I've tried adding a predicate that recursively matches those nested unspec:PSI things, which makes combine happy, but then reload would bail out for some reason. Moreover, fp insns that do this (set (match_operand:PSI 3 "fpscr_operand" "=&c") (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))] won't get eliminated even if their results are unused/dead. I'm trying to come up with some alternative approach.
Created attachment 33716 [details] Using virtual FPSCR registers to model insn dependencies This is a somewhat larger patch which does a couple of things to address the deficits of the previous approach. 1) The hard reg set is extended by two more registers: - FPSCR_MODES_REG represents any mode bits in FPSCR, which are used by fp insns. - FPSCR_STAT_REG represents the status bits in FPSCR, which are written by fp insns. All fp insns that previously had a (use (match_operand:PSI 2 "fpscr_operand" "") now get a (use (reg:SI FPSCR_MODES_REG)) 2) The 'fpu_switch' insn, which is just a FPSCR load/store, now uses and sets the virtual regs FPSCR_MODES_REG and FPSCR_STAT_REG. This creates a sort of reordering barrier due to the reg uses/sets. 3) Two new dummy insns extend_psi_si and truncate_si_psi are added to convert PSIMode <-> SImode, because we can do only logic on SImode, but FPSCR is described as PSImode. This hack could be removed later maybe. 4) The insns toggle_sz and toggle_pr are now also setting the virtual reg FPSCR_MODES_REG. 5) The fsca insn needed some adjustments for the operand matching, as combine started going different paths. Some of the fsca tests in gcc.target/sh were failing and are fixed by this. 6) sh_emit_set_t_insn is adjusted to emit the correct comparison insns with use/set of virtual FPSCR regs. 7) calc_live_regs is adjusted to ignore virtual regs FPSCR_MODES_REG and FPSCR_STAT_REG, or else it will try to generate push/pop insns for those regs. 8) sh_emit_mode_set now emits the FPSCR store-modify-load insn sequence instead of loading FPSCR from __fpscr_values. 9) Some redundant fp insns and related functions are deleted. I've tried to keep the patch as short as possible and not do too many unrelated changes. I haven't fully tested the patch, so there are probably a couple of fallouts. There are two regressions with this patch. The FPSCR loads are not put into delay slots anymore. Probably because the 'fpu_switch' pattern now has too many sets and the DBR rejects that as a slot candidate. The other thing are the failures in gcc.target/sh w.r.t. interrupt functions, which can be addressed later. Kaz, could you please have an early look at it?
Created attachment 33717 [details] Using virtual FPSCR registers to model insn dependencies sh_emit_set_t_insn was producing nested parallel patterns. When compiling libgcc (for -m4 -mb) memory consumption goes up a lot. Either there is a bug in the patch, or it's triggering a bug somewhere else which seems to cause an infinite alloc cycle somewhere.
(In reply to Oleg Endo from comment #23) > Kaz, could you please have an early look at it? The idea looks OK to me. Build fails on sh4-linux with the patch, though. Maybe a wrong version? There is a typo divdf3 expand. s/gen_divdf3 /gen_divdf3_i / - expand_df_binop (&gen_divdf3_i, operands); + emit_insn (gen_divdf3 (operands[0], operands[1], operands[2])); With fixing it, yet segfaults during compiling libgcc2.c __powidf2 at #0 0x089706b4 in sh_adjust_cost (insn=0xb7f767e0, link=0xb7f79380, dep_insn=0xb7f762d0, cost=25) at trunk/gcc/config/sh/sh.c:10908 10908 SET_SRC (use_pat))) (gdb) l 10903 cycle earlier. */ 10904 else if (reload_completed 10905 && get_attr_dfp_comp (dep_insn) == DFP_COMP_YES 10906 && (use_pat = single_set (insn)) 10907 && ! regno_use_in (REGNO (SET_DEST (single_set (dep_insn))), 10908 SET_SRC (use_pat))) It looks - (use (match_operand:PSI 3 "fpscr_operand" "c"))] + (set (reg:SI FPSCR_STAT_REG) (const_int 0)) + (use (reg:SI FPSCR_MODES_REG))] breaks the assumption of sh_adjust_cost which divdf3_i has only one set insn.
(In reply to Kazumoto Kojima from comment #25) > (In reply to Oleg Endo from comment #23) > > Kaz, could you please have an early look at it? > > The idea looks OK to me. Build fails on sh4-linux with the patch, though. > Maybe a wrong version? > There is a typo divdf3 expand. s/gen_divdf3 /gen_divdf3_i / > > - expand_df_binop (&gen_divdf3_i, operands); > + emit_insn (gen_divdf3 (operands[0], operands[1], operands[2])); > > With fixing it, yet segfaults during compiling libgcc2.c __powidf2 at No, that was the correct version unfortunately. And that was my infinite-memory problem. Thanks for spotting it. > > #0 0x089706b4 in sh_adjust_cost (insn=0xb7f767e0, link=0xb7f79380, > dep_insn=0xb7f762d0, cost=25) at trunk/gcc/config/sh/sh.c:10908 > 10908 SET_SRC (use_pat))) > (gdb) l > 10903 cycle earlier. */ > 10904 else if (reload_completed > 10905 && get_attr_dfp_comp (dep_insn) == DFP_COMP_YES > 10906 && (use_pat = single_set (insn)) > 10907 && ! regno_use_in (REGNO (SET_DEST (single_set > (dep_insn))), > 10908 SET_SRC (use_pat))) > > It looks > > - (use (match_operand:PSI 3 "fpscr_operand" "c"))] > + (set (reg:SI FPSCR_STAT_REG) (const_int 0)) > + (use (reg:SI FPSCR_MODES_REG))] > > breaks the assumption of sh_adjust_cost which divdf3_i has only one > set insn. Yep, single_set now returns null for basically all fp insns and it's not checked there. The default implementation of single_set is also the reason why delay slot stuffing is not working on fp insns anymore.
Created attachment 33721 [details] Using virtual FPSCR registers to model insn dependencies Updated patch that avoids the single_set problems by using (clobber (reg:SI FPSCR_STAT_REG)) instead of a set. This also eliminates the fsca pattern changes in the previous patch. Since the 'fpu_switch' insn is still a multiple set insn, it won't be used for delay slot stuffing, but this is a minor issue that can be addressed later. I'm testing the patch now on sh-sim. At least 'make all' works.
Created attachment 33727 [details] Using virtual FPSCR registers to model insn dependencies (In reply to Oleg Endo from comment #27) > Created attachment 33721 [details] > Using virtual FPSCR registers to model insn dependencies > > Updated patch that avoids the single_set problems by using (clobber (reg:SI > FPSCR_STAT_REG)) instead of a set. This also eliminates the fsca pattern > changes in the previous patch. Since the 'fpu_switch' insn is still a > multiple set insn, it won't be used for delay slot stuffing, but this is a > minor issue that can be addressed later. I'm testing the patch now on > sh-sim. At least 'make all' works. Testing for '-m4 -ml' and '-m4 -mb' shows one new failure (ignoring the ISR failures): FAIL: gcc.c-torture/execute/pr28982a.c -O1 (internal compiler error) FAIL: gcc.c-torture/execute/pr28982a.c -O1 (test for excess errors) The problem is the define_split and the peephole2 patterns below the "fpu_switch" insn. I don't know how/if that was working before. I've removed the peephole2 pattern and rewrote the split pattern, which fixes the failure above. I'll re-test the whole thing again.
(In reply to Oleg Endo from comment #28) > Created attachment 33727 [details] > Using virtual FPSCR registers to model insn dependencies > > The problem is the define_split and the peephole2 patterns below the > "fpu_switch" insn. I don't know how/if that was working before. I've > removed the peephole2 pattern and rewrote the split pattern, which fixes the > failure above. I'll re-test the whole thing again. With this patch there are no new failures for -m4 -ml and -m4 -mb here, except the ISR failures. I'll also test it for the other combinations later. I'd like to get the current sh4 working version first. Then fix other niche problems with ISRs or SH2E. Kaz, what do you think?
(In reply to Oleg Endo from comment #29) > With this patch there are no new failures for -m4 -ml and -m4 -mb here, > except the ISR failures. I'll also test it for the other combinations > later. I'd like to get the current sh4 working version first. Then fix > other niche problems with ISRs or SH2E. Kaz, what do you think? Sounds reasonable. Please go ahead.
Author: olegendo Date: Thu Oct 16 10:58:36 2014 New Revision: 216307 URL: https://gcc.gnu.org/viewcvs?rev=216307&root=gcc&view=rev Log: gcc/ PR target/53513 * config/sh/sh-protos.h (emit_sf_insn, emit_df_insn, expand_sf_unop, expand_sf_binop, expand_df_unop, expand_df_binop): Remove. * config/sh/sh.c (sh_emit_set_t_insn): Adjust generated insn pattern to match fp insn patterns. (calc_live_regs): Add FPSCR_MODES_REG and FPSCR_STAT_REG to the ignore list. (emit_sf_insn, emit_df_insn, expand_sf_unop, expand_sf_binop, expand_df_unop, expand_df_binop): Remove. (sh_conditional_register_usage): Mark FPSCR_MODES_REG and FPSCR_STAT_REG as not call clobbered. (sh_emit_mode_set): Emit fpscr store-modify-load sequence instead of invoking fpscr_set_from_mem. * config/sh/sh.h (MAX_REGISTER_NAME_LENGTH): Increase to 6. (SH_REGISTER_NAMES_INITIALIZER): Add names for FPSCR_MODES_REG and FPSCR_STAT_REG. (REGISTER_NAMES): Adjust. (SPECIAL_REGISTER_P): Add FPSCR_MODES_REG and FPSCR_STAT_REG. (FIRST_PSEUDO_REGISTER): Increase to 156. (DWARF_FRAME_REGISTERS): Define as 153 to keep the original value. (FIXED_REGISTERS, CALL_USED_REGISTERS): Add FPSCR_MODES_REG and FPSCR_STAT_REG. (REG_CLASS_CONTENTS): Adjust ALL_REGS bit mask to include FPSCR_MODES_REG and FPSCR_STAT_REG. (REG_ALLOC_ORDER): Add FPSCR_MODES_REG and FPSCR_STAT_REG. * config/sh/sh.md (FPSCR_MODES_REG, FPSCR_STAT_REG, FPSCR_PR, FPSCR_SZ): Add new constants. (UNSPECV_FPSCR_MODES, UNSPECV_FPSCR_STAT): Add new unspecv constants. (movpsi): Use TARGET_FPU_ANY condition, invoke gen_fpu_switch. (fpu_switch): Add use and set of FPSCR_STAT_REG and FPSCR_MODES_REG. Use TARGET_FPU_ANY condition. (fpu_switch peephole2): Remove. (fpu_switch split): Use simple_mem_operand to capture the mem and adjust split implementation. (extend_psi_si, truncate_si_psi): New insns. (toggle_sz, toggle_pr): Use FPSCR_SZ, FPSCR_PR constants. Add set of FPSCR_MODES_REG. (push_e, push_4, pop_e, pop_4, movdf_i4, reload_indf__frn, movsf_ie, reload_insf__frn, force_mode_for_call, calli, calli_tbr_rel, calli_pcrel, call_pcrel, call_compact, call_compact_rettramp, call_valuei, call_valuei_tbr_rel, call_valuei_pcrel, call_value_pcrel, call_value_compact, call_value_compact_rettramp, call, call_pop_compact, call_pop_compact_rettramp, call_value, sibcalli, sibcalli_pcrel, sibcalli_thunk, sibcall_pcrel, sibcall_compact, sibcall, sibcall_valuei, sibcall_valuei_pcrel, sibcall_value_pcrel, sibcall_value_compact, sibcall_value, call_value_pop_compact, call_value_pop_compact_rettramp, various unnamed splits): Replace use of FPSCR_REG with use of FPSCR_MODES_REG. Adjust gen_* function uses. (floatsisf2_i4, *floatsisf2_ie): Merge into floatsisf2_i4. (fix_truncsfsi2_i4, *fixsfsi): Merge into fix_truncsfsi2_i4. (cmpgtsf_t, cmpgtsf_t_i4): Merge into cmpgtsf_t. (cmpeqsf_t, cmpeqsf_t_i4): Merge into cmpeqsf_t. (ieee_ccmpeqsf_t, *ieee_ccmpeqsf_t_4): Merge into ieee_ccmpeqsf_t. (udivsi3_i4, divsi3_i4, addsf3_i, subsf3_i, mulsf3_i, fmasf4_i, *fmasf4, divsf3_i, floatsisf2_i4, fix_truncsfsi2_i4, cmpgtsf_t, cmpeqsf_t, ieee_ccmpeqsf_t, sqrtsf2_i, rsqrtsf2, fsca, adddf3_i, subdf3_i, muldf3_i, divdf3_i, floatsidf2_i, fix_truncdfsi2_i, cmpgtdf_t, cmpeqdf_t, *ieee_ccmpeqdf_t, sqrtdf2_i, extendsfdf2_i4, truncdfsf2_i4): Replace use of FPSCR_REG with clobber of FPSCR_STAT_REG and use of FPSCR_MODES_REG. Adjust gen_* function uses. gcc/testsuite/ PR target/53513 * gcc.target/sh/pr54680.c: Adjust matching of lds insn. Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/sh-protos.h trunk/gcc/config/sh/sh.c trunk/gcc/config/sh/sh.h trunk/gcc/config/sh/sh.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/sh/pr54680.c
As a next step, I'd like to add built-in functions to get/set the FPSCR on SH. We have the old __get_fpscr and __set_fpscr functions in libgcc which modify the __fpscr_values array and the FPSCR. One idea would be to make __get_fpscr and __set_fpscr built-in functions, so that new code never generates the respective library calls. I don't know how expected/unexpected that would be for users and their existing code. For example, the current glibc does this in sysdeps/sh/sh4/fpu/fpu_control.h: #if defined __GNUC__ __BEGIN_DECLS /* GCC provides this function. */ extern void __set_fpscr (unsigned long); #define _FPU_SETCW(cw) __set_fpscr ((cw)) #else #define _FPU_SETCW(cw) __asm__ ("lds %0,fpscr" : : "r" (cw)) #endif By making __set_fpscr a builtin it would automagically work, but would never update the __fpscr_values. I'm not sure what kind of other fpscr related work arounds build on top of that. Alternatively, we could name the built-in functions differently. I've briefly checked the docs of other targets, this is what I've found: aarch64: unsigned int __builtin_aarch64_get_fpcr () void __builtin_aarch64_set_fpcr (unsigned int) unsigned int __builtin_aarch64_get_fpsr () void __builtin_aarch64_set_fpsr (unsigned int) ARM: unsigned int __builtin_arm_get_fpscr () void __builtin_arm_set_fpscr (unsigned int) MIPS: unsigned int __builtin_mips_get_fcsr (void) void __builtin_mips_set_fcsr (unsigned int value) I'd propose the following for SH: unsigned int __builtin_sh_get_fpscr () void __builtin_sh_set_fpscr (unsigned int) Any opinions or feedback regarding the matter?
(In reply to Oleg Endo from comment #32) > I'd propose the following for SH: > > unsigned int __builtin_sh_get_fpscr () > void __builtin_sh_set_fpscr (unsigned int) > > Any opinions or feedback regarding the matter? Looks fine to me.
Test run for -m2e -ml, -m2a -mb, -m2a-single -mb, -m4-single-ml has finished and shows no new failures (except the ISR stuff).
Created attachment 33745 [details] Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr So I ended up removing the usage of PSImode for FPSCR and using SImode instead. I've tested this patch on r216173, together with the already applied attachment 33727 [details] patch, with -m4 -ml and -m4 -mb. There is one new failure: FAIL: gcc.c-torture/execute/20021120-1.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects (internal compiler error) error: insn does not satisfy its constraints: (insn 1491 1489 1176 5 (set (reg:SI 12 r12) (plus:SI (reg:SI 13 r13) (const_int 24 [0x18]))) 20021120-1.c:39 76 {*addsi3_compact} (nil)) 20021120-1.c:58:1: internal compiler error: in reload_cse_simplify_operands, at postreload.c:415 0x8565017 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) ../../gcc-trunk2/gcc/rtl-error.c:110 0x8565055 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) ../../gcc-trunk2/gcc/rtl-error.c:121 0x850c0cc reload_cse_simplify_operands ../../gcc-trunk2/gcc/postreload.c:415 0x850e5ea reload_cse_simplify ../../gcc-trunk2/gcc/postreload.c:127 0x850e5ea reload_cse_regs_1 ../../gcc-trunk2/gcc/postreload.c:224 0x850e6a2 reload_cse_regs ../../gcc-trunk2/gcc/postreload.c:72 0x850e6a2 execute ../../gcc-trunk2/gcc/postreload.c:2348 At that place, reload it's trying to stuff with the fpu_switch insn: (insn 293 1480 1173 5 (parallel [ (set (mem/c:DF (reg:SI 12 r12) [4 %sfp+-344 S8 A32]) (reg:DF 70 fr6)) (use (reg:SI 154 fpscr0)) (clobber (scratch:SI)) ]) 20021120-1.c:39 289 {movdf_i4} (nil)) (insn 1173 293 1483 5 (parallel [ (set (reg:SI 12 r12) (reg/v:SI 151 )) (use (reg:SI 155 fpscr1)) (use (reg:SI 154 fpscr0)) (set (reg:SI 155 fpscr1) (unspec_volatile:SI [ (const_int 0 [0]) ] UNSPECV_FPSCR_STAT)) (set (reg:SI 154 fpscr0) (unspec_volatile:SI [ (const_int 0 [0]) ] UNSPECV_FPSCR_MODES)) ]) 20021120-1.c:39 434 {fpu_switch} (nil)) (insn 1483 1173 1484 5 (set (reg:SI 13 r13) (const_int 252 [0xfc])) 20021120-1.c:39 258 {movsi_ie} (nil)) (insn 1484 1483 1485 5 (set (reg:SI 13 r13) (plus:SI (reg:SI 13 r13) (reg/f:SI 15 r15))) 20021120-1.c:39 76 {*addsi3_compact} (expr_list:REG_EQUIV (plus:SI (reg/f:SI 15 r15) (const_int 252 [0xfc])) (nil))) (insn 1485 1484 1174 5 (set (mem/c:SI (plus:SI (reg:SI 13 r13) (const_int 24 [0x18])) [4 %sfp+-76 S4 A32]) (reg:SI 12 r12)) 20021120-1.c:39 258 {movsi_ie} (nil)) (note 1174 1485 1490 5 NOTE_INSN_DELETED) (insn 1490 1174 1175 5 (set (reg:SI 13 r13) (const_int 524288 [0x80000])) 20021120-1.c:39 258 {movsi_ie} (nil)) (insn 1175 1490 1487 5 (set (reg:SI 12 r12) (xor:SI (reg:SI 12 r12) (reg:SI 13 r13))) 20021120-1.c:39 138 {*xorsi3_compact} (nil)) (insn 1487 1175 1488 5 (set (reg:SI 13 r13) (const_int 252 [0xfc])) 20021120-1.c:39 258 {movsi_ie} (nil)) (insn 1488 1487 1489 5 (set (reg:SI 13 r13) (plus:SI (reg:SI 13 r13) (reg/f:SI 15 r15))) 20021120-1.c:39 76 {*addsi3_compact} (expr_list:REG_EQUIV (plus:SI (reg/f:SI 15 r15) (const_int 252 [0xfc])) (nil))) (insn 1489 1488 1491 5 (set (mem/c:SI (plus:SI (reg:SI 13 r13) (const_int 24 [0x18])) [4 %sfp+-76 S4 A32]) (reg:SI 12 r12)) 20021120-1.c:39 258 {movsi_ie} (nil)) (insn 1491 1489 1176 5 (set (reg:SI 12 r12) (plus:SI (reg:SI 13 r13) (const_int 24 [0x18]))) 20021120-1.c:39 76 {*addsi3_compact} (nil)) (insn 1176 1491 1496 5 (parallel [ (set (reg/v:SI 151 ) (mem/c:SI (reg:SI 12 r12) [4 %sfp+-76 S4 A32])) (use (reg:SI 155 fpscr1)) (use (reg:SI 154 fpscr0)) (set (reg:SI 155 fpscr1) (unspec_volatile:SI [ (const_int 0 [0]) ] UNSPECV_FPSCR_STAT)) (set (reg:SI 154 fpscr0) (unspec_volatile:SI [ (const_int 0 [0]) ] UNSPECV_FPSCR_MODES)) ]) 20021120-1.c:39 434 {fpu_switch} (nil)) (insn 1496 1176 1497 5 (set (reg:SI 12 r12) (const_int 16 [0x10])) 20021120-1.c:39 258 {movsi_ie} (nil)) Summary: insn 1173 stores the fpscr to r12 the mode switch (xor op) is done on r12 insn 1489 writes r12 (new fpscr value) on the stack insn 1491 tries to recalculate the stack address and put it into r12, to satisfy the memory constraints of fpu_switch (which now accepts post-inc load and simple register address). insn 1776 tries to load it from the stack Reload generates a wrong insn addsi3 insn. Something similar was happening in PR 55212, too.
(In reply to Oleg Endo from comment #35) > Created attachment 33745 [details] > Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr > > So I ended up removing the usage of PSImode for FPSCR and using SImode > instead. I've tested this patch on r216173, together with the already > applied attachment 33727 [details] patch, with -m4 -ml and -m4 -mb. There > is one new failure: > > FAIL: gcc.c-torture/execute/20021120-1.c -O2 -flto -fuse-linker-plugin > -fno-fat-lto-objects (internal compiler error) > > error: insn does not satisfy its constraints: > (insn 1491 1489 1176 5 (set (reg:SI 12 r12) > (plus:SI (reg:SI 13 r13) > (const_int 24 [0x18]))) 20021120-1.c:39 76 {*addsi3_compact} > (nil)) > > > Summary: > > insn 1173 stores the fpscr to r12 > > the mode switch (xor op) is done on r12 > > insn 1489 writes r12 (new fpscr value) on the stack > > insn 1491 tries to recalculate the stack address and put it into r12, to > satisfy the memory constraints of fpu_switch (which now accepts post-inc > load and simple register address). > > insn 1776 tries to load it from the stack > > > Reload generates a wrong insn addsi3 insn. Something similar was happening > in PR 55212, too. Disallowing memory operands (except pre-dec store, post-inc load for push,pop) in the fpu_switch pattern fixes that failure. Although it might result in less optimal code in some cases where user code wants to store the current fpscr in memory. On the other hand, the generated memory address code in such cases is not very good anyway. Thus it's probably better to drop that for now.
Created attachment 33751 [details] Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr I'm now testing this patch. It removes The PSImode for FPSCR completely and splits the fpu_switch insn into two separate lds_fpscr and sts_fpscr insns. This allows relaxing the insn dependencies. Also now that sts_fpscr becomes a single-set insn, it can be stuffed into delay slots. I've also removed the alternatives for FPSCR <-> memory, except the pre-dec and post-inc ones, which are needed for push and pop insns. As a consequence, a user initiated __builtin_sh_get_fpscr () to a memory will always be ferried though a general register. There is some room for improvement, but it goes into the direction of address-mode-selection optimization, which can be done later. When doing a __builtin_sh_set_fpscr (value) the compiler will always insert code to preserve the current FPSCR FR, SZ, PR mode bits. This always involves getting the current FPSCR into a general register first and then loading FPSCR from a general register. Thus we can omit FPSCR loads from memory for now.
(In reply to Oleg Endo from comment #37) > Created attachment 33751 [details] > Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr > > I'm now testing this patch. ... there are no new failures for -m4 -ml and -m4 -mb. I'm tempted to apply it. Kaz, do you have any objections?
(In reply to Oleg Endo from comment #38) > ... there are no new failures for -m4 -ml and -m4 -mb. I'm tempted to apply > it. Kaz, do you have any objections? I have no objection.
Author: olegendo Date: Sat Oct 18 10:51:08 2014 New Revision: 216424 URL: https://gcc.gnu.org/viewcvs?rev=216424&root=gcc&view=rev Log: gcc/ PR target/53513 * config/sh/sh-modes.def (PSI): Remove. * config/sh/sh-protos.h (get_fpscr_rtx): Remove. * config/sh/sh.c (fpscr_rtx, get_fpscr_rtx): Remove. (sh_reorg): Remove commented out FPSCR code. (fpscr_set_from_mem): Use SImode instead of PSImode. Emit lds_fpscr insn instead of move insn. (sh_hard_regno_mode_ok): Return SImode for FPSCR. (sh_legitimate_address_p, sh_legitimize_reload_address): Remove PSImode handling. (sh_emit_mode_set): Emit lds_fpscr and sts_fpscr insns. (sh1_builtin_p): Uncomment. (SH_BLTIN_UV 25, SH_BLTIN_VU 26): New macros. (bdesc): Add __builtin_sh_get_fpscr and __builtin_sh_set_fpscr. * config/sh/sh/predicates.md (fpscr_operand): Simplify. (fpscr_movsrc_operand, fpscr_movdst_operand): New predicates. (general_movsrc_operand, general_movdst_operand): Disallow fpscr_operand. * config/sh/sh.md (FPSCR_FR): New constant. (push_fpscr): Emit sts_fpscr insn. (pop_fpscr): Emit lds_fpscr_insn. (movsi_ie): Disallow FPSCR operands. (fpu_switch, unnamed related split, extend_psi_si, truncate_si_psi): Remove insns. (lds_fpscr, sts_fpscr): New insns. (toggle_sz, toggle_pr): Use SImode for FPSCR_REG instead of PSImode. Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/predicates.md trunk/gcc/config/sh/sh-modes.def trunk/gcc/config/sh/sh-protos.h trunk/gcc/config/sh/sh.c trunk/gcc/config/sh/sh.md
Author: olegendo Date: Sun Dec 7 23:19:59 2014 New Revision: 218472 URL: https://gcc.gnu.org/viewcvs?rev=218472&root=gcc&view=rev Log: gcc/testsuite/ PR target/53513 * gcc.target/sh/pr54602-4.c: Fix matching of rte-nop sequence. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/sh/pr54602-4.c
Author: olegendo Date: Wed Dec 10 00:21:36 2014 New Revision: 218551 URL: https://gcc.gnu.org/viewcvs?rev=218551&root=gcc&view=rev Log: gcc/ PR target/53513 * doc/extend.texi (__builtin_sh_get_fpscr, __builtin_sh_get_fpscr): Document it. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/extend.texi
Author: olegendo Date: Wed Dec 10 08:31:32 2014 New Revision: 218563 URL: https://gcc.gnu.org/viewcvs?rev=218563&root=gcc&view=rev Log: gcc/ PR target/53513 * doc/extend.texi (__builtin_sh_set_fpscr): Fix typo. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/extend.texi
Author: olegendo Date: Sat Dec 13 13:17:55 2014 New Revision: 218707 URL: https://gcc.gnu.org/viewcvs?rev=218707&root=gcc&view=rev Log: gcc/testsuite/ PR target/53513 * gcc.target/sh/attr-isr-nosave_low_regs.c: Fix matching of expected register push/pop sequences. * gcc.target/sh/attr-isr.c: Likewise. * gcc.target/sh/attr-isr-trapa.c: Likewise. * gcc.target/sh/pragma-isr-nosave_low_regs.c: Likewise. * gcc.target/sh/pragma-isr-trapa.c: Likewise. * gcc.target/sh/pragma-isr-trapa2.c: Likewise. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/sh/attr-isr-nosave_low_regs.c trunk/gcc/testsuite/gcc.target/sh/attr-isr-trapa.c trunk/gcc/testsuite/gcc.target/sh/attr-isr.c trunk/gcc/testsuite/gcc.target/sh/pragma-isr-nosave_low_regs.c trunk/gcc/testsuite/gcc.target/sh/pragma-isr-trapa.c trunk/gcc/testsuite/gcc.target/sh/pragma-isr-trapa2.c
From my point of view, this PR can be closed as fixed. The ISR related failures were just issues w.r.t. how the the expected code sequences are matched in the test cases. I've created followup PR 64305 and PR 64299, which are issues that can be addressed separately.
Author: olegendo Date: Tue Dec 16 21:28:59 2014 New Revision: 218793 URL: https://gcc.gnu.org/viewcvs?rev=218793&root=gcc&view=rev Log: gcc/testsuite/ PR target/53513 * gcc.target/sh/fpchg.c: Rename to ... * gcc.target/sh/pr53513-1.c: ... this. Adjust test case to work for -m4a and -m4a-single. Added: trunk/gcc/testsuite/gcc.target/sh/pr53513-1.c Removed: trunk/gcc/testsuite/gcc.target/sh/fpchg.c Modified: trunk/gcc/testsuite/ChangeLog
Fixed/implemented for GCC 5.