Bug 53513 - [SH] Add support for fpchg insn and improve fenv support
Summary: [SH] Add support for fpchg insn and improve fenv support
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
: 60138 (view as bug list)
Depends on: 29349
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-28 23:57 UTC by Oleg Endo
Modified: 2014-12-21 17:53 UTC (History)
3 users (show)

See Also:
Host:
Target: sh*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-03-16 00:00:00


Attachments
a possible patch (1.19 KB, patch)
2014-10-11 21:28 UTC, Oleg Endo
Details | Diff
a possible patch (1.19 KB, patch)
2014-10-11 22:02 UTC, Oleg Endo
Details | Diff
Using virtual FPSCR registers to model insn dependencies (9.82 KB, patch)
2014-10-15 00:06 UTC, Oleg Endo
Details | Diff
Using virtual FPSCR registers to model insn dependencies (9.86 KB, patch)
2014-10-15 01:00 UTC, Oleg Endo
Details | Diff
Using virtual FPSCR registers to model insn dependencies (9.42 KB, patch)
2014-10-15 11:28 UTC, Oleg Endo
Details | Diff
Using virtual FPSCR registers to model insn dependencies (9.86 KB, patch)
2014-10-15 17:57 UTC, Oleg Endo
Details | Diff
Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr (4.34 KB, patch)
2014-10-17 09:22 UTC, Oleg Endo
Details | Diff
Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr (5.35 KB, patch)
2014-10-17 17:41 UTC, Oleg Endo
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Endo 2012-05-28 23:57:12 UTC
Currently FPU mode switches are done by always loading values from the global __fpscr_values array into FPSCR.
On SH4A the fpchg insn should be used to do SFmode / DFmode switches.
On non-SH4A it would probably be better to toggle the PR bit by other means in order to preserve the other FPSCR bits.  One example use case where the FPSCR.FR bit needs to be preserved is matrix multiplication, where user code could swap the register bank using the frchg insn.  Even if the user code updated the global __fpscr_values to reflect the FPSCR.FR change, it would break when there are multiple threads that want to use different FPSCR.FR/PR/SZ settings, as there is only one global __fpscr_values array.

To provide some backwards compatibility with existing binaries the global __fpscr_values should be kept around.  Old code would then still be able to reference them and new code would not touch them except when __set_fpscr is used.  Since the FPSCR.PR/SZ settings at function entry and function exit are defined by the selected ABI there should be no interoperability problems.
Comment 1 Oleg Endo 2013-03-10 19:53:56 UTC
Some related notes:

According to the public documentation, the 'fschg' insn is only valid when FPSCR.PR = 0 on all FPU enabled cores (SH2A, SH4, SH4A).

On SH4 and SH4A the 'frchg' insn is only valid when FPSCR.PR = 0.

The setting FPSCR.PR = 1, FPSCR.SZ = 1 is only valid for SH4A and SH2A, but not SH4.
Comment 2 Oleg Endo 2013-07-31 17:12:41 UTC
See also PR 29349.
Comment 3 Oleg Endo 2014-03-13 20:48:12 UTC
*** Bug 60138 has been marked as a duplicate of this bug. ***
Comment 4 Oleg Endo 2014-03-16 20:47:36 UTC
As mentioned in PR 60138, this issue also prevents a working implementation of fenv.h & friends on SH.

The idea would be to get rid of the __fpscr_values first and set the FPSCR.PR bit with insn sequences such like..

set pr = 1:
  sts	  fpscr,r2
  mov.l   #(1 << 19),r1
  or      r1,r2
  lds	  r2,fpscr

set pr = 0:
  sts     fpscr,r2
  mov.l   #~(1 << 19),r1
  and      r1,r2
  lds     r2,fpscr

This would obviously result in a performance regression but would work with all SH FPUs.

On SH4A this can then be improved by adding support for fpchg.  Although this would require changes/extensions to the mode switching machinery, as mentioned in PR 29349.  The problem is that the mode switching pass emits only mode changes to a particular mode, not from mode 'x' to mode 'y'.  In PR 29349 an extension of the pre_edge_lcm function is suggested which would make the necessary information available.

Here are a few more 'requirements' for SH specific mode change issues:

1)
The following function:

double test (const float* a, const float* b, const double* c, float x)
{
  float aa = a[0] * b[0];
  double cc = c[0] + c[1];

  aa += b[1] * b[2];
  cc += c[2] + c[3];
  aa += b[3] * b[4];
  cc += c[4] + c[5];
  aa += b[5] * b[6];

  return aa / cc;
}

compiled with -m4 -O2 (default PR mode = double) results in 4 mode switches.  Rewriting it as:

double test (const float* a, const float* b, const double* c, float x)
{
  float aa = a[0] * b[0] + b[1] * b[2] + b[3] * b[4] + b[5] * b[6];

  double cc = c[0] + c[1];
  cc += c[2] + c[3];
  cc += c[4] + c[5];

  return aa / cc;
}

results only in 2 mode switches (as expected).  FP operations which are independent should be reordered in order to minimize mode switches.
This could go as far as ...


2)
... doing loop distribution, for cases such as:

double test (const float* x, const double* y, unsigned int c)
{
  float r0 = 0;
  double r1 = 0;

  while (c--)
  {
    float xx = x[0] * x[1] + x[2] + 123.0f;
    x += 3;

    double yy = y[0] + y[1];
    y += 2;

    r0 += xx;
    r1 += yy;
  }

  return r0 + r1;
}

which currently produces a loop with 4 mode switches in it.  Reordering the FP operations would bring this down to 2 mode switches in the loop.  Since r0 and r1 are calculated independently, 2 loops can be used, having the mode switches outside the loops.


3)
FPSCR.SZ mode changes might interfere with FPSCR.PR mode changes.  For example, using fschg to flip FPSCR.SZ might require changing FPSCR.PR first (and potentially changing it back).  If fpchg is not available (SH4A only), it's better to set both bits directly.  In order to minimize mode switches it might be necessary to reorder instructions and doing loop distribution while looking at PR and SZ bits simultaneously.


4)
rounding mode settings also mean FPSCR mode changes.
http://gcc.gnu.org/ml/gcc-patches/2014-02/msg01378.html


5)
in some cases preserving FPSCR bits across mode changes is not required (if I'm not mistaken):

double func (float a, float b, double c, double d)
{
  #pragma STDC FENV_ACCESS ON

  // function entry, PR = double

  // mode switch PR = single
  float ab = a + b;

  // mode switch PR = double
  double x = ab + c + d;

  // read back FP status bits and do something with it
  return x;

  // function exit, PR = double
}

In this case the mode switch double -> float -> double can be done more efficiently by pushing the PR = double FPSCR state onto the stack, switch to PR = single and then switch back to PR = double by popping FPSCR from the stack.

However, this must not happen if other FPSCR settings are changed after the first switch to PR = single, such as invoking a fenv modifying standard function or changing the FPSCR.FR bit on SH4.
Comment 5 Oleg Endo 2014-03-16 23:32:21 UTC
(In reply to Oleg Endo from comment #4)
> 5)
> in some cases preserving FPSCR bits across mode changes is not required (if
> I'm not mistaken):
> 
> double func (float a, float b, double c, double d)
> {
>   #pragma STDC FENV_ACCESS ON
> 
>   // function entry, PR = double
> 
>   // mode switch PR = single
>   float ab = a + b;
> 
>   // mode switch PR = double
>   double x = ab + c + d;
> 
>   // read back FP status bits and do something with it
>   return x;
> 
>   // function exit, PR = double
> }
> 
> In this case the mode switch double -> float -> double can be done more
> efficiently by pushing the PR = double FPSCR state onto the stack, switch to
> PR = single and then switch back to PR = double by popping FPSCR from the
> stack.
> 
> However, this must not happen if other FPSCR settings are changed after the
> first switch to PR = single, such as invoking a fenv modifying standard
> function or changing the FPSCR.FR bit on SH4.

If it's OK for a temporary mode switch to clobber other FPSCR bits (such as in the PR = single mode above), it should also be OK to load the FPSCR value from a thread local variable inside the thread-control-block:

     mov.l  @(<disp>, gbr),r0  // r0 = address to fpscr value for a
                               // particular mode setting
     lds.l  @r0+,fpscr         // mode switch

This would require that any FPSCR setting change is also propagated to the TLS variables.  E.g. setting the rounding mode would have to update FPSCR mode values in all such TLS variables.
I guess that it would be useful to be able to select an FPSCR value for at least all combinations of FR and SZ bit settings, in other words having a TLS __fpscr_values array with 4 entries.  However, it would make things such as toggling FPSCR.FR via frchg inefficient due to the required updates of the TLS variables.  Other setting changes such as denorm or rounding mode are probably not so critical.
Comment 6 Rich Felker 2014-03-17 01:30:41 UTC
On Sun, Mar 16, 2014 at 11:32:21PM +0000, olegendo at gcc dot gnu.org wrote:
> If it's OK for a temporary mode switch to clobber other FPSCR bits (such as in
> the PR = single mode above), it should also be OK to load the FPSCR value from
> a thread local variable inside the thread-control-block:
> 
>      mov.l  @(<disp>, gbr),r0  // r0 = address to fpscr value for a
>                                // particular mode setting
>      lds.l  @r0+,fpscr         // mode switch

IMO this is an ugly hack that shouldn't be taken. It has lots of
complex interactions with other things: signal handlers, the
ucontext.h functions, fenv, pthread, etc. that could probably be
achieved correctly if somebody wanted to spend the effort on it, but
it would be ugly and SH-specific and honestly we already have a
shortage of people willing to spend time fixing SH problems without
introducing even more work.

> This would require that any FPSCR setting change is also propagated to the TLS
> variables.  E.g. setting the rounding mode would have to update FPSCR mode
> values in all such TLS variables.
> I guess that it would be useful to be able to select an FPSCR value for at
> least all combinations of FR and SZ bit settings, in other words having a TLS
> __fpscr_values array with 4 entries.  However, it would make things such as
> toggling FPSCR.FR via frchg inefficient due to the required updates of the TLS
> variables.  Other setting changes such as denorm or rounding mode are probably
> not so critical.

If I'm not mistaken, the toggle approach will be efficient without
this TLS hack once it's implemented, right? I don't think it makes
sense to introduce a hack for just a short-term mitigation of the
performance regression. If you think the short-term fix for this issue
is too costly, the proper solution is probably to add a -m option to
turn it back off (using the old __fpscr_values approach).
Comment 7 Oleg Endo 2014-03-17 09:09:12 UTC
(In reply to Rich Felker from comment #6)
> On Sun, Mar 16, 2014 at 11:32:21PM +0000, olegendo at gcc dot gnu.org wrote:
> > If it's OK for a temporary mode switch to clobber other FPSCR bits (such as in
> > the PR = single mode above), it should also be OK to load the FPSCR value from
> > a thread local variable inside the thread-control-block:
> > 
> >      mov.l  @(<disp>, gbr),r0  // r0 = address to fpscr value for a
> >                                // particular mode setting
> >      lds.l  @r0+,fpscr         // mode switch
> 
> IMO this is an ugly hack that shouldn't be taken. It has lots of
> complex interactions with other things: signal handlers, the
> ucontext.h functions, fenv, pthread, etc.

Could you please elaborate on the problems you see there?

> that could probably be
> achieved correctly if somebody wanted to spend the effort on it, but
> it would be ugly and SH-specific and honestly we already have a
> shortage of people willing to spend time fixing SH problems without
> introducing even more work.

I failed to mention, that the idea with TLS variables was meant to be a separate option.  If implemented in the compiler it would need to be turned on explicitly, similar to the option '-matomic-model=soft-tcb,gbr-offset=<your offset here>'.  Thus, if a libc/kernel/whatever doesn't want to make use of it, it won't have to.

> 
> > This would require that any FPSCR setting change is also propagated to
> > the TLS variables.  E.g. setting the rounding mode would have to update
> > FPSCR mode values in all such TLS variables.
> > I guess that it would be useful to be able to select an FPSCR value for at
> > least all combinations of FR and SZ bit settings, in other words having
> > a TLS __fpscr_values array with 4 entries.  However, it would make things
> > such as toggling FPSCR.FR via frchg inefficient due to the required updates
> > of the TLS variables.  Other setting changes such as denorm or rounding
> > mode are probably not so critical.
> 
> If I'm not mistaken, the toggle approach will be efficient without
> this TLS hack once it's implemented, right?

No.  The toggle approach is only efficient on SH4A.  Other SH FPUs don't implement the fpchg instruction and require sts-lds fpscr sequences, regardless of toggling or explicitly setting the PR mode. 

> I don't think it makes
> sense to introduce a hack for just a short-term mitigation of the
> performance regression. If you think the short-term fix for this issue
> is too costly, the proper solution is probably to add a -m option to
> turn it back off (using the old __fpscr_values approach).

This was not meant to be a short-term mitigation.  In fact figuring out whether FPSCR bits can be clobbered during a PR mode switch or not is already not so simple.  If at all, it would be an additional optimization opportunity after everything else is in place.

Keeping the 'global __fpscr_values' approach as an option could be useful for thread model = single, or for bare-metal applications where the rounding mode etc is never changed from the default and FP status bits are never read back by application code.
Comment 8 chrbr 2014-03-17 11:53:51 UTC
Hi Oleg, 

I posted a patch in 2006 (http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01562.html) to support mode flipping fpchg instruction for the sh4a. it is in used in our current (4.8) production compiler without issue since then.

I didn't insist a lot pushing it since too sh4a specific, but I'd be happy to port it to the 4.10 branch and repost if there is enough interest.

Cheers

a code like 
float
foo(float a)
{
return a+2;
}

compiles into:

        mova    .L2,r0
        fmov.s  @r0+,fr0
        fpchg
        fadd    fr5,fr0
        rts
        fpchg

instead of:

      mov.l   .L2,r1
        mova    .L3,r0
        fmov.s  @r0+,fr0
        lds.l   @r1+,fpscr
        add     #-4,r1
        add     #4,r1
        fadd    fr5,fr0
        rts
        lds.l   @r1+,fpscr
.L4:
        .align 2
.L2:
        .long   ___fpscr_values
.L3:
        .long   1073741824
Comment 9 Kazumoto Kojima 2014-03-17 13:51:39 UTC
Although it seems that (1)-(5) in #3 are interesting points, they
are almost optimizations.  I guess that programs with frequent FP
mode switchings are simply rare in real world and would be a bit
special or even pathological in the first place.
I like the idea of mode switching without __fpscr_values even if it
requires more instructions on SH4.  Now SH4 is a fairy old core and
is not for heavy FP computations anyway.  I think that it won't impact
the real working set.
It looks to me that Christian's patch is the way to go.
Comment 10 Oleg Endo 2014-03-17 14:23:44 UTC
(In reply to Kazumoto Kojima from comment #9)
> Although it seems that (1)-(5) in #3 are interesting points, they
> are almost optimizations.  

Yep.  Those are not necessary to get the functionality (of not using __fpscr_values).

> I guess that programs with frequent FP
> mode switchings are simply rare in real world and would be a bit
> special or even pathological in the first place.
> I like the idea of mode switching without __fpscr_values even if it
> requires more instructions on SH4.  Now SH4 is a fairy old core and
> is not for heavy FP computations anyway.  I think that it won't impact
> the real working set.
> It looks to me that Christian's patch is the way to go.

Yep.  However, when the patch was proposed there were some objections regarding the modifications in lcm.c (if I'm not mistaken).  We could try again.

The reason why I brought up (1)-(5) in #3 was that if one of them is eventually implemented (e.g. reorder calculation insns) the changes to lcm.c might not be required and could be avoided.  Depending on the implementation of such optimization, the mode switch information might have to be obtained/maintained in a different way.  However, I at the moment I have no concrete ideas/plans.  If Christian's patch is accepted, that's cool, too.
Comment 11 chrbr 2014-03-17 15:18:28 UTC
(In reply to Oleg Endo from comment #10)
> (In reply to Kazumoto Kojima from comment #9)
> > Although it seems that (1)-(5) in #3 are interesting points, they
> > are almost optimizations.  
> 
> Yep.  Those are not necessary to get the functionality (of not using
> __fpscr_values).
> 
> > I guess that programs with frequent FP
> > mode switchings are simply rare in real world and would be a bit
> > special or even pathological in the first place.
> > I like the idea of mode switching without __fpscr_values even if it
> > requires more instructions on SH4.  Now SH4 is a fairy old core and
> > is not for heavy FP computations anyway.  I think that it won't impact
> > the real working set.
> > It looks to me that Christian's patch is the way to go.
> 
> Yep.  However, when the patch was proposed there were some objections
> regarding the modifications in lcm.c (if I'm not mistaken).  We could try
> again.

there were neither followup nor objections to the last version. I'll post again, the time to cross-check with epiphany or x96. 

> 
> The reason why I brought up (1)-(5) in #3 was that if one of them is
> eventually implemented (e.g. reorder calculation insns) the changes to lcm.c
> might not be required and could be avoided.  Depending on the implementation
> of such optimization, the mode switch information might have to be
> obtained/maintained in a different way.  However, I at the moment I have no
> concrete ideas/plans.  If Christian's patch is accepted, that's cool, too.

Implementing in LCM can also be a base to expose the architectural mode flipping extension with other ports (MMX was suggested I think)
Comment 12 Oleg Endo 2014-03-17 15:41:43 UTC
(In reply to chrbr from comment #11)
> 
> there were neither followup nor objections to the last version. I'll post
> again, the time to cross-check with epiphany or x96. 
> 

Epiphany has some additional mode switching passes.  I haven't checked the details, but maybe it could be of some use on SH, too.
Comment 13 chrbr 2014-05-12 08:46:54 UTC
for the record, last patch here: 
http://gcc.gnu.org/ml/gcc-patches/2014-05/msg00738.html
Comment 14 Oleg Endo 2014-10-11 21:28:41 UTC
Created attachment 33690 [details]
a possible patch

This is a simple patch that does sts-lds fpscr mode switching (not fully tested).

With the patch applied, the following example

float test (float x, float y)
{
  return x + y;
}

compiled with -O2 -m4 (default fpu mode = double):

        sts     fpscr,r1        ! 20    fpu_switch/8
        mov.l   .L2,r2          ! 22    movsi_ie/1
        xor     r2,r1           ! 23    *xorsi3_compact/2
        lds     r1,fpscr        ! 25    fpu_switch/5
        xor     r2,r1           ! 29    *xorsi3_compact/2
        fmov    fr5,fr0         ! 34    movsf_ie/1
        fadd    fr4,fr0         ! 8     addsf3_i
        rts                     ! 37    *return_i
        lds     r1,fpscr        ! 31    fpu_switch/5
.L3:
        .align 2
.L2:
        .long   524288

The switch is done by 3 (+2 artificial) individual instructions (load - modify - store).  In this case the RA / optimizers figure out that there's no need to store fpscr twice and reorder the operations.  This is because all the fp insn patterns in the machine description only "use" the fpscr, but actually they also modify it.  This means that the fenv is reset after the 'fadd', i.e. it potentially clears exception flags etc.

I think this is wrong.  It also seems impossible to get the fpscr value immediately after the fp insn, as it always gets reordered in some way.  As far as I understand, all the fp insns that update bits in fpscr should actually do so (clobber it or set it in someway) and a builtin "get_fpscr" is required so that optimizers see the dependencies on fpscr.
Comment 15 Oleg Endo 2014-10-11 22:02:20 UTC
Created attachment 33691 [details]
a possible patch

The previous patch was buggy, it always generated a PR toggle sequence, even if a mode should be set directly.
Comment 16 Oleg Endo 2014-10-11 22:09:37 UTC
I've tried a modified example from PR 5360, using floats instead of doubles:

void loop_p (int np, int non0, float coeff[][2048], float tmp1)
{
  int j, k;

  for (j = non0; j < np; j++)
    for (k = 0; k < j; k++)
      coeff[j][j] -= tmp1 * coeff[j][k];
}

with -O2 -m4a (double mode default) and the patch from comment #15 applied:

  (loop setup code omitted)
        ...
.L6:
        cmp/pl  r5           ! outer loop, set to single
        bf/s    .L7
        sts     fpscr,r7
        mov.l   .L16,r4
        mov     r0,r2
        fmov.s  @r3,fr1
        mov     r5,r1
        and     r4,r7
        lds     r7,fpscr
        .align 2
.L5:
        fmov.s  @r2+,fr0     ! inner loop, no switch
        dt      r1
        fneg    fr0
        fmac    fr0,fr5,fr1
        bf/s    .L5
        fmov.s  fr1,@r3
.L7:
        dt      r6
        add     #1,r5
        add     r9,r0
        bf/s    .L6
        add     r8,r3

        sts     fpscr,r1     ! function return, set to double
        mov.l   .L17,r2
        mov.l   @r15+,r9
        or      r2,r1
        mov.l   @r15+,r8
        rts
        lds     r1,fpscr

Obviously, if the inner loop count is small the mode set in the outer loop will dominate.  Something seems to be missing in the mode-switch optimization.  The mode switch should be just hoisted above all loops, which then can use the fpchg insn on SH4A.
Comment 17 Oleg Endo 2014-10-12 14:13:25 UTC
(In reply to Oleg Endo from comment #14)
> 
> The switch is done by 3 (+2 artificial) individual instructions (load -
> modify - store).  In this case the RA / optimizers figure out that there's
> no need to store fpscr twice and reorder the operations.  This is because
> all the fp insn patterns in the machine description only "use" the fpscr,
> but actually they also modify it.  This means that the fenv is reset after
> the 'fadd', i.e. it potentially clears exception flags etc.
> 
> I think this is wrong.  It also seems impossible to get the fpscr value
> immediately after the fp insn, as it always gets reordered in some way.  As
> far as I understand, all the fp insns that update bits in fpscr should
> actually do so (clobber it or set it in someway) and a builtin "get_fpscr"
> is required so that optimizers see the dependencies on fpscr.

In the 'addsf3_i' pattern, I've tried replacing the

    (use (match_operand:PSI 3 "fpscr_operand" "c"))

with

    (set (match_operand:PSI 3 "fpscr_operand" "=&c")
         (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))]

and after that the asm output looks OK:
        sts     fpscr,r1
        mov.l   .L2,r2
        xor     r2,r1
        lds     r1,fpscr
        fmov    fr5,fr0
        fadd    fr4,fr0
        sts     fpscr,r1
        xor     r2,r1
        rts
        lds     r1,fpscr

I haven't checked all the other side effects it could have, but at least the FMA combine patterns still seem work after that change.
Comment 18 Oleg Endo 2014-10-13 06:54:10 UTC
(In reply to Oleg Endo from comment #15)
> Created attachment 33691 [details]
> a possible patch
> 
> The previous patch was buggy, it always generated a PR toggle sequence, even
> if a mode should be set directly.

I've tested the patch, there are some new failures:

-m4 -mb:
FAIL: g++.dg/torture/type-generic-1.C   -O0  execution test
FAIL: gcc.c-torture/execute/builtins/complex-1.c execution,  -O0 
FAIL: gcc.c-torture/execute/builtins/complex-1.c execution,  -Og -g 
FAIL: gcc.c-torture/execute/complex-6.c   -O0  execution test
FAIL: gcc.c-torture/execute/gofast.c   -O0  execution test
FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution,  -O0 
FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution,  -O1 
FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution,  -Og -g 
FAIL: gcc.c-torture/execute/ieee/20030331-1.c execution,  -O0 
FAIL: gcc.c-torture/execute/ieee/copysign1.c execution,  -O0 
FAIL: gcc.c-torture/execute/ieee/mzero3.c execution,  -O0 
FAIL: gcc.dg/torture/type-generic-1.c   -O0  execution test

-m4 -ml:
FAIL: g++.dg/torture/type-generic-1.C   -O0  execution test
FAIL: g++.dg/torture/type-generic-1.C   -O1  execution test
FAIL: g++.dg/torture/type-generic-1.C   -O2  execution test
FAIL: g++.dg/torture/type-generic-1.C   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
FAIL: g++.dg/torture/type-generic-1.C   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
FAIL: g++.dg/torture/type-generic-1.C   -O3 -fomit-frame-pointer  execution test
FAIL: g++.dg/torture/type-generic-1.C   -O3 -g  execution test
FAIL: g++.dg/torture/type-generic-1.C   -Os  execution test
FAIL: gcc.c-torture/execute/builtins/complex-1.c execution,  -O0 
FAIL: gcc.c-torture/execute/builtins/complex-1.c execution,  -Og -g 
FAIL: gcc.c-torture/execute/complex-6.c   -O0  execution test
FAIL: gcc.c-torture/execute/conversion.c   -O0  execution test
FAIL: gcc.c-torture/execute/gofast.c   -O0  execution test
FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution,  -O0 
FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution,  -O1 
FAIL: gcc.c-torture/execute/ieee/20010114-2.c execution,  -Og -g 
FAIL: gcc.c-torture/execute/ieee/20030331-1.c execution,  -O0 
FAIL: gcc.c-torture/execute/ieee/copysign1.c execution,  -O0 
FAIL: gcc.c-torture/execute/ieee/mzero3.c execution,  -O0 
FAIL: gcc.dg/pr28796-2.c execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O0  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O1  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O2  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O3 -fomit-frame-pointer  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -O3 -g  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c   -Os  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O0  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O1  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O2  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O3 -fomit-frame-pointer  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -O3 -g  execution test
FAIL: gcc.dg/torture/type-generic-1.c   -Os  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O0  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O1  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O2  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O3 -fomit-frame-pointer  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -O3 -g  execution test
FAIL: gcc.dg/torture/vec-cvt-1.c   -Os  execution test
Comment 19 Oleg Endo 2014-10-13 06:57:27 UTC
(In reply to Oleg Endo from comment #18)
> (In reply to Oleg Endo from comment #15)
> > Created attachment 33691 [details]
> > a possible patch
> > 
> > The previous patch was buggy, it always generated a PR toggle sequence, even
> > if a mode should be set directly.
> 
> I've tested the patch, there are some new failures:
> ...

No I didn't.  That was a patch for PR 63260.  Sorry for the noise.
Comment 20 Oleg Endo 2014-10-13 14:40:41 UTC
(In reply to Oleg Endo from comment #19)
> 
> No I didn't.  That was a patch for PR 63260.  Sorry for the noise.


Now I have.  For both '-m4 -ml' and '-m4 -mb' there are a few new failures:

FAIL: gcc.dg/attr-isr.c scan-assembler-times [^f]r[0-9][ \t]*, 8
(SH specific test 'gcc.dg/attr-isr.c' should be moved to 'gcc.target/sh')

FAIL: gcc.target/sh/attr-isr-nosave_low_regs.c scan-assembler-not [^f]r1[,0-3]
FAIL: gcc.target/sh/attr-isr-nosave_low_regs.c scan-assembler-not [^f]r[0-9][ \t]*,
FAIL: gcc.target/sh/attr-isr-trapa.c scan-assembler-not r1[,0-3]
FAIL: gcc.target/sh/pr54680.c scan-assembler-times lds\t 6
FAIL: gcc.target/sh/pragma-isr-nosave_low_regs.c scan-assembler-not [^f]r1[,0-3]
FAIL: gcc.target/sh/pragma-isr-nosave_low_regs.c scan-assembler-not [^f]r[0-9][ \t]*,
FAIL: gcc.target/sh/pragma-isr-trapa.c scan-assembler-not r1[,0-3]
FAIL: gcc.target/sh/pragma-isr-trapa2.c scan-assembler-not r1[,0-3]
FAIL: gcc.target/sh/pragma-isr-trapa2.c scan-assembler-times [^_]fpscr 3
FAIL: gcc.target/sh/pragma-isr-trapa2.c scan-assembler-times r[0-7]\n 3


The failures seem to be related to ISR-like function handling or the tests need adjustments because they assume that fpscr is accessed in a particular way, which is now about to change.

I'd like to ignore ISR function problems for now, and fix it as part of PR 57339.
Comment 21 Oleg Endo 2014-10-14 03:46:19 UTC
(In reply to Oleg Endo from comment #17)
> 
> In the 'addsf3_i' pattern, I've tried replacing the
> 
>     (use (match_operand:PSI 3 "fpscr_operand" "c"))
> 
> with
> 
>     (set (match_operand:PSI 3 "fpscr_operand" "=&c")
>          (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))]
> 
> 
> I haven't checked all the other side effects it could have, but at least the
> FMA combine patterns still seem work after that change.

With those changes above applied to all the other FP insns, the FMA tests in gcc.target/sh fail.

One problem is that at -O1, FMA patterns won't be expanded initially, but are rather formed during combine.  With the more complex FPSCA set/use combine won't match the patterns at -O1 but only at -O2.  This is regression is probably acceptable.

The other issue is that the fix for PR 56547 doesn't work anymore, because for some reason combine can't figure out what to do with the 1.0 constant.  Probably because now it'd need to deal with multiple-set insns, which it doesn't do well.  Adding a combine bridge pattern doesn't make it go further, as it won't try to combine 'fpreg = fpreg + 1.0' and 'fpreg = fpreg * fpreg'.  In this case it's probably better to do a manual fma combine in the split1 pass.  This is more effort, but would also fix the first problem.
Comment 22 Oleg Endo 2014-10-14 12:25:51 UTC
(In reply to Oleg Endo from comment #21)
> (In reply to Oleg Endo from comment #17)
> > 
> > In the 'addsf3_i' pattern, I've tried replacing the
> > 
> >     (use (match_operand:PSI 3 "fpscr_operand" "c"))
> > 
> > with
> > 
> >     (set (match_operand:PSI 3 "fpscr_operand" "=&c")
> >          (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))]
> > 
> > 
> > I haven't checked all the other side effects it could have, but at least the
> > FMA combine patterns still seem work after that change.
> 
> With those changes above applied to all the other FP insns, the FMA tests in
> gcc.target/sh fail.

As well as some of the FSCA tests.  E.g. 'return sinf (x) + cosf (x)' would expand into two sincos patterns and normally combine would fold them into one.  However, it doesn't understand the unspec:PSI stuff and tries to nest it such as:
Failed to match this instruction:
(parallel [
        (set (reg:SI 174)
            (fix:SI (reg:SF 167)))
        (set (reg/v:PSI 151 )
            (unspec:PSI [
                    (unspec:PSI [
                            (reg/v:PSI 151 )
                        ] UNSPEC_FPSCR)
                ] UNSPEC_FPSCR))
    ])

I've tried adding a predicate that recursively matches those nested unspec:PSI things, which makes combine happy, but then reload would bail out for some reason.  Moreover, fp insns that do this

 (set (match_operand:PSI 3 "fpscr_operand" "=&c")
      (unspec:PSI [(match_dup 3)] UNSPEC_FPSCR_SET))]

won't get eliminated even if their results are unused/dead.
I'm trying to come up with some alternative approach.
Comment 23 Oleg Endo 2014-10-15 00:06:05 UTC
Created attachment 33716 [details]
Using virtual FPSCR registers to model insn dependencies

This is a somewhat larger patch which does a couple of things to address the deficits of the previous approach.

1) The hard reg set is extended by two more registers:
- FPSCR_MODES_REG represents any mode bits in FPSCR, which are used by fp insns.

- FPSCR_STAT_REG represents the status bits in FPSCR, which are written by fp insns.

All fp insns that previously had a
  (use (match_operand:PSI 2 "fpscr_operand" "")

now get a
  (use (reg:SI FPSCR_MODES_REG))


2) The 'fpu_switch' insn, which is just a FPSCR load/store, now uses and sets the virtual regs FPSCR_MODES_REG and FPSCR_STAT_REG.  This creates a sort of reordering barrier due to the reg uses/sets.


3) Two new dummy insns extend_psi_si and truncate_si_psi are added to convert PSIMode <-> SImode, because we can do only logic on SImode, but FPSCR is described as PSImode.  This hack could be removed later maybe.


4) The insns toggle_sz and toggle_pr are now also setting the
virtual reg FPSCR_MODES_REG.


5) The fsca insn needed some adjustments for the operand matching, as combine started going different paths.  Some of the fsca tests in gcc.target/sh were failing and are fixed by this.


6) sh_emit_set_t_insn is adjusted to emit the correct comparison insns with use/set of virtual FPSCR regs.


7) calc_live_regs is adjusted to ignore virtual regs FPSCR_MODES_REG and FPSCR_STAT_REG, or else it will try to generate push/pop insns for those regs.


8) sh_emit_mode_set now emits the FPSCR store-modify-load insn sequence instead of loading FPSCR from __fpscr_values.


9) Some redundant fp insns and related functions are deleted.


I've tried to keep the patch as short as possible and not do too many unrelated changes.  I haven't fully tested the patch, so there are probably a couple of fallouts.

There are two regressions with this patch.  The FPSCR loads are not put into delay slots anymore.  Probably because the 'fpu_switch' pattern now has too many sets and the DBR rejects that as a slot candidate.  The other thing are the failures in gcc.target/sh w.r.t. interrupt functions, which can be addressed later.

Kaz, could you please have an early look at it?
Comment 24 Oleg Endo 2014-10-15 01:00:00 UTC
Created attachment 33717 [details]
Using virtual FPSCR registers to model insn dependencies

sh_emit_set_t_insn was producing nested parallel patterns.

When compiling libgcc (for -m4 -mb) memory consumption goes up a lot.  Either there is a bug in the patch, or it's triggering a bug somewhere else which seems to cause an infinite alloc cycle somewhere.
Comment 25 Kazumoto Kojima 2014-10-15 04:02:35 UTC
(In reply to Oleg Endo from comment #23)
> Kaz, could you please have an early look at it?

The idea looks OK to me.  Build fails on sh4-linux with the patch, though.
Maybe a wrong version?
There is a typo divdf3 expand.  s/gen_divdf3 /gen_divdf3_i /

-      expand_df_binop (&gen_divdf3_i, operands);
+      emit_insn (gen_divdf3 (operands[0], operands[1], operands[2]));

With fixing it, yet segfaults during compiling libgcc2.c __powidf2 at

#0  0x089706b4 in sh_adjust_cost (insn=0xb7f767e0, link=0xb7f79380, 
    dep_insn=0xb7f762d0, cost=25) at trunk/gcc/config/sh/sh.c:10908
10908                                         SET_SRC (use_pat)))
(gdb) l
10903                cycle earlier.  */
10904             else if (reload_completed
10905                      && get_attr_dfp_comp (dep_insn) == DFP_COMP_YES
10906                      && (use_pat = single_set (insn))
10907                      && ! regno_use_in (REGNO (SET_DEST (single_set (dep_insn))),
10908                                         SET_SRC (use_pat)))

It looks

-   (use (match_operand:PSI 3 "fpscr_operand" "c"))]
+   (set (reg:SI FPSCR_STAT_REG) (const_int 0))
+   (use (reg:SI FPSCR_MODES_REG))]

breaks the assumption of sh_adjust_cost which divdf3_i has only one
set insn.
Comment 26 Oleg Endo 2014-10-15 08:41:05 UTC
(In reply to Kazumoto Kojima from comment #25)
> (In reply to Oleg Endo from comment #23)
> > Kaz, could you please have an early look at it?
> 
> The idea looks OK to me.  Build fails on sh4-linux with the patch, though.
> Maybe a wrong version?
> There is a typo divdf3 expand.  s/gen_divdf3 /gen_divdf3_i /
> 
> -      expand_df_binop (&gen_divdf3_i, operands);
> +      emit_insn (gen_divdf3 (operands[0], operands[1], operands[2]));
> 
> With fixing it, yet segfaults during compiling libgcc2.c __powidf2 at

No, that was the correct version unfortunately.  And that was my infinite-memory problem.  Thanks for spotting it.

> 
> #0  0x089706b4 in sh_adjust_cost (insn=0xb7f767e0, link=0xb7f79380, 
>     dep_insn=0xb7f762d0, cost=25) at trunk/gcc/config/sh/sh.c:10908
> 10908                                         SET_SRC (use_pat)))
> (gdb) l
> 10903                cycle earlier.  */
> 10904             else if (reload_completed
> 10905                      && get_attr_dfp_comp (dep_insn) == DFP_COMP_YES
> 10906                      && (use_pat = single_set (insn))
> 10907                      && ! regno_use_in (REGNO (SET_DEST (single_set
> (dep_insn))),
> 10908                                         SET_SRC (use_pat)))
> 
> It looks
> 
> -   (use (match_operand:PSI 3 "fpscr_operand" "c"))]
> +   (set (reg:SI FPSCR_STAT_REG) (const_int 0))
> +   (use (reg:SI FPSCR_MODES_REG))]
> 
> breaks the assumption of sh_adjust_cost which divdf3_i has only one
> set insn.

Yep, single_set now returns null for basically all fp insns and it's not checked there.  The default implementation of single_set is also the reason why delay slot stuffing is not working on fp insns anymore.
Comment 27 Oleg Endo 2014-10-15 11:28:44 UTC
Created attachment 33721 [details]
Using virtual FPSCR registers to model insn dependencies

Updated patch that avoids the single_set problems by using (clobber (reg:SI FPSCR_STAT_REG)) instead of a set.  This also eliminates the fsca pattern changes in the previous patch.  Since the 'fpu_switch' insn is still a multiple set insn, it won't be used for delay slot stuffing, but this is a minor issue that can be addressed later.  I'm testing the patch now on sh-sim.  At least 'make all' works.
Comment 28 Oleg Endo 2014-10-15 17:57:07 UTC
Created attachment 33727 [details]
Using virtual FPSCR registers to model insn dependencies

(In reply to Oleg Endo from comment #27)
> Created attachment 33721 [details]
> Using virtual FPSCR registers to model insn dependencies
> 
> Updated patch that avoids the single_set problems by using (clobber (reg:SI
> FPSCR_STAT_REG)) instead of a set.  This also eliminates the fsca pattern
> changes in the previous patch.  Since the 'fpu_switch' insn is still a
> multiple set insn, it won't be used for delay slot stuffing, but this is a
> minor issue that can be addressed later.  I'm testing the patch now on
> sh-sim.  At least 'make all' works.

Testing for '-m4 -ml' and '-m4 -mb' shows one new failure (ignoring the ISR failures):

FAIL: gcc.c-torture/execute/pr28982a.c   -O1  (internal compiler error)
FAIL: gcc.c-torture/execute/pr28982a.c   -O1  (test for excess errors)

The problem is the define_split and the peephole2 patterns below the "fpu_switch" insn.  I don't know how/if that was working before.  I've removed the peephole2 pattern and rewrote the split pattern, which fixes the failure above.  I'll re-test the whole thing again.
Comment 29 Oleg Endo 2014-10-15 22:07:57 UTC
(In reply to Oleg Endo from comment #28)
> Created attachment 33727 [details]
> Using virtual FPSCR registers to model insn dependencies
> 
> The problem is the define_split and the peephole2 patterns below the
> "fpu_switch" insn.  I don't know how/if that was working before.  I've
> removed the peephole2 pattern and rewrote the split pattern, which fixes the
> failure above.  I'll re-test the whole thing again.

With this patch there are no new failures for -m4 -ml and -m4 -mb here, except the ISR failures.  I'll also test it for the other combinations later.  I'd like to get the current sh4 working version first.  Then fix other niche problems with ISRs or SH2E.  Kaz, what do you think?
Comment 30 Kazumoto Kojima 2014-10-16 01:13:19 UTC
(In reply to Oleg Endo from comment #29)
> With this patch there are no new failures for -m4 -ml and -m4 -mb here,
> except the ISR failures.  I'll also test it for the other combinations
> later.  I'd like to get the current sh4 working version first.  Then fix
> other niche problems with ISRs or SH2E.  Kaz, what do you think?

Sounds reasonable.  Please go ahead.
Comment 31 Oleg Endo 2014-10-16 10:59:07 UTC
Author: olegendo
Date: Thu Oct 16 10:58:36 2014
New Revision: 216307

URL: https://gcc.gnu.org/viewcvs?rev=216307&root=gcc&view=rev
Log:
gcc/
	PR target/53513
	* config/sh/sh-protos.h (emit_sf_insn, emit_df_insn, expand_sf_unop,
	expand_sf_binop, expand_df_unop, expand_df_binop): Remove.

	* config/sh/sh.c (sh_emit_set_t_insn): Adjust generated insn pattern
	to match fp insn patterns.
	(calc_live_regs): Add FPSCR_MODES_REG and FPSCR_STAT_REG to the ignore
	list.
	(emit_sf_insn, emit_df_insn, expand_sf_unop, expand_sf_binop,
	expand_df_unop, expand_df_binop): Remove.
	(sh_conditional_register_usage): Mark FPSCR_MODES_REG and
	FPSCR_STAT_REG as not call clobbered.
	(sh_emit_mode_set): Emit fpscr store-modify-load sequence instead of
	invoking fpscr_set_from_mem.

	* config/sh/sh.h (MAX_REGISTER_NAME_LENGTH): Increase to 6.
	(SH_REGISTER_NAMES_INITIALIZER): Add names for FPSCR_MODES_REG and
	FPSCR_STAT_REG.
	(REGISTER_NAMES): Adjust.
	(SPECIAL_REGISTER_P): Add FPSCR_MODES_REG and FPSCR_STAT_REG.
	(FIRST_PSEUDO_REGISTER): Increase to 156.
	(DWARF_FRAME_REGISTERS): Define as 153 to keep the original value.
	(FIXED_REGISTERS, CALL_USED_REGISTERS): Add FPSCR_MODES_REG and
	FPSCR_STAT_REG.
	(REG_CLASS_CONTENTS): Adjust ALL_REGS bit mask to include
	FPSCR_MODES_REG and FPSCR_STAT_REG.
	(REG_ALLOC_ORDER): Add FPSCR_MODES_REG and FPSCR_STAT_REG.

	* config/sh/sh.md (FPSCR_MODES_REG, FPSCR_STAT_REG, FPSCR_PR,
	FPSCR_SZ): Add new constants.
	(UNSPECV_FPSCR_MODES, UNSPECV_FPSCR_STAT): Add new unspecv constants.

	(movpsi): Use TARGET_FPU_ANY condition, invoke gen_fpu_switch.
	(fpu_switch): Add use and set of FPSCR_STAT_REG and FPSCR_MODES_REG.
	Use TARGET_FPU_ANY condition.
	(fpu_switch peephole2): Remove.
	(fpu_switch split): Use simple_mem_operand to capture the mem and
	adjust split implementation.
	(extend_psi_si, truncate_si_psi): New insns.
	(toggle_sz, toggle_pr): Use FPSCR_SZ, FPSCR_PR constants.  Add
	set of FPSCR_MODES_REG.

	(push_e, push_4, pop_e, pop_4, movdf_i4, reload_indf__frn, movsf_ie,
	reload_insf__frn, force_mode_for_call, calli, calli_tbr_rel,
	calli_pcrel, call_pcrel, call_compact, call_compact_rettramp,
	call_valuei, call_valuei_tbr_rel, call_valuei_pcrel, call_value_pcrel,
	call_value_compact, call_value_compact_rettramp, call,
	call_pop_compact, call_pop_compact_rettramp, call_value, sibcalli,
	sibcalli_pcrel, sibcalli_thunk, sibcall_pcrel, sibcall_compact,
	sibcall, sibcall_valuei, sibcall_valuei_pcrel, sibcall_value_pcrel,
	sibcall_value_compact, sibcall_value, call_value_pop_compact,
	call_value_pop_compact_rettramp, various unnamed splits):
	Replace use of FPSCR_REG with use of FPSCR_MODES_REG.  Adjust gen_*
	function uses.

	(floatsisf2_i4, *floatsisf2_ie): Merge into floatsisf2_i4.
	(fix_truncsfsi2_i4, *fixsfsi): Merge into fix_truncsfsi2_i4.
	(cmpgtsf_t, cmpgtsf_t_i4): Merge into cmpgtsf_t.
	(cmpeqsf_t, cmpeqsf_t_i4): Merge into cmpeqsf_t.
	(ieee_ccmpeqsf_t, *ieee_ccmpeqsf_t_4): Merge into ieee_ccmpeqsf_t.

	(udivsi3_i4, divsi3_i4, addsf3_i, subsf3_i, mulsf3_i, fmasf4_i,
	*fmasf4, divsf3_i, floatsisf2_i4, fix_truncsfsi2_i4, cmpgtsf_t,
	cmpeqsf_t, ieee_ccmpeqsf_t, sqrtsf2_i, rsqrtsf2, fsca, adddf3_i,
	subdf3_i, muldf3_i, divdf3_i, floatsidf2_i, fix_truncdfsi2_i,
	cmpgtdf_t, cmpeqdf_t, *ieee_ccmpeqdf_t, sqrtdf2_i, extendsfdf2_i4,
	truncdfsf2_i4): Replace use of FPSCR_REG with clobber of FPSCR_STAT_REG
	and use of FPSCR_MODES_REG.  Adjust gen_* function uses.

gcc/testsuite/
	PR target/53513
	* gcc.target/sh/pr54680.c: Adjust matching of lds insn.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh-protos.h
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.h
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/sh/pr54680.c
Comment 32 Oleg Endo 2014-10-16 12:12:14 UTC
As a next step, I'd like to add built-in functions to get/set the FPSCR on SH.

We have the old __get_fpscr and __set_fpscr functions in libgcc which modify the __fpscr_values array and the FPSCR.

One idea would be to make __get_fpscr and __set_fpscr built-in functions, so that new code never generates the respective library calls.  I don't know how expected/unexpected that would be for users and their existing code.
For example, the current glibc does this in sysdeps/sh/sh4/fpu/fpu_control.h:

#if defined __GNUC__
__BEGIN_DECLS

/* GCC provides this function.  */
extern void __set_fpscr (unsigned long);
#define _FPU_SETCW(cw) __set_fpscr ((cw))
#else
#define _FPU_SETCW(cw) __asm__ ("lds %0,fpscr" : : "r" (cw))
#endif

By making __set_fpscr a builtin it would automagically work, but would never update the __fpscr_values.  I'm not sure what kind of other fpscr related work arounds build on top of that.

Alternatively, we could name the built-in functions differently.  I've briefly checked the docs of other targets, this is what I've found:

aarch64:
     unsigned int __builtin_aarch64_get_fpcr ()
     void __builtin_aarch64_set_fpcr (unsigned int)
     unsigned int __builtin_aarch64_get_fpsr ()
     void __builtin_aarch64_set_fpsr (unsigned int)

ARM:
    unsigned int __builtin_arm_get_fpscr ()
     void __builtin_arm_set_fpscr (unsigned int)

MIPS:
    unsigned int __builtin_mips_get_fcsr (void)
    void __builtin_mips_set_fcsr (unsigned int value)

I'd propose the following for SH:

    unsigned int __builtin_sh_get_fpscr ()
    void __builtin_sh_set_fpscr (unsigned int)

Any opinions or feedback regarding the matter?
Comment 33 Kazumoto Kojima 2014-10-16 13:28:19 UTC
(In reply to Oleg Endo from comment #32)
> I'd propose the following for SH:
> 
>     unsigned int __builtin_sh_get_fpscr ()
>     void __builtin_sh_set_fpscr (unsigned int)
> 
> Any opinions or feedback regarding the matter?

Looks fine to me.
Comment 34 Oleg Endo 2014-10-16 18:57:34 UTC
Test run for 

  -m2e -ml, -m2a -mb, -m2a-single -mb, -m4-single-ml

has finished and shows no new failures (except the ISR stuff).
Comment 35 Oleg Endo 2014-10-17 09:22:45 UTC
Created attachment 33745 [details]
Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr

So I ended up removing the usage of PSImode for FPSCR and using SImode instead.  I've tested this patch on r216173, together with the already applied attachment 33727 [details] patch, with -m4 -ml and -m4 -mb.  There is one new failure:

FAIL: gcc.c-torture/execute/20021120-1.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  (internal compiler error)

error: insn does not satisfy its constraints:
(insn 1491 1489 1176 5 (set (reg:SI 12 r12)
        (plus:SI (reg:SI 13 r13)
            (const_int 24 [0x18]))) 20021120-1.c:39 76 {*addsi3_compact}
     (nil))


20021120-1.c:58:1: internal compiler error: in reload_cse_simplify_operands, at postreload.c:415
0x8565017 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	../../gcc-trunk2/gcc/rtl-error.c:110
0x8565055 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	../../gcc-trunk2/gcc/rtl-error.c:121
0x850c0cc reload_cse_simplify_operands
	../../gcc-trunk2/gcc/postreload.c:415
0x850e5ea reload_cse_simplify
	../../gcc-trunk2/gcc/postreload.c:127
0x850e5ea reload_cse_regs_1
	../../gcc-trunk2/gcc/postreload.c:224
0x850e6a2 reload_cse_regs
	../../gcc-trunk2/gcc/postreload.c:72
0x850e6a2 execute
	../../gcc-trunk2/gcc/postreload.c:2348


At that place, reload it's trying to stuff with the fpu_switch insn:

(insn 293 1480 1173 5 (parallel [
            (set (mem/c:DF (reg:SI 12 r12) [4 %sfp+-344 S8 A32])
                (reg:DF 70 fr6))
            (use (reg:SI 154 fpscr0))
            (clobber (scratch:SI))
        ]) 20021120-1.c:39 289 {movdf_i4}
     (nil))
(insn 1173 293 1483 5 (parallel [
            (set (reg:SI 12 r12)
                (reg/v:SI 151 ))
            (use (reg:SI 155 fpscr1))
            (use (reg:SI 154 fpscr0))
            (set (reg:SI 155 fpscr1)
                (unspec_volatile:SI [
                        (const_int 0 [0])
                    ] UNSPECV_FPSCR_STAT))
            (set (reg:SI 154 fpscr0)
                (unspec_volatile:SI [
                        (const_int 0 [0])
                    ] UNSPECV_FPSCR_MODES))
        ]) 20021120-1.c:39 434 {fpu_switch}
     (nil))
(insn 1483 1173 1484 5 (set (reg:SI 13 r13)
        (const_int 252 [0xfc])) 20021120-1.c:39 258 {movsi_ie}
     (nil))
(insn 1484 1483 1485 5 (set (reg:SI 13 r13)
        (plus:SI (reg:SI 13 r13)
            (reg/f:SI 15 r15))) 20021120-1.c:39 76 {*addsi3_compact}
     (expr_list:REG_EQUIV (plus:SI (reg/f:SI 15 r15)
            (const_int 252 [0xfc]))
        (nil)))
(insn 1485 1484 1174 5 (set (mem/c:SI (plus:SI (reg:SI 13 r13)
                (const_int 24 [0x18])) [4 %sfp+-76 S4 A32])
        (reg:SI 12 r12)) 20021120-1.c:39 258 {movsi_ie}
     (nil))
(note 1174 1485 1490 5 NOTE_INSN_DELETED)
(insn 1490 1174 1175 5 (set (reg:SI 13 r13)
        (const_int 524288 [0x80000])) 20021120-1.c:39 258 {movsi_ie}
     (nil))
(insn 1175 1490 1487 5 (set (reg:SI 12 r12)
        (xor:SI (reg:SI 12 r12)
            (reg:SI 13 r13))) 20021120-1.c:39 138 {*xorsi3_compact}
     (nil))
(insn 1487 1175 1488 5 (set (reg:SI 13 r13)
        (const_int 252 [0xfc])) 20021120-1.c:39 258 {movsi_ie}
     (nil))
(insn 1488 1487 1489 5 (set (reg:SI 13 r13)
        (plus:SI (reg:SI 13 r13)
            (reg/f:SI 15 r15))) 20021120-1.c:39 76 {*addsi3_compact}
     (expr_list:REG_EQUIV (plus:SI (reg/f:SI 15 r15)
            (const_int 252 [0xfc]))
        (nil)))
(insn 1489 1488 1491 5 (set (mem/c:SI (plus:SI (reg:SI 13 r13)
                (const_int 24 [0x18])) [4 %sfp+-76 S4 A32])
        (reg:SI 12 r12)) 20021120-1.c:39 258 {movsi_ie}
     (nil))
(insn 1491 1489 1176 5 (set (reg:SI 12 r12)
        (plus:SI (reg:SI 13 r13)
            (const_int 24 [0x18]))) 20021120-1.c:39 76 {*addsi3_compact}
     (nil))
(insn 1176 1491 1496 5 (parallel [
            (set (reg/v:SI 151 )
                (mem/c:SI (reg:SI 12 r12) [4 %sfp+-76 S4 A32]))
            (use (reg:SI 155 fpscr1))
            (use (reg:SI 154 fpscr0))
            (set (reg:SI 155 fpscr1)
                (unspec_volatile:SI [
                        (const_int 0 [0])
                    ] UNSPECV_FPSCR_STAT))
            (set (reg:SI 154 fpscr0)
                (unspec_volatile:SI [
                        (const_int 0 [0])
                    ] UNSPECV_FPSCR_MODES))
        ]) 20021120-1.c:39 434 {fpu_switch}
     (nil))
(insn 1496 1176 1497 5 (set (reg:SI 12 r12)
        (const_int 16 [0x10])) 20021120-1.c:39 258 {movsi_ie}
     (nil))


Summary:

insn 1173 stores the fpscr to r12

the mode switch (xor op) is done on r12

insn 1489 writes r12 (new fpscr value) on the stack

insn 1491 tries to recalculate the stack address and put it into r12, to satisfy the memory constraints of fpu_switch (which now accepts post-inc load and simple register address).

insn 1776 tries to load it from the stack


Reload generates a wrong insn addsi3 insn.  Something similar was happening in PR 55212, too.
Comment 36 Oleg Endo 2014-10-17 13:30:30 UTC
(In reply to Oleg Endo from comment #35)
> Created attachment 33745 [details]
> Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr
> 
> So I ended up removing the usage of PSImode for FPSCR and using SImode
> instead.  I've tested this patch on r216173, together with the already
> applied attachment 33727 [details] patch, with -m4 -ml and -m4 -mb.  There
> is one new failure:
> 
> FAIL: gcc.c-torture/execute/20021120-1.c   -O2 -flto -fuse-linker-plugin
> -fno-fat-lto-objects  (internal compiler error)
> 
> error: insn does not satisfy its constraints:
> (insn 1491 1489 1176 5 (set (reg:SI 12 r12)
>         (plus:SI (reg:SI 13 r13)
>             (const_int 24 [0x18]))) 20021120-1.c:39 76 {*addsi3_compact}
>      (nil))
> 
> 
> Summary:
> 
> insn 1173 stores the fpscr to r12
> 
> the mode switch (xor op) is done on r12
> 
> insn 1489 writes r12 (new fpscr value) on the stack
> 
> insn 1491 tries to recalculate the stack address and put it into r12, to
> satisfy the memory constraints of fpu_switch (which now accepts post-inc
> load and simple register address).
> 
> insn 1776 tries to load it from the stack
> 
> 
> Reload generates a wrong insn addsi3 insn.  Something similar was happening
> in PR 55212, too.

Disallowing memory operands (except pre-dec store, post-inc load for push,pop) in the fpu_switch pattern fixes that failure.  Although it might result in less optimal code in some cases where user code wants to store the current fpscr in memory.  On the other hand, the generated memory address code in such cases is not very good anyway.  Thus it's probably better to drop that for now.
Comment 37 Oleg Endo 2014-10-17 17:41:47 UTC
Created attachment 33751 [details]
Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr

I'm now testing this patch.  It removes The PSImode for FPSCR completely and splits the fpu_switch insn into two separate lds_fpscr and sts_fpscr insns.  This allows relaxing the insn dependencies.  Also now that sts_fpscr becomes a single-set insn, it can be stuffed into delay slots.
I've also removed the alternatives for FPSCR <-> memory, except the pre-dec and post-inc ones, which are needed for push and pop insns.  As a consequence, a user initiated __builtin_sh_get_fpscr () to a memory will always be ferried though a general register.  There is some room for improvement, but it goes into the direction of address-mode-selection optimization, which can be done later.

When doing a __builtin_sh_set_fpscr (value) the compiler will always insert code to preserve the current FPSCR FR, SZ, PR mode bits.  This always involves getting the current FPSCR into a general register first and then loading FPSCR from a general register.  Thus we can omit FPSCR loads from memory for now.
Comment 38 Oleg Endo 2014-10-17 22:20:51 UTC
(In reply to Oleg Endo from comment #37)
> Created attachment 33751 [details]
> Use SImode for FPSCR, add __builtin_sh_get_fpscr, __builtin_sh_set_fpscr
> 
> I'm now testing this patch.

... there are no new failures for -m4 -ml and -m4 -mb.  I'm tempted to apply it.  Kaz, do you have any objections?
Comment 39 Kazumoto Kojima 2014-10-17 23:16:30 UTC
(In reply to Oleg Endo from comment #38)
> ... there are no new failures for -m4 -ml and -m4 -mb.  I'm tempted to apply
> it.  Kaz, do you have any objections?

I have no objection.
Comment 40 Oleg Endo 2014-10-18 10:51:40 UTC
Author: olegendo
Date: Sat Oct 18 10:51:08 2014
New Revision: 216424

URL: https://gcc.gnu.org/viewcvs?rev=216424&root=gcc&view=rev
Log:
gcc/
	PR target/53513
	* config/sh/sh-modes.def (PSI): Remove.
	* config/sh/sh-protos.h (get_fpscr_rtx): Remove.
	* config/sh/sh.c (fpscr_rtx, get_fpscr_rtx): Remove.
	(sh_reorg): Remove commented out FPSCR code.
	(fpscr_set_from_mem): Use SImode instead of PSImode.  Emit lds_fpscr
	insn instead of move insn.
	(sh_hard_regno_mode_ok): Return SImode for FPSCR.
	(sh_legitimate_address_p, sh_legitimize_reload_address): Remove PSImode
	handling.
	(sh_emit_mode_set): Emit lds_fpscr and sts_fpscr insns.
	(sh1_builtin_p): Uncomment.
	(SH_BLTIN_UV 25, SH_BLTIN_VU 26): New macros.
	(bdesc): Add __builtin_sh_get_fpscr and __builtin_sh_set_fpscr.
	* config/sh/sh/predicates.md (fpscr_operand): Simplify.
	(fpscr_movsrc_operand, fpscr_movdst_operand): New predicates.
	(general_movsrc_operand, general_movdst_operand): Disallow
	fpscr_operand.
	* config/sh/sh.md (FPSCR_FR): New constant.
	(push_fpscr): Emit sts_fpscr insn.
	(pop_fpscr): Emit lds_fpscr_insn.
	(movsi_ie): Disallow FPSCR operands.
	(fpu_switch, unnamed related split, extend_psi_si,
	truncate_si_psi): Remove insns.
	(lds_fpscr, sts_fpscr): New insns.
	(toggle_sz, toggle_pr): Use SImode for FPSCR_REG instead of PSImode.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/predicates.md
    trunk/gcc/config/sh/sh-modes.def
    trunk/gcc/config/sh/sh-protos.h
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.md
Comment 41 Oleg Endo 2014-12-07 23:20:31 UTC
Author: olegendo
Date: Sun Dec  7 23:19:59 2014
New Revision: 218472

URL: https://gcc.gnu.org/viewcvs?rev=218472&root=gcc&view=rev
Log:
gcc/testsuite/
	PR target/53513
	* gcc.target/sh/pr54602-4.c: Fix matching of rte-nop sequence.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/sh/pr54602-4.c
Comment 42 Oleg Endo 2014-12-10 00:22:08 UTC
Author: olegendo
Date: Wed Dec 10 00:21:36 2014
New Revision: 218551

URL: https://gcc.gnu.org/viewcvs?rev=218551&root=gcc&view=rev
Log:
gcc/
	PR target/53513
	* doc/extend.texi (__builtin_sh_get_fpscr, __builtin_sh_get_fpscr):
	Document it.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/extend.texi
Comment 43 Oleg Endo 2014-12-10 08:32:03 UTC
Author: olegendo
Date: Wed Dec 10 08:31:32 2014
New Revision: 218563

URL: https://gcc.gnu.org/viewcvs?rev=218563&root=gcc&view=rev
Log:
gcc/
	PR target/53513
	* doc/extend.texi (__builtin_sh_set_fpscr): Fix typo.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/extend.texi
Comment 44 Oleg Endo 2014-12-13 13:18:27 UTC
Author: olegendo
Date: Sat Dec 13 13:17:55 2014
New Revision: 218707

URL: https://gcc.gnu.org/viewcvs?rev=218707&root=gcc&view=rev
Log:
gcc/testsuite/
	PR target/53513
	* gcc.target/sh/attr-isr-nosave_low_regs.c: Fix matching of expected
	register push/pop sequences.
	* gcc.target/sh/attr-isr.c: Likewise.
	* gcc.target/sh/attr-isr-trapa.c: Likewise.
	* gcc.target/sh/pragma-isr-nosave_low_regs.c: Likewise.
	* gcc.target/sh/pragma-isr-trapa.c: Likewise.
	* gcc.target/sh/pragma-isr-trapa2.c: Likewise.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/sh/attr-isr-nosave_low_regs.c
    trunk/gcc/testsuite/gcc.target/sh/attr-isr-trapa.c
    trunk/gcc/testsuite/gcc.target/sh/attr-isr.c
    trunk/gcc/testsuite/gcc.target/sh/pragma-isr-nosave_low_regs.c
    trunk/gcc/testsuite/gcc.target/sh/pragma-isr-trapa.c
    trunk/gcc/testsuite/gcc.target/sh/pragma-isr-trapa2.c
Comment 45 Oleg Endo 2014-12-14 14:00:03 UTC
From my point of view, this PR can be closed as fixed.  The ISR related failures were just issues w.r.t. how the the expected code sequences are matched in the test cases.
I've created followup PR 64305 and PR 64299, which are issues that can be addressed separately.
Comment 46 Oleg Endo 2014-12-16 21:29:32 UTC
Author: olegendo
Date: Tue Dec 16 21:28:59 2014
New Revision: 218793

URL: https://gcc.gnu.org/viewcvs?rev=218793&root=gcc&view=rev
Log:
gcc/testsuite/
	PR target/53513
	* gcc.target/sh/fpchg.c: Rename to ...
	* gcc.target/sh/pr53513-1.c: ... this.  Adjust test case to work for
	-m4a and -m4a-single.

Added:
    trunk/gcc/testsuite/gcc.target/sh/pr53513-1.c
Removed:
    trunk/gcc/testsuite/gcc.target/sh/fpchg.c
Modified:
    trunk/gcc/testsuite/ChangeLog
Comment 47 Oleg Endo 2014-12-21 17:53:50 UTC
Fixed/implemented for GCC 5.