[Bug target/56592] New: [SH] Add vector ABI

Sun Mar 10 23:43:00 GMT 2013

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56592

             Bug #: 56592
           Summary: [SH] Add vector ABI
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: olegendo@gcc.gnu.org
                CC: kkojima@gcc.gnu.org
            Target: sh*-*-*

On SH there are a couple of ABI related issues which unfortunately can't be all
fixed without breaking binary compatibility.  Thus the idea to add a new ABI
which can be selected by a target -mabi=vector option.  Already existing ABIs
could also be selected based on this option:
-mrenesas -> -mabi=renesas
-mnorenesas -> -mabi=gnu

Some of the primary issues that the vector ABI is supposed to improve are:

----------------------
PR 13423
sh-elf: V4SFmode passed in integer registers

float vectors, float arrays (of fixed size) or structs of floats when passed by
value should be passed in FP regs entirely.  The current ABI allows passing of
up to 8 FP regs (FR4..FR11), so there would be space to pass two 4D float
vectors.  It should also be possible to return a 4D float vectors in registers.
Since FR0..FR11 are call clobbered, they can as well be used to return multiple
vectors.

----------------------
PR 53513
SH Target: Add support for fschg and fpchg insns

Although this PR could be solved without breaking the ABI too much, there are
some issues which could be fixed in a new ABI.
The current approach is to use two global variables (__fpscr_values) in order
to perform FPU single/double mode switching.  The default FPU precision setting
is defined by an -m option.  Currently there are three such FPU default modes:
- double mode default
- single mode default
- single mode only

When changing the FPU mode the current FPSCR setting is overwritten with one of
the global values from __fpscr_values.  This is the fastest way (on non-SH4A)
to perform a mode switch, but it has some disadvantages.  One of them is PR
6526.  In general all information in FPSCR is lost after performing a mode
switch this way, e.g. it is not possible to read FPU exception causes after a
series of operations.  Moreover, in multi-threaded environments it is not
possible to set the default FPSCR setting (e.g. rounding mode or denormal
handling) for threads independently.  In order to minimize mode switches the
function signature can be taken into account when deciding the default FPU
precision for a particular function.  E.g. when a function has any double
precision arguments, it can be assumed that the function will use the double
values in some way.  Thus the default entry mode for such a function should be
'double'.  Similarly, for functions that return double values it can be better
to leave the function with 'double' mode.

Because of this, '-m4 -mvabi' and '-m4-single -mvabi' would actually result in
the same ABI.

It should also be possible to override the FPSCR.PR settings for function entry
and function leave via function attributes.  This can be useful e.g. in cases
where hand written asm FPU routines are invoked from C/C++ code that expect
certain settings.  E.g. code that uses the 'frchg' insn to flip FPSCR.FR bit on
SH4 must be executed with FPSCR.PR = 0.

----------------------
PR 52441
Target: Double sign/zero extensions for function arguments

Values that are passed in registers that are < 32 bit in size have usually
undefined high bits.  The standard GNU calling convention thus performs
sign/zero extension of such values before the function call and inside the
function itself.  The Renesas calling convention (-mrenesas) however only
extends values inside the function.  Whether an extension is actually required
at all depends on how the value is used.  This is known only inside of a
function.  Thus adopting the Renesas calling convention in this case is more
efficient.

----------------------
Register ordering for arguments.

I don't remember in which PR this was mentioned but the current GNU calling
convention allocates FR registers on big endian like:
FR4 = arg0
FR5 = arg1
FR6 = arg2
FR7 = arg3
...

and on little endian:
FR4 = arg1
FR5 = arg0
FR6 = arg3
FR7 = arg2
...

This can make writing endian neutral asm code more complicated.  The ordering
for little endian should be the same as for big endian (which is also
equivalent to the -mrenesas ABI).

----------------------
Alignment of double precision FP values.

Currently the default alignment for those is 32 bit and can be changed to 64
bit by the option -mdalign.  In order to be able to maximize the utilization of
64 bit fmov insns, 64 bit double alignment should be the default.

----------------------
Boolean function return values

A boolean return value of a function tends to be produced inside the function
by using some sort of comparison insns which store the comparison result in the
T bit.  The T bit is then transferred to a GP reg before returning from the
function.  On the caller side, the value in the GP reg is then often tested for
!= 0 followed by a conditional branch.  The redundant != 0 test can be
eliminated by returning boolean values in the T bit directly.  However, there
might be compatibility problems with C code that typedefs its own bool type as
signed/unsigned char or something else.

----------------------
Variadic functions

Passing variable number of arguments ('...') over the stack as it is currently
done with -mrenesas tends to produce more efficient code, especially when
traversing the va_list .

----------------------
ABI summary I've got so far

R0..R3:      Call-clobbered.
             Function return values / scratch registers.
             High bits of values < 32 bit are undefined.

R4..R7:      Call-clobbered.
             Function arguments / scratch registers.
             High bits of values < 32 bit are undefined.

R8..R15:     Call-saved.

             R15: stack pointer
             R14: frame pointer (optional)
             R12: GOT pointer (optional, for PIC code)

PR:          Call-saved.
             Function return address.

SR.S:        '0' (MAC saturation disabled) at function entry and function
leave.

SR.T:        Call-clobbered.
             Boolean return value.

SR.M, SR.Q:  Call-clobbered.

Other SR bits: Ignored by the compiler.

GBR:         Call-saved.
             Pointer to current execution context (thread).

MACL,MACH:   Call-clobbered.
             Scratch registers.

FPUL:        Call-clobbered.
             Scratch register.

FR0..FR3:    Call-clobbered.
             Function return values / scratch registers.

FR4..FR7:    Call-clobbered.
             Function arguments / return values / scratch registers.

FR8..FR11:   Call-clobbered.
             Function arguments / scratch registers.

FR12..FR15:  Call-saved.
             Local variables.

XF0..XF15:   Undefined, not modified by compiler generated code.

FPSCR.FR:    Undefined, not modified by compiler generated code.

FPSCR.SZ:    '0' (32 bit fmov) on function entry / leave by default.

FPSCR.PR:    Function entry:
             '0' (single precision) if the function takes no floating point
             arguments, or if the number of 'float' arguments is greater than
             the number of 'double' arguments, '1' otherwise.

             Function leave:
             Unmodified if the function returns 'void' or integral values or
             aggregates.
             '0' if the function returns more 'float' values than 'double'
             values, '1' otherwise.

             '0' on exception handler entry.

Other FPSCR bits: Undefined, not modified by compiler generated code.

When counting the number of 'float' and 'double' values elements of vectors are
counted as individual values.  I.e. a 4D 'float' vector has more 'float' values
than a 2D 'double' vector has 'double' values.  va_args are ignored.

Function argument/return value aggregates are decomposed so that the individual
members can be passed in different register classes, based on the data type. 
E.g. 

struct FuncArg
{
  int a;     // -> r4
  int b;     // -> r5
  float c;   // -> fr4
};

struct FuncArg
{
  int a;     // -> r4
  int b;     // -> r5
  float c;   // -> fr4
  double d;  // -> dr6 (fr6:fr7)
  bool e;    // -> T
  float f;   // -> fr5
};

struct FuncArg
{
  int a;     // -> r4
  int b;     // -> r5
  int c;     // -> r6
  int d;     // -> r7
};

struct FuncArg
{
  int a;        // -> r4
  int b;        // -> r5
  int c;        // -> r6
  long long d;  // -> stack
  short e;      // -> r7
};

struct FuncArg
{
  float a;      // -> fr4
  float b;      // -> fr5
  float c;      // -> fr6
  float d;      // -> fr7
};

Return values/aggregates that don't fit into registers are returned partially
in registers and partially onto the caller's stack.  In this case R2 is used to
pass the hidden pointer to the remaining return values.

Argument aggregates that don't fit into registers are passed partially in
registers and the remaining pieces are pushed onto the stack.

va_args are passed on the stack entirely (simpler traversal of va_list).

'double' values are passed in DR registers, where the high 32 bits are passed
in FR(n*2) and the low 32 bits in FR(n*2+1) regardless of the endian setting.

4D 'float' vectors are passed in FV registers, i.e. FR(n*4), in order to avoid
reg copies before vector insns (fipr, ftrv).

SH targets that don't support double precision floating-point in hardware
handle the operations in software, but should accept the same ABI otherwise. 
This would fix e.g. PR 36939.

I'm not sure how to integrate untyped calls and whether this kind of ABI would
require additional extensions to GDB.  Probably there are also lots of other
details missing for this to be a complete ABI definition.  Any suggestions and
feedback is highly appreciated.