Bug 56592 - [SH] Add vector ABI
Summary: [SH] Add vector ABI
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 52441
  Show dependency treegraph
 
Reported: 2013-03-10 23:43 UTC by Oleg Endo
Modified: 2015-11-11 14:04 UTC (History)
2 users (show)

See Also:
Host:
Target: sh*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Endo 2013-03-10 23:43:29 UTC
On SH there are a couple of ABI related issues which unfortunately can't be all fixed without breaking binary compatibility.  Thus the idea to add a new ABI which can be selected by a target -mabi=vector option.  Already existing ABIs could also be selected based on this option:
-mrenesas -> -mabi=renesas
-mnorenesas -> -mabi=gnu


Some of the primary issues that the vector ABI is supposed to improve are:

----------------------
PR 13423
sh-elf: V4SFmode passed in integer registers

float vectors, float arrays (of fixed size) or structs of floats when passed by value should be passed in FP regs entirely.  The current ABI allows passing of up to 8 FP regs (FR4..FR11), so there would be space to pass two 4D float vectors.  It should also be possible to return a 4D float vectors in registers.
Since FR0..FR11 are call clobbered, they can as well be used to return multiple vectors.

----------------------
PR 53513
SH Target: Add support for fschg and fpchg insns

Although this PR could be solved without breaking the ABI too much, there are some issues which could be fixed in a new ABI.
The current approach is to use two global variables (__fpscr_values) in order to perform FPU single/double mode switching.  The default FPU precision setting is defined by an -m option.  Currently there are three such FPU default modes:
- double mode default
- single mode default
- single mode only

When changing the FPU mode the current FPSCR setting is overwritten with one of the global values from __fpscr_values.  This is the fastest way (on non-SH4A) to perform a mode switch, but it has some disadvantages.  One of them is PR 6526.  In general all information in FPSCR is lost after performing a mode switch this way, e.g. it is not possible to read FPU exception causes after a series of operations.  Moreover, in multi-threaded environments it is not possible to set the default FPSCR setting (e.g. rounding mode or denormal handling) for threads independently.  In order to minimize mode switches the function signature can be taken into account when deciding the default FPU precision for a particular function.  E.g. when a function has any double precision arguments, it can be assumed that the function will use the double values in some way.  Thus the default entry mode for such a function should be 'double'.  Similarly, for functions that return double values it can be better to leave the function with 'double' mode.

Because of this, '-m4 -mvabi' and '-m4-single -mvabi' would actually result in the same ABI.

It should also be possible to override the FPSCR.PR settings for function entry and function leave via function attributes.  This can be useful e.g. in cases where hand written asm FPU routines are invoked from C/C++ code that expect certain settings.  E.g. code that uses the 'frchg' insn to flip FPSCR.FR bit on SH4 must be executed with FPSCR.PR = 0.


----------------------
PR 52441
Target: Double sign/zero extensions for function arguments

Values that are passed in registers that are < 32 bit in size have usually undefined high bits.  The standard GNU calling convention thus performs sign/zero extension of such values before the function call and inside the function itself.  The Renesas calling convention (-mrenesas) however only extends values inside the function.  Whether an extension is actually required at all depends on how the value is used.  This is known only inside of a function.  Thus adopting the Renesas calling convention in this case is more efficient.


----------------------
Register ordering for arguments.

I don't remember in which PR this was mentioned but the current GNU calling convention allocates FR registers on big endian like:
FR4 = arg0
FR5 = arg1
FR6 = arg2
FR7 = arg3
...

and on little endian:
FR4 = arg1
FR5 = arg0
FR6 = arg3
FR7 = arg2
...

This can make writing endian neutral asm code more complicated.  The ordering for little endian should be the same as for big endian (which is also equivalent to the -mrenesas ABI).


----------------------
Alignment of double precision FP values.

Currently the default alignment for those is 32 bit and can be changed to 64 bit by the option -mdalign.  In order to be able to maximize the utilization of 64 bit fmov insns, 64 bit double alignment should be the default.


----------------------
Boolean function return values

A boolean return value of a function tends to be produced inside the function by using some sort of comparison insns which store the comparison result in the T bit.  The T bit is then transferred to a GP reg before returning from the function.  On the caller side, the value in the GP reg is then often tested for != 0 followed by a conditional branch.  The redundant != 0 test can be eliminated by returning boolean values in the T bit directly.  However, there might be compatibility problems with C code that typedefs its own bool type as signed/unsigned char or something else.


----------------------
Variadic functions

Passing variable number of arguments ('...') over the stack as it is currently done with -mrenesas tends to produce more efficient code, especially when traversing the va_list .


----------------------
ABI summary I've got so far


R0..R3:      Call-clobbered.
             Function return values / scratch registers.
             High bits of values < 32 bit are undefined.


R4..R7:      Call-clobbered.
             Function arguments / scratch registers.
             High bits of values < 32 bit are undefined.


R8..R15:     Call-saved.

             R15: stack pointer
             R14: frame pointer (optional)
             R12: GOT pointer (optional, for PIC code)


PR:          Call-saved.
             Function return address.


SR.S:        '0' (MAC saturation disabled) at function entry and function leave.

SR.T:        Call-clobbered.
             Boolean return value.

SR.M, SR.Q:  Call-clobbered.

Other SR bits: Ignored by the compiler.

GBR:         Call-saved.
             Pointer to current execution context (thread).

MACL,MACH:   Call-clobbered.
             Scratch registers.

FPUL:        Call-clobbered.
             Scratch register.

FR0..FR3:    Call-clobbered.
             Function return values / scratch registers.

FR4..FR7:    Call-clobbered.
             Function arguments / return values / scratch registers.

FR8..FR11:   Call-clobbered.
             Function arguments / scratch registers.

FR12..FR15:  Call-saved.
             Local variables.

XF0..XF15:   Undefined, not modified by compiler generated code.

FPSCR.FR:    Undefined, not modified by compiler generated code.
              
FPSCR.SZ:    '0' (32 bit fmov) on function entry / leave by default.

FPSCR.PR:    Function entry:
             '0' (single precision) if the function takes no floating point
             arguments, or if the number of 'float' arguments is greater than
             the number of 'double' arguments, '1' otherwise.

             Function leave:
             Unmodified if the function returns 'void' or integral values or
             aggregates.
             '0' if the function returns more 'float' values than 'double'
             values, '1' otherwise.

             '0' on exception handler entry.

Other FPSCR bits: Undefined, not modified by compiler generated code.


When counting the number of 'float' and 'double' values elements of vectors are counted as individual values.  I.e. a 4D 'float' vector has more 'float' values than a 2D 'double' vector has 'double' values.  va_args are ignored.


Function argument/return value aggregates are decomposed so that the individual members can be passed in different register classes, based on the data type.  E.g. 

struct FuncArg
{
  int a;     // -> r4
  int b;     // -> r5
  float c;   // -> fr4
};

struct FuncArg
{
  int a;     // -> r4
  int b;     // -> r5
  float c;   // -> fr4
  double d;  // -> dr6 (fr6:fr7)
  bool e;    // -> T
  float f;   // -> fr5
};

struct FuncArg
{
  int a;     // -> r4
  int b;     // -> r5
  int c;     // -> r6
  int d;     // -> r7
};

struct FuncArg
{
  int a;        // -> r4
  int b;        // -> r5
  int c;        // -> r6
  long long d;  // -> stack
  short e;      // -> r7
};

struct FuncArg
{
  float a;      // -> fr4
  float b;      // -> fr5
  float c;      // -> fr6
  float d;      // -> fr7
};


Return values/aggregates that don't fit into registers are returned partially in registers and partially onto the caller's stack.  In this case R2 is used to pass the hidden pointer to the remaining return values.

Argument aggregates that don't fit into registers are passed partially in registers and the remaining pieces are pushed onto the stack.

va_args are passed on the stack entirely (simpler traversal of va_list).

'double' values are passed in DR registers, where the high 32 bits are passed in FR(n*2) and the low 32 bits in FR(n*2+1) regardless of the endian setting.

4D 'float' vectors are passed in FV registers, i.e. FR(n*4), in order to avoid reg copies before vector insns (fipr, ftrv).

SH targets that don't support double precision floating-point in hardware handle the operations in software, but should accept the same ABI otherwise.  This would fix e.g. PR 36939.


I'm not sure how to integrate untyped calls and whether this kind of ABI would require additional extensions to GDB.  Probably there are also lots of other details missing for this to be a complete ABI definition.  Any suggestions and feedback is highly appreciated.
Comment 1 Manu Evans 2013-03-14 09:48:17 UTC
I watch with keen anticipation! :)
Comment 2 Oleg Endo 2013-03-17 14:19:55 UTC
Regarding multi-word arguments:

> 'double' values are passed in DR registers, where the high 32 bits are passed
> in FR(n*2) and the low 32 bits in FR(n*2+1) regardless of the endian setting.
> 
> 4D 'float' vectors are passed in FV registers, i.e. FR(n*4), in order to avoid
> reg copies before vector insns (fipr, ftrv).

Multi-word integer values should be passed in little endian word order.  E.g. 'long long' (64 bit) would be passed in r1:r0, where r1 are the high 32 bits and r0 are the low 32 bits.  This would make it easier to write endian neutral asm code.
Comment 3 Oleg Endo 2015-10-08 11:15:04 UTC
Maybe also interesting: __attribute__((vector)) (function attribute)
Comment 4 Oleg Endo 2015-11-11 14:04:11 UTC
(In reply to Oleg Endo from comment #0)
> 
> Function argument/return value aggregates are decomposed so that the
> individual members can be passed in different register classes, based on the
> data type.  E.g. 
> 
> ...
> 
> struct FuncArg
> {
>   float a;      // -> fr4
>   float b;      // -> fr5
>   float c;      // -> fr6
>   float d;      // -> fr7
> };
> 

Maybe such simple cases can be handled by implementing TARGET_ARRAY_MODE_SUPPORTED_P