This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 9/17][ARM] Add NEON FP16 arithmetic instructions.


On Thu, May 19, 2016 at 05:29:16PM +0000, Joseph Myers wrote:
> On Thu, 19 May 2016, Jiong Wang wrote:
> 
> > Then,
> > 
> >   * if we add scalar HF mode to standard patterns, vector HF modes operation
> > will be
> >     turned into scalar HF operations instead of scalar SF operations.
> > 
> >   * if we add vector HF mode to standard patterns, vector HF modes operations
> > will
> >     generate vector HF instructions directly.
> > 
> >   Will this still cause precision inconsistence with old gcc when there are
> > cascade
> >   vector float operations?
> 
> I'm not sure inconsistency with old GCC is what's relevant here.
> 
> Standard-named RTL patterns have particular semantics.  Those semantics do 
> not depend on the target architecture (except where there are target 
> macros / hooks to define such dependence).  If you have an instruction 
> that matches those target-independent semantics, it should be available 
> for the standard-named pattern.  I believe that is the case here, for both 
> the scalar and the vector instructions - they have the standard semantics, 
> so should be available for the standard patterns.
> 
> It is the responsibility of the target-independent parts of the compiler 
> to ensure that the RTL generated matches the source code semantics, so 
> that providing a standard pattern for an instruction that matches the 
> pattern's semantics does not cause any problems regarding source code 
> semantics.
> 
> That said: if the expander in old GCC is converting a vector HF operation 
> into scalar SF operations, I'd expect it also to include a conversion from 
> SFmode back to HFmode after those operations, since it will be producing a 
> vector HF result.  And that would apply for each individual operation 
> expanded.  So I would not expect inconsistency to arise from making direct 
> HFmode operations available (given that the semantics of scalar + - * / 
> are the same whether you do them directly on HFmode or promote to SFmode, 
> do the operation there and then convert the result back to HFmode before 
> doing any further operations on it).

I think the confusion here is that these two functions:

  float16x8_t
  __attribute__ ((noinline)) 
  foo (float16x8_t a, float16x8_t b, float16x8_t c)
  {
    return a * b / c;
  }

  float16_t
  __attribute__ ((noinline)) 
  bar (float16_t a, float16_t b, float16_t c)
  {
    return a * b / c;
  }

Have different behaviours in terms of when they extend and truncate between
floating-point precisions.

A full testcase calling these functions is attached.

Compile with

  `gcc -O3`
     for AArch64 ARMv8-A
  `gcc -O3 -mfloat-abi=hard -mfpu=neon-fp16 -mfp16-format=ieee -march=armv7-a`
     for ARMv7-A 

This prints:

  Fail:
	Scalar Input	256.000000
	Scalar Output	256.000000
	Vector input	256.000000
	Vector output	inf
  Fail:
	Scalar Input	3.300781
	Scalar Output	3.300781
	Vector input	3.300781
	Vector output	3.302734
  Fail:
	Scalar Input	10000.000000
	Scalar Output	10000.000000
	Vector input	10000.000000
	Vector output	inf
  Fail:
	Scalar Input	0.000003
	Scalar Output	0.000003
	Vector input	0.000003
	Vector output	0.000000
  Fail:
	Scalar Input	0.000400
	Scalar Output	0.000400
	Vector input	0.000400
	Vector output	0.000447

foo, operating on vectors, remains in 16-bit precision throughout gimple,
will scalarise during veclower, and will add float_extend and float_truncate
around each operation during expand to preserve the 16-bit rounding
behaviour. For this testcase, that means two truncates per vector element.
One after the multiply, one after the divide.

bar, operating on scalars, adds promotions early due to TARGET_PROMOTED_TYPE.
In gimple we stay in 32-bit precision for the two operations, and we
truncate only after both operations. That means one truncate, taking place
after the divide.

However, I find this surprising at a language level, though I see
that Clang 3.8 has the same behaviour.  ACLE doesn't mention the GCC
vector extensions, so doesn't specify the behaviour of the arithmetic
operators on vector-of-float16_t types. GCC's vector extension documentation
gives this definition for arithmetic operations:

  The types defined in this manner can be used with a subset of normal
  C operations. Currently, GCC allows using the following operators on
  these types: +, -, *, /, unary minus, ^, |, &, ~, %.

  The operations behave like C++ valarrays. Addition is defined as
  the addition of the corresponding elements of the operands. For
  example, in the code below, each of the 4 elements in a is added to
  the corresponding 4 elements in b and the resulting vector is stored
  in c.

  Subtraction, multiplication, division, and the logical operations
  operate in a similar manner. Likewise, the result of using the unary
  minus or complement operators on a vector type is a vector whose
  elements are the negative or complemented values of the corresponding
  elements in the operand. 

Without digging in to the compiler code, I would have expected the vector
implementation to give equivalent results to the scalar one.

My question is whether you consider the different behaviour between scalar
float16_t and vector-of-float16_t types to be a bug? I can think of some
ways to fix the vector behaviour if it is buggy, but they would of course
be a change in behaviour from current releases (and from clang 3.8).

Clearly, this makes no difference to your comment that we should implement
these using standard pattern names. Either this is a bug, in which case
the front-end will arrange for the promotion to vector-of-float32_t
types, and implementing the vector standard pattern names would potentially
allow for some optimisation back to vector-of-float16_t type, or this
is not a bug, in which case the vector-of-float16_t standard pattern names
match the expected semantics perfectly.

Thanks,
James

#include "arm_neon.h"
#include "stdio.h"

float16x8_t
__attribute__ ((noinline))
foo (float16x8_t a, float16x8_t b, float16x8_t c)
{
  return a * b / c;
}

float16_t
__attribute__ ((noinline))
bar (float16_t a, float16_t b, float16_t c)
{
  return a * b / c;
}

#define VALS { 1.0f, 256.0f, 1.1f, 2.2f, \
		     3.3f, 10000.0f, 0.000003f, 0.0004f }

int
main (int argc, char **argv)
{
  float16_t x[8] = VALS;
  float16_t y[8];
  float16x8_t vx = VALS;

  for (int i = 0; i< 8; i++)
    y[i] = bar (x[i], x[i], x[i]);

  float16x8_t vy = foo (vx, vx, vx);

  for (int i = 0; i < 8; i++)
    if (y[i] != vy[i])
      printf ("Fail:\n\tScalar Input\t%f\n\tScalar Output\t%f\n\t"
	      "Vector input\t%f\n\tVector output\t%f\n",
	      x[i], y[i], vx[i], vy[i]);
  
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]