This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

MD representation of IA64 floating point operations


I've got two enhancement requests for ia64 on my radar which have
convergent requirements on the floating-point section of the machine
description.  One is the implementation of the HPUX __fpreg extension,
giving C-level access to more of the hardware functionality than is
exposed through the usual floating types.  This requires us to model
the hardware accurately in the machine description.

The other is a scheduling deficiency; division and square root
operations, when inlined, serialize against each other because they're
only split to their full instruction sequences after register
allocation, so the register allocator doesn't know to give each one
different scratch registers.  We can't expand to the full instruction
sequence during tree->RTL lowering, which would be the logical place
to do it, because the machine description doesn't model the hardware
accurately - in just the same way that's a problem for __fpreg.

What's wrong with the model of the hardware?  Consider the following
define_insns (the first is a cut-down version of *movsf_internal):

(define_insn "*loadsf"
  [(set (match_operand:SF 0 "fr_register_operand" "=f")
        (match_operand:SF 1 "general_operand" "Q"))]
  "ia64_move_ok (operands[0], operands[1])"
  "ldfs %0 = %1%P1"
  [(set_attr "itanium_class" "fld")])

(define_insn "addsf3"
  [(set (match_operand:SF 0 "fr_register_operand" "=f")
        (plus:SF (match_operand:SF 1 "fr_register_operand" "%f")
                 (match_operand:SF 2 "fr_reg_or_fp01_operand" "fG")))]
  ""
  "fadd.s %0 = %1, %F2"
  [(set_attr "itanium_class" "fmac")])

This is a simplification of what the hardware actually does.  The ia64
floating point registers always extend the value they contain to a
special 82-bit internal format.  Call that RFmode - "register float".
The arithmetic instructions always operate on all 82 input bits, but
can round to IEEE single or double after the operation - and then it
gets extended back to full width.  Accurate RTL patterns for ldfs and
fadd.s would be something like

  (set (match_operand:RF 0 "fr_register_operand" "=f")
       (float_extend:RF (match_operand:SF 1 "general_operand" "Q")))

and

  (set (match_operand:RF 0 "fr_register_operand" "=f")
       (float_extend:RF
         (float_truncate:SF
           (plus:RF (match_operand:RF 1 "fr_register_operand" "%f")
                    (match_operand:RF 2 "fr_reg_or_fp01_operand" "fG")))))

here I assume that (float_extend:M (float_truncate:N (expr:M))) will
_not_ be simplified to (expr:M).

Now, the first thing you're probably wondering is "why do we care?"
The answer is that highly optimized ia64 floating point calculation
sequences can take advantage of the extra precision internally.  One
such sequence is the inline expansion of a floating-point divide.
This is the suggested sequence for throughput-optimized
single-precision division, from the ia64 optimization manual¹:

        // calculate f6/f7
        // result left in f8
        // scratch registers required: f8, f9, p6

        frcpa        f8,p6 = f6,f7
   (p6) fnma.s1      f9    = f7,f8,f1
   (p6) fma.s1       f9    = f9,f9,f9
   (p6) fma.s1       f8    = f9,f8,f8
   (p6) fma.s.s1     f9    = f6,f8,f0
   (p6) fnma.s1      f6    = f7,f9,f6
   (p6) fma.s        f8    = f6,f8,f9

Notice the lack of an .s (not to be confused with .s1²) suffix on most
of these instructions.  That means the result is left in RFmode.  But
one of them, in the middle, has an .s suffix - so that intermediate
result, and only that intermediate result, gets rounded to SFmode -
and then the very next instruction picks it up for a calculation back
in RFmode.

We currently represent this in ia64.md (pattern "divsf3_internal_thr")
by using two REG rtxes with different modes and the same register
number.  That is allowed but only for hard registers, and therefore we
can only expose the complete sequence after register allocation.  That
intermediate result could be put in a different scratch register, but
that doesn't help with the initial set of f8, which may be the
correctly-rounded single-precision result (if p6 is false) or an
approximation using the full register width (if p6 is true).  It
would, however, be possible to represent this whole calculation
accurately using the more complicated RTL patterns shown above.

The __fpreg feature is intended to allow writing highly optimized
elementary-function library routines (possibly to be inlined) in C,
and so it exposes this property at the C level - rounding operations,
when written, are delayed to be combined with a future arithmetic
operation.  You could write the above in extended-C as

float
divsf3 (float f6, float f7)
{
#pragma STDC FP_CONTRACT ON
  float f8, f9;
  __fpreg f6x, f8x, f9x;
  bool p6;

  _Asm_frcpa(f8x, p6, f6, f7);
  if (p6)
    {
#pragma _USE_SF 1
      f9x = 1.0 - f7 * f8x;
      f9x = f9x * f9x + f9x;
      f8x = f9x * f8x + f8x;
      f9  = f6 * (float)f8x;  // rounding happens AFTER multiplication
      f6x = f7 - f9 * (__fpreg)f6;
#pragma _USE_SF 0
      f8  = (float)f6x * (float)f8x + (float)f6x;  // ditto
    }
  else
    f8 = f8x;  // optimized out
  return f8;
}

[In this example it would be much more natural to write the casts to
occur after the arithmetic operations, but I didn't design this
feature, I'm just tasked with implementing it.]

So that's why we care.  Now, how do we describe that in MD-language?
My current thinking is that the right thing would be to change all the
floating-point instruction patterns so that they look like the
more-complicated examples above.  However, that may have undesirable
consequences for the early RTL optimizers.  (But do we care anymore?)
An alternative would be equivalents of LOAD_EXTEND_OP and the like,
for floating point modes.  That, however, would require changes in
machine-independent code, which I would prefer to avoid to the maximum
extent possible (since this is exclusively an ia64 weirdness as far as
I know).

Thoughts?

zw

¹ "Divide, Square Root, and Remainder Algorithms for the Intel
Itanium Architecture".  Intel Application Note #248725-003, November 2000.

² For purposes of this discussion, the ".s1" suffixes can be
ignored.  They select an alternate floating-point control register
which is guaranteed [by the ABI] to have the correct settings for this
algorithm.  One of those settings does something wondrous strange with
the width of the exponent field, but let's not get into that.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]