This is the mail archive of the
`gcc@gcc.gnu.org`
mailing list for the GCC project.

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |

Other format: | [Raw text] |

*From*: Zack Weinberg <zack at codesourcery dot com>*To*: gcc at gcc dot gnu dot org, Richard Henderson <rth at twiddle dot net>, Jim Wilson <wilson at specifixinc dot com>, sje at cup dot hp dot com, Gary dot Tracy2 at hp dot com, Sverre dot Jarp at cern dot ch, Michal dot Kapalka at cern dot ch*Date*: Thu, 22 Jul 2004 15:06:24 -0700*Subject*: MD representation of IA64 floating point operations*References*: <200407222032.NAA20562@hpsje.cup.hp.com>

I've got two enhancement requests for ia64 on my radar which have convergent requirements on the floating-point section of the machine description. One is the implementation of the HPUX __fpreg extension, giving C-level access to more of the hardware functionality than is exposed through the usual floating types. This requires us to model the hardware accurately in the machine description. The other is a scheduling deficiency; division and square root operations, when inlined, serialize against each other because they're only split to their full instruction sequences after register allocation, so the register allocator doesn't know to give each one different scratch registers. We can't expand to the full instruction sequence during tree->RTL lowering, which would be the logical place to do it, because the machine description doesn't model the hardware accurately - in just the same way that's a problem for __fpreg. What's wrong with the model of the hardware? Consider the following define_insns (the first is a cut-down version of *movsf_internal): (define_insn "*loadsf" [(set (match_operand:SF 0 "fr_register_operand" "=f") (match_operand:SF 1 "general_operand" "Q"))] "ia64_move_ok (operands[0], operands[1])" "ldfs %0 = %1%P1" [(set_attr "itanium_class" "fld")]) (define_insn "addsf3" [(set (match_operand:SF 0 "fr_register_operand" "=f") (plus:SF (match_operand:SF 1 "fr_register_operand" "%f") (match_operand:SF 2 "fr_reg_or_fp01_operand" "fG")))] "" "fadd.s %0 = %1, %F2" [(set_attr "itanium_class" "fmac")]) This is a simplification of what the hardware actually does. The ia64 floating point registers always extend the value they contain to a special 82-bit internal format. Call that RFmode - "register float". The arithmetic instructions always operate on all 82 input bits, but can round to IEEE single or double after the operation - and then it gets extended back to full width. Accurate RTL patterns for ldfs and fadd.s would be something like (set (match_operand:RF 0 "fr_register_operand" "=f") (float_extend:RF (match_operand:SF 1 "general_operand" "Q"))) and (set (match_operand:RF 0 "fr_register_operand" "=f") (float_extend:RF (float_truncate:SF (plus:RF (match_operand:RF 1 "fr_register_operand" "%f") (match_operand:RF 2 "fr_reg_or_fp01_operand" "fG"))))) here I assume that (float_extend:M (float_truncate:N (expr:M))) will _not_ be simplified to (expr:M). Now, the first thing you're probably wondering is "why do we care?" The answer is that highly optimized ia64 floating point calculation sequences can take advantage of the extra precision internally. One such sequence is the inline expansion of a floating-point divide. This is the suggested sequence for throughput-optimized single-precision division, from the ia64 optimization manual¹: // calculate f6/f7 // result left in f8 // scratch registers required: f8, f9, p6 frcpa f8,p6 = f6,f7 (p6) fnma.s1 f9 = f7,f8,f1 (p6) fma.s1 f9 = f9,f9,f9 (p6) fma.s1 f8 = f9,f8,f8 (p6) fma.s.s1 f9 = f6,f8,f0 (p6) fnma.s1 f6 = f7,f9,f6 (p6) fma.s f8 = f6,f8,f9 Notice the lack of an .s (not to be confused with .s1²) suffix on most of these instructions. That means the result is left in RFmode. But one of them, in the middle, has an .s suffix - so that intermediate result, and only that intermediate result, gets rounded to SFmode - and then the very next instruction picks it up for a calculation back in RFmode. We currently represent this in ia64.md (pattern "divsf3_internal_thr") by using two REG rtxes with different modes and the same register number. That is allowed but only for hard registers, and therefore we can only expose the complete sequence after register allocation. That intermediate result could be put in a different scratch register, but that doesn't help with the initial set of f8, which may be the correctly-rounded single-precision result (if p6 is false) or an approximation using the full register width (if p6 is true). It would, however, be possible to represent this whole calculation accurately using the more complicated RTL patterns shown above. The __fpreg feature is intended to allow writing highly optimized elementary-function library routines (possibly to be inlined) in C, and so it exposes this property at the C level - rounding operations, when written, are delayed to be combined with a future arithmetic operation. You could write the above in extended-C as float divsf3 (float f6, float f7) { #pragma STDC FP_CONTRACT ON float f8, f9; __fpreg f6x, f8x, f9x; bool p6; _Asm_frcpa(f8x, p6, f6, f7); if (p6) { #pragma _USE_SF 1 f9x = 1.0 - f7 * f8x; f9x = f9x * f9x + f9x; f8x = f9x * f8x + f8x; f9 = f6 * (float)f8x; // rounding happens AFTER multiplication f6x = f7 - f9 * (__fpreg)f6; #pragma _USE_SF 0 f8 = (float)f6x * (float)f8x + (float)f6x; // ditto } else f8 = f8x; // optimized out return f8; } [In this example it would be much more natural to write the casts to occur after the arithmetic operations, but I didn't design this feature, I'm just tasked with implementing it.] So that's why we care. Now, how do we describe that in MD-language? My current thinking is that the right thing would be to change all the floating-point instruction patterns so that they look like the more-complicated examples above. However, that may have undesirable consequences for the early RTL optimizers. (But do we care anymore?) An alternative would be equivalents of LOAD_EXTEND_OP and the like, for floating point modes. That, however, would require changes in machine-independent code, which I would prefer to avoid to the maximum extent possible (since this is exclusively an ia64 weirdness as far as I know). Thoughts? zw ¹ "Divide, Square Root, and Remainder Algorithms for the Intel Itanium Architecture". Intel Application Note #248725-003, November 2000. ² For purposes of this discussion, the ".s1" suffixes can be ignored. They select an alternate floating-point control register which is guaranteed [by the ABI] to have the correct settings for this algorithm. One of those settings does something wondrous strange with the width of the exponent field, but let's not get into that.

**Follow-Ups**:**Re: MD representation of IA64 floating point operations***From:*Jim Wilson

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |