This page discusses a number of related problems and desired features of ia64 back end's floating point handling, and a plan for solving all of them. == Problems == At root, the problem is that the ia64 back end does not accurately model the hardware. The ia64 architecture's floating point has a number of exotic features: * Floating-point registers are 82 bits wide and use an (internal, subject-to-change) format which does not exactly correspond to any IEEE data type. * Values loaded from memory in IEEE single, double, or extended format are automatically extended (but''not''normalized) to the full register width. Arithmetic instructions all operate on the full register width, with an optional truncation to single or double precision afterward; the truncated value is then immediately extended back to register width. The upshot is that it is almost never necessary to execute a widening conversion instruction, and narrowing conversions can frequently be merged with a preceding arithmetic instruction. * There are four alternative IEEE control registers. Each instruction can specify which control register is used. Most calculation uses control register 0; register 1 is reserved for use by arithmetic library routines (e.g. division and square root) and has contents defined by the ABI; registers 2 and 3 are for speculative floating-point calculation, which is out of scope for this discussion. GCC's model is inaccurate in the following ways: * It is not aware that widening conversions are not necessary in most cases. This directly causes suboptimal code to be generated. Also, hand-coded assembly sequences (again such as division and square root) that avoid widening conversions must cheat in some way. The current inline divide is done with a post-reload splitter, at which point hard registers can be referenced in more than one mode. The problem with this is, register allocation is likely to allocate the same scratch registers to all divisions in the same function, which forces them to be serialized against each other even when the scheduler would like it to be otherwise. * It is not aware of the 82-bit register format. This would be harmless if it were not desired to make operations in that format accessible to the user (primarily for ease of coding math libraries); see below. * Only a limited set of instruction patterns which use the alternate control registers are available. Again, this is only a problem because it is desired to make the alternate control registers accessible to the user. (It will also be a problem if and when we ever get round to implementing speculative calculation.) There are also related maintenance headaches: * To model the fact that narrowing conversions can be merged with preceding arithmetic instructions, we must create extra insn patterns: one for each {{{DFmode}}} pattern and two for each {{{XFmode}}} pattern, differing only in the RTL template. This produces massive duplication of code. * Adding a complete set of alternate-control-register patterns will require yet more duplication: for every pattern (including the merged-narrowing-conversion patterns) a duplicate differing only in the RTL template and output format. = Features = It is desired to add some features of the HP-UX system compiler for ia64 to GCC. They facilitate coding high-speed math libraries in C rather than in assembly. == =__fpreg= == =__fpreg= is an extended floating point type which provides user access to the full width of the floating point registers. It has the following properties: * It is copied between registers and memory with 'fill/spill' instructions that transfer the complete internal register content. * There is no way to write a constant with type {{{__fpreg}}} * Only the following operations on {{{__fpreg}}} types are supported: 1. Function calls (arguments and return values) 1. Arithmetic operations of addition (+), subtraction (-) and multiplication (*) 1. Unary minus (-) and address operator (&) 1. Equality operations {{{}}} !=) 1. Relationals (<, < {{{, >, >}}} 1. Simple assignment 1. Initialization 1. Casts to/from any other arithmetic type (except {{{__float128}}} see below) 1. Element of a struct/union 1. Element of an array 1. Pointers to this type * Mixed-mode calculations with {{{__fpreg}}} and a narrower type, in the usual C fashion, widen the narrower type to {{{__fpreg}}} before performing the calculation (as discussed above, this normally requires no explicit conversion instruction). * Widening conversion from {{{__fpreg}}} to {{{__float128}}} (IEEE quad, implemented only in software) is not allowed. * Narrowing conversion from {{{__fpreg}}} to {{{float}}} {{{double}}} or {{{__float80}}} (IEEE extended) is "passive": the value is not truncated. The next arithmetic operation in the narrower mode will see the full-width value. (Also, if the value is written to memory in the narrower mode it is obviously truncated. This has the potential to cause problems similar to those seen on the i386 architecture with its wide internal calculations; however, the vast number of registers available on the ia64 architecture makes the odds of users actually experiencing problems small.) * Conversions between {{{__fpreg}}} and integer types have the usual C semantics. == =#pragma _USE_SF= == This {{{#pragma}}} gives control over the choice of floating-point control register. It has the syntax =# pragma ''USE_SF _n'' where''n''is 0, 1, 2, or 3. Its effect is to cause all subsequent floating-point operations to use the specified control register, until the end of the containing block. It is constrained to appear only once per lexical block and only at the beginning thereof. HP's specification says that the effect of the {{{#pragma}}} only applies to assembly intrinsics, but acc applies it to all operations. The GCC implementation will be consistent with acc. Note that inline divide and square root will always use control register 1 for intermediate calculations. == Assembly intrinsics == acc supports a large set of intrinsics (machine-specific builtins, in GCC terminology) which map directly to floating-point instructions that may not be readily accessible from C. It would be nice to support these. Post implementation of all the above features and improvements, this will be easy, as GCC already has plenty of support for machine-specific builtins. = Plan = The hard part is modeling the machine behavior without a combinatorial explosion in the size of {{{ia64.md}}} It is also desirable, although less important, to avoid a combinatorial explosion in the size of the generated files. We believe free extension of narrow to wide floating-point modes can be best modeled by creating new floating-point operand predicates which accept either {{{(reg''M1''}}} or {{{(float_extend''M1''(reg''M2'')}}} (where ''M2''is narrower than''M1''. These would be used for input operands of arithmetic instructions. Combine should then be able to merge explicit extension instructions with arithmetic. However, Richard Henderson cautions that this may require changes to reload (specifically, to recognize that the thing that needs reloading is {{{(reg''M2''}}} . It is already possible to model free truncation after arithmetic and alternate control registers; the only issue is the combinatorial explosion in the size of {{{ia64.md}}} and concomitant maintenance problems. Let's look at an example set of arithmetic patterns. {{{ (define_insn "adddf3" [(set (match_operand:DF 0 "fr_register_operand" "=f") (plus:DF (match_operand:DF 1 "fr_register_operand" "%f") (match_operand:DF 2 "fr_reg_or_fp01_operand" "fG")))] "" "fadd.d %0 = %1, %F2" [(set_attr "itanium_class" "fmac")]) (define_insn "*adddf3_trunc" [(set (match_operand:SF 0 "fr_register_operand" "=f") (float_truncate:SF (plus:DF (match_operand:DF 1 "fr_register_operand" "%f") (match_operand:DF 2 "fr_reg_or_fp01_operand" "fG"))))] "" "fadd.s %0 = %1, %F2" [(set_attr "itanium_class" "fmac")]) (define_insn "*adddf3_alts" [(set (match_operand:DF 0 "fr_register_operand" "=f") (plus:DF (match_operand:DF 1 "fr_register_operand" "%f") (match_operand:DF 2 "fr_reg_or_fp01_operand" "fG"))) (use (match_operand:SI 3 "const_int_operand" ""))] "" "fadd.d.s%3 %0 = %1, %F2" [set_attr "itanium_class" "fmac")]) (define_insn "*adddf3_trunc_alts" [(set (match_operand:SF 0 "fr_register_operand" "=f") (float_truncate:SF (plus:DF (match_operand:DF 1 "fr_register_operand" "%f") (match_operand:DF 2 "fr_reg_or_fp01_operand" "fG")))) (use (match_operand:SI 3 "const_int_operand" ""))] "" "fadd.s.s%3 %0 = %1, %F2" [(set_attr "itanium_class" "fmac")]) }}} As you can see, this is highly repetitive. (The {{{*adddf3_alts}}} and {{{*addf3_trunc_alts}}} patterns do not actually exist in the machine description, because ''alts= patterns have only been added when they are used directly by other parts of the machine description, in an effort to keep the repetition down.) What we would like is a notation that allowed us to write just the first pattern, or something very like it, and have the other three patterns synthesized. Richard Sandiford's "mode macros" and "code macros" are the most obvious related feature, but they don't facilitate mutating the RTL template. A better analogy is {{{define_cond_exec}}} which _does''create modified patterns with mutated RTL templates. Here is a half-baked suggestion for a construct that might work: {{{ (define_pattern_macro "fp_insn" [(set (match_operand 1) (match_operand 2)] ["*_trunc" (parallel[ (set (match_dup:narrower_float 1) (float_truncate:narrower_float (match_dup 2))) ]) "*_alts" (parallel[ (match_dup 0) (use (match_operand:SI 3 "const_int_operand" "")) ]) "*_trunc_alts" (parallel[ (set (match_dup:narrower_float 1) (float_truncate:narrower_float (match_dup 2))) (use (match_operand:SI 3 "const_int_operand" "")) ]) ]) (define_fp_insn "adddf3" [(set (match_operand:DF 0 "fr_register_operand" "=f") (plus:DF (match_operand:DF 1 "fr_register_operand" "%f") (match_operand:DF 2 "fr_reg_or_fp01_operand" "fG")))] "" "fadd%m0%s3 %0 = %1, %F2" [(set_attr "itanium_class" "fmac")]) }}} Here {{{narrower_float}}} is a mode macro, defined to expand to only those floating point modes that are narrower than the mode of the arithmetic. (This is not a capability that mode macros have at present, but it should be possible to add.) {{{(match_dup 0)}}} automatically refers to the entire pattern, and {{{(match_dup''M''_n_)}}} means "operand''n''but with mode''M''. The output template has to be aware of the selected operation mode, hence {{{%m0}}} and of whether or not the third operand even''exists'' which I paper over with {{{%s3}}} (May or may not correspond to actual ia64 {{{output_operand}}} modifier letters.) One could then go even farther, and use mode macros in the {{{define_fp_insn}}} so that it would not need repeating for every floating-point mode. ---- Having done all that, it is then comparatively simple to implement {{{__fpreg}}} by adding a new mode, tentatively named {{{RFmode}}} ("register float"), to the back end. GCC will not be aware of the precise format of this mode, as the architecture manual indicates that it may change in the future. Passive conversions will be achieved by making the {{{truncrf?f}}} patterns be no-ops, and by modifying the floating-point operand predicates to accept {{{(float_truncate:MODE (reg:RF))}}} in an rvalue position where {{{(reg:MODE)}}} would have been acceptable. Restricting the set of operations allowed for {{{__fpreg}}} may require front end changes to produce sensible error messages rather than an ICE or a bizarre link failure (eg. {{{undefined symbol __divrf3}}} ). However, it may be feasible to do that entirely in the back end by writing stub expanders that call {{{error}}} Implementation of {{{#pragma _USE_SF}}} will definitely require front-end changes, as GCC currently has no support for lexically-scoped {{{#pragma}}} (This feature is desirable for C99 and OpenMP support as well.) Assuming some mechanism for propagating the information all the way through the tree optimizers, the backend needs only check the state of the {{{#pragma}}} in expanders and choose the appropriate pattern