This page discusses a number of related problems and desired features of ia64 back end's floating point handling, and a plan for solving all of them.
Problems
At root, the problem is that the ia64 back end does not accurately model the hardware. The ia64 architecture's floating point has a number of exotic features:
- Floating-point registers are 82 bits wide and use an (internal, subject-to-change) format which does not exactly correspond to any IEEE data type.
Values loaded from memory in IEEE single, double, or extended format are automatically extended (butnotnormalized) to the full register width. Arithmetic instructions all operate on the full register width, with an optional truncation to single or double precision afterward; the truncated value is then immediately extended back to register width. The upshot is that it is almost never necessary to execute a widening conversion instruction, and narrowing conversions can frequently be merged with a preceding arithmetic instruction.
- There are four alternative IEEE control registers. Each instruction can specify which control register is used. Most calculation uses control register 0; register 1 is reserved for use by arithmetic library routines (e.g. division and square root) and has contents defined by the ABI; registers 2 and 3 are for speculative floating-point calculation, which is out of scope for this discussion.
GCC's model is inaccurate in the following ways:
- It is not aware that widening conversions are not necessary in most cases. This directly causes suboptimal code to be generated. Also, hand-coded assembly sequences (again such as division and square root) that avoid widening conversions must cheat in some way. The current inline divide is done with a post-reload splitter, at which point hard registers can be referenced in more than one mode. The problem with this is, register allocation is likely to allocate the same scratch registers to all divisions in the same function, which forces them to be serialized against each other even when the scheduler would like it to be otherwise.
- It is not aware of the 82-bit register format. This would be harmless if it were not desired to make operations in that format accessible to the user (primarily for ease of coding math libraries); see below.
- Only a limited set of instruction patterns which use the alternate control registers are available. Again, this is only a problem because it is desired to make the alternate control registers accessible to the user. (It will also be a problem if and when we ever get round to implementing speculative calculation.)
There are also related maintenance headaches:
To model the fact that narrowing conversions can be merged with preceding arithmetic instructions, we must create extra insn patterns: one for each DFmode pattern and two for each XFmode pattern, differing only in the RTL template. This produces massive duplication of code.
- Adding a complete set of alternate-control-register patterns will require yet more duplication: for every pattern (including the merged-narrowing-conversion patterns) a duplicate differing only in the RTL template and output format.
Features
It is desired to add some features of the HP-UX system compiler for ia64 to GCC. They facilitate coding high-speed math libraries in C rather than in assembly.
=__fpreg=
=fpreg= is an extended floating point type which provides user access to the full width of the floating point registers. It has the following properties: There is no way to write a constant with type __fpreg Only the following operations on __fpreg types are supported: Unary minus (-) and address operator (&) Equality operations !=) Relationals (<, < , >, > Casts to/from any other arithmetic type (except __float128 see below) Mixed-mode calculations with __fpreg and a narrower type, in the usual C fashion, widen the narrower type to __fpreg before performing the calculation (as discussed above, this normally requires no explicit conversion instruction). Widening conversion from __fpreg to __float128 (IEEE quad, implemented only in software) is not allowed. Narrowing conversion from __fpreg to float double or __float80 (IEEE extended) is "passive": the value is not truncated. The next arithmetic operation in the narrower mode will see the full-width value. (Also, if the value is written to memory in the narrower mode it is obviously truncated. This has the potential to cause problems similar to those seen on the i386 architecture with its wide internal calculations; however, the vast number of registers available on the ia64 architecture makes the odds of users actually experiencing problems small.) Conversions between __fpreg and integer types have the usual C semantics.
This #pragma gives control over the choice of floating-point control register. It has the syntax =# pragma USE_SF _n wherenis 0, 1, 2, or 3. Its effect is to cause all subsequent floating-point operations to use the specified control register, until the end of the containing block. It is constrained to appear only once per lexical block and only at the beginning thereof. HP's specification says that the effect of the #pragma only applies to assembly intrinsics, but acc applies it to all operations. The GCC implementation will be consistent with acc. Note that inline divide and square root will always use control register 1 for intermediate calculations.
acc supports a large set of intrinsics (machine-specific builtins, in GCC terminology) which map directly to floating-point instructions that may not be readily accessible from C. It would be nice to support these. Post implementation of all the above features and improvements, this will be easy, as GCC already has plenty of support for machine-specific builtins.
The hard part is modeling the machine behavior without a combinatorial explosion in the size of ia64.md It is also desirable, although less important, to avoid a combinatorial explosion in the size of the generated files. We believe free extension of narrow to wide floating-point modes can be best modeled by creating new floating-point operand predicates which accept either (reg''M1'' or (float_extend''M1''(reg''M2'') (where M2is narrower thanM1. These would be used for input operands of arithmetic instructions. Combine should then be able to merge explicit extension instructions with arithmetic. However, Richard Henderson cautions that this may require changes to reload (specifically, to recognize that the thing that needs reloading is (reg''M2'' . It is already possible to model free truncation after arithmetic and alternate control registers; the only issue is the combinatorial explosion in the size of ia64.md and concomitant maintenance problems. Let's look at an example set of arithmetic patterns. As you can see, this is highly repetitive. (The *adddf3_alts and *addf3_trunc_alts patterns do not actually exist in the machine description, because alts= patterns have only been added when they are used directly by other parts of the machine description, in an effort to keep the repetition down.) What we would like is a notation that allowed us to write just the first pattern, or something very like it, and have the other three patterns synthesized. Richard Sandiford's "mode macros" and "code macros" are the most obvious related feature, but they don't facilitate mutating the RTL template. A better analogy is define_cond_exec which _doescreate modified patterns with mutated RTL templates. Here is a half-baked suggestion for a construct that might work: Here narrower_float is a mode macro, defined to expand to only those floating point modes that are narrower than the mode of the arithmetic. (This is not a capability that mode macros have at present, but it should be possible to add.) (match_dup 0) automatically refers to the entire pattern, and (match_dup''M''_n_) means "operandnbut with modeM. The output template has to be aware of the selected operation mode, hence %m0 and of whether or not the third operand evenexists which I paper over with %s3 (May or may not correspond to actual ia64 output_operand modifier letters.) One could then go even farther, and use mode macros in the define_fp_insn so that it would not need repeating for every floating-point mode. Having done all that, it is then comparatively simple to implement __fpreg by adding a new mode, tentatively named RFmode ("register float"), to the back end. GCC will not be aware of the precise format of this mode, as the architecture manual indicates that it may change in the future. Passive conversions will be achieved by making the truncrf?f patterns be no-ops, and by modifying the floating-point operand predicates to accept (float_truncate:MODE (reg:RF)) in an rvalue position where (reg:MODE) would have been acceptable. Restricting the set of operations allowed for __fpreg may require front end changes to produce sensible error messages rather than an ICE or a bizarre link failure (eg. undefined symbol __divrf3 ). However, it may be feasible to do that entirely in the back end by writing stub expanders that call error Implementation of #pragma _USE_SF will definitely require front-end changes, as GCC currently has no support for lexically-scoped #pragma (This feature is desirable for C99 and OpenMP support as well.) Assuming some mechanism for propagating the information all the way through the tree optimizers, the backend needs only check the state of the #pragma in expanders and choose the appropriate pattern. =#pragma _USE_SF=
Assembly intrinsics
Plan
(define_insn "adddf3"
[(set (match_operand:DF 0 "fr_register_operand" "=f")
(plus:DF (match_operand:DF 1 "fr_register_operand" "%f")
(match_operand:DF 2 "fr_reg_or_fp01_operand" "fG")))]
""
"fadd.d %0 = %1, %F2"
[(set_attr "itanium_class" "fmac")])
(define_insn "*adddf3_trunc"
[(set (match_operand:SF 0 "fr_register_operand" "=f")
(float_truncate:SF
(plus:DF (match_operand:DF 1 "fr_register_operand" "%f")
(match_operand:DF 2 "fr_reg_or_fp01_operand" "fG"))))]
""
"fadd.s %0 = %1, %F2"
[(set_attr "itanium_class" "fmac")])
(define_insn "*adddf3_alts"
[(set (match_operand:DF 0 "fr_register_operand" "=f")
(plus:DF (match_operand:DF 1 "fr_register_operand" "%f")
(match_operand:DF 2 "fr_reg_or_fp01_operand" "fG")))
(use (match_operand:SI 3 "const_int_operand" ""))]
""
"fadd.d.s%3 %0 = %1, %F2"
[set_attr "itanium_class" "fmac")])
(define_insn "*adddf3_trunc_alts"
[(set (match_operand:SF 0 "fr_register_operand" "=f")
(float_truncate:SF
(plus:DF (match_operand:DF 1 "fr_register_operand" "%f")
(match_operand:DF 2 "fr_reg_or_fp01_operand" "fG"))))
(use (match_operand:SI 3 "const_int_operand" ""))]
""
"fadd.s.s%3 %0 = %1, %F2"
[(set_attr "itanium_class" "fmac")])
(define_pattern_macro "fp_insn"
[(set (match_operand 1) (match_operand 2)]
["*_trunc<narrower_float>"
(parallel[ (set (match_dup:narrower_float 1)
(float_truncate:narrower_float (match_dup 2))) ])
"*_alts"
(parallel[ (match_dup 0)
(use (match_operand:SI 3 "const_int_operand" "")) ])
"*_trunc<narrower_float>_alts"
(parallel[ (set (match_dup:narrower_float 1)
(float_truncate:narrower_float (match_dup 2)))
(use (match_operand:SI 3 "const_int_operand" "")) ])
])
(define_fp_insn "adddf3"
[(set (match_operand:DF 0 "fr_register_operand" "=f")
(plus:DF (match_operand:DF 1 "fr_register_operand" "%f")
(match_operand:DF 2 "fr_reg_or_fp01_operand" "fG")))]
""
"fadd%m0%s3 %0 = %1, %F2"
[(set_attr "itanium_class" "fmac")])