[PATCH] Add floating point timings to rs6000_rtx_costs

Tue Jul 6 02:24:00 GMT 2004

On Mon, 5 Jul 2004, David Edelsohn wrote:
> Why does the patch use the FP instruction latency for the cost?
> The values will be used with the COSTS_N_INSNS() macro.

The values that GCC's middle-end cares about are the latencies.
In almost all cases (RTL expansion, if-conversion, RTL simplification,
combine, etc...) the assumption is that rtx_cost is the "cost" or
"time" for the result to become available for use by the next
instruction, relative to a fast integer instruction.  Almost all
CPUs, including PowerPC, parameterize their cycle times based upon
the pipeline delay through the integer ALU, for example an addition.
Hence COSTS_N_INSNS(1) is the time to perform an integer addition,
a.k.a. 1 cycle, and all other costs are relative to this.

I'm not sure if you've misread my patch, but all of the values in the
struct processor_cost table are scaled by COSTS_N_INSNS converting
them from pure hardware cycles, to "relative cycles" (additions).

> I think the values should be the latency of the class of instruction
> divided by either the latency of a simple FP instruction or the latency
> of a simple FXP instructions.  In other words, it should be scaled with
> respect to the cost of a single instruction.

Whilst using FP-relative costs for FP operations is an interesting
idea, it prohibits the comparison of integer vs. floating point costs.
For example, the middle-end can efficiently implement abs, neg, signbit
and copysign using bit-wise integer operations.  Without the ability to
compare the "cost" of doing these operations in the ALU vs. the FPU,
GCC currently resorts to always using the native FP instructions if a
backend has them, even if the source and target of these operations are
in memory (i.e. no FP->int->FP move overhead).  Clearly using a single
reference for relative costs is preferable.

> This is what my colleagues and I did for the POWER4/POWER5 integer
> multiply.

I think here you're talking about adjusting the latency by the issue
rate, and not by the latency of a simple integer instruction.  I won't
disagree that backend maintainers might prefer to represent costs
that way, playing the probability game with availability of co-issueable
instructions.  No-one can argue against benchmarks.  [long digression
entitled "Amdahl's law and the fallacy of super-linear speed-up" deleted]

> 	Also, the ChangeLog has a typo referring to "ppc640_cost".

Thanks.  Fixed locally.

> 	PowerPC processors always hold floating point values in FPRs as
> 64-bit quantities.  The value always can be used as an input to any
> floating point operation.  The operation is performed, the result rounded
> to the appropriate precision, and the value stored in the result register.
> If the operand has excess precision, it will be used in the operation.
> Some processors implement an early exit of the single precision FP
> multiply operation when the additional precision will not be visible.

Thanks.  It sounds as though "fmul ; frsp" will therefore always be
equivalent to "fmuls", modulo the effects of double rounding.  i.e. the
results are guaranteed to be identical except for the last bit of the
mantissa (i.e. within 1 ulp).

This is probably best added as an rs6000-specific peephole2 guarded by
flag_unsafe_math_optimizations, as these semantics aren't guaranteed
across platforms.

Roger
--