[PATCH] PR middle-end/15239: Artificial limit in expand_mult

Mon May 10 15:31:00 GMT 2004

On Mon, 10 May 2004, Alan Modra wrote:
> You seem to be ignoring Jim's comment in the PR that the limit was
> placed there to prevent code size increasing too much.  Memory effects
> are important in all but mickey mouse benchmark code.

Not ignoring at all.  Yes, memory effects can be important, but these
effects and their interactions with scheduling and cache sizes are
clearly very target dependent.  This is why such factors are normally
parameterized into the values returned by rtx_cost.  There is no
requirement that the values returned by rtx_costs correspond exactly
to CPU cycle counts, only their relative values are significant.
Hence most backends use values which encapsulate subtle effects such
as pairability, typical operand distributions and cache effects.

Much like -Os instructs the backend to return rtx_costs based upon
instruction sizes, at other times rtx_costs should encode "overall"
performance merit of an instruction or instruction sequence.

If code expansion is a significant factor that was previously capped by
the middle-end's 12*add_cost rule, the backend can restore that behaviour
by reducing the rtx_cost for a multiplication.  Indeed, if cache effects
are that severe one might argue that it should already be parameterizing
multiplication at less than 12*add_cost.

On many platforms, multiplication cost should be less than "manufacturers
recommended cycle count" to encode cache effects.  On the pentium4, the
rtx_costs indicate multiply/add being 15-to-1, when based on instruction
timings along the value is closer to 30-to-1.  However, the ability to
interleave the scheduling of consecutive multiplications, and break up
register dependencies conversely prefers long add/shift sequences over
atomic "imul" instructions.  Best for backend maintainers to decide...

Consider as an example, a typical 8-bit or 16-bit embedded CPU
without a multiplier.  The cost of multiplication via __mulsi3 might
easily be hundreds of cpu cycles including parameter passing and
function call overhead.  On these platforms, implementing simple
DSP-like algorithms would much prefer a twenty instruction sequence.
Unfortunately, the current CVS sources prohibit the backends from
making these determinations/trade-offs for themselves.

Roger
--