PING Re: [PATCH, MIPS] add new peephole for 74k dspr2
Maciej W. Rozycki
macro@codesourcery.com
Tue Sep 25 18:06:00 GMT 2012
On Tue, 25 Sep 2012, Richard Sandiford wrote:
> >> According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4
> >> cycles on that processor). An MTHI/MTLO pair will take 2 cycles;
> >> obviously the resulting larger code may adversely affect cache performance
> >> in some scenarios.
> >
> > That's not how the 4650 DFA models it though.
> >
> > (define_insn_reservation "generic_hilo" 1
> > (eq_attr "type" "mfhi,mflo,mthi,mtlo")
> > "imuldiv*3")
> >
> > (define_insn_reservation "r4650_imul" 4
> > (and (eq_attr "cpu" "r4650")
> > (eq_attr "type" "imul,imul3,imadd"))
> > "imuldiv*4")
> >
> > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> > rather than 4 cycles. Any attempt to use the DFA would still favour MULT.
I can't track a reference on R4650 MTHI/MTLO latency; I'd be happy to
learn of one, or otherwise I wonder where the delay is coming from. Also
a small update: apparently MULT is 3 clocks only on the R4650 where
operands are 16 bits (unsure if it is enough if only one is; for a zero by
zero multiplication it surely does not matter though). So I think using a
MULT here is at least reasonable.
> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
> MULT $0,$0 would be a really bad choice. Sigh. The amount of effort
> required for this optimisation is getting a bit ridiculous.
I have double-checked some documentation, and in fact many MIPS cores,
including the current ones, have a configuration option to include either
a high-performance or an area-efficient MD unit. Take the M14Kc for
example -- its high-performance unit has a one-cycle latency/issue rate
for 16-bit multiplication (two-cycle for full 32 bits; here the width of
rt is explicitly named) and the area-efficient has a 32-cycle
latency/issue rate only regardless of the operand size (obviously
iterating over addition one bit at a time). The latency of MTHI/MTLO is 1
across both units.
So I think this can't really be selected automatically for all cores,
some human-supplied knowledge about the MD unit used is required -- that
obviously affects other operations too, e.g. some multiplications
involving a constant that may be cheaper to do either directly or with a
sequence of additions depending on the MD unit present (unless optimising
for size, of course).
Maciej
More information about the Gcc-patches
mailing list