PING Re: [PATCH, MIPS] add new peephole for 74k dspr2

Tue Sep 25 18:06:00 GMT 2012

On Tue, 25 Sep 2012, Richard Sandiford wrote:

> >>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
> >> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
> >> obviously the resulting larger code may adversely affect cache performance 
> >> in some scenarios.
> >
> > That's not how the 4650 DFA models it though.
> >
> > (define_insn_reservation "generic_hilo" 1
> >   (eq_attr "type" "mfhi,mflo,mthi,mtlo")
> >   "imuldiv*3")
> >
> > (define_insn_reservation "r4650_imul" 4
> >   (and (eq_attr "cpu" "r4650")
> >        (eq_attr "type" "imul,imul3,imadd"))
> >   "imuldiv*4")
> >
> > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> > rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

 I can't track a reference on R4650 MTHI/MTLO latency; I'd be happy to 
learn of one, or otherwise I wonder where the delay is coming from.  Also 
a small update: apparently MULT is 3 clocks only on the R4650 where 
operands are 16 bits (unsure if it is enough if only one is; for a zero by 
zero multiplication it surely does not matter though).  So I think using a 
MULT here is at least reasonable.

> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
> MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
> required for this optimisation is getting a bit ridiculous.

 I have double-checked some documentation, and in fact many MIPS cores, 
including the current ones, have a configuration option to include either 
a high-performance or an area-efficient MD unit.  Take the M14Kc for 
example -- its high-performance unit has a one-cycle latency/issue rate 
for 16-bit multiplication (two-cycle for full 32 bits; here the width of 
rt is explicitly named) and the area-efficient has a 32-cycle 
latency/issue rate only regardless of the operand size (obviously 
iterating over addition one bit at a time).  The latency of MTHI/MTLO is 1 
across both units.

 So I think this can't really be selected automatically for all cores, 
some human-supplied knowledge about the MD unit used is required -- that 
obviously affects other operations too, e.g. some multiplications 
involving a constant that may be cheaper to do either directly or with a 
sequence of additions depending on the MD unit present (unless optimising 
for size, of course).

  Maciej