[PATCH]: GCC Scheduler support for R10000 on MIPS

Wed Aug 6 07:58:00 GMT 2008

Richard Sandiford wrote:
> 
> I think you want "foo, bar" (foo one cycle, then bar the next)
> rather than "foo + bar" (foo and bar simultaneously).  And you don't
> want to tie up the issue and completion units for more than one cycle.

That would make sense.  However, converting the 'foo + bar' to 'foo, bar' only 
seems to work as long as there aren't any repeat rates.  Down in the fdiv bits, 
as it start to calculate the repeat rates, it starts to send the state count out 
of control, to the point where my octane runs out of memory trying to process it 
all.


Here's what I converted one of the fdiv's into:

(define_insn_reservation "r10k_fdiv_single" 12
   (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
        (and (eq_attr "type" "fdiv,frdiv")
             (eq_attr "mode" "SF")))
   "r10k_fpmpy_issue, (r10k_fpdiv * 14), r10k_fpmpy_completion")

I figure that syntax reads as "issue, fpdiv is 14 cycles, completion".  But that 
repeat rate number at 14 makes the insn-automata.c build take a long time (an 
hour minimum).  At a repeat rate of 10, the NDA state count for r10k_a_fpmpy was 
in the 12,000 range (and took 4-5mins).  Plus, the mips.dfa output is 822MB.  So 
I think I'm missing something...


> Looks good.

Neat.  Not too hard :)

Here's a quick question, though.  Integer multiply and divides happen on ALU2. 
The manual makes a note that divides keep ALU2 busy for the duration of the 
divide.  I think this means that division isn't pipelined, and the GCC internals 
manual seems to describe something like this, though the example to me isn't 
easy to decipher.  If I'm interpreting it right, does this look correct?:

(define_insn_reservation "r10k_idiv_single" 34
   (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
        (and (eq_attr "type" "idiv")
             (eq_attr "mode" "SI")))
   "r10k_alu2 * 35, r10k_idiv_single")


It gives this as an example (formatted to look like mine):

(define_insn_reservation "div" 8
   (eq_attr "type" "div")
"i1_pipeline, div * 7, div + (port0 | port1)")

As the manual states, it's issued into the pipeline, divided for 7 cycles (I 
think), then issued to the finishing bit.

Come to think of it, This might apply the the problem above too...  Maybe I need 
to rethink my layout of my fp-multiplier cpu units?


> Again, I'm afraid it's really a case of trying and seeing what gives
> the best performance. ;)

Well, I dug around in mips.md, and I think I found the define_insn statement 
that sets up the "imul" type.  It looks like it only emits a "mult" asm 
instruction.  As far as I could tell, no "multu" or "dmultu" commands look like 
their emitted at all in mips.md.  I'm guess this isn't a widely used instructions?

If correct, I think it's safe then to just keep imul as-is, since it would 
appear that the majority of multiplications would be done via "mult".


> OK.  The time taken to compile something is certainly a valid benchmark.
> (It forms part of SPECINT, of course.)

Okay, I'll probably benchmark a lengthy program like glibc.  Even though the 
final output doesn't work yet (different problem there), it'll still compile and 
I can time it (usually, 3.5-5hrs).


> Nothing goes wrong if you fail to handle an instruction.  The compiler
> won't crash or anything.
> 
> At this stage we're dealing with things like "-march=r4130 -mtune=r10000".
> I don't think any particular handling of imadd is better than any other
> in that case.  So my personal perference would be to leave out the
> unnecessary insns (IMADD, SIGNEXT, FRDIV1, etc.).

Makes sense, so I went ahead and removed them.


> Right.

Done.


> Nope, that's mips.c:mips_cpu_table (which you're already handling
> correctly).  "cpu" is just an .md copy of enum processor_type.

Gotcha.  How about the "cpu" attr in the scheduler definition?  Do those need to 
be removed (to match the values in mips.md's "cpu" type), or is it checking on 
what's passed to -march?


> That'd be OK with me.

Done.


Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org


gcc/
     * config/mips/10000.md: Add R10000 scheduler
     * config/mips/mips.c: Add r10000 params & costs
     * config/mips/mips.h: Add R10k constant
     * config/mips/mips.md: Add r10000 params & incl 10000.md



diff -Naurp gcc.orig/gcc/config/mips/10000.md gcc/gcc/config/mips/10000.md

--- gcc.orig/gcc/config/mips/10000.md	1969-12-31 19:00:00.000000000 -0500
+++ gcc/gcc/config/mips/10000.md	2008-08-05 21:26:34.000000000 -0400
@@ -0,0 +1,252 @@
+;; DFA-based pipeline description for the VR1x000.
+;;   Copyright (C) 2005, 2006, 2008 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+
+;; GCC is free software; you can redistribute it and/or modify it
+;; under the terms of the GNU General Public License as published
+;; by the Free Software Foundation; either version 3, or (at your
+;; option) any later version.
+
+;; GCC is distributed in the hope that it will be useful, but WITHOUT
+;; ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+;; or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+;; License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; <http://www.gnu.org/licenses/>.
+
+
+;; R12K/R14K/R16K are derivatives of R10K, thus copy its description
+;; until specific tuning for each is added.
+
+;; R10000 has an int queue, fp queue, address queue.
+;; The int queue feeds ALU1 and ALU2.
+;; The fp queue feeds the fp-adder and fp-multiplier.
+;; The addr queue feeds the Load/Store unit.
+;;
+;; However, we define the fp-adder and fp-multiplier as
+;; separate automatons, because the fp-multiplier is
+;; divided into fp-multiplier, fp-division, and
+;; fp-squareroot units, all of which share the same
+;; issue and completion logic, yet can operate in
+;; parallel.
+;;
+;; This is based on the model described in the R10K Manual
+;; and it helps to reduce the size of the automata.
+(define_automaton "r10k_a_int, r10k_a_fpadder, r10k_a_addr,
+                   r10k_a_fpmpy, r10k_a_fpdiv, r10k_a_fpsqrt")
+
+(define_cpu_unit "r10k_alu1" "r10k_a_int")
+(define_cpu_unit "r10k_alu2" "r10k_a_int")
+(define_cpu_unit "r10k_fpadd" "r10k_a_fpadder")
+(define_cpu_unit "r10k_fpmpy" "r10k_a_fpmpy")
+(define_cpu_unit "r10k_fpmpy_issue" "r10k_a_fpmpy")
+(define_cpu_unit "r10k_fpmpy_completion" "r10k_a_fpmpy")
+(define_cpu_unit "r10k_fpdiv" "r10k_a_fpdiv")
+(define_cpu_unit "r10k_fpsqrt" "r10k_a_fpsqrt")
+(define_cpu_unit "r10k_loadstore" "r10k_a_addr")
+
+
+;; R10k Loads and Stores.
+(define_insn_reservation "r10k_load" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "load,prefetch,prefetchx"))
+  "r10k_loadstore")
+
+(define_insn_reservation "r10k_store" 0
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "store,fpstore,fpidxstore"))
+  "r10k_loadstore")
+
+(define_insn_reservation "r10k_fpload" 3
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fpload,fpidxload"))
+  "r10k_loadstore")
+
+
+;; Integer add/sub + logic ops, and mt hi/lo can be done by alu1 or alu2.
+;; Miscellaneous arith goes here too (this is a guess).
+(define_insn_reservation "r10k_arith" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "arith,mthilo,slt,clz,const,nop,trap,logical"))
+  "r10k_alu1 | r10k_alu2")
+
+;; We treat mfhilo differently, because we need to know when
+;; it's HI and when it's LO.
+(define_insn_reservation "r10k_mfhi" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "mfhilo")
+            (not (match_operand 1 "lo_operand"))))
+  "r10k_alu1 | r10k_alu2")
+
+(define_insn_reservation "r10k_mflo" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "mfhilo")
+            (match_operand 1 "lo_operand")))
+  "r10k_alu1 | r10k_alu2")
+
+
+;; ALU1 handles shifts, branch eval, and condmove.
+;;
+;; Brancher is separate, but part of ALU1, but can only
+;; do one branch per cycle (is this even implementable?).
+;;
+;; Unsure if the brancher handles jumps and calls as well, but since
+;; they're related, we'll add them here for now.
+(define_insn_reservation "r10k_brancher" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "shift,branch,jump,call"))
+  "r10k_alu1")
+
+(define_insn_reservation "r10k_int_cmove" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "condmove")
+            (eq_attr "mode" "SI,DI")))
+  "r10k_alu1")
+
+
+;; Coprocessor Moves.
+;; mtc1/dmtc1 are handled by ALU1.
+;; mfc1/dmfc1 are handled by the fp-multiplier.
+(define_insn_reservation "r10k_mt_xfer" 3
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "mtc"))
+  "r10k_alu1")
+
+(define_insn_reservation "r10k_mf_xfer" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "mfc"))
+  "r10k_fpmpy_issue, r10k_fpmpy, r10k_fpmpy_completion")
+
+
+;; Only ALU2 does int multiplications and divisions.
+;;
+;; According to the Vr10000 series user manual,
+;; integer mult and div insns can be issued one
+;; cycle earlier if using register Lo, but this is
+;; not modeled here.  We use the latency for the
+;; Lo register, however, as this is the common case.
+;;
+;; Divides also keep ALU2 busy, but this isn't modeled
+;; here.
+(define_insn_reservation "r10k_imul_single" 5
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "imul,imul3")
+            (eq_attr "mode" "SI")))
+  "r10k_alu2 * 6")
+
+(define_insn_reservation "r10k_imul_double" 9
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "imul,imul3")
+            (eq_attr "mode" "DI")))
+  "r10k_alu2 * 10")
+
+(define_insn_reservation "r10k_idiv_single" 34
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "idiv")
+            (eq_attr "mode" "SI")))
+  "r10k_alu2 * 35")
+
+(define_insn_reservation "r10k_idiv_double" 66
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "idiv")
+            (eq_attr "mode" "DI")))
+  "r10k_alu2 * 67")
+
+;; If on the HI register, latency goes up one cycle
+(define_bypass 6 "r10k_imul_single" "r10k_mfhi")
+(define_bypass 10 "r10k_imul_double" "r10k_mfhi")
+(define_bypass 35 "r10k_idiv_single" "r10k_mfhi")
+(define_bypass 67 "r10k_idiv_double" "r10k_mfhi")
+
+
+;; Floating point add/sub, mul, abs value, neg, comp, & moves.
+(define_insn_reservation "r10k_fp_miscadd" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fadd,fabs,fneg,fcmp"))
+  "r10k_fpadd")
+
+(define_insn_reservation "r10k_fp_miscmul" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fmul,fmove"))
+  "r10k_fpmpy_issue, r10k_fpmpy, r10k_fpmpy_completion")
+
+(define_insn_reservation "r10k_fp_cmove" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "condmove")
+            (eq_attr "mode" "SF,DF")))
+  "r10k_fpmpy_issue, r10k_fpmpy, r10k_fpmpy_completion")
+
+
+;; The fcvt.s.[wl] insn has latency 4, repeat 2.
+;; All other fcvt insns have latency 2, repeat 1.
+(define_insn_reservation "r10k_fcvt_single" 4
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fcvt")
+            (eq_attr "cnv_mode" "I2S")))
+  "r10k_fpadd * 2")
+
+(define_insn_reservation "r10k_fcvt_other" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fcvt")
+            (eq_attr "cnv_mode" "!I2S")))
+  "r10k_fpadd")
+
+
+;; Run the fmadd insn through fp-adder first, then fp-multiplier.
+;;
+;; The latency for fmadd is 2 cycles if the result is used
+;; by another fmadd instruction.
+(define_insn_reservation "r10k_fmadd" 4
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fmadd"))
+  "r10k_fpadd, r10k_fpmpy_issue, r10k_fpmpy, r10k_fpmpy_completion")
+
+(define_bypass 2 "r10k_fmadd" "r10k_fmadd")
+
+
+;; Floating point Divisions & square roots.
+(define_insn_reservation "r10k_fdiv_single" 12
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fdiv,frdiv")
+            (eq_attr "mode" "SF")))
+  "r10k_fpmpy_issue, (r10k_fpdiv * 14), r10k_fpmpy_completion")
+
+(define_insn_reservation "r10k_fdiv_double" 19
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fdiv,frdiv")
+            (eq_attr "mode" "DF")))
+  "r10k_fpmpy_issue, (r10k_fpdiv * 21), r10k_fpmpy_completion")
+
+(define_insn_reservation "r10k_fsqrt_single" 18
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fsqrt")
+            (eq_attr "mode" "SF")))
+  "r10k_fpmpy_issue, (r10k_fpsqrt * 20), r10k_fpmpy_completion")
+
+(define_insn_reservation "r10k_fsqrt_double" 33
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fsqrt")
+            (eq_attr "mode" "DF")))
+  "r10k_fpmpy_issue, (r10k_fpsqrt * 35), r10k_fpmpy_completion")
+
+(define_insn_reservation "r10k_frsqrt_single" 30
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "frsqrt")
+            (eq_attr "mode" "SF")))
+  "r10k_fpmpy_issue, (r10k_fpsqrt * 20), r10k_fpmpy_completion")
+
+(define_insn_reservation "r10k_frsqrt_double" 52
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "frsqrt")
+            (eq_attr "mode" "DF")))
+  "r10k_fpmpy_issue, (r10k_fpsqrt * 35), r10k_fpmpy_completion")
+
+
+;; Handle unknown/multi insns here (this is a guess).
+(define_insn_reservation "r10k_unknown" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "unknown,multi"))
+  "r10k_alu1 + r10k_alu2")
diff -Naurp gcc.orig/gcc/config/mips/mips.c gcc/gcc/config/mips/mips.c
--- gcc.orig/gcc/config/mips/mips.c	2008-08-01 21:55:41.000000000 -0400
+++ gcc/gcc/config/mips/mips.c	2008-08-04 01:26:49.000000000 -0400
@@ -593,6 +593,10 @@ static const struct mips_cpu_info mips_c

    /* MIPS IV processors. */
    { "r8000", PROCESSOR_R8000, 4, 0 },
+  { "r10000", PROCESSOR_R10000, 4, 0 },
+  { "r12000", PROCESSOR_R10000, 4, 0 },
+  { "r14000", PROCESSOR_R10000, 4, 0 },
+  { "r16000", PROCESSOR_R10000, 4, 0 },
    { "vr5000", PROCESSOR_R5000, 4, 0 },
    { "vr5400", PROCESSOR_R5400, 4, 0 },
    { "vr5500", PROCESSOR_R5500, 4, PTF_AVOID_BRANCHLIKELY },
@@ -988,6 +992,19 @@ static const struct mips_rtx_cost_data m
  		     1,           /* branch_cost */
  		     4            /* memory_latency */
    },
+  { /* R1x000 */
+    COSTS_N_INSNS (2),            /* fp_add */
+    COSTS_N_INSNS (2),            /* fp_mult_sf */
+    COSTS_N_INSNS (2),            /* fp_mult_df */
+    COSTS_N_INSNS (12),           /* fp_div_sf */
+    COSTS_N_INSNS (19),           /* fp_div_df */
+    COSTS_N_INSNS (5),            /* int_mult_si */
+    COSTS_N_INSNS (9),           /* int_mult_di */
+    COSTS_N_INSNS (34),           /* int_div_si */
+    COSTS_N_INSNS (66),           /* int_div_di */
+		     1,           /* branch_cost */
+		     4            /* memory_latency */
+  },
    { /* SB1 */
      /* These costs are the same as the SB-1A below.  */
      COSTS_N_INSNS (4),            /* fp_add */
@@ -9872,7 +9889,10 @@ mips_issue_rate (void)
  	 but in reality only a maximum of 3 insns can be issued as
  	 floating-point loads and stores also require a slot in the
  	 AGEN pipe.  */
-     return 4;
+    case PROCESSOR_R10000:
+      /* All R10K Processors are quad-issue (being the first MIPS
+         processors to support this feature). */
+      return 4;

      case PROCESSOR_20KC:
      case PROCESSOR_R4130:
diff -Naurp gcc.orig/gcc/config/mips/mips.h gcc/gcc/config/mips/mips.h
--- gcc.orig/gcc/config/mips/mips.h	2008-08-01 21:55:41.000000000 -0400
+++ gcc/gcc/config/mips/mips.h	2008-08-04 00:05:27.000000000 -0400
@@ -66,6 +66,7 @@ enum processor_type {
    PROCESSOR_R7000,
    PROCESSOR_R8000,
    PROCESSOR_R9000,
+  PROCESSOR_R10000,
    PROCESSOR_SB1,
    PROCESSOR_SB1A,
    PROCESSOR_SR71000,
@@ -241,6 +242,7 @@ enum mips_code_readable_setting {
  #define TARGET_MIPS5500             (mips_arch == PROCESSOR_R5500)
  #define TARGET_MIPS7000             (mips_arch == PROCESSOR_R7000)
  #define TARGET_MIPS9000             (mips_arch == PROCESSOR_R9000)
+#define TARGET_MIPS10000            (mips_arch == PROCESSOR_R10000)
  #define TARGET_SB1                  (mips_arch == PROCESSOR_SB1		\
  				     || mips_arch == PROCESSOR_SB1A)
  #define TARGET_SR71K                (mips_arch == PROCESSOR_SR71000)
@@ -267,6 +269,7 @@ enum mips_code_readable_setting {
  #define TUNE_MIPS6000               (mips_tune == PROCESSOR_R6000)
  #define TUNE_MIPS7000               (mips_tune == PROCESSOR_R7000)
  #define TUNE_MIPS9000               (mips_tune == PROCESSOR_R9000)
+#define TUNE_MIPS10000              (mips_tune == PROCESSOR_R10000)
  #define TUNE_SB1                    (mips_tune == PROCESSOR_SB1		\
  				     || mips_tune == PROCESSOR_SB1A)

diff -Naurp gcc.orig/gcc/config/mips/mips.md gcc/gcc/config/mips/mips.md
--- gcc.orig/gcc/config/mips/mips.md	2008-08-01 21:55:41.000000000 -0400
+++ gcc/gcc/config/mips/mips.md	2008-08-05 18:31:52.000000000 -0400
@@ -553,7 +553,7 @@
  ;; Attribute describing the processor.  This attribute must match exactly
  ;; with the processor_type enumeration in mips.h.
  (define_attr "cpu"
- 
"r3000,4kc,4kp,5kc,5kf,20kc,24kc,24kf2_1,24kf1_1,74kc,74kf2_1,74kf1_1,74kf3_2,loongson_2e,loongson_2f,m4k,r3900,r6000,r4000,r4100,r4111,r4120,r4130,r4300,r4600,r4650,r5000,r5400,r5500,r7000,r8000,r9000,sb1,sb1a,sr71000,xlr"
+ 
"r3000,4kc,4kp,5kc,5kf,20kc,24kc,24kf2_1,24kf1_1,74kc,74kf2_1,74kf1_1,74kf3_2,loongson_2e,loongson_2f,m4k,r3900,r6000,r4000,r4100,r4111,r4120,r4130,r4300,r4600,r4650,r5000,r5400,r5500,r7000,r8000,r9000,r10000,sb1,sb1a,sr71000,xlr"
    (const (symbol_ref "mips_tune")))

  ;; The type of hardware hazard associated with this instruction.
@@ -903,6 +903,7 @@
  (include "6000.md")
  (include "7000.md")
  (include "9000.md")
+(include "10000.md")
  (include "sb1.md")
  (include "sr71k.md")
  (include "xlr.md")
diff -Naurp gcc.orig/gcc/doc/invoke.texi gcc/gcc/doc/invoke.texi
--- gcc.orig/gcc/doc/invoke.texi	2008-08-01 21:51:46.000000000 -0400
+++ gcc/gcc/doc/invoke.texi	2008-08-04 00:09:12.000000000 -0400
@@ -11980,6 +11980,7 @@ The processor names are:
  @samp{r2000}, @samp{r3000}, @samp{r3900}, @samp{r4000}, @samp{r4400},
  @samp{r4600}, @samp{r4650}, @samp{r6000}, @samp{r8000},
  @samp{rm7000}, @samp{rm9000},
+@samp{r10000}, @samp{r12000}, @samp{r14000}, @samp{r16000},
  @samp{sb1},
  @samp{sr71000},
  @samp{vr4100}, @samp{vr4111}, @samp{vr4120}, @samp{vr4130}, @samp{vr4300},