This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH]: Add pipeline description for Mips R10K-series



Hi all,


Attached is my first shot at getting that R10K pipeline description from gcc-3.4.x ported over to gcc-4. The initial work was done on gcc-4.0.2, but checking between 4.0, 4.1, and 4.2, the changes needed to match it to HEAD seemed relatively minor (imul3 additions, etc). I'm not sure how well this fits into gcc itself, as I haven't really tested it insofar as to whether it generates fitting code (MIPS asm isn't something I'm skilled in).

What I can say, is I have tested it in compiling gcc, and building several userland packages with, notably glibc, and it seems to have passed that, so I suspect that's a decent first start.

Some other observations, noted below, point out some things that probably need repair on in this patch, so feedback is appreciated:

*  DFA statistics seem to indicate my r10k_fp automaton is *way* out
   of whack.  See:

     Automaton `r10k_fp'
          6052 NDFA states,          20004 NDFA arcs
          6052 DFA states,           20004 DFA arcs
          5714 minimal DFA states,   19098 minimal DFA arcs
           284 all insns         11 insn equivalence classes
     23453 transition comb vector els, 62854 trans table els: use comb vect
     23453 state alts comb vector els, 62854 state alts table els: use comb vect
     62854 min delay table els, compression factor 1

   The other statistics for the remaining MIPS automatons barely crack ~1000
   for any of the above numbers, so I think I've got something specified
   incorrectly in the use of that automaton and it's associated cpu_units.


* As the comments in the patch indicate, there are some bits of the R10K I couldn't quite figure out how to model properly with the DFA description. If anyone has pointers on fixing this up or representing them in DFA format, I'm all ears:

	* R10K Branch unit is separate from, but a part of ALU1, and can do
	  one branch per cycle.  Unsure how to implement this, since I'm not
	  sure if it's only the branch unit being busy for 1 cycle, or all
	  of ALU1 as well.

	* Also unsure if the branch unit handles jump & calls (which
	  are unconditional, whereas branch is conditional according to
	  the mips.md description).  R10K manual specifies the brancher
	  handles conditional branches, but doesn't specify whether this
	  includes unconditional cousins, like jump or call.

	* mtc1/dmtc1 is handled by ALU1, while mfc1/dmfc1 is handled by
	  the fp-multiplier.  I followed the method used in sb1.md on
	  splitting the 'xfer' type into the two instructions, but I'm
	  unsure if this is done properly.

	* For mult/dmult/div/ddiv/unsigned; these instructions can
	  execute one cycle earlier if using register Lo, but I'm
	  unsure on proper detection of this (using match_operand and
	  lo_operand).  It appears match_operand takes a specific
	  operand 'n' to pass to lo_operand to check, but I'm not sure
	  which operand offset to pass, and whether every instruction
	  has the operand representing Lo in the same offset (i.e., 0
	  or 1).

	* For integer division, ALU2 remains busy for the remainder of the
	  divide operation.  The define_insn_reservation example provided
	  in the Internals manual seems to describe a very similar scenario,
	  but I was unsure because both the insn reservation and the cpu_unit
	  were named "div", so the dual naming makes the example a little
	  less clear on how this is to be implemented.

	* Fp division and square root are technically handled by separate
	  execution units that can operate in parallel.  However, both of
	  these units share their issue and completion logic with the fp
	  multiplier.  My understanding of this and of DFA is thus that if
	  the multiplier is busy, it won't be able to issue or handle a
	  completion from either of these two units.  So for this section
	  I used the DFA setup which I *think* implies sending the insn
	  only if both units are free.  But this may explain the absurd DFA
	  statistics, so some sanity checking here would be welcome.

	* Unsure what to do with unknown/multi instructions.  I mimicked what
	  other pipeline descriptions did and just stated all cpu_units are
	  to process them when free.

	* Latency for multi/div (of int or fp?, doesn't specify) is dependent
	  on the subsequent instruction.  My guess is this means there's some
	  calculation out there for determining the appropriate latency based
	  on the instruction to follow a mult/div instruction, and implementing
	  these via define_bypass.  I've yet to find such calculations or even
	  a chart of these dependencies, though.

	* Prefetch instructions aren't listed in the latency chart provided
	  in the R10K manual, so I took a guess and implemented it as having
	  the same latency as a standard int load.

	* There are a lot of notes in the R10K manual about changes done to
	  improve on the design in the R12K.  Most of these appear to not
	  affect the pipeline description enough to be of concern, but there
	  was one that caught my attention.  When load/store, cacheop, or
	  prefetch insns are decoded, they are sent into the integer queue.
	  The IQ treats the address-calculate unit as an "ALU3" for these
	  calculations, and when finished, they get put into the Address queue
	  for further processing.  I'm not sure if this can be modeled in DFA
	  and whether it's even worth it, since this seems to be the only
	  difference that I can find in R12K material that affects the pipeline
	  description in any significant way.

	* No information is known on what the precise differences are in the
	  R14K and R16K.  All that is known is that they are derivatives of
	  the R10K, only having a faster core speed and smaller die size.  If
	  there is a newer version of the R10K manual that contains notes on
	  these processors and their enhancements, I sure haven't found it.

	* Not even close to knowing what to do for COST_N_INSNS or how to
	  calculate these values correctly.  Is there a known formula out
	  there that simply needs values plugged in to get an average value?



I think that covers everything for now, for those who've read down this far. I can definitely say, this processor is far more unique than I imagined. Comments on the above notes (like tips on how to resolve them) and feedback on the overall patch welcome. I can then go back and tweak things further once I have a better understanding and hopefully deliver something worthy of inclusion.

Much thanks to the guys in #gcc for putting up with my questions too :)

References:
NEC VR10000 Manual:
http://www.ee.nec.de/_pdf/U10278EJ4V0UM00.PDF

SGI Version w/ R12K Errata:
http://techpubs.sgi.com/library/manuals/2000/007-2490-001/pdf/007-2490-001.pdf



--Kumba

--
Gentoo/MIPS Team Lead
Gentoo Foundation Board of Trustees

"Such is oft the course of deeds that move the wheels of the world: small hands do them because they must, while the eyes of the great are elsewhere." --Elrond
diff -Naurp gcc-4.0.2.orig/gcc/config/mips/10000.md gcc-4.0.2/gcc/config/mips/10000.md
--- gcc-4.0.2.orig/gcc/config/mips/10000.md	1969-12-31 19:00:00 -0500
+++ gcc-4.0.2/gcc/config/mips/10000.md	2005-12-18 19:27:05 -0500
@@ -0,0 +1,243 @@
+;; VR1x000 pipeline description.
+;;   Copyright (C) 2005, 2006 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+
+;; GCC is free software; you can redistribute it and/or modify it
+;; under the terms of the GNU General Public License as published
+;; by the Free Software Foundation; either version 2, or (at your
+;; option) any later version.
+
+;; GCC is distributed in the hope that it will be useful, but WITHOUT
+;; ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+;; or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+;; License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING.  If not, write to the
+;; Free Software Foundation, 51 Franklin Street, Fifth Floor, Boston,
+;; MA 02110-1301, USA.
+
+
+;; This file overrides parts of generic.md.  It is derived from the
+;; old define_function_unit description.
+
+
+
+;; R12K/R14K/R16K are derivatives of R10K, thus copy its description
+;; until specific tuning for each is added
+
+
+;; R10000 has int queue, fp queue, address queue
+(define_automaton "r10k_int, r10k_fp, r10k_addr")
+
+;; R10000 has 2 integer ALUs, load/store, fp-adder and fp-multiplier
+(define_cpu_unit "r10k_alu1" "r10k_int")
+(define_cpu_unit "r10k_alu2" "r10k_int")
+(define_cpu_unit "r10k_fpadd" "r10k_fp")
+(define_cpu_unit "r10k_fpmpy" "r10k_fp")
+(define_cpu_unit "r10k_loadstore" "r10k_addr")
+
+;; R10000 has separate fp-div and fp-sqrt units as well and these can
+;; execute in parallel, however their issue & completion logic is shared
+;; by the fp-multiplier
+(define_cpu_unit "r10k_fpdiv" "r10k_fp")
+(define_cpu_unit "r10k_fpsqrt" "r10k_fp")
+
+
+
+
+;; loader
+(define_insn_reservation "r10k_load" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "load,prefetch,prefetchx"))
+  "r10k_loadstore")
+
+(define_insn_reservation "r10k_store" 0
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "store,fpstore,fpidxstore"))
+  "r10k_loadstore")
+
+(define_insn_reservation "r10k_fpload" 3
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fpload,fpidxload"))
+  "r10k_loadstore")
+
+
+
+
+;; Integer add/sub + logic ops, and mf/mt hi/lo can be done by alu1 or alu2
+;; Miscellaneous arith goes here too (this is a guess)
+(define_insn_reservation "r10k_arith" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "arith,mfhilo,mthilo,slt,clz,const,nop,trap"))
+  "r10k_alu1 | r10k_alu2")
+
+
+
+
+;; ALU1 handles shifts, branch eval, and condmove
+;;
+;; Brancher is separate, but part of ALU1, but can only
+;; do one branch per cycle (but needs implementing)
+;;
+;; jump, call - unsure if brancher handles these too
+(define_insn_reservation "r10k_shift" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "shift,branch"))
+  "r10k_alu1")
+
+(define_insn_reservation "r10k_int_cmove" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "condmove")
+            (eq_attr "mode" "SI,DI")))
+  "r10k_alu1")
+
+
+
+
+;; Coprocessor Moves
+;; mtc1/dmtc1 are handled by ALU1
+;; mfc1/dmfc1 are handled by the fp-multiplier
+(define_insn_reservation "r10k_mt_xfer" 3
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "xfer")
+            (not (match_operand 0 "fpr_operand"))))
+  "r10k_alu1")
+
+(define_insn_reservation "r10k_mf_xfer" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "xfer")
+            (match_operand 0 "fpr_operand")))
+  "r10k_fpmpy")
+
+
+
+
+;; Only ALU2 does int multiplications and divisions
+;; R10K allows an int insn using register Lo to be issued
+;; one cycle earlier than an insn using register Hi for
+;; the insns below, however, we skip on doing this
+;; for now until usage of lo_operand() is figured out.
+;;
+;; Divides keep ALU2 busy, but this isn't expressed here
+(define_insn_reservation "r10k_imul_single" 6
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "imul,imadd")
+            (eq_attr "mode" "SI")))
+  "r10k_alu2 * 6")
+
+(define_insn_reservation "r10k_imul_double" 10
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "imul,imadd")
+            (eq_attr "mode" "DI")))
+  "r10k_alu2 * 10")
+
+(define_insn_reservation "r10k_idiv_single" 35
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "idiv")
+            (eq_attr "mode" "SI")))
+  "r10k_alu2 * 35")
+
+(define_insn_reservation "r10k_idiv_double" 67
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "idiv")
+            (eq_attr "mode" "DI")))
+  "r10k_alu2 * 67")
+
+
+
+
+;; FP add/sub, madd, mul, abs value, neg, comp, convert (single, other), & moves
+(define_insn_reservation "r10k_fp_miscadd" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fadd,fabs,fneg,fcmp"))
+  "r10k_fpadd")
+
+(define_insn_reservation "r10k_fp_miscmul" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fmul,fmove"))
+  "r10k_fpmpy")
+
+(define_insn_reservation "r10k_fp_cmove" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "condmove")
+            (eq_attr "mode" "SF,DF")))
+  "r10k_fpmpy")
+
+;; Runs through fp-adder first, then fp-multiplier
+(define_insn_reservation "r10k_fmadd" 4
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "fmadd"))
+  "r10k_fpadd, r10k_fpmpy")
+
+;; For cvt.s.w/cvt.s.l
+(define_insn_reservation "r10k_fcvt_single" 4
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fcvt")
+            (eq_attr "mode" "SF")))
+  "r10k_fpadd * 2")
+
+;; For all other cvt
+(define_insn_reservation "r10k_fcvt_other" 2
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fcvt")
+            (eq_attr "mode" "!SF")))
+  "r10k_fpadd")
+
+
+
+
+;; The latency for fmadd is 2 cycles if the result is used
+;; by another fmadd instruction
+(define_bypass 2 "r10k_fmadd" "r10k_fmadd")
+
+
+
+
+;; Divisions & square roots 
+(define_insn_reservation "r10k_fdiv_single" 12
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fdiv,frdiv")
+            (eq_attr "mode" "SF")))
+  "r10k_fpmpy + (r10k_fpdiv * 14)")
+
+(define_insn_reservation "r10k_fdiv_double" 19
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fdiv,frdiv")
+            (eq_attr "mode" "DF")))
+  "r10k_fpmpy + (r10k_fpdiv * 21)")
+
+(define_insn_reservation "r10k_fsqrt_single" 18
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fsqrt")
+            (eq_attr "mode" "SF")))
+  "r10k_fpmpy + (r10k_fpsqrt * 20)")
+
+(define_insn_reservation "r10k_fsqrt_double" 33
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "fsqrt")
+            (eq_attr "mode" "DF")))
+  "r10k_fpmpy + (r10k_fpsqrt * 35)")
+
+(define_insn_reservation "r10k_frsqrt_single" 30
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "frsqrt")
+            (eq_attr "mode" "SF")))
+  "r10k_fpmpy + (r10k_fpsqrt * 20)")
+
+(define_insn_reservation "r10k_frsqrt_double" 52
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (and (eq_attr "type" "frsqrt")
+            (eq_attr "mode" "DF")))
+  "r10k_fpmpy + (r10k_fpsqrt * 35)")
+
+
+
+
+;; Unknown/multi (this is a guess)
+(define_insn_reservation "r10k_unknown" 1
+  (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
+       (eq_attr "type" "unknown,multi"))
+  "r10k_alu1 + r10k_alu2 + r10k_fpadd + r10k_fpmpy + r10k_fpdiv + r10k_fpsqrt")
+
diff -Naurp gcc-4.0.2.orig/gcc/config/mips/mips.c gcc-4.0.2/gcc/config/mips/mips.c
--- gcc-4.0.2.orig/gcc/config/mips/mips.c	2005-05-08 07:56:53 -0400
+++ gcc-4.0.2/gcc/config/mips/mips.c	2005-12-18 19:12:24 -0500
@@ -688,6 +688,10 @@ const struct mips_cpu_info mips_cpu_info
 
   /* MIPS IV */
   { "r8000", PROCESSOR_R8000, 4 },
+  { "r10000", PROCESSOR_R10000, 4 },
+  { "r12000", PROCESSOR_R12000, 4 },
+  { "r14000", PROCESSOR_R14000, 4 },
+  { "r16000", PROCESSOR_R16000, 4 },
   { "vr5000", PROCESSOR_R5000, 4 },
   { "vr5400", PROCESSOR_R5400, 4 },
   { "vr5500", PROCESSOR_R5500, 4 },
@@ -9311,6 +9315,12 @@ mips_issue_rate (void)
 {
   switch (mips_tune)
     {
+    case PROCESSOR_R10000:
+    case PROCESSOR_R12000:
+    case PROCESSOR_R14000:
+    case PROCESSOR_R16000:
+      return 4;
+
     case PROCESSOR_R4130:
     case PROCESSOR_R5400:
     case PROCESSOR_R5500:
diff -Naurp gcc-4.0.2.orig/gcc/config/mips/mips.h gcc-4.0.2/gcc/config/mips/mips.h
--- gcc-4.0.2.orig/gcc/config/mips/mips.h	2005-04-15 03:00:18 -0400
+++ gcc-4.0.2/gcc/config/mips/mips.h	2005-12-18 19:12:24 -0500
@@ -58,6 +58,10 @@ enum processor_type {
   PROCESSOR_R7000,
   PROCESSOR_R8000,
   PROCESSOR_R9000,
+  PROCESSOR_R10000,
+  PROCESSOR_R12000,
+  PROCESSOR_R14000,
+  PROCESSOR_R16000,
   PROCESSOR_SB1,
   PROCESSOR_SR71000
 };
@@ -306,6 +310,10 @@ extern const struct mips_cpu_info *mips_
 #define TARGET_MIPS5500             (mips_arch == PROCESSOR_R5500)
 #define TARGET_MIPS7000             (mips_arch == PROCESSOR_R7000)
 #define TARGET_MIPS9000             (mips_arch == PROCESSOR_R9000)
+#define TARGET_MIPS10000            (mips_arch == PROCESSOR_R10000)
+#define TARGET_MIPS12000            (mips_arch == PROCESSOR_R12000)
+#define TARGET_MIPS14000            (mips_arch == PROCESSOR_R14000)
+#define TARGET_MIPS16000            (mips_arch == PROCESSOR_R16000)
 #define TARGET_SB1                  (mips_arch == PROCESSOR_SB1)
 #define TARGET_SR71K                (mips_arch == PROCESSOR_SR71000)
 
@@ -321,6 +329,10 @@ extern const struct mips_cpu_info *mips_
 #define TUNE_MIPS6000               (mips_tune == PROCESSOR_R6000)
 #define TUNE_MIPS7000               (mips_tune == PROCESSOR_R7000)
 #define TUNE_MIPS9000               (mips_tune == PROCESSOR_R9000)
+#define TUNE_MIPS10000              (mips_tune == PROCESSOR_R10000)
+#define TUNE_MIPS12000              (mips_tune == PROCESSOR_R12000)
+#define TUNE_MIPS14000              (mips_tune == PROCESSOR_R14000)
+#define TUNE_MIPS16000              (mips_tune == PROCESSOR_R16000)
 #define TUNE_SB1                    (mips_tune == PROCESSOR_SB1)
 
 /* True if the pre-reload scheduler should try to create chains of
diff -Naurp gcc-4.0.2.orig/gcc/config/mips/mips.md gcc-4.0.2/gcc/config/mips/mips.md
--- gcc-4.0.2.orig/gcc/config/mips/mips.md	2005-05-08 07:56:58 -0400
+++ gcc-4.0.2/gcc/config/mips/mips.md	2005-12-18 19:13:00 -0500
@@ -252,7 +252,7 @@
 ;; Attribute describing the processor.  This attribute must match exactly
 ;; with the processor_type enumeration in mips.h.
 (define_attr "cpu"
-  "default,4kc,5kc,20kc,m4k,r3000,r3900,r6000,r4000,r4100,r4111,r4120,r4130,r4300,r4600,r4650,r5000,r5400,r5500,r7000,r8000,r9000,sb1,sr71000"
+  "default,4kc,5kc,20kc,m4k,r3000,r3900,r6000,r4000,r4100,r4111,r4120,r4130,r4300,r4600,r4650,r5000,r5400,r5500,r7000,r8000,r9000,r10000,r12000,r14000,r16000,sb1,sr71000"
   (const (symbol_ref "mips_tune")))
 
 ;; The type of hardware hazard associated with this instruction.
@@ -498,6 +498,7 @@
 (include "6000.md")
 (include "7000.md")
 (include "9000.md")
+(include "10000.md")
 (include "sb1.md")
 (include "sr71k.md")
 (include "generic.md")

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]