This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[committed] optimise MIPS %highest/%higher/%high sequences

From: Richard Sandiford <rsandifo at redhat dot com>
To: gcc-patches at gcc dot gnu dot org
Date: Thu, 02 Sep 2004 19:35:09 +0100
Subject: [committed] optimise MIPS %highest/%higher/%high sequences

In static n64 code, there are two ways of loading a symbol value into
a register: a sequential version that takes 6 consecutive arithmetic
instructions, and a parallel version that takes 4 cycles on superscalar
machines.  gcc already has code to use the latter where possible.

However, when accessing the contents of an object (rather than loading
its address), there are also sequential and (slightly) parallel versions,
and gcc doesn't yet generate the parallel one.  Adding it has been on my
todo list for a while and is finally implemented with the patch below.
We can now generate:

        lui     op1,%highest(op2)
        lui     op0,%hi(op2)
        daddiu  op1,op1,%higher(op2)
        dsll32  op1,op1,0
        daddu   op1,op1,op0
        ... access %lo(op2)(op1) ...

Tested by inspecting the output and (for sanity) by bootstrapping &
regression testing on mips-sgi-irix6.5.  Applied to mainline.

Richard


	* config/mips/mips.md (*lea_high64): Change split condition to
	flow2_completed.  Add a peephole2 to generate a more parallel version.

Index: config/mips/mips.md
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/mips/mips.md,v
retrieving revision 1.299
diff -c -p -F^\([(a-zA-Z0-9_]\|#define\) -r1.299 mips.md
*** config/mips/mips.md	31 Aug 2004 06:54:42 -0000	1.299
--- config/mips/mips.md	1 Sep 2004 18:28:30 -0000
*************** (define_insn "mov_<store>r"
*** 3103,3114 ****
  ;;	dsll	op0,op0,16
  ;;	daddiu	op0,op0,%hi(op1)
  ;;	dsll	op0,op0,16
  (define_insn_and_split "*lea_high64"
    [(set (match_operand:DI 0 "register_operand" "=d")
  	(high:DI (match_operand:DI 1 "general_symbolic_operand" "")))]
    "TARGET_EXPLICIT_RELOCS && ABI_HAS_64BIT_SYMBOLS"
    "#"
!   "&& reload_completed"
    [(set (match_dup 0) (high:DI (match_dup 2)))
     (set (match_dup 0) (lo_sum:DI (match_dup 0) (match_dup 2)))
     (set (match_dup 0) (ashift:DI (match_dup 0) (const_int 16)))
--- 3103,3117 ----
  ;;	dsll	op0,op0,16
  ;;	daddiu	op0,op0,%hi(op1)
  ;;	dsll	op0,op0,16
+ ;;
+ ;; The split is deferred until after flow2 to allow the peephole2 below
+ ;; to take effect.
  (define_insn_and_split "*lea_high64"
    [(set (match_operand:DI 0 "register_operand" "=d")
  	(high:DI (match_operand:DI 1 "general_symbolic_operand" "")))]
    "TARGET_EXPLICIT_RELOCS && ABI_HAS_64BIT_SYMBOLS"
    "#"
!   "&& flow2_completed"
    [(set (match_dup 0) (high:DI (match_dup 2)))
     (set (match_dup 0) (lo_sum:DI (match_dup 0) (match_dup 2)))
     (set (match_dup 0) (ashift:DI (match_dup 0) (const_int 16)))
*************** (define_insn_and_split "*lea_high64"
*** 3120,3125 ****
--- 3123,3151 ----
  }
    [(set_attr "length" "20")])
  
+ ;; Use a scratch register to reduce the latency of the above pattern
+ ;; on superscalar machines.  The optimized sequence is:
+ ;;
+ ;;	lui	op1,%highest(op2)
+ ;;	lui	op0,%hi(op2)
+ ;;	daddiu	op1,op1,%higher(op2)
+ ;;	dsll32	op1,op1,0
+ ;;	daddu	op1,op1,op0
+ (define_peephole2
+   [(match_scratch:DI 0 "d")
+    (set (match_operand:DI 1 "register_operand")
+ 	(high:DI (match_operand:DI 2 "general_symbolic_operand")))]
+   "TARGET_EXPLICIT_RELOCS && ABI_HAS_64BIT_SYMBOLS"
+   [(set (match_dup 1) (high:DI (match_dup 3)))
+    (set (match_dup 0) (high:DI (match_dup 4)))
+    (set (match_dup 1) (lo_sum:DI (match_dup 1) (match_dup 3)))
+    (set (match_dup 1) (ashift:DI (match_dup 1) (const_int 32)))
+    (set (match_dup 1) (plus:DI (match_dup 1) (match_dup 0)))]
+ {
+   operands[3] = mips_unspec_address (operands[2], SYMBOL_64_HIGH);
+   operands[4] = mips_unspec_address (operands[2], SYMBOL_64_LOW);
+ })
+ 
  ;; On most targets, the expansion of (lo_sum (high X) X) for a 64-bit
  ;; SYMBOL_GENERAL X will take 6 cycles.  This next pattern allows combine
  ;; to merge the HIGH and LO_SUM parts of a move if the HIGH part is only

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]