This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[committed] optimise MIPS %highest/%higher/%high sequences
- From: Richard Sandiford <rsandifo at redhat dot com>
- To: gcc-patches at gcc dot gnu dot org
- Date: Thu, 02 Sep 2004 19:35:09 +0100
- Subject: [committed] optimise MIPS %highest/%higher/%high sequences
In static n64 code, there are two ways of loading a symbol value into
a register: a sequential version that takes 6 consecutive arithmetic
instructions, and a parallel version that takes 4 cycles on superscalar
machines. gcc already has code to use the latter where possible.
However, when accessing the contents of an object (rather than loading
its address), there are also sequential and (slightly) parallel versions,
and gcc doesn't yet generate the parallel one. Adding it has been on my
todo list for a while and is finally implemented with the patch below.
We can now generate:
lui op1,%highest(op2)
lui op0,%hi(op2)
daddiu op1,op1,%higher(op2)
dsll32 op1,op1,0
daddu op1,op1,op0
... access %lo(op2)(op1) ...
Tested by inspecting the output and (for sanity) by bootstrapping &
regression testing on mips-sgi-irix6.5. Applied to mainline.
Richard
* config/mips/mips.md (*lea_high64): Change split condition to
flow2_completed. Add a peephole2 to generate a more parallel version.
Index: config/mips/mips.md
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/mips/mips.md,v
retrieving revision 1.299
diff -c -p -F^\([(a-zA-Z0-9_]\|#define\) -r1.299 mips.md
*** config/mips/mips.md 31 Aug 2004 06:54:42 -0000 1.299
--- config/mips/mips.md 1 Sep 2004 18:28:30 -0000
*************** (define_insn "mov_<store>r"
*** 3103,3114 ****
;; dsll op0,op0,16
;; daddiu op0,op0,%hi(op1)
;; dsll op0,op0,16
(define_insn_and_split "*lea_high64"
[(set (match_operand:DI 0 "register_operand" "=d")
(high:DI (match_operand:DI 1 "general_symbolic_operand" "")))]
"TARGET_EXPLICIT_RELOCS && ABI_HAS_64BIT_SYMBOLS"
"#"
! "&& reload_completed"
[(set (match_dup 0) (high:DI (match_dup 2)))
(set (match_dup 0) (lo_sum:DI (match_dup 0) (match_dup 2)))
(set (match_dup 0) (ashift:DI (match_dup 0) (const_int 16)))
--- 3103,3117 ----
;; dsll op0,op0,16
;; daddiu op0,op0,%hi(op1)
;; dsll op0,op0,16
+ ;;
+ ;; The split is deferred until after flow2 to allow the peephole2 below
+ ;; to take effect.
(define_insn_and_split "*lea_high64"
[(set (match_operand:DI 0 "register_operand" "=d")
(high:DI (match_operand:DI 1 "general_symbolic_operand" "")))]
"TARGET_EXPLICIT_RELOCS && ABI_HAS_64BIT_SYMBOLS"
"#"
! "&& flow2_completed"
[(set (match_dup 0) (high:DI (match_dup 2)))
(set (match_dup 0) (lo_sum:DI (match_dup 0) (match_dup 2)))
(set (match_dup 0) (ashift:DI (match_dup 0) (const_int 16)))
*************** (define_insn_and_split "*lea_high64"
*** 3120,3125 ****
--- 3123,3151 ----
}
[(set_attr "length" "20")])
+ ;; Use a scratch register to reduce the latency of the above pattern
+ ;; on superscalar machines. The optimized sequence is:
+ ;;
+ ;; lui op1,%highest(op2)
+ ;; lui op0,%hi(op2)
+ ;; daddiu op1,op1,%higher(op2)
+ ;; dsll32 op1,op1,0
+ ;; daddu op1,op1,op0
+ (define_peephole2
+ [(match_scratch:DI 0 "d")
+ (set (match_operand:DI 1 "register_operand")
+ (high:DI (match_operand:DI 2 "general_symbolic_operand")))]
+ "TARGET_EXPLICIT_RELOCS && ABI_HAS_64BIT_SYMBOLS"
+ [(set (match_dup 1) (high:DI (match_dup 3)))
+ (set (match_dup 0) (high:DI (match_dup 4)))
+ (set (match_dup 1) (lo_sum:DI (match_dup 1) (match_dup 3)))
+ (set (match_dup 1) (ashift:DI (match_dup 1) (const_int 32)))
+ (set (match_dup 1) (plus:DI (match_dup 1) (match_dup 0)))]
+ {
+ operands[3] = mips_unspec_address (operands[2], SYMBOL_64_HIGH);
+ operands[4] = mips_unspec_address (operands[2], SYMBOL_64_LOW);
+ })
+
;; On most targets, the expansion of (lo_sum (high X) X) for a 64-bit
;; SYMBOL_GENERAL X will take 6 cycles. This next pattern allows combine
;; to merge the HIGH and LO_SUM parts of a move if the HIGH part is only