This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH, ARM] Cortex-A8 backend fixes


This patch fixes few things in pipeline description of ARM Cortex-A8.

1) arm_no_early_alu_shift_value_dep() checks early dependence only for one argument, ignoring the dependence on register used as shift amount. For example, this function is used as a condition in bypass that sets dep_cost=0 between mov and ALU operations:

  mov r0, r1
  add r3, r4, r5, asr r0

This results in dep_cost returning 0 for these insns, while according
to Technical Reference Manual it should be 1
(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babcagee.html).



Also, in PLUS and MINUS rtx expressions the order of operands is different: PLUS has shift expression as its first argument, while MINUS usually has shift as a second argument. But in arm_no_early_alu_shift_value_dep() only the first argument is checked as EARLY_OP. We changed arm_no_early_alu_shift_dep() so it uses rtx_search() to find SHIFT expression. As all registers for SHIFT expression are required at stage E1, it's no difference whether it's shift's first or second argument, so we use new arm_no_early_alu_shift_dep() instead of arm_no_early_alu_shift_value_dep() in Cortex-A8 bypasses. Functions arm_no_early_alu_shift_[value_]dep() are also used in Cortex-A5, Cortex-R4 and ARM1136JFS descriptions, so we named modified function as arm_cortex_a8_no_early_alu_shift_dep().
Besides SHIFTs and ROTATE, the function also handles MULT (which is used to represent shifts by a constant) and ZERO_EXTEND and SIGN_EXTEND (they also have type of alu_shift).


2) MUL to ALU bypass has incorrect delay of 4 cycles, while according to TRM it has to be 5 for MUL and 6 for MULL. The patch splits this bypass in two and sets the correct delay values.

3) In cortex-a8.md MOV with shift instructions matched to wrong reservations (cortex_a8_alu_shift, cortex_a8_alu_shift_reg). Adding insn attribute "mov" for arm_shiftsi3 pattern in arm.md fixes that.

4) SMLALxy was moved from cortex_a8_mull reservation to cortex_a8_smlald, which according to TRM has proper timing for this insn (1 cycle less than MULL).

5) ARM Cortex-A8 TRM itself contains inaccurate timings for availability of RdLo in some multiply instructions. Namely, lower part of the result for (S|U)MULL, (S|U)MLAL, UMAAL, SMLALxy, SMLALD, SMLSLD instructions is already available at E4 stage (instead of E5 in TRM).

This information initially was found in beagle board mailing list, and it's confirmed by our tests and these sites: http://www.avison.me.uk/ben/programming/cortex-a8.html and http://hilbert-space.de/?p=66

The patch adds two bypasses between these instructions and MOV instruction, which uses arm_mull_low_part_dep() to check whether dependency is only on the low part of MUL destination. Bypasses between MULL and ALU insns for RdLo can't be added, because between this pair of reservation there are existing bypasses. However, in practice these multiply insns are rare, and on SPEC2K INT code low part of the result for such insns is never used.

--
Best regards,
  Dmitry
2012-02-09  Ruben Buchatskiy <rb@ispras.ru>

        * config/arm/arm-protos.h (arm_cortex_a8_no_early_alu_shift_dep,
        arm_mull_low_part_dep): Declare.
        * config/arm/arm.c (arm_cortex_a8_no_early_alu_shift_dep,
        arm_mull_low_part_dep, is_early_op): New function.
        * config/arm/arm.md (arm_shiftsi3): Add "mov" insn attribute.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 23a29c6..2a1334e 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -97,10 +97,12 @@ extern int neon_struct_mem_operand (rtx);
 extern int arm_no_early_store_addr_dep (rtx, rtx);
 extern int arm_early_store_addr_dep (rtx, rtx);
 extern int arm_early_load_addr_dep (rtx, rtx);
+extern int arm_cortex_a8_no_early_alu_shift_dep (rtx, rtx);
 extern int arm_no_early_alu_shift_dep (rtx, rtx);
 extern int arm_no_early_alu_shift_value_dep (rtx, rtx);
 extern int arm_no_early_mul_dep (rtx, rtx);
 extern int arm_mac_accumulator_is_mul_result (rtx, rtx);
+extern int arm_mull_low_part_dep (rtx, rtx);
 
 extern int tls_mentioned_p (rtx);
 extern int symbol_mentioned_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ee26c51..e92c75b 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -23035,6 +23035,56 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
   return reg_overlap_mentioned_p (value, addr);
 }
 
+/* Return nonzero and copy *X to *DATA if *X is a SHIFT operand.
+   This is a callback for for_each_rtx in arm_no_early_alu_shift_dep().  */
+
+static int
+is_early_op (rtx *x, void *data)
+{
+  rtx *rtx_data = (rtx *) data;
+  enum rtx_code code;
+  code = GET_CODE (*x);
+
+  if (code == ASHIFT || code == ASHIFTRT || code == LSHIFTRT
+      || code == ROTATERT || code == ROTATE || code == MULT
+      || code == ZERO_EXTEND || code == SIGN_EXTEND)
+    {
+       *rtx_data = *x;
+       return 1;
+    }
+  else
+    return 0;
+}
+
+/* Return nonzero if the CONSUMER instruction (an ALU op) does not
+   have an early register shift value or amount dependency on the
+   result of PRODUCER.  */
+
+int
+arm_cortex_a8_no_early_alu_shift_dep (rtx producer, rtx consumer)
+{
+  rtx value = PATTERN (producer);
+  rtx op = PATTERN (consumer);
+  rtx early_op;
+
+  if (GET_CODE (value) == COND_EXEC)
+    value = COND_EXEC_CODE (value);
+  if (GET_CODE (value) == PARALLEL)
+    value = XVECEXP (value, 0, 0);
+  value = XEXP (value, 0);
+  if (GET_CODE (op) == COND_EXEC)
+    op = COND_EXEC_CODE (op);
+  if (GET_CODE (op) == PARALLEL)
+    op = XVECEXP (op, 0, 0);
+  op = XEXP (op, 1);
+
+  /* Traverse OP looking for SHIFT, ROTATE, SIGN_EXTEND.
+     EARLY_OP will hold the whole matching rtx.  */
+  for_each_rtx (&op, is_early_op, &early_op);
+
+  return !reg_overlap_mentioned_p (value, early_op);
+}
+
 /* Return nonzero if the CONSUMER instruction (an ALU op) does not
    have an early register shift value or amount dependency on the
    result of PRODUCER.  */
@@ -23132,6 +23182,42 @@ arm_no_early_mul_dep (rtx producer, rtx consumer)
   return 0;
 }
 
+/* Return nonzero if the CONSUMER (MULL insn) have a dependency only on the low
+   part of PRODUCER's result (RdLo register), which for some insns is available
+   one cycle earlier than its high part.  */
+
+int
+arm_mull_low_part_dep (rtx producer, rtx consumer)
+{
+  rtx value = PATTERN (producer);
+  rtx op = PATTERN (consumer);
+  enum machine_mode mode;
+  int dep = 0;
+
+  if (GET_CODE (value) == COND_EXEC)
+    value = COND_EXEC_CODE (value);
+  if (GET_CODE (value) == PARALLEL)
+    value = XVECEXP (value, 0, 0);
+  value = XEXP (value, 0);
+  if (GET_CODE (op) == COND_EXEC)
+    op = COND_EXEC_CODE (op);
+  if (GET_CODE (op) == PARALLEL)
+    op = XVECEXP (op, 0, 0);
+  op = XEXP (op, 1);
+
+  /* Save the current MODE of VALUE. */
+  mode = GET_MODE (value);
+  if (mode != SImode && mode != VOIDmode)
+    PUT_MODE(value, SImode);
+  if (reg_overlap_mentioned_p (value, op))
+    dep = 1;
+
+  /* Restore the saved MODE. */
+  PUT_MODE(value, mode);
+
+  return dep;
+}
+
 /* We can't rely on the caller doing the proper promotion when
    using APCS or ATPCS.  */
 
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 7ac3f5c..aef0ff5 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -3666,6 +3666,7 @@
   "* return arm_output_shift(operands, 0);"
   [(set_attr "predicable" "yes")
    (set_attr "shift" "1")
+   (set_attr "insn" "mov")
    (set (attr "type") (if_then_else (match_operand 2 "const_int_operand" "")
 		      (const_string "alu_shift")
 		      (const_string "alu_shift_reg")))]
diff --git a/gcc/config/arm/cortex-a8.md b/gcc/config/arm/cortex-a8.md
index 1922e5c..ef0b41b 100644
--- a/gcc/config/arm/cortex-a8.md
+++ b/gcc/config/arm/cortex-a8.md
@@ -117,19 +117,19 @@
 ;; (Such a pair can be issued in parallel, hence latency zero.)
 (define_bypass 0 "cortex_a8_mov" "cortex_a8_alu")
 (define_bypass 0 "cortex_a8_mov" "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 0 "cortex_a8_mov" "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; An ALU instruction followed by an ALU instruction with no early dep.
 (define_bypass 1 "cortex_a8_alu,cortex_a8_alu_shift,cortex_a8_alu_shift_reg"
                "cortex_a8_alu")
 (define_bypass 1 "cortex_a8_alu,cortex_a8_alu_shift,cortex_a8_alu_shift_reg"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 1 "cortex_a8_alu,cortex_a8_alu_shift,cortex_a8_alu_shift_reg"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; Multiplication instructions.  These are categorized according to their
 ;; reservation behavior and the need below to distinguish certain
@@ -149,7 +149,7 @@
 
 (define_insn_reservation "cortex_a8_mull" 7
   (and (eq_attr "tune" "cortexa8")
-       (eq_attr "insn" "smull,umull,smlal,umlal,umaal,smlalxy"))
+       (eq_attr "insn" "smull,umull,smlal,umlal,umaal"))
   "cortex_a8_multiply_3")
 
 (define_insn_reservation "cortex_a8_smulwy" 5
@@ -162,7 +162,7 @@
 ;; cannot go in cortex_a8_mla above.  (See below for bypass details.)
 (define_insn_reservation "cortex_a8_smlald" 6
   (and (eq_attr "tune" "cortexa8")
-       (eq_attr "insn" "smlald,smlsld"))
+       (eq_attr "insn" "smlald,smlsld,smlalxy"))
   "cortex_a8_multiply_2")
 
 ;; A multiply with a single-register result or an MLA, followed by an
@@ -174,17 +174,28 @@
 
 ;; A multiply followed by an ALU instruction needing the multiply
 ;; result only at E2 has lower latency than one needing it at E1.
-(define_bypass 4 "cortex_a8_mul,cortex_a8_mla,cortex_a8_mull,\
-                  cortex_a8_smulwy,cortex_a8_smlald"
+(define_bypass 5 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy,\
+                  cortex_a8_smlald"
+               "cortex_a8_alu")
+(define_bypass 6 "cortex_a8_mull"
                "cortex_a8_alu")
 (define_bypass 4 "cortex_a8_mul,cortex_a8_mla,cortex_a8_mull,\
                   cortex_a8_smulwy,cortex_a8_smlald"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 4 "cortex_a8_mul,cortex_a8_mla,cortex_a8_mull,\
                   cortex_a8_smulwy,cortex_a8_smlald"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
+
+;; A MULL followed by a MOV instruction needing the multiply
+;; result at E1, but result (RdLo) is available only at E4 stage.
+(define_bypass 6 "cortex_a8_mull"
+	       "cortex_a8_mov"
+	       "arm_mull_low_part_dep")
+(define_bypass 5 "cortex_a8_smlald"
+               "cortex_a8_mov"
+               "arm_mull_low_part_dep")
 
 ;; Load instructions.
 ;; The presence of any register writeback is ignored here.
@@ -201,10 +212,10 @@
                "cortex_a8_alu")
 (define_bypass 2 "cortex_a8_load1_2"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 2 "cortex_a8_load1_2"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; We do not currently model the fact that loads with scaled register
 ;; offsets that are not LSL #2 have an extra cycle latency (they issue
@@ -224,10 +235,10 @@
                "cortex_a8_alu")
 (define_bypass 4 "cortex_a8_load3_4"
                "cortex_a8_alu_shift"
-               "arm_no_early_alu_shift_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 (define_bypass 4 "cortex_a8_load3_4"
                "cortex_a8_alu_shift_reg"
-               "arm_no_early_alu_shift_value_dep")
+               "arm_cortex_a8_no_early_alu_shift_dep")
 
 ;; Store instructions.
 ;; Writeback is again ignored.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]