[sel-sched] Fix speculation bug in selective scheduler; ia64 tuning

Wed Dec 26 14:50:00 GMT 2007

Hi.

The first part of this patch addresses a bug in speculation: when both  
control
speculation and data speculation are enabled, and we are trying to move a
control-speculative load through a potentially overlapping store, thus  
making
it also data-speculative, we must emit a branchy data speculation check in
place of original control-speculative load, and emit that load in the  
recovery
block.  Before this patch, we simply asked target whether it prefers  
branchy
checks, and we would lose control-speculativeness of original load by  
emitting
simple check; now we always create a recovery block when original  
instruction
was speculative already.  This patch also makes check instruction marked as
non-speculative even if it does not check all speculation types that are
applied to instruction (so in the example above check instruction itself is
not speculative even though it is assumed to only check "data" aspect of
control+data speculated load), and uses attributes of original load when
emitting a load in the recovery block, so that the latter will get correct
speculation attributes.

The second part of the patch simply removes attempts to print liveness
information in scheduler dumps when it is not available (yet|already).

The third part of the patch adds a new heuristic to ia64 back-end: avoiding
more than specified number of memory accesses per cycle.  Itanium 2 is very
sensitive to cache bank conflicts, so restricting number of memory accesses
per cycle often helps to reduce cache latency.  There are two new target  
flags
introduced, which allow to specify maximal number of memory accesses per
cycle, and specify whether that limit is "soft" (so that extra loads will  
be
presented to scheduler as having low priority), or "hard" (extra loads will
require a cycle advance to be issued).  The default behaviour is soft  
limit of
one memory access per cycle.

The fourth part of the patch removes a strange Itanium 2 bypass specifying  
a 7
cycle delay between floating-point arithmetic instructions and stores.
Intel's manuals do not mention it, and benchmarks do not justify it either.

Committed by Andrey to the sel-sched branch.

2007-12-26  Alexander Monakov  <amonakov@ispras.ru>

	* sel-sched.c (create_speculation_check): Always create a recovery
	block when original instruction was speculative.  Make check
	instruction non-speculative irregardless of what speculation types are
	checked.  Use attributes of original instruction when creating a load
	in the recovery block.
	(selective_scheduling_run): Do not try to print liveness information
	in before-init and after-finish dumps.
	* config/ia64/ia64.opt (msched-max-memory-insns,
	msched-max-memory-insns-hard-limit): New target flags.
	* config/ia64/ia64.c (mem_ops_in_group, current_cycle): New variables.
	(ia64_sched_init): Initialize them.
	(is_load_p, record_memory_reference): New functions.
	(ia64_dfa_sched_reorder): Lower priority of loads when limit is
	reached.
	(ia64_variable_issue): Note if a load or a store is issued.
	(ia64_first_cycle_multipass_dfa_lookahead_guard_spec): Require a cycle
	advance if maximal number of loads or stores was issued on current cycle.
	(ia64_dfa_new_cycle): Update current_cycle and mem_ops_in_group.
	(struct _ia64_sched_context): New members.
	(ia64_init_sched_context): Initialize them.
	(ia64_set_sched_context): Copy them.
	* config/ia64/itanium2.md: Remove strange bypass.

Index: gcc/sel-sched.c
===================================================================

--- gcc/sel-sched.c	(revision 131180)
+++ gcc/sel-sched.c	(working copy)
@@ -1462,8 +1462,10 @@ create_speculation_check (expr_t c_rhs,

    sel_dump_cfg ("before-gen-spec-check");

-  /* Create a recovery block if target is going to emit branchy check.  */
-  if (targetm.sched.needs_block_p (check_ds))
+  /* Create a recovery block if target is going to emit branchy check, or  
if
+     ORIG_INSN was speculative already.  */
+  if (targetm.sched.needs_block_p (check_ds)
+      || EXPR_SPEC_DONE_DS (INSN_EXPR (orig_insn)) != 0)
      {
        recovery_block = sel_create_recovery_block (orig_insn);
        label = BB_HEAD (recovery_block);
@@ -1487,7 +1489,7 @@ create_speculation_check (expr_t c_rhs,
  				      INSN_SEQNO (orig_insn), orig_insn);

    /* Make check to be non-speculative.  */
-  EXPR_SPEC_DONE_DS (INSN_EXPR (insn)) &= ~check_ds;
+  EXPR_SPEC_DONE_DS (INSN_EXPR (insn)) = 0;
    INSN_SPEC_CHECKED_DS (insn) = check_ds;

    if (recovery_block != NULL)
@@ -1499,7 +1501,8 @@ create_speculation_check (expr_t c_rhs,

        twin_rtx = copy_rtx (PATTERN (EXPR_INSN_RTX (c_rhs)));
        twin_rtx = create_insn_rtx_from_pattern (twin_rtx, NULL_RTX);
-      twin = sel_gen_recovery_insn_from_rtx_after (twin_rtx, INSN_EXPR  
(insn),
+      twin = sel_gen_recovery_insn_from_rtx_after (twin_rtx,
+						   INSN_EXPR (orig_insn),
  						   INSN_SEQNO (insn),
  						   bb_note (recovery_block));
      }
@@ -5871,8 +5874,7 @@ selective_scheduling_run (void)
    setup_dump_cfg_params (false);

    sel_dump_cfg_1 ("before-init",
-		  (SEL_DUMP_CFG_BB_INSNS | SEL_DUMP_CFG_FUNCTION_NAME
-		   | SEL_DUMP_CFG_BB_LIVE));
+		  (SEL_DUMP_CFG_BB_INSNS | SEL_DUMP_CFG_FUNCTION_NAME));

    sel_global_init ();

@@ -5903,8 +5905,7 @@ selective_scheduling_run (void)
    sel_global_finish ();

    sel_dump_cfg_1 ("after-finish",
-		  (SEL_DUMP_CFG_BB_INSNS | SEL_DUMP_CFG_FUNCTION_NAME
-		   | SEL_DUMP_CFG_BB_LIVE));
+		  (SEL_DUMP_CFG_BB_INSNS | SEL_DUMP_CFG_FUNCTION_NAME));

    sched_verbose_param = old_sched_verbose_param;
  }
Index: gcc/config/ia64/ia64.opt
===================================================================
--- gcc/config/ia64/ia64.opt	(revision 131180)
+++ gcc/config/ia64/ia64.opt	(working copy)
@@ -152,6 +152,14 @@ msched-fp-mem-deps-zero-cost
  Target Report Var(mflag_sched_fp_mem_deps_zero_cost) Init(1)
  Assume that floating-point stores and loads are not likely to cause  
conflict when placed into one instruction group

+msched-max-memory-insns=
+Target RejectNegative Joined UInteger Var(ia64_max_memory_insns) Init(1)
+Limit on number of memory insns per instruction group. Frequently useful  
to prevent cache bank conflicts. Default value is 1
+
+msched-max-memory-insns-hard-limit
+Target Report Var(mflag_sched_mem_insns_hard_limit) Init(0)
+Disallow more than `msched-max-memory-insns' in instruction group.  
Otherwise, limit is `soft' (prefer non-memory operations when limit is  
reached)
+
  msel-sched-renaming
  Target Report Var(mflag_sel_sched_renaming) Init(1)
  Do register renaming in selective scheduling
Index: gcc/config/ia64/ia64.c
===================================================================
--- gcc/config/ia64/ia64.c	(revision 131180)
+++ gcc/config/ia64/ia64.c	(working copy)
@@ -6276,6 +6276,12 @@ static int *add_cycles;
  /* The following variable value is number of data speculations in  
progress.  */
  static int pending_data_specs = 0;

+/* Number of memory references on current and three future processor  
cycles.  */
+static char mem_ops_in_group[4];
+
+/* Number of current processor cycle (from scheduler's point of view).  */
+static int current_cycle;
+
  static rtx ia64_single_set (rtx);
  static void ia64_emit_insn_before (rtx, rtx);

@@ -6449,6 +6455,9 @@ ia64_sched_init (FILE *dump ATTRIBUTE_UN
  #endif
    last_scheduled_insn = NULL_RTX;
    init_insn_group_barriers ();
+
+  current_cycle = 0;
+  memset (mem_ops_in_group, 0, sizeof (mem_ops_in_group));
  }

  /* We're beginning a scheduling pass.  Check assertion.  */
@@ -6469,12 +6478,46 @@ ia64_sched_finish_global (FILE *dump ATT
    gcc_assert (pending_data_specs == 0);
  }

+/* Return TRUE if INSN is a load (either normal or speculative, but not a
+   speculation check), FALSE otherwise.  */
+static bool
+is_load_p (rtx insn)
+{
+  enum attr_itanium_class insn_class = ia64_safe_itanium_class (insn);
+
+  return
+   ((insn_class == ITANIUM_CLASS_LD || insn_class == ITANIUM_CLASS_FLD)
+    && get_attr_check_load (insn) == CHECK_LOAD_NO);
+}
+
+/* If INSN is a memory reference, memoize it in MEM_OPS_IN_GROUP global  
array
+   (taking account for 3-cycle cache reference postponing for stores:  
Intel
+   Itanium 2 Reference Manual for Software Development and Optimization,
+   6.7.3.1).  */
+static void
+record_memory_reference (rtx insn)
+{
+  enum attr_itanium_class insn_class = ia64_safe_itanium_class (insn);
+
+  switch (insn_class) {
+    case ITANIUM_CLASS_FLD:
+    case ITANIUM_CLASS_LD:
+      mem_ops_in_group[current_cycle % 4]++;
+      break;
+    case ITANIUM_CLASS_STF:
+    case ITANIUM_CLASS_ST:
+      mem_ops_in_group[(current_cycle + 3) % 4]++;
+      break;
+    default:;
+  }
+}
+
  /* We are about to being issuing insns for this clock cycle.
     Override the default sort algorithm to better slot instructions.  */

  static int
  ia64_dfa_sched_reorder (FILE *dump, int sched_verbose, rtx *ready,
-			int *pn_ready, int clock_var ATTRIBUTE_UNUSED,
+			int *pn_ready, int clock_var,
  			int reorder_type)
  {
    int n_asms;
@@ -6554,6 +6597,27 @@ ia64_dfa_sched_reorder (FILE *dump, int
        ready += deleted;
      }

+  current_cycle = clock_var;
+  if (reload_completed && mem_ops_in_group[clock_var % 4] >=  
ia64_max_memory_insns)
+    {
+      int moved = 0;
+
+      insnp = e_ready;
+      /* Move down loads/stores, preserving relative order.  */
+      while (insnp-- > ready + moved)
+	while (insnp >= ready + moved)
+	  {
+	    rtx insn = *insnp;
+	    if (! is_load_p (insn))
+	      break;
+	    memmove (ready + 1, ready, (insnp - ready) * sizeof (rtx));
+	    *ready = insn;
+	    moved++;
+	  }
+      n_ready -= moved;
+      ready += moved;
+    }
+
    return 1;
  }

@@ -6612,6 +6676,8 @@ ia64_variable_issue (FILE *dump ATTRIBUT
  	init_insn_group_barriers ();
        stops_p [INSN_UID (insn)] = stop_before_p;
        stop_before_p = 0;
+
+      record_memory_reference (insn);
      }
    return 1;
  }
@@ -6625,7 +6691,10 @@ ia64_first_cycle_multipass_dfa_lookahead
    gcc_assert (insn && INSN_P (insn));
    return ((!reload_completed
  	   || !safe_group_barrier_needed (insn))
-	  && ia64_first_cycle_multipass_dfa_lookahead_guard_spec (insn));
+         && ia64_first_cycle_multipass_dfa_lookahead_guard_spec (insn)
+         && (!mflag_sched_mem_insns_hard_limit
+             || !is_load_p (insn)
+             || mem_ops_in_group[current_cycle % 4] <  
ia64_max_memory_insns));
  }

  /* We are choosing insn from the ready queue.  Return nonzero if INSN
@@ -6701,6 +6770,8 @@ ia64_dfa_new_cycle (FILE *dump, int verb
  		 last_clock == clock ? " + cycle advance" : "");

        stop_before_p = 1;
+      current_cycle = clock;
+      mem_ops_in_group[current_cycle % 4] = 0;

        if (last_clock == clock)
  	{
@@ -6800,6 +6871,9 @@ struct _ia64_sched_context
    struct reg_write_state rws_sum[NUM_REGS];
    struct reg_write_state rws_insn[NUM_REGS];
    int first_instruction;
+  int pending_data_specs;
+  int current_cycle;
+  char mem_ops_in_group[4];
  };
  typedef struct _ia64_sched_context *ia64_sched_context_t;

@@ -6825,6 +6899,9 @@ ia64_init_sched_context (void *_sc, bool
        memset (sc->rws_sum, 0, sizeof (rws_sum));
        memset (sc->rws_insn, 0, sizeof (rws_insn));
        sc->first_instruction = 1;
+      sc->pending_data_specs = 0;
+      sc->current_cycle = 0;
+      memset (sc->mem_ops_in_group, 0, sizeof (mem_ops_in_group));
      }
    else
      {
@@ -6833,6 +6910,9 @@ ia64_init_sched_context (void *_sc, bool
        memcpy (sc->rws_sum, rws_sum, sizeof (rws_sum));
        memcpy (sc->rws_insn, rws_insn, sizeof (rws_insn));
        sc->first_instruction = first_instruction;
+      sc->pending_data_specs = pending_data_specs;
+      sc->current_cycle = current_cycle;
+      memcpy (sc->mem_ops_in_group, mem_ops_in_group, sizeof  
(mem_ops_in_group));
      }
  }

@@ -6849,6 +6929,9 @@ ia64_set_sched_context (void *_sc)
    memcpy (rws_sum, sc->rws_sum, sizeof (rws_sum));
    memcpy (rws_insn, sc->rws_insn, sizeof (rws_insn));
    first_instruction = sc->first_instruction;
+  pending_data_specs = sc->pending_data_specs;
+  current_cycle = sc->current_cycle;
+  memcpy (mem_ops_in_group, sc->mem_ops_in_group, sizeof  
(mem_ops_in_group));
  }

  /* Clears the data in the _SC scheduling context.  */
Index: gcc/config/ia64/itanium2.md
===================================================================
--- gcc/config/ia64/itanium2.md	(revision 131180)
+++ gcc/config/ia64/itanium2.md	(working copy)
@@ -1072,7 +1072,6 @@ (define_bypass  3 "2_ilog" "2_mmalua,2_m
  (define_bypass  3 "2_ialu" "2_mmalua,2_mmmul,2_mmshf")
  (define_bypass  3 "2_mmalua,2_mmmul,2_mmshf"  
"2_ialu,2_ilog,2_ishf,2_st,2_ld,2_ldc")
  (define_bypass  6 "2_tofr"  "2_frfr,2_stf")
-(define_bypass  7 "2_fmac"  "2_frfr,2_stf")

  ;; We don't use here fcmp because scall may be predicated.
  (define_bypass  0  
"2_fcvtfx,2_fld,2_flda,2_fldc,2_fmac,2_fmisc,2_frar_i,2_frar_m,\