Unroll and Jam loop transformation

Mon Nov 14 04:41:00 GMT 2016

Hi,

I'm working on this since some time; I have various extensions to it in 
the works (hinted at in the comments), but haven't yet stabilized them.  
But this is useful on its own as is, so lets get it out before midnight at 
UTC-12 (hey, that's still 7 hours to go) ;)  Basically unroll-and-jam 
deals with such loops:

  unsigned long i, j;
  for (i = 0; i < m; i++) {
      for (j = 0; j < n; j++) {
          a[j*m+i] = b[j*m+i] + b[j*m+i+1];
      }
  }

Ignore the fact that this nest could also be switched to improve 
performance, note that there's data reuse going on for the 'i' index of 
the outer loop (the [i+1] load will be the [i] load in the next 
iteration).  Unroll-and-jam deals with this by effectively unrolling the 
outer loop and then fusing the now adjacent copies of the inner loop:

  for (i = 0; i < m; i+=2) {
      for (j = 0; j < n; j++) {
          a[j*m+i] = b[j*m+i] + b[j*m+i+1];
          a[j*m+i+1] = b[j*m+i+1] + b[j*m+i+1+1];
      }
  }

Now the two b[...i+1] loads are trivial to see and reuse.  Obviously this 
is only valid in certain circumstances.

So, see patch.  For now I've tacked it to the loop splitting pass (when no 
splitting happens unroll-and-jam is tried).  Probably I should then also 
rename the pass (suggestions?).  The changes in chrec and data-ref 
routines are necessary to fix some ICEs in the testsuite that already 
occur when simply doing the existing data-dep analysis as the patch does 
(i.e. on a two-loop nest), without any transformations.

The patch was tested with regstrapping on x86-64-linux with a change to 
force it active with -O2 already (so that the bootstrap excercises it).  
As proposed it would be -O3.

There are no regressions except some changes in excpected info output: 
it's sometimes the case that the unrolling creates an epilogue loop that 
can be vectorized when the fused loop can, so the number of vectorized 
loops (or similar dump info) doubles; i.e. some testsuite churn.  Should 
I disable this pass for those testcases or should I adjust the expected 
output?

I'd like to get this into GCC 7 if possible (and will continue to work on 
the further improvements).

Comments?  Okay for trunk with the nits fixed in which way? :)


Ciao,
Michael.
	* tree-chrec.c (chrec_fold_plus_poly_poly): Accept subtracting
	two pointers.
	* tree-data-ref.c (analyze_miv_subscript): If subtracting two
	pointers the target type is sizetype.

	* tree-ssa-loop-split.c (fix_loop_structure): New function.
	(bb_prevents_fusion_p): Ditto.
	(unroll_jam_possible_p): Ditto.
	(fuse_loops): Ditto.
	(determine_reuse): Ditto.
	(reverse_list): Ditto.
	(tree_loop_unroll_and_jam): Ditto.
	(pass_loop_split::execute): Call new pass.

diff --git a/gcc/tree-chrec.c b/gcc/tree-chrec.c
index cefac2c..6b900f2 100644
--- a/gcc/tree-chrec.c
+++ b/gcc/tree-chrec.c
@@ -106,8 +106,11 @@ chrec_fold_plus_poly_poly (enum tree_code code,
   gcc_assert (TREE_CODE (poly0) == POLYNOMIAL_CHREC);
   gcc_assert (TREE_CODE (poly1) == POLYNOMIAL_CHREC);
   if (POINTER_TYPE_P (chrec_type (poly0)))
-    gcc_checking_assert (ptrofftype_p (chrec_type (poly1))
-			 && useless_type_conversion_p (type, chrec_type (poly0)));
+    gcc_checking_assert ((POINTER_TYPE_P (chrec_type (poly1))
+			  && code == MINUS_EXPR)
+			 || (ptrofftype_p (chrec_type (poly1))
+			     && useless_type_conversion_p (type,
+							   chrec_type (poly0))));
   else
     gcc_checking_assert (useless_type_conversion_p (type, chrec_type (poly0))
 			 && useless_type_conversion_p (type, chrec_type (poly1)));
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 8152da3..85454ad 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -2932,9 +2932,15 @@ analyze_miv_subscript (tree chrec_a,
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "(analyze_miv_subscript \n");
 
-  type = signed_type_for_types (TREE_TYPE (chrec_a), TREE_TYPE (chrec_b));
-  chrec_a = chrec_convert (type, chrec_a, NULL);
-  chrec_b = chrec_convert (type, chrec_b, NULL);
+  type = TREE_TYPE (chrec_a);
+  if (type != TREE_TYPE (chrec_b))
+    {
+      type = signed_type_for_types (TREE_TYPE (chrec_a), TREE_TYPE (chrec_b));
+      chrec_a = chrec_convert (type, chrec_a, NULL);
+      chrec_b = chrec_convert (type, chrec_b, NULL);
+    }
+  else if (POINTER_TYPE_P (type))
+    type = sizetype;
   difference = chrec_fold_minus (type, chrec_a, chrec_b);
 
   if (eq_evolutions_p (chrec_a, chrec_b))
diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
index dac68e6..118a203 100644
--- a/gcc/tree-ssa-loop-split.c
+++ b/gcc/tree-ssa-loop-split.c
@@ -39,6 +39,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "gimple-fold.h"
 #include "gimplify-me.h"
+#include "cfgcleanup.h"
+#include "tree-data-ref.h"
+#include "tree-ssa-loop-ivopts.h"
 
 /* This file implements loop splitting, i.e. transformation of loops like
 
@@ -666,6 +669,459 @@ tree_ssa_split_loops (void)
   return 0;
 }
 
+/* Unroll and Jam
+   
+   This is a combination of two transformations, where the second
+   is not always valid.  It's applicable if a loop nest has redundancies
+   over the iterations of an outer loop while not having that with
+   an inner loop.
+
+   Given this nest:
+       for (i) {
+	 for (j) {
+	   B(i,j)
+	 }
+       }
+
+   first unroll:
+       for (i by 2) {
+	 for (j) {
+	   B(i,j)
+	 }
+	 for (j) {
+	   B(i+1,j)
+	 }
+       }
+
+   then fuse the two adjacent inner loops resulting from that:
+       for (i by 2) {
+	 for (j) {
+	   B(i,j)
+	   B(i+1,j)
+	 }
+       }
+
+   As the order of evaluations of the body B changes this is valid
+   only in certain situations: all distance vectors need to be forward.
+   Additionally if there are multiple induction variables than just
+   a counting control IV (j above) we can also deal with some situations.
+
+   The validity is checked by unroll_jam_possible_p, and the data-dep
+   testing below.
+
+   A trivial example where the fusion is wrong would be when
+   B(i,j) == x[j-1] = x[j];
+       for (i by 2) {
+	 for (j) {
+	   x[j-1] = x[j];
+	 }
+	 for (j) {
+	   x[j-1] = x[j];
+	 }
+       }  effect: move content in front by two elements
+       -->
+       for (i by 2) {
+	 for (j) {
+	   x[j-1] = x[j];
+	   x[j-1] = x[j];
+	 }
+       }  effect: move content to front by one element
+*/
+
+/* Modify the loop tree for the fact that all code once belonging
+   to the OLD loop or the outer loop of OLD now is inside LOOP.  */
+
+static void
+fix_loop_structure (struct loop *loop, struct loop *old)
+{
+  basic_block *bbs;
+  int i, n;
+  struct loop *subloop;
+  edge e;
+  edge_iterator ei;
+
+  /* Find its nodes.  */
+  bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun));
+  n = get_loop_body_with_size (loop, bbs, n_basic_blocks_for_fn (cfun));
+
+  for (i = 0; i < n; i++)
+    {
+      /* If the block was direct child of OLD loop it's now part
+         of LOOP.  If it was outside OLD, then it moved into LOOP
+	 as well.  This avoids changing the loop father for BBs
+	 in inner loops of OLD.  */
+      if (bbs[i]->loop_father == old
+	  || loop_depth (bbs[i]->loop_father) < loop_depth (old))
+	{
+	  remove_bb_from_loops (bbs[i]);
+	  add_bb_to_loop (bbs[i], loop);
+	  continue;
+	}
+
+      /* If we find a direct subloop of OLD, move it to LOOP.  */
+      subloop = bbs[i]->loop_father;
+      if (loop_outer (subloop) == old && subloop->header == bbs[i])
+	{
+	  flow_loop_tree_node_remove (subloop);
+	  flow_loop_tree_node_add (loop, subloop);
+	}
+    }
+
+  /* Update the information about loop exit edges.  */
+  for (i = 0; i < n; i++)
+    {
+      FOR_EACH_EDGE (e, ei, bbs[i]->succs)
+	{
+	  rescan_loop_exit (e, false, false);
+	}
+    }
+
+  loop->num_nodes = n;
+
+  free (bbs);
+}
+
+/* BB exits the outer loop of an unroll-and-jam situation.
+   Check if any statements therein would prevent the transformation.  */
+
+static bool
+bb_prevents_fusion_p (basic_block bb)
+{
+  gimple_stmt_iterator gsi;
+  /* BB is duplicated by outer unrolling and then all N-1 first copies
+     move into the body of the fused inner loop.  The last copy remains
+     the exit block of the outer loop and is still outside the inner loop
+     also after fusion.  We can't allow this for some effects of BB:
+       * stores or unknown side-effects prevent fusion
+       * loads don't
+       * computations into SSA names: these aren't problematic.  Their
+         result will be unused on the exit edges of the first N-1 copies
+	 (those aren't taken after unrolling).  If they are used on the
+	 other edge (the one leading to the outer latch block) they are
+	 loop-carried (on the outer loop) and the Nth copy of BB will
+	 compute them again (i.e. the first N-1 copies will be dead).  */
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *g = gsi_stmt (gsi);
+      if (gimple_vdef (g))
+	return true;
+    }
+  return false;
+}
+
+/* Given an inner loop LOOP (of some OUTER loop) determine if
+   we can safely fuse copies of it (generated by outer unrolling).
+   If so return true, otherwise return false.  */
+
+static bool
+unroll_jam_possible_p (struct loop *outer, struct loop *loop)
+{
+  basic_block *bbs;
+  int i, n;
+
+  /* When fusing the loops we skip the latch block
+     of the first one, so it mustn't have any effects to
+     preserve.  */
+  if (!empty_block_p (loop->latch))
+    return false;
+
+  if (!single_exit (loop))
+    return false;
+
+  /* We need a perfect nest.  Quick check for adjacent inner loops.  */
+  if (outer->inner != loop || loop->next)
+    return false;
+
+  /* And check blocks belonging to just outer loop.  */
+  bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun));
+  n = get_loop_body_with_size (outer, bbs, n_basic_blocks_for_fn (cfun));
+
+  for (i = 0; i < n; i++)
+    {
+      if (bbs[i]->loop_father == outer
+	  && bbs[i] != outer->latch && bbs[i] != outer->header
+	  && (!loop_exits_from_bb_p (outer, bbs[i])
+	      || bb_prevents_fusion_p (bbs[i])))
+	break;
+      /* XXX Note that the above disallows head-controlled inner loops,
+         that we usually have.  The guard block would need to be accepted
+	 (invariant condition either entering or skipping the loop),
+	 without also accepting arbitrary control flow.  When unswitching
+	 ran before us (as with -O3) this won't be a problem because its
+	 outer loop unswitching will have moved out the invariant condition.
+	 
+	 If we do that we need to extend fuse_loops() to cope with this
+	 by threading through the (still invariant) copied condition
+	 between the two loop copies.  */
+    }
+  free (bbs);
+  if (i != n)
+    return false;
+
+  /* For now we can safely fuse copies of LOOP only if all
+     loop carried variables are inductions (or the virtual op).
+
+     We could handle reductions as well (the initial value in the second
+     body would be the after-iter value of the first body) if it's over
+     an associative and commutative operation.  We wouldn't
+     be able to handle unknown cycles.  */
+  gphi_iterator psi;
+  for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next (&psi))
+    {
+      affine_iv iv;
+      tree op = gimple_phi_result (psi.phi ());
+
+      if (virtual_operand_p (op))
+	continue;
+      if (!simple_iv (loop, loop, op, &iv, true))
+	return false;
+      /* The inductions must be regular, loop invariant step and initial
+         value.  */
+      if (!expr_invariant_in_loop_p (outer, iv.step)
+	  || !expr_invariant_in_loop_p (outer, iv.base))
+	return false;
+      /* XXX With more effort we could also be able to deal with inductions
+         where the initial value is loop variant but a simple IV in the
+	 outer loop.  The initial value for the second body would be
+	 the original initial value plus iv.base.step.  The next value
+	 for the fused loop would be the original next value of the first
+	 copy, _not_ the next value of the second body.  */
+    }
+
+  return true;
+}
+
+/* Fuse LOOP with all further neighbors.  The loops are expected to
+   be in appropriate form.  */
+
+static void
+fuse_loops (struct loop *loop)
+{
+  struct loop *next = loop->next;
+
+  while (next)
+    {
+      edge e;
+
+      remove_branch (single_pred_edge (loop->latch));
+      /* Make delete_basic_block not fiddle with the loop structure.  */
+      basic_block oldlatch = loop->latch;
+      loop->latch = NULL;
+      delete_basic_block (oldlatch);
+      e = redirect_edge_and_branch (loop_latch_edge (next),
+				    loop->header);
+      loop->latch = e->src;
+      flush_pending_stmts (e);
+
+      gcc_assert (EDGE_COUNT (next->header->preds) == 1);
+
+      /* The PHI nodes of the second body (single-argument now)
+         need adjustments to use the right values: either directly
+	 the value of the corresponding PHI in the first copy or
+	 the one leaving the first body which unrolling did for us.
+
+	 See also unroll_jam_possible_p() for further possibilities.  */
+      gphi_iterator psi_first, psi_second;
+      e = single_pred_edge (next->header);
+      for (psi_first = gsi_start_phis (loop->header),
+	   psi_second = gsi_start_phis (next->header);
+	   !gsi_end_p (psi_first);
+	   gsi_next (&psi_first), gsi_next (&psi_second))
+	{
+	  gphi *phi_first = psi_first.phi ();
+	  gphi *phi_second = psi_second.phi ();
+	  tree firstop = gimple_phi_result (phi_first);
+	  /* The virtual operand is correct already as it's
+	     always live at exit, hence has a LCSSA node and outer
+	     loop unrolling updated SSA form.  */
+	  if (virtual_operand_p (firstop))
+	    continue;
+
+	  /* Due to unroll_jam_possible_p() we know that this is
+	     an induction.  The second body goes over the same
+	     iteration space.  */
+	  add_phi_arg (phi_second, firstop, e,
+		       gimple_location (phi_first));
+	}
+      gcc_assert (gsi_end_p (psi_second));
+
+      fix_loop_structure (loop, next);
+      gcc_assert (!next->num_nodes);
+      struct loop *ln = next->next;
+      delete_loop (next);
+      next = ln;
+    }
+  rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_USE, loop);
+}
+
+/* Returns true if the distance in DDR can be determined and stores
+   its value in *DISTP.  Otherwise returns false.  */
+
+static bool
+determine_reuse (struct data_dependence_relation *ddr, int *distp)
+{
+  bool ret = false;
+  if (DDR_ARE_DEPENDENT (ddr) != chrec_known)
+    {
+      if (DDR_NUM_DIST_VECTS (ddr) == 0)
+	return false;
+      unsigned i;
+      lambda_vector dist_v;
+      FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
+	{
+	  /* We're interested in the distance wrt the outer loop.  */
+	  int dist = dist_v[0];
+	  if (!dist)
+	    continue;
+	  if (DDR_REVERSED_P (ddr))
+	    dist = -dist;
+	  *distp = dist;
+	  ret = true;
+	}
+    }
+  return ret;
+}
+
+/* Reverse the ->next list starting at loop L.  No code transformation
+   is done, it's purely internal.  */
+
+static struct loop *
+reverse_list (struct loop *l)
+{
+  struct loop *prev = 0, *next;
+  for (; l; l = next)
+    {
+      next = l->next;
+      l->next = prev;
+      prev = l;
+    }
+  return prev;
+}
+
+/* Main entry point for the unroll-and-jam transformation
+   described above.  */
+
+static unsigned int
+tree_loop_unroll_and_jam (void)
+{
+  struct loop *loop;
+  bool changed = false;
+
+  gcc_assert (scev_initialized_p ());
+
+  /* Go through all innermost loops.  */
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+    {
+      struct loop *outer = loop_outer (loop);
+
+      if (loop_depth (loop) < 2)
+	continue;
+
+      if (!unroll_jam_possible_p (outer, loop))
+	continue;
+
+      vec<data_reference_p> datarefs;
+      vec<ddr_p> dependences;
+      int unroll_factor;
+      struct tree_niter_desc desc;
+      bool unroll = false;
+
+      auto_vec<loop_p, 3> loop_nest;
+      dependences.create (10);
+      datarefs.create (10);
+      if (!compute_data_dependences_for_loop (outer, true, &loop_nest,
+					       &datarefs, &dependences))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Cannot analyze data dependencies\n");
+	  free_data_refs (datarefs);
+	  free_dependence_relations (dependences);
+	  return false;
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	dump_data_dependence_relations (dump_file, dependences);
+
+      unroll_factor = 2;
+
+      /* Check all dependencies.  */
+      unsigned i;
+      struct data_dependence_relation *ddr;
+      FOR_EACH_VEC_ELT (dependences, i, ddr)
+	{
+	  int distance;
+	  struct data_reference *dra, *drb;
+
+	  /* If the refs are independend there's nothing to do.  */
+	  if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	    continue;
+	  dra = DDR_A (ddr);
+	  drb = DDR_B (ddr);
+	  /* Nothing interesting for the self dependencies.  */
+	  if (dra == drb)
+	    continue;
+
+	  /* Now check the distance vector, for determining a sensible
+	     outer unroll factor, and for validity of merging the inner
+	     loop copies.  */
+	  if (!determine_reuse (ddr, &distance))
+	    {
+	      /* Couldn't get the distance vector.  For two reads that's
+	         harmless (we assume we should unroll).  For at least
+		 one write this means we can't check the dependence direction
+		 and hence can't determine safety.  */
+
+	      if (DR_IS_WRITE (dra) || DR_IS_WRITE (drb))
+		{
+		  unroll_factor = 0;
+		  break;
+		}
+	    }
+	  else
+	    {
+	      /* A negative distance means we can't fuse the inner loop
+	         copies after outer unrolling.  */
+	      if (distance < 0)
+		{
+		  unroll_factor = 0;
+		  break;
+		}
+	      if (distance <= 4 && unroll_factor < distance)
+		unroll_factor = distance;
+	    }
+	}
+
+      unroll = (unroll_factor > 1
+		&& can_unroll_loop_p (outer, unroll_factor, &desc));
+
+      if (unroll)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Applying unroll and jam with factor %d\n",
+		     unroll_factor);
+	  initialize_original_copy_tables ();
+	  tree_unroll_loop (outer, unroll_factor, single_dom_exit (outer),
+			    &desc);
+	  free_original_copy_tables ();
+	  outer->inner = reverse_list (outer->inner);
+	  fuse_loops (outer->inner);
+	  changed = true;
+	}
+
+      loop_nest.release ();
+      free_dependence_relations (dependences);
+      free_data_refs (datarefs);
+    }
+
+  if (changed)
+    {
+      scev_reset ();
+      free_dominance_info (CDI_DOMINATORS);
+      return TODO_cleanup_cfg;
+    }
+  return 0;
+}
+
 /* Loop splitting pass.  */
 
 namespace {
@@ -699,10 +1155,14 @@ public:
 unsigned int
 pass_loop_split::execute (function *fun)
 {
+  int ret;
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  return tree_ssa_split_loops ();
+  ret = tree_ssa_split_loops ();
+  if (!ret)
+    ret = tree_loop_unroll_and_jam ();
+  return ret;
 }
 
 } // anon namespace