[RFC] changes to -fprefetch-loop-arrays

Janis Johnson janis187@us.ibm.com
Fri Jun 7 14:01:00 GMT 2002


I've been trying out several changes to -fprefetch-loop-arrays.  The
patch isn't ready to submit, but I'd appreciate feedback from Honza
about some of the basic changes I'm examining.  All of my performance
testing has been on Itanium.  Some of it is based on toy programs that
helped me find optimum values for some of the PREFETCH parameters.  I'm
also running selected SPEC CPU2000 tests; I don't have the resources to
run them all regularly, but plan to do that before submitting a real
patch, using -fprefetch-loop-arrays with and without -funroll-loops, and
with and without my changes.

Honza, you approach prefetching arrays in loops from the angle of how
much memory is accessed over the lifetime of the loop, and prefetching
everything in a section of an array if the density of accesses within
that section is high enough.  I'm looking at it in terms of what portion
of an array is accessed within each loop iteration, and prefetching only
those cache lines that contain accesses.  I assume that if we know that
we are likely to access an array element then we should prefetch the
cache line that contains it, even if it's only a single byte of that
cache line.  My toy programs show that this is a valid assumption on
Itanium.  Is there a reason on other architectures to prefetch solid
chunks of data, rather than doing sparse prefetches of only those cache
blocks that contain data that will be accessed?  If sparse prefetches
are always worthwhile then the density checks can be removed, along with
the bytes_accessed and total_bytes values for each prefetch entry.  The
current density checks are wrong, by the way; they are based on a size
that is the number of bytes of the address, not the size of the data,
and they count multiple accesses of the same address multiple times.
If we keep the density checks we'll need to reconsider how it is
calculated.

One of the pieces of my patch is more investigation of prefetches within
the same array at the same stride, to merge prefetches that are in the
same or adjacent cache blocks.  This isn't necessary if all prefetches
are solid, as in the current implementation, but it minimizes the number
of sparse prefetches.  This is more likely to be relevant for loops that
access several members of a struct in an array than for arrays that
access multiple elements of a scalar array, although it also seems to
interact better with loop unrolling.

I'm currently using a new parameter, PREFETCH_DISTANCE, to determine how
many loop iterations ahead to prefetch for.  This value is based on
experimentation.  It should be calculated from several factors related
to the target and to the work done in the loop.

The various PREFETCH_* parameters have changed yet again.  I expect them
to continue to mutate for a while longer, but when they're stable I'll
document them properly in tm.texi, either in the "Miscellaneous
Parameters" section or in a new "Data Prefetch Parameters" section.  The
real patch will put my ia64 target values into ia64.h.

The current version of this patch is not very efficient, and I plan to
clean it up before submitting it.  I'll try to avoid allocating and
freeing lots of small blocks for each loop, but the current version has
been easier to play with.  Some of the loops that go through the
prefetches can be combined; again, this has made it easier to try out
changes.

Limited testing with SPEC CPU2000 doesn't show much difference except
in the FP tests 171.swim and 179.art, where the new results are quite
impressive.  I was planning to show comparative results (I can't give
out real numbers for the Itanium system I use), but I've made enough
dumb benchmarking mistakes in the past that I'll verify the numbers
carefully before sharing them.

We should be able to improve performance in general by playing with the
target prefetch parameters and by coming up with a good heuristic for
determining the prefetch distance.

Janis

P.S.  World Juggling Day is June 15.  Find an event near you at
http://wjd.juggle.org/02events.php.  There's one listed for Prague!


Index: loop.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/loop.c,v
retrieving revision 1.406
diff -u -p -r1.406 loop.c
--- loop.c	2 Jun 2002 21:09:34 -0000	1.406
+++ loop.c	7 Jun 2002 20:19:17 -0000
@@ -87,7 +87,17 @@ Software Foundation, 59 Temple Place - S
    easily for performance testing on new architecures.  These can be
    defined in target-dependent files.  */
 
-/* Prefetch is worthwhile only when loads/stores are dense.  */
+/* Temporary values that Janis is using for testing on ia64; these will
+   be moved to target-specific files for the real patch.  */
+#define PREFETCH_ONLY_DENSE_MEM 0
+#define PREFETCH_LOW_LOOPCNT 8
+#define PREFETCH_EXTREME_STRIDE 64
+#define PREFETCH_EXTREME_DIFFERENCE 128
+#define PREFETCH_CONST_DISTANCE 1
+#define PREFETCH_DISTANCE 12
+
+/* Prefetch is worthwhile only when loads/stores are dense.  Setting this to
+   one also disables the use of sparse prefetches.  */
 #ifndef PREFETCH_ONLY_DENSE_MEM
 #define PREFETCH_ONLY_DENSE_MEM 1
 #endif
@@ -108,10 +118,22 @@ Software Foundation, 59 Temple Place - S
 #define PREFETCH_LOW_LOOPCNT 32
 #endif
 
+/* Do not prefetch before a loop whose iteration count is constant and is
+   more than this value.  If this is zero then we never prefetch before
+   a loop.  */
+#ifndef PREFETCH_HIGH_LOOPCNT
+#define PREFETCH_HIGH_LOOPCNT 128
+#endif
+
 /* Do not prefetch for a loop that contains a function call; such a loop is
    probably not an internal loop.  */
 #ifndef PREFETCH_NO_CALL
-#define PREFETCH_NO_CALL 1
+#define PREFETCH_NO_CALL 0
+#endif
+
+/* Do not prefetch for a loop that contains a const function call.  */
+#ifndef PREFETCH_NO_NONCONST_CALL
+#define PREFETCH_NO_NONCONST_CALL 1
 #endif
 
 /* Do not prefetch accesses with an extreme stride.  */
@@ -131,12 +153,15 @@ Software Foundation, 59 Temple Place - S
 #endif
 
 /* Issue prefetch instructions before the loop to fetch data to be used
-   in the first few loop iterations.  */
+   in the first few loop iterations.
+   ??? This isn't currently used; turn off prefetch before a loop by
+   defining PREFETCH_HIGH_LOOPCNT to zero.  */
 #ifndef PREFETCH_BEFORE_LOOP
 #define PREFETCH_BEFORE_LOOP 1
 #endif
 
-/* Do not handle reversed order prefetches (negative stride).  */
+/* Do not handle reversed order prefetches (negative stride).
+   ??? Negative strides aren't yet handled correctly, so don't change this.  */
 #ifndef PREFETCH_NO_REVERSE_ORDER
 #define PREFETCH_NO_REVERSE_ORDER 1
 #endif
@@ -146,8 +171,20 @@ Software Foundation, 59 Temple Place - S
 #define PREFETCH_CONDITIONAL 1
 #endif
 
+/* Use a constant value for the prefetch distance (number of iterations
+   ahead for which to prefetch data) rather than computing it.  */
+#ifndef PREFETCH_CONST_DISTANCE
+#define PREFETCH_CONST_DISTANCE 0
+#endif
+
+/* Define the constant prefetch distance.  */
+#ifndef PREFETCH_DISTANCE
+#define PREFETCH_DISTANCE 12
+#endif
+
 /* If the loop requires more prefetches than the target can process in
-   parallel then don't prefetch anything in that loop.  */
+   parallel then don't prefetch anything in that loop.
+   ??? This isn't currently used.  */
 #ifndef PREFETCH_LIMIT_TO_SIMULTANEOUS
 #define PREFETCH_LIMIT_TO_SIMULTANEOUS 1
 #endif
@@ -3564,26 +3601,43 @@ loop_reg_used_before_p (loop, set, insn)
   return 0;
 }
 
-
-/* Information we collect about arrays that we might want to prefetch.  */
-struct prefetch_info
-{
+/* Information about an array access that we might want to prefetch.  */
+struct prefetch_acc_info {
   struct iv_class *class;	/* Class this prefetch is based on.  */
   struct induction *giv;	/* GIV this prefetch is based on.  */
-  rtx base_address;		/* Start prefetching from this address plus
-				   index.  */
-  HOST_WIDE_INT index;
-  HOST_WIDE_INT stride;		/* Prefetch stride in bytes in each
-				   iteration.  */
-  unsigned int bytes_accessed;	/* Sum of sizes of all acceses to this
-				   prefetch area in one iteration.  */
-  unsigned int total_bytes;	/* Total bytes loop will access in this block.
+  struct prefetch_acc_info *next;
+  				/* Entry for next access for this array.  */
+  HOST_WIDE_INT index;		/* Start prefetching from this index plus
+				   the base address stored for the array.  */
+  unsigned int bytes_accessed;	/* Total bytes loop will access in this block.
+				   This is set only for loops with known
+				   iteration counts and is 0xfffffffff
+				   otherwise.  */
+  unsigned int total_bytes;	/* Total bytes loop will prefetch in this block.
 				   This is set only for loops with known
 				   iteration counts and is 0xffffffff
 				   otherwise.  */
-  int prefetch_in_loop;		/* Number of prefetch insns in loop.  */
+  int prefetch_in_loop;		/* Number of prefetch insns within loop.  */
   int prefetch_before_loop;	/* Number of prefetch insns before loop.  */
-  unsigned int write : 1;	/* 1 for read/write prefetches.  */
+  int write;			/* 1 for read/write prefetches.  */
+  int solid;			/* 1 if access touches all consecutive cache
+				   blocks.  */
+  int next_line;		/* 1 if the next prefetch is for the next
+				   cache line.  */
+};
+
+/* Information we collect about an array and stride that we might want
+   to prefetch.  */
+struct prefetch_arr_info {
+  rtx base_address;		/* Start prefetching from this address plus
+				   the index for an access.  */
+  HOST_WIDE_INT stride;		/* Absolute value of prefetch stride in bytes
+				   for each iteration.  */
+  int stride_sign;		/* 1 for positive stride, -1 for negative.  */
+  struct prefetch_arr_info *next;
+  				/* Pointer to the next array in the list.  */
+  struct prefetch_acc_info *first_access;
+  				/* First access for this array and stride.  */
 };
 
 /* Data used by check_store function.  */
@@ -3596,6 +3650,15 @@ struct check_store_data
 static void check_store PARAMS ((rtx, rtx, void *));
 static void emit_prefetch_instructions PARAMS ((struct loop *));
 static int rtx_equal_for_prefetch_p PARAMS ((rtx, rtx));
+static struct prefetch_arr_info * prefetch_array_info
+	PARAMS ((struct prefetch_arr_info **, HOST_WIDE_INT, int, rtx));
+static void prefetch_access_add PARAMS ((struct prefetch_arr_info *,
+      					 HOST_WIDE_INT, unsigned int,
+					 struct iv_class *,
+					 struct induction *, int, int));
+static void prefetch_free_list PARAMS ((struct prefetch_arr_info *));
+static void prefetch_merge PARAMS ((struct prefetch_acc_info *, int));
+static void prefetch_array_merge PARAMS ((struct prefetch_arr_info *));
 
 /* Set mem_write when mem_address is found.  Used as callback to
    note_stores.  */
@@ -3696,6 +3759,265 @@ rtx_equal_for_prefetch_p (x, y)
     }
   return 1;
 }
+
+/* Given information about an array and stride for a data access to prefetch,
+   return an existing entry or create a new one and return it.  */
+
+static struct prefetch_arr_info *
+prefetch_array_info (arr_list, stride, stride_sign, address)
+     struct prefetch_arr_info ** arr_list;
+     HOST_WIDE_INT stride;
+     int stride_sign;
+     rtx address;
+{
+  struct prefetch_arr_info *arr;
+  struct prefetch_arr_info *last_arr = NULL;
+
+  /* Search the existing list for this array/stride combination.  While we're
+     going through the array, keep track of the last item so if we need to
+     create a new entry we can add it to the end of the list.  */
+  for (arr = *arr_list; arr; arr = arr->next)  
+    {
+      if (rtx_equal_for_prefetch_p (address, arr->base_address)
+	  && stride * stride_sign == arr->stride * arr->stride_sign)
+	return arr;
+      last_arr = arr;
+    }
+
+  arr = (struct prefetch_arr_info *) xmalloc (sizeof (struct prefetch_arr_info));
+  /* Add the new entry to the list.  */
+  if (last_arr)
+    last_arr->next = arr;
+  else
+    *arr_list = arr;
+
+  arr->base_address = address;
+  arr->stride = stride;
+  arr->stride_sign = stride_sign;
+  arr->first_access = NULL;
+  arr->next = NULL;
+  return arr; 
+}
+
+/* Given information about a specific access within an array, add an entry
+   for that access in order by index within the accesses for the array.  If
+   there is already an access at this index, ignore the new one.  */
+
+static void
+prefetch_access_add (arr, index, size, class, giv, write, solid)
+     struct prefetch_arr_info *arr;
+     HOST_WIDE_INT index;
+     unsigned int size;
+     struct iv_class *class;
+     struct induction *giv;
+     int write;
+     int solid;
+
+{
+  struct prefetch_acc_info *acc;
+  struct prefetch_acc_info *prev_access = NULL;
+
+  for (acc = arr->first_access; acc; acc = acc->next)
+    {
+      if (index == acc->index)
+	{
+	  /* We're already prefetching this element.  */
+	  acc->write |= write;
+	  return;
+	}
+
+      /* This goes at the beginning of the list.  */
+      if (index < acc->index)
+	break;
+
+      prev_access = acc;
+    }
+
+  /* The access is not already in the list, so create a new entry.  */
+  acc = (struct prefetch_acc_info *) xmalloc (sizeof(struct prefetch_acc_info));
+  acc->index = index;
+  acc->class = class;
+  acc->giv = giv;
+  acc->bytes_accessed = size;
+  acc->write = write;
+  acc->solid = solid;
+  acc->next_line = 0;
+  acc->prefetch_in_loop = 0;
+  acc->prefetch_before_loop = 0;
+
+  if (prev_access == NULL)
+    {
+      /* The new access is the first one in the list.  */
+      acc->next = arr->first_access;
+      arr->first_access = acc;
+    }
+  else
+    {
+      /* Insert the new entry after an existing entry in the list.  */
+      acc->next = prev_access->next;
+      prev_access->next= acc;
+    }
+}
+
+static void
+prefetch_free_list (arr_list)
+     struct prefetch_arr_info *arr_list;
+{
+  struct prefetch_arr_info * arr;
+  struct prefetch_arr_info * prev_arr;
+  struct prefetch_acc_info * acc;
+  struct prefetch_acc_info * prev_acc;
+
+  for (arr = arr_list; arr;)
+    {
+      for (acc = arr->first_access; acc;)
+	{
+	  prev_acc = acc;
+	  acc = acc->next;
+	  free (prev_acc);
+	}
+      prev_arr = arr;
+      arr = arr->next;
+      free (prev_arr);
+    }
+}
+
+/* Merge prefetches for array accesses ACC and ACC->NEXT.  If the flag
+   SECOND is 1 then use the index and address from the later access.
+   ??? In practice, SECOND doesn't seem to improve performance for merges
+   where it was expected to be useful.  */
+
+static void
+prefetch_merge (acc, second)
+  struct prefetch_acc_info *acc;
+  int second;
+{
+  struct prefetch_acc_info *next_acc = acc->next;
+  acc->bytes_accessed += next_acc->bytes_accessed;
+  acc->write |= next_acc->write;
+  acc->next = next_acc->next;
+  if (second)
+    {
+      acc->index = next_acc->index;
+      acc->giv = next_acc->giv;
+      acc->class = next_acc->class;
+    }
+  free (next_acc);
+}
+
+/* Merge prefetches for array accesses with the same stride if they are
+   reasonable close together.  Keep track of whether each prefetch is
+   solid (the entire section of the array is prefetched) or sparse (there
+   are gaps between the sections of the array that are prefetched).  The
+   accesses in the list are in increasing order by index.
+   ??? This doesn't yet account for negative strides.  */
+
+static void
+prefetch_array_merge (arr)
+  struct prefetch_arr_info *arr;
+{
+  struct prefetch_acc_info *acc;
+  int stride = arr->stride;
+
+  /* Process each pair of adjacent accesses or, in some case, a series of
+     consecutive entries.  The list loses entries as accesses are merged.  */
+  for (acc = arr->first_access; acc->next;)
+    {
+      HOST_WIDE_INT index_diff = acc->next->index - acc->index;
+
+      /* If the indices are far apart then it's possible that by the time
+	 the later access occurs, the data will no longer be in the cache;
+	 don't merge them.  */
+      if (index_diff >= PREFETCH_EXTREME_DIFFERENCE)
+	{
+	  acc = acc->next;
+	  continue;
+	}
+
+      /* If a prefetch of either access would get every cache line within the
+	  accessed portion of the array, or if the two prefetches are in the
+	  same or adjacent cache lines, merge them and call them solid.  */
+      if (acc->solid || acc->next->solid || (stride <= PREFETCH_BLOCK *2))
+	{
+	  acc->solid = 1;
+	  prefetch_merge (acc, 0);
+	  continue;
+	}
+
+      /* At this point we know that both accesses are sparse.  The prefetches
+	 for the accesses can be merged if the difference in their indices is
+	 a multiple of the stride, so that the later access is already in the
+	 cache from the earlier prefetch.  */
+      if ((index_diff % stride) == 0)
+	{
+	  prefetch_merge (acc, 0);
+	  continue;
+	}
+
+      /* If two sparse accesses are for adjacent cache blocks, keep track of
+	 that information; we might be able to merge them later.  */
+      if (index_diff <= PREFETCH_BLOCK)
+	{
+	  acc->next_line = 1;
+	  acc = acc->next;
+	  continue;
+	}
+
+      /* They can't be merged; go on to the next pair.  */
+      acc = acc->next;
+    }
+
+  /* Merge each series of accesses that touches every cache line for the
+     stride.  There can be more than one such series, with a gap between
+     them.  */
+  for (acc = arr->first_access; acc && acc->next;)
+    {
+      struct prefetch_acc_info *a = acc;
+
+      /* We're only interested in the start of a series.  */
+      if (acc->next_line == 0)
+	{
+	  acc = acc->next;
+	  continue;
+	}
+
+      /* Find the last access in the series.  */
+      while (a->next_line)
+	a = a->next;
+
+      if (a->index - acc->index > stride - PREFETCH_BLOCK)
+	/* The series touches every cache line within the stride; combine
+	   into a single solid prefetch.  */
+	while (acc->next_line)
+	  {
+	    acc->solid = 1;
+	    acc->next_line = acc->next->next_line;
+	    prefetch_merge (acc, 0);
+	  }
+      else
+	{
+	  /* Adjust adjacent sparse prefetches to be a cache line apart, and
+	     merge other prefetches that are now between them.  */
+	  while (acc && acc->next_line)
+	    {
+	      /* If either is for writes, make them both for writes.  */
+	      acc->write |= acc->next->write;
+	      acc->next->write |= acc->write;
+	      if (acc->next->index >= acc->index)
+		acc->next->index = acc->index + PREFETCH_BLOCK;
+	      else
+		{
+		  /* We've adjusted previous accesses enough that this one
+		     is no longer needed; remove it, along with any others
+		     between them.  */
+		  while (acc->next && acc->index >= acc->next->index)
+		    prefetch_merge (acc, 0);
+		}
+	      acc = acc->next;
+	    }
+	}
+    }
+}
 
 /* Remove constant addition value from the expression X (when present)
    and return it.  */
@@ -3744,7 +4066,7 @@ remove_constant_addition (x)
 }
 
 /* Attempt to identify accesses to arrays that are most likely to cause cache
-   misses, and emit prefetch instructions a few prefetch blocks forward.
+   misses, and emit prefetch instructions a few loop iterations forward.
 
    To detect the arrays we use the GIV information that was collected by the
    strength reduction pass.
@@ -3767,17 +4089,18 @@ static void
 emit_prefetch_instructions (loop)
      struct loop *loop;
 {
-  int num_prefetches = 0;
   int num_real_prefetches = 0;
   int num_real_write_prefetches = 0;
   int num_prefetches_before = 0;
   int num_write_prefetches_before = 0;
-  int ahead = 0;
-  int i;
+  int iterations_ahead = 0;
+  unsigned HOST_WIDE_INT iterations = LOOP_INFO (loop)->n_iterations;
   struct iv_class *bl;
   struct induction *iv;
-  struct prefetch_info info[MAX_PREFETCHES];
   struct loop_ivs *ivs = LOOP_IVS (loop);
+  struct prefetch_arr_info *arr_list = NULL;
+  struct prefetch_arr_info *arr;
+  struct prefetch_acc_info *acc;
 
   if (!HAVE_prefetch)
     return;
@@ -3792,10 +4115,19 @@ emit_prefetch_instructions (loop)
       return;
     }
 
+  /* Consider only loops without const calls.  */
+  if (PREFETCH_NO_NONCONST_CALL && LOOP_INFO (loop)->has_nonconst_call)
+    {
+      if (loop_dump_stream)
+	fprintf (loop_dump_stream, "Prefetch: ignoring loop: has nonconst call.\n");
+
+      return;
+    }
+
   /* Don't prefetch in loops known to have few iterations.  */
   if (PREFETCH_NO_LOW_LOOPCNT
-      && LOOP_INFO (loop)->n_iterations
-      && LOOP_INFO (loop)->n_iterations <= PREFETCH_LOW_LOOPCNT)
+      && iterations
+      && iterations <= PREFETCH_LOW_LOOPCNT)
     {
       if (loop_dump_stream)
 	fprintf (loop_dump_stream,
@@ -3859,12 +4191,12 @@ emit_prefetch_instructions (loop)
 	  rtx address;
 	  rtx temp;
 	  HOST_WIDE_INT index = 0;
-	  int add = 1;
 	  HOST_WIDE_INT stride = 0;
 	  int stride_sign = 1;
+	  int solid = 0;
 	  struct check_store_data d;
 	  const char *ignore_reason = NULL;
-	  int size = GET_MODE_SIZE (GET_MODE (iv));
+	  int size;
 
 	  /* See whether an induction variable is interesting to us and if
 	     not, report the reason.  */
@@ -3924,6 +4256,8 @@ emit_prefetch_instructions (loop)
 	  address = simplify_gen_binary (PLUS, Pmode, temp, address);
 	  index = remove_constant_addition (&address);
 
+	  size = GET_MODE_SIZE (GET_MODE (iv->mem));
+
 	  d.mem_write = 0;
 	  d.mem_address = *iv->location;
 
@@ -3939,173 +4273,164 @@ emit_prefetch_instructions (loop)
 	      continue;
 	    }
 
-	  /* Attempt to find another prefetch to the same array and see if we
-	     can merge this one.  */
-	  for (i = 0; i < num_prefetches; i++)
-	    if (rtx_equal_for_prefetch_p (address, info[i].base_address)
-		&& stride == info[i].stride)
-	      {
-		/* In case both access same array (same location
-		   just with small difference in constant indexes), merge
-		   the prefetches.  Just do the later and the earlier will
-		   get prefetched from previous iteration.
-		   The artificial threshold should not be too small,
-		   but also not bigger than small portion of memory usually
-		   traversed by single loop.  */
-		if (index >= info[i].index
-		    && index - info[i].index < PREFETCH_EXTREME_DIFFERENCE)
-		  {
-		    info[i].write |= d.mem_write;
-		    info[i].bytes_accessed += size;
-		    info[i].index = index;
-		    info[i].giv = iv;
-		    info[i].class = bl;
-		    info[num_prefetches].base_address = address;
-		    add = 0;
-		    break;
-		  }
-
-		if (index < info[i].index
-		    && info[i].index - index < PREFETCH_EXTREME_DIFFERENCE)
-		  {
-		    info[i].write |= d.mem_write;
-		    info[i].bytes_accessed += size;
-		    add = 0;
-		    break;
-		  }
-	      }
-
-	  /* Merging failed.  */
-	  if (add)
-	    {
-	      info[num_prefetches].giv = iv;
-	      info[num_prefetches].class = bl;
-	      info[num_prefetches].index = index;
-	      info[num_prefetches].stride = stride;
-	      info[num_prefetches].base_address = address;
-	      info[num_prefetches].write = d.mem_write;
-	      info[num_prefetches].bytes_accessed = size;
-	      num_prefetches++;
-	      if (num_prefetches >= MAX_PREFETCHES)
-		{
-		  if (loop_dump_stream)
-		    fprintf (loop_dump_stream,
-			     "Maximal number of prefetches exceeded.\n");
-		  return;
-		}
-	    }
+	  /* Record whether we're accessing every cache line for this
+	     prefetch.  */
+	  solid = (stride <= PREFETCH_BLOCK);
+
+	  /* Insert this access into a list of other access of the same
+	     array and stride.  Each array's list is ordered by index.  */
+	  prefetch_access_add (prefetch_array_info (&arr_list, stride,
+						    stride_sign, address),
+	      		       index, size, bl, iv, d.mem_write, solid);
 	}
     }
 
-  for (i = 0; i < num_prefetches; i++)
-    {
-      int density;
+  /* Merge prefetches for accesses with the same stride in the same array.  */
+  for (arr = arr_list; arr; arr = arr->next)
+    prefetch_array_merge (arr);
 
-      /* Attempt to calculate the total number of bytes fetched by all
-	 iterations of the loop.  Avoid overflow.  */
-      if (LOOP_INFO (loop)->n_iterations
-	  && ((unsigned HOST_WIDE_INT) (0xffffffff / info[i].stride)
-	      >= LOOP_INFO (loop)->n_iterations))
-	info[i].total_bytes = info[i].stride * LOOP_INFO (loop)->n_iterations;
-      else
-	info[i].total_bytes = 0xffffffff;
+  for (arr = arr_list; arr; arr = arr->next)
+    for (acc = arr->first_access; acc; acc = acc->next)
+      {
+	int density;
 
-      density = info[i].bytes_accessed * 100 / info[i].stride;
+	/* Attempt to calculate the total number of bytes fetched by all
+	   iterations of the loop.  Avoid overflow.  */
+	if (iterations
+	    && ((unsigned HOST_WIDE_INT) (0xffffffff / arr->stride)
+		>= iterations))
+	  acc->total_bytes = arr->stride * iterations;
+	else
+	  acc->total_bytes = 0xffffffff;
 
-      /* Prefetch might be worthwhile only when the loads/stores are dense.  */
-      if (PREFETCH_ONLY_DENSE_MEM)
-	if (density * 256 > PREFETCH_DENSE_MEM * 100
-	    && (info[i].total_bytes / PREFETCH_BLOCK
-		>= PREFETCH_BLOCKS_BEFORE_LOOP_MIN))
-	  {
-	    info[i].prefetch_before_loop = 1;
-	    info[i].prefetch_in_loop
-	      = (info[i].total_bytes / PREFETCH_BLOCK
-		 > PREFETCH_BLOCKS_BEFORE_LOOP_MAX);
-	  }
+	density = acc->bytes_accessed * 100 / arr->stride;
+
+	/* Determine whether we should prefetch within the loop and/or before
+	   the loop.  Later we'll calculate how many prefetch instructions to
+	   generate before the loop and within the loop.  */
+	if (PREFETCH_ONLY_DENSE_MEM)
+	  /* Prefetch might be worthwhile only when the loads/stores are dense.
+	     ??? Is density relevant if we're doing sparse prefetches?
+	     We should be concerned with which cache lines are actually accessed
+	     rather than what percentage of prefetched bytes are accessed.  */
+	  if (density * 256 > PREFETCH_DENSE_MEM * 100
+	      && (acc->total_bytes / PREFETCH_BLOCK
+		  >= PREFETCH_BLOCKS_BEFORE_LOOP_MIN))
+	    {
+	      acc->solid = 1;
+	      acc->prefetch_before_loop = 1;
+	      acc->prefetch_in_loop
+		= (acc->total_bytes / PREFETCH_BLOCK
+		   > PREFETCH_BLOCKS_BEFORE_LOOP_MAX);
+	    }
+	  else
+	    {
+	      acc->prefetch_in_loop = 0;
+	      acc->prefetch_before_loop = 0;
+	      if (loop_dump_stream)
+		fprintf (loop_dump_stream,
+		    "Prefetch: ignoring giv at %d: %d%% density is too low.\n",
+			 INSN_UID (acc->giv->insn), density);
+	    }
 	else
 	  {
-	    info[i].prefetch_in_loop = 0, info[i].prefetch_before_loop = 0;
-	    if (loop_dump_stream)
-	      fprintf (loop_dump_stream,
-		  "Prefetch: ignoring giv at %d: %d%% density is too low.\n",
-		       INSN_UID (info[i].giv->insn), density);
+	    acc->prefetch_in_loop = 1;
+	    acc->prefetch_before_loop = 1;	
 	  }
-      else
-	info[i].prefetch_in_loop = 1, info[i].prefetch_before_loop = 1;
 
-      /* Find how many prefetch instructions we'll use within the loop.  */
-      if (info[i].prefetch_in_loop != 0)
-	{
-	  info[i].prefetch_in_loop = ((info[i].stride + PREFETCH_BLOCK - 1)
-				  / PREFETCH_BLOCK);
-	  num_real_prefetches += info[i].prefetch_in_loop;
-	  if (info[i].write)
-	    num_real_write_prefetches += info[i].prefetch_in_loop;
-	}
-    }
+	/* Determine how many prefetch instructions we'll use within the loop.
+	   For a sparse prefetch (not solid), leave it at one.  */
+	if (acc->prefetch_in_loop != 0)
+	  {
+	    if (acc->solid)
+	      acc->prefetch_in_loop = ((arr->stride + PREFETCH_BLOCK - 1)
+				       / PREFETCH_BLOCK);
+	    num_real_prefetches += acc->prefetch_in_loop;
+	    if (acc->write)
+	      num_real_write_prefetches += acc->prefetch_in_loop;
+	  }
+      }
 
-  /* Determine how many iterations ahead to prefetch within the loop, based
-     on how many prefetches we currently expect to do within the loop.  */
-  if (num_real_prefetches != 0)
-    {
-      if ((ahead = SIMULTANEOUS_PREFETCHES / num_real_prefetches) == 0)
+  /* Determine how many iterations ahead to prefetch within the loop.
+     ??? Ideally this should take into account several factors, including
+     the cache size and memory latency on the target, the number of other
+     memory accesses within the loop, and the amount of other work done in
+     the loop.  */
+  if (PREFETCH_CONST_DISTANCE)
+    iterations_ahead = PREFETCH_DISTANCE;
+  else if (num_real_prefetches != 0)
+    {
+      /* Determine the prefetch distance based on how many prefetches we
+	 currently expect to do within the loop  */
+      if ((iterations_ahead = SIMULTANEOUS_PREFETCHES / num_real_prefetches)
+	  == 0)
 	{
 	  if (loop_dump_stream)
 	    fprintf (loop_dump_stream,
-		     "Prefetch: ignoring prefetches within loop: ahead is zero; %d < %d\n",
-		     SIMULTANEOUS_PREFETCHES, num_real_prefetches);
-	  num_real_prefetches = 0, num_real_write_prefetches = 0;
+		     "Prefetch: ignoring prefetches within loop: more prefetches (%d) than target can handle (%d)\n",
+		     num_real_prefetches, SIMULTANEOUS_PREFETCHES);
+	  num_real_prefetches = 0;
+	  num_real_write_prefetches = 0;
 	}
     }
-  /* We'll also use AHEAD to determine how many prefetch instructions to
-     emit before a loop, so don't leave it zero.  */
-  if (ahead == 0)
-    ahead = PREFETCH_BLOCKS_BEFORE_LOOP_MAX;
-
-  for (i = 0; i < num_prefetches; i++)
-    {
-      /* Update if we've decided not to prefetch anything within the loop.  */
-      if (num_real_prefetches == 0)
-	info[i].prefetch_in_loop = 0;
 
-      /* Find how many prefetch instructions we'll use before the loop.  */
-      if (info[i].prefetch_before_loop != 0)
-	{
-	  int n = info[i].total_bytes / PREFETCH_BLOCK;
-	  if (n > ahead)
-	    n = ahead;
-	  info[i].prefetch_before_loop = n;
-	  num_prefetches_before += n;
-	  if (info[i].write)
-	    num_write_prefetches_before += n;
-	}
+  /* We'll also use ITERATIONS_AHEAD to determine how many prefetch
+     instructions to emit before a loop, so don't leave it zero.  */
+  if (iterations_ahead == 0)
+    iterations_ahead = PREFETCH_BLOCKS_BEFORE_LOOP_MAX;
 
-      if (loop_dump_stream)
-	{
-	  if (info[i].prefetch_in_loop == 0
-	      && info[i].prefetch_before_loop == 0)
-	    continue;
-	  fprintf (loop_dump_stream, "Prefetch insn: %d",
-		   INSN_UID (info[i].giv->insn));
-	  fprintf (loop_dump_stream,
-		   "; in loop: %d; before: %d; %s\n",
-		   info[i].prefetch_in_loop,
-		   info[i].prefetch_before_loop,
-		   info[i].write ? "read/write" : "read only");
-	  fprintf (loop_dump_stream,
-		   " density: %d%%; bytes_accessed: %u; total_bytes: %u\n",
-		   (int) (info[i].bytes_accessed * 100 / info[i].stride),
-		   info[i].bytes_accessed, info[i].total_bytes);
-	  fprintf (loop_dump_stream, " index: ");
-	  fprintf (loop_dump_stream, HOST_WIDE_INT_PRINT_DEC, info[i].index);
-	  fprintf (loop_dump_stream, "; stride: ");
-	  fprintf (loop_dump_stream, HOST_WIDE_INT_PRINT_DEC, info[i].stride);
-	  fprintf (loop_dump_stream, "; address: ");
-	  print_rtl (loop_dump_stream, info[i].base_address);
-	  fprintf (loop_dump_stream, "\n");
-	}
-    }
+  for (arr = arr_list; arr; arr = arr->next)
+    for (acc = arr->first_access; acc; acc = acc->next)
+      {
+	/* Update if we've decided not to prefetch anything within the loop.  */
+	if (num_real_prefetches == 0)
+	  acc->prefetch_in_loop = 0;
+
+	/* Update if we're not doing prefetches before the loop.  We don't
+	   bother if the loop is known to have many iterations.  */
+	if (iterations >= PREFETCH_HIGH_LOOPCNT)
+	  acc->prefetch_before_loop = 0;
+
+	/* Find how many prefetch instructions we'll use before the loop.  */
+	if (acc->prefetch_before_loop != 0)
+	  {
+	    int n = ((iterations_ahead * arr->stride) / PREFETCH_BLOCK) + 1;
+	    if (n > iterations_ahead)
+	      n = iterations_ahead;
+	    acc->prefetch_before_loop = n;
+	    num_prefetches_before += n;
+	    if (acc->write)
+	      num_write_prefetches_before += n;
+	  }
+
+	if (loop_dump_stream)
+	  {
+	    if (acc->prefetch_in_loop == 0
+		&& acc->prefetch_before_loop == 0)
+	      continue;
+	    fprintf (loop_dump_stream, "Prefetch insn: %d",
+		     INSN_UID (acc->giv->insn));
+	    fprintf (loop_dump_stream,
+		     "; %s; in loop: %d; before: %d; %s\n",
+		     acc->solid ? "solid" : "sparse",
+		     acc->prefetch_in_loop,
+		     acc->prefetch_before_loop,
+		     acc->write ? "read/write" : "read only");
+	    fprintf (loop_dump_stream,
+		     " density: %d%%; bytes_accessed: %u; total_bytes: %u\n",
+		     (int) (acc->bytes_accessed * 100 / arr->stride),
+		     acc->bytes_accessed, acc->total_bytes);
+	    fprintf (loop_dump_stream, " index: ");
+	    fprintf (loop_dump_stream, HOST_WIDE_INT_PRINT_DEC, acc->index);
+	    fprintf (loop_dump_stream, "; stride: ");
+	    fprintf (loop_dump_stream, HOST_WIDE_INT_PRINT_DEC, arr->stride);
+	    fprintf (loop_dump_stream, "; address: ");
+	    print_rtl (loop_dump_stream, arr->base_address);
+	    fprintf (loop_dump_stream, "\n");
+	    fprintf (loop_dump_stream, "PREFETCH CHECK: solid: %d; %s\n",
+		     acc->solid, acc->write ? "read/write" : "read only");
+	  }
+      }
 
   if (num_real_prefetches + num_prefetches_before > 0)
     {
@@ -4114,91 +4439,96 @@ emit_prefetch_instructions (loop)
 
       if (loop_dump_stream)
 	{
-	  fprintf (loop_dump_stream, "Real prefetches needed within loop: %d (write: %d)\n",
+	  fprintf (loop_dump_stream,
+		   "Real prefetches needed within loop: %d (write: %d)\n",
 		   num_real_prefetches, num_real_write_prefetches);
-	  fprintf (loop_dump_stream, "Real prefetches needed before loop: %d (write: %d)\n",
+	  fprintf (loop_dump_stream,
+		   "Real prefetches needed before loop: %d (write: %d)\n",
 		   num_prefetches_before, num_write_prefetches_before);
 	}
     }
 
-  for (i = 0; i < num_prefetches; i++)
-    {
-      int y;
+  for (arr = arr_list; arr; arr = arr->next)
+    for (acc = arr->first_access; acc; acc = acc->next)
+      {
+	int y;
+	int prefetch_stride;
 
-      for (y = 0; y < info[i].prefetch_in_loop; y++)
-	{
-	  rtx loc = copy_rtx (*info[i].giv->location);
-	  rtx insn;
-	  int bytes_ahead = PREFETCH_BLOCK * (ahead + y);
-	  rtx before_insn = info[i].giv->insn;
-	  rtx prev_insn = PREV_INSN (info[i].giv->insn);
-	  rtx seq;
-
-	  /* We can save some effort by offsetting the address on
-	     architectures with offsettable memory references.  */
-	  if (offsettable_address_p (0, VOIDmode, loc))
-	    loc = plus_constant (loc, bytes_ahead);
-	  else
-	    {
-	      rtx reg = gen_reg_rtx (Pmode);
-	      loop_iv_add_mult_emit_before (loop, loc, const1_rtx,
-		      			    GEN_INT (bytes_ahead), reg,
-				  	    0, before_insn);
-	      loc = reg;
-	    }
+	/* Generate prefetch instructions within the loop.
+	   ??? This does not yet account for negative strides.  */
+	for (y = 0; y < acc->prefetch_in_loop; y++)
+	  {
+	    rtx loc = copy_rtx (*acc->giv->location);
+	    rtx insn;
+	    int bytes_ahead = (iterations_ahead * arr->stride)
+	      		      + (PREFETCH_BLOCK * y);
+	    rtx before_insn = acc->giv->insn;
+	    rtx prev_insn = PREV_INSN (acc->giv->insn);
+	    rtx seq;
+
+	    /* We can save some effort by offsetting the address on
+	       architectures with offsettable memory references.  */
+	    if (offsettable_address_p (0, VOIDmode, loc))
+	      loc = plus_constant (loc, bytes_ahead);
+	    else
+	      {
+		rtx reg = gen_reg_rtx (Pmode);
+		loop_iv_add_mult_emit_before (loop, loc, const1_rtx,
+		      			      GEN_INT (bytes_ahead), reg,
+				  	      0, before_insn);
+		loc = reg;
+	      }
 
-	  start_sequence ();
-	  /* Make sure the address operand is valid for prefetch.  */
-	  if (! (*insn_data[(int)CODE_FOR_prefetch].operand[0].predicate)
-		  (loc, insn_data[(int)CODE_FOR_prefetch].operand[0].mode))
-	    loc = force_reg (Pmode, loc);
-	  emit_insn (gen_prefetch (loc, GEN_INT (info[i].write),
-				   GEN_INT (3)));
-	  seq = gen_sequence ();
-	  end_sequence ();
-	  emit_insn_before (seq, before_insn);
-
-	  /* Check all insns emitted and record the new GIV
-	     information.  */
-	  insn = NEXT_INSN (prev_insn);
-	  while (insn != before_insn)
-	    {
-	      insn = check_insn_for_givs (loop, insn,
-					  info[i].giv->always_executed,
-					  info[i].giv->maybe_multiple);
-	      insn = NEXT_INSN (insn);
-	    }
-	}
+	    /* Make sure the address operand is valid for prefetch.  */
+	    if (! (*insn_data[(int)CODE_FOR_prefetch].operand[0].predicate)
+		    (loc, insn_data[(int)CODE_FOR_prefetch].operand[0].mode))
+	      loc = force_reg (Pmode, loc);
+
+	    start_sequence ();
+	    emit_insn (gen_prefetch (loc, GEN_INT (acc->write), GEN_INT (3)));
+	    seq = gen_sequence ();
+	    end_sequence ();
+	    emit_insn_before (seq, before_insn);
+
+	    /* Check all insns emitted and record the new GIV
+	       information.  */
+	    insn = NEXT_INSN (prev_insn);
+	    while (insn != before_insn)
+	      {
+		insn = check_insn_for_givs (loop, insn,
+					    acc->giv->always_executed,
+					    acc->giv->maybe_multiple);
+		insn = NEXT_INSN (insn);
+	      }
+	  }
 
-      if (PREFETCH_BEFORE_LOOP)
-	{
-	  /* Emit insns before the loop to fetch the first cache lines or,
-	     if we're not prefetching within the loop, everything we expect
-	     to need.  */
-	  for (y = 0; y < info[i].prefetch_before_loop; y++)
-	    {
-	      rtx reg = gen_reg_rtx (Pmode);
-	      rtx loop_start = loop->start;
-	      rtx init_val = info[i].class->initial_value;
-	      rtx add_val = simplify_gen_binary (PLUS, Pmode,
-						 info[i].giv->add_val,
-						 GEN_INT (y * PREFETCH_BLOCK));
-
-	      /* Functions called by LOOP_IV_ADD_EMIT_BEFORE expect a
-		 non-constant INIT_VAL to have the same mode as REG, which
-		 in this case we know to be Pmode.  */
-	      if (GET_MODE (init_val) != Pmode && !CONSTANT_P (init_val))
-		init_val = convert_to_mode (Pmode, init_val, 0);
-	      loop_iv_add_mult_emit_before (loop, init_val,
-					    info[i].giv->mult_val,
-					    add_val, reg, 0, loop_start);
-	      emit_insn_before (gen_prefetch (reg, GEN_INT (info[i].write),
-					      GEN_INT (3)),
-				loop_start);
-	    }
-	}
-    }
+	/* Emit insns before the loop to fetch the first cache lines or,
+	   if we're not prefetching within the loop, everything we expect
+	   to need.
+	   ??? This does not yet account for negative strides.  */
+	prefetch_stride = acc->solid ? PREFETCH_BLOCK : arr->stride;
+	for (y = 0; y < acc->prefetch_before_loop; y++)
+	  {
+	    rtx reg = gen_reg_rtx (Pmode);
+	    rtx loop_start = loop->start;
+	    rtx init_val = acc->class->initial_value;
+	    rtx add_val = simplify_gen_binary (PLUS, Pmode, acc->giv->add_val,
+					       GEN_INT (y * prefetch_stride));
+
+	    /* Functions called by LOOP_IV_ADD_EMIT_BEFORE expect a
+	       non-constant INIT_VAL to have the same mode as REG, which
+	       in this case we know to be Pmode.  */
+	    if (GET_MODE (init_val) != Pmode && !CONSTANT_P (init_val))
+	      init_val = convert_to_mode (Pmode, init_val, 0);
+	    loop_iv_add_mult_emit_before (loop, init_val, acc->giv->mult_val,
+					  add_val, reg, 0, loop_start);
+	    emit_insn_before (gen_prefetch (reg, GEN_INT (acc->write),
+					    GEN_INT (3)),
+			      loop_start);
+	  }
+      }
 
+  prefetch_free_list (arr_list);
   return;
 }
 



More information about the Gcc-patches mailing list