PATCH: Enhance Reuse Analysis of Long-stride Memory References in the Prefetcher

Wed Aug 12 15:51:00 GMT 2009

I have made both changes. I have added a new function to compute the
miss probability considering all possible alignments. I also switched
back from FP arithmetic to INT arithmetic. Here is the revised patch. It
bootstrapped and passed CPU2006 giving the expected results: lbm
improves by 8% and all other benchmarks are not affected. Is it OK for
trunck?

Thanks
-Ghassan


2009-08-12  Ghassan Shobaki  <ghassan.shobaki@amd.com>

	* tree-ssa-loop-prefetch.c 
	(prune_ref_by_group_reuse): Enhance probabilistic analysis 
	for long-stride pruning.
	(compute_miss_rate): New function to compute the probability
	that two memory references access different cache lines.


Index: tree-ssa-loop-prefetch.c
===================================================================

--- tree-ssa-loop-prefetch.c	(revision 150474)
+++ tree-ssa-loop-prefetch.c	(working copy)
@@ -593,6 +593,45 @@ ddown (HOST_WIDE_INT x, unsigned HOST_WI
     return (x + by - 1) / by;
 }
 
+/* Given a CACHE_LINE_SIZE and two inductive memory references 
+   with a common STEP greater than CACHE_LINE_SIZE and an address 
+   difference DELTA, compute the probability that they will fall 
+   in different cache lines.  DISTINCT_ITERS is the number of 
+   distinct iterations after which the pattern repeats itself.  
+   ALIGN_UNIT is the unit of alignment in bytes.  */
+
+static int
+compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size, 
+		   HOST_WIDE_INT step, HOST_WIDE_INT delta,
+		   unsigned HOST_WIDE_INT distinct_iters,
+		   int align_unit)
+{
+  unsigned align, iter;
+  int total_positions, miss_positions, miss_rate;
+  int address1, address2, cache_line1, cache_line2;
+
+  total_positions = 0;
+  miss_positions = 0;
+  
+  /* Iterate through all possible alignments of the first
+     memory reference within its cache line.  */
+  for (align = 0; align < cache_line_size; align += align_unit)
+
+    /* Iterate through all distinct iterations.  */
+    for (iter = 0; iter < distinct_iters; iter++)
+      {
+	address1 = align + step * iter;
+	address2 = address1 + delta;
+	cache_line1 = address1 / cache_line_size;
+	cache_line2 = address2 / cache_line_size;
+	total_positions += 1;
+	if (cache_line1 != cache_line2)
+	  miss_positions += 1;
+      }
+  miss_rate = 1000 * miss_positions / total_positions;
+  return miss_rate;
+}
+
 /* Prune the prefetch candidate REF using the reuse with BY.
    If BY_IS_BEFORE is true, BY is before REF in the loop.  */
 
@@ -606,6 +645,11 @@ prune_ref_by_group_reuse (struct mem_ref
   HOST_WIDE_INT delta = delta_b - delta_r;
   HOST_WIDE_INT hit_from;
   unsigned HOST_WIDE_INT prefetch_before, prefetch_block;
+  int miss_rate;
+  HOST_WIDE_INT reduced_step;
+  unsigned HOST_WIDE_INT reduced_prefetch_block;
+  tree ref_type;
+  int align_unit;
 
   if (delta == 0)
     {
@@ -667,25 +711,29 @@ prune_ref_by_group_reuse (struct mem_ref
       return;
     }
 
-  /* A more complicated case.  First let us ensure that size of cache
line
-     and step are coprime (here we assume that PREFETCH_BLOCK is a
power
-     of two.  */
+  /* A more complicated case with step > prefetch_block.  First reduce 
+     the ratio between the step and the cache line size to its simplest
+     terms.  The resulting denominator will then represent the number
of 
+     distinct iterations after which each address will go back to its 
+     initial location within the cache line.  This computation assumes 
+     that PREFETCH_BLOCK is a power of two.  */
   prefetch_block = PREFETCH_BLOCK;
-  while ((step & 1) == 0
-	 && prefetch_block > 1)
+  reduced_prefetch_block = prefetch_block;
+  reduced_step = step;
+  while ((reduced_step & 1) == 0
+	 && reduced_prefetch_block > 1)
     {
-      step >>= 1;
-      prefetch_block >>= 1;
-      delta >>= 1;
+      reduced_step >>= 1;
+      reduced_prefetch_block >>= 1;
     }
 
-  /* Now step > prefetch_block, and step and prefetch_block are
coprime.
-     Determine the probability that the accesses hit the same cache
line.  */
-
   prefetch_before = delta / step;
   delta %= step;
-  if ((unsigned HOST_WIDE_INT) delta
-      <= (prefetch_block * ACCEPTABLE_MISS_RATE / 1000))
+  ref_type = TREE_TYPE (ref->mem);
+  align_unit = TYPE_ALIGN (ref_type) / 8;
+  miss_rate = compute_miss_rate(prefetch_block, step, delta, 
+				reduced_prefetch_block, align_unit);
+  if (miss_rate <= ACCEPTABLE_MISS_RATE)
     {
       if (prefetch_before < ref->prefetch_before)
 	ref->prefetch_before = prefetch_before;
@@ -696,8 +744,9 @@ prune_ref_by_group_reuse (struct mem_ref
   /* Try also the following iteration.  */
   prefetch_before++;
   delta = step - delta;
-  if ((unsigned HOST_WIDE_INT) delta
-      <= (prefetch_block * ACCEPTABLE_MISS_RATE / 1000))
+  miss_rate = compute_miss_rate(prefetch_block, step, delta, 
+				reduced_prefetch_block, align_unit);
+  if (miss_rate <= ACCEPTABLE_MISS_RATE) 
     {
       if (prefetch_before < ref->prefetch_before)
 	ref->prefetch_before = prefetch_before;



-----Original Message-----
From: Shobaki, Ghassan 
Sent: Saturday, August 08, 2009 9:16 AM
To: Shobaki, Ghassan; Zdenek Dvorak; Joseph S. Myers
Cc: gcc-patches@gcc.gnu.org; Pop, Sebastian
Subject: RE: PATCH: Enhance Reuse Analysis of Long-stride Memory
References in the Prefetcher

Correction: I meant to say a weighted sum of miss probabilities over all
possible alignments. Sorry for having to send two emails!

-Ghassan

-----Original Message-----
From: Shobaki, Ghassan 
Sent: Saturday, August 08, 2009 9:13 AM
To: 'Zdenek Dvorak'; Joseph S. Myers
Cc: gcc-patches@gcc.gnu.org; Pop, Sebastian
Subject: RE: PATCH: Enhance Reuse Analysis of Long-stride Memory
References in the Prefetcher

Great feedback! I'll work on making both changes (use integer arithmetic
instead of FP arithmetic and compute a weighted sum of all possible
alignments). I'll then send an updated patch once the implementation and
testing are completed. 

Thanks
-Ghassan  

-----Original Message-----
From: Zdenek Dvorak [mailto:rakdver@kam.mff.cuni.cz] 
Sent: Friday, August 07, 2009 10:25 PM
To: Joseph S. Myers
Cc: Shobaki, Ghassan; gcc-patches@gcc.gnu.org; Pop, Sebastian
Subject: Re: PATCH: Enhance Reuse Analysis of Long-stride Memory
References in the Prefetcher

Hi,

> Joseph S. Myers:
> > On Fri, 7 Aug 2009, Shobaki, Ghassan wrote:
> 
> > 2. Using floating point arithmetic for a more precise computation of
the 
> > miss probability.
> 
> Can you prove that this cannot lead to results depending on the host 
> system (for example, being different depending on whether float
arithmetic 
> has excess precision)?
> 
> > +  miss_rate = (float) delta / (float) prefetch_block;
> > +  acceptable_miss_rate = (float) ACCEPTABLE_MISS_RATE / 1000.0;
> 
> Is there some reason you can't simply use appropriately scaled integer

> arithmetic (miss_rate = 1000 * delta / prefetch_block)?  (This
supposes 
> that 1000 * delta can't overflow; I don't know if there are already
checks 
> elsewhere in the code that avoid that possibility, but if not you'd
need 
> to add them.)

as delta < prefetch_block (which is cache line size, i.e., usually
something
like 64), the overflow is not an issue, and I also believe that using
only
the integer arithmetic would be better.  Also,

> In particular, if the offset difference (delta) between the two memory
> references in question is less than the greatest common divisor (GCD)
of the
> step and the cache_line_size, the current analysis will give a miss
probability
> of zero, thus causing the pruning of one of the two memory references.
To see
> why this can be a significant under-estimate, consider, for example,
two memory
> references with a step of 96, a cache_line_size of 64 and a delta of
30.  In
> this case the GCD is 32. Diving by this GCD gives a step of 3, a
> cache_line_size of 2 and a delta of zero. This leads to a miss
probability of
> zero.  The actual miss probability in this case depends on the
alignment of the
> two memory references. If the first memory reference is aligned on a
cache line
> boundary, the miss probability will be 0% (both memory references will
hit the
> same cache line in all iterations), but if the memory reference falls
somewhere
> in the middle of a cache line (which is very likely), the miss
probability will
> be 50% (the two memory references will hit different cache lines in
every other
> iteration). Obviously, when the miss probability is 50%, we would like
to issue
> prefetches for both memory references.

As your discussion shows, the "miss rate" you compute makes only sense
if you
average over all possible alignments (and consider them all to be
equally
likely).  I think it would be preferable to compute explictly the miss
rates
that are actually possible, and their probabilities depending on the
alignment.
That is, in your example the miss rate is

0% with probability 4/64 (for the alignments 0, 1, 32 and 33 of the
first memory reference) and
50% with probability 60/64 (the remaining cases)

This should make the validity and the assumptions of the computation
more obvious, as well
as make further changes (e.g., taking the actual alignment of the
references into account
or altering it) easier.  The patch is OK with these two changes,

Zdenek