PATCH: Enhance Reuse Analysis of Long-stride Memory References in the Prefetcher

Shobaki, Ghassan Ghassan.Shobaki@amd.com
Sat Aug 8 00:50:00 GMT 2009


Hi,

This patch enhances the reuse analysis of long-stride memory references in the prefetching pass. The current reuse analysis for long-stride prefetching under-estimates the probability that two memory references use different cache lines, and that may result in missing important prefetches. 
 
In particular, if the offset difference (delta) between the two memory references in question is less than the greatest common divisor (GCD) of the step and the cache_line_size, the current analysis will give a miss probability of zero, thus causing the pruning of one of the two memory references. To see why this can be a significant under-estimate, consider, for example, two memory references with a step of 96, a cache_line_size of 64 and a delta of 30. In this case the GCD is 32. Diving by this GCD gives a step of 3, a cache_line_size of 2 and a delta of zero. This leads to a miss probability of zero. The actual miss probability in this case depends on the alignment of the two memory references. If the first memory reference is aligned on a cache line boundary, the miss probability will be 0% (both memory references will hit the same cache line in all iterations), but if the memory reference falls somewhere in the middle of a cache line (which is very likely), the miss probability will be 50% (the two memory references will hit different cache lines in every other iteration). Obviously, when the miss probability is 50%, we would like to issue prefetches for both memory references.

The attached patch enhances the current computation of the miss probability by:

1.	Introducing an explicit variable for the miss probability, which makes the code more readable.
2.	Using floating point arithmetic for a more precise computation of the miss probability.
3.	Most importantly, accounting for the fact that multiple alignments are possible. Note though that this patch does not try to identify the actual alignment or ensure a certain alignment. It just assumes that multiple alignments are possible. In a private discussion that I had with Zdenek (who originally wrote this code), he made some suggestions for controlling the alignment, including a suggestion to set the DECL_ALIGN macro. I have done some investigation of that idea and concluded that a significant amount of work will be needed to implement it. So, I plan on addressing that in a future patch to further enhance the analysis.

In terms of performance, I ran CPU2006 on an AMD Shanghai machine (using our best base flags shown below) and found that this patch improves lbm by 8% without affecting any other benchmarks. This improves the geometric mean of FP2006 by 0.5%.

The patch bootstrapped successfully on Linux. 

Is this patch OK for trunk?

Thanks
-Ghassan


2009-08-07  Ghassan Shobaki  <ghassan.shobaki@amd.com>

	* tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Enhance
	probabilistic analysis for long-stride pruning.


Index: tree-ssa-loop-prefetch.c
===================================================================
--- tree-ssa-loop-prefetch.c	(revision 150474)
+++ tree-ssa-loop-prefetch.c	(working copy)
@@ -606,6 +606,7 @@ prune_ref_by_group_reuse (struct mem_ref
   HOST_WIDE_INT delta = delta_b - delta_r;
   HOST_WIDE_INT hit_from;
   unsigned HOST_WIDE_INT prefetch_before, prefetch_block;
+  float miss_rate, acceptable_miss_rate;
 
   if (delta == 0)
     {
@@ -671,21 +672,15 @@ prune_ref_by_group_reuse (struct mem_ref
      and step are coprime (here we assume that PREFETCH_BLOCK is a power
      of two.  */
   prefetch_block = PREFETCH_BLOCK;
-  while ((step & 1) == 0
-	 && prefetch_block > 1)
-    {
-      step >>= 1;
-      prefetch_block >>= 1;
-      delta >>= 1;
-    }
 
-  /* Now step > prefetch_block, and step and prefetch_block are coprime.
+  /* Now step > prefetch_block.
      Determine the probability that the accesses hit the same cache line.  */
 
   prefetch_before = delta / step;
   delta %= step;
-  if ((unsigned HOST_WIDE_INT) delta
-      <= (prefetch_block * ACCEPTABLE_MISS_RATE / 1000))
+  miss_rate = (float) delta / (float) prefetch_block;
+  acceptable_miss_rate = (float) ACCEPTABLE_MISS_RATE / 1000.0;
+  if (miss_rate <= acceptable_miss_rate)
     {
       if (prefetch_before < ref->prefetch_before)
 	ref->prefetch_before = prefetch_before;
@@ -696,12 +691,12 @@ prune_ref_by_group_reuse (struct mem_ref
   /* Try also the following iteration.  */
   prefetch_before++;
   delta = step - delta;
-  if ((unsigned HOST_WIDE_INT) delta
-      <= (prefetch_block * ACCEPTABLE_MISS_RATE / 1000))
+  miss_rate = (float) delta / (float) prefetch_block;
+  if (miss_rate <= acceptable_miss_rate)
     {
       if (prefetch_before < ref->prefetch_before)
 	ref->prefetch_before = prefetch_before;
-
+      
       return;
     }



GCC Flags Used in the SPEC CPU2006 test:

INT2006
COPTIMIZE	= -O3 -funroll-all-loops -minline-all-stringops -mveclibabi=acml -m64 -march=amdfam10 -fno-sched-dep-count-heuristic -fprefetch-loop-arrays --param prefetch-min-insn-to-mem-ratio=4 --param simultaneous-prefetches=20 --param prefetch-latency=300
CXXOPTIMIZE	= -O3 -funroll-all-loops -minline-all-stringops -m32 -march=amdfam10 -static -fprefetch-loop-arrays

FP2006
COPTIMIZE	= -O3 -funroll-all-loops -ffast-math -mveclibabi=acml -m64 -march=amdfam10 -fno-tree-pre -fprefetch-loop-arrays --param prefetch-latency=700
CXXOPTIMIZE	= -O3 -mveclibabi=acml -m64 -march=amdfam10 -fprefetch-loop-arrays
FOPTIMIZE	= -O3 -funroll-all-loops -ffast-math -mveclibabi=acml -m64 -march=amdfam10 -fno-tree-pre



More information about the Gcc-patches mailing list