This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: [patch] Improve prefetch heuristics

From: "Fang, Changpeng" <Changpeng dot Fang at amd dot com>
To: Zdenek Dvorak <rakdver at kam dot mff dot cuni dot cz>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "rguenther at suse dot de" <rguenther at suse dot de>, "sebpop at gmail dot com" <sebpop at gmail dot com>
Date: Mon, 17 May 2010 11:44:26 -0500
Subject: RE: [patch] Improve prefetch heuristics
References: <20100430010543.GA30055@kam.mff.cuni.cz> <1C13CD442679CE45A2E80AE9251D7EF921803A43@SAUSEXMBP01.amd.com>,<20100512075207.GA8691@kam.mff.cuni.cz>

Hi,

Attached are two approved patches for gcc prefetch improvements.  Note that patch 0004 is the first
part of original patch 0001-Reduce-useless-and-redundant-prefetches.patch.

Both patches passed bootstrap and regression tests.

Thanks, 

Changpeng




>>> 0001-Reduce-useless-and-redundant-prefetches.patch
>>> ==========================================
>>> This patch has too parts:
>>> First one is in schedule_prefetches.  If the actual unroll_factor is far less than what is
>>> required by the prefetch (i.e. the loop is not sufficiently unrolled), redundant prefetches
>>> will be introduced. For example, if the prefetch requires unrolling 16 times and the actually
>>> unroll factor is 1 (not unrolled), 15 out of 16 interations of the loop will issue redundant
>>> prefetches (16 prefetches fall on the same cache line).
>>> We add the following lines in schedule_prefetches to disable prefetches for such cases:
>>> if (prefetch_mod / unroll_factor > 8)
>>>   continue;
>>
>> this part is OK.





________________________________________
From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
Sent: Wednesday, May 12, 2010 2:52 AM
To: Fang, Changpeng
Cc: gcc-patches@gcc.gnu.org; rguenther@suse.de; sebpop@gmail.com
Subject: Re: [patch] Improve prefetch heuristics

Hi,

> > >> Patch5: 0005-Also-apply-the-insn-to-prefetch-ratio-heuristic-to-l.patch
> > >> This patch applies the instruction to prefetch ratio heuristic also to loops with
> > >> known loop trip count.  It improves 416.gamess by 3~4% and 445.gobmk by 3%.
> >
> >in principle, I agree that the patch might be reasonable.  But, I would like to see
> >the comparison with the results that you get by decreasing SIMULTANEOUS_PREFETCHES to
> >some reasonable value (say 10),
>
> First of all, decrease SIMULTANEOUS_PREFETCHES to 10 seems drop some useful prefetches and
> causes ~20+% performance degradation on 470.lbm.
>
> For 416.gamess, with SIMULTANEOUS_PREFETCHES =10, the performance is still worse  (~1%) than what I got
> by appltying instruction-to-prefetch-ratio heuristic. But we see a ~2% gain compared with
> SIMULTANEOUS_PREFETCHES =100 (default).
>
> For 445.gobmk, The performance is the same with SIMULTANEOUS_PREFETCHES =10 and =100. So, for
> this benchmark, too many prefetch may not be the reason of the degradation.

thanks for the comparison.  The patch is OK,

Zdenek

From fcc7524f515a65e4687efd94f31d98b59eab733c Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Wed, 12 May 2010 15:28:18 -0700
Subject: [PATCH 4/5] Define PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO heuristic

	* tree-ssa-loop-prefetch.c (PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO): New.
	(schedule_prefetches): Do not generate a prefetch if the unroll factor
	is far from what is required by the prefetch.
---
 gcc/tree-ssa-loop-prefetch.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 2cc16bb..5f5c34c 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -226,6 +226,17 @@ struct mem_ref_group
 #define TRIP_COUNT_TO_AHEAD_RATIO 4
 #endif
 
+/* Do not generate a prefetch if the unroll factor is significantly less
+   than what is required by the prefetch.  This is to avoid redundant
+   prefetches.  For example, if prefetch_mod is 16 and unroll_factor is
+   1, this means prefetching requires unrolling the loop 16 times, but
+   the loop is not going to be unrolled.  In this case (ratio = 16),
+   prefetching is not likely to be beneficial.  */
+
+#ifndef PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO
+#define PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO 8
+#endif
+
 /* The memory reference.  */
 
 struct mem_ref
@@ -907,6 +918,12 @@ schedule_prefetches (struct mem_ref_group *groups, unsigned unroll_factor,
 	if (!should_issue_prefetch_p (ref))
 	  continue;
 
+        /* The loop is far from being sufficiently unrolled for this
+           prefetch.  Do not generate prefetch to avoid many redudant
+           prefetches.  */
+        if (ref->prefetch_mod / unroll_factor > PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO)
+          continue;
+
 	/* If we need to prefetch the reference each PREFETCH_MOD iterations,
 	   and we unroll the loop UNROLL_FACTOR times, we need to insert
 	   ceil (UNROLL_FACTOR / PREFETCH_MOD) instructions in each
-- 
1.6.3.3

From a4bfecdac69787065649817187e00ade61fbbe02 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Wed, 12 May 2010 15:50:43 -0700
Subject: [PATCH 5/5] Also apply the insn to prefetch ratio heuristic to loops with known trip count.

	* doc/invoke.texi: Update documentation for min-insn-to-prefetch-ratio.
	* tree-ssa-loop-prefetch.c (is_loop_prefetching_profitable):
	Also apply the insn to prefetch ratio heuristic to loops with
	known trip count.
---
 gcc/doc/invoke.texi          |    3 +--
 gcc/tree-ssa-loop-prefetch.c |   29 +++++++++++++++--------------
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index bf3cd18..ba0bfc5 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8414,8 +8414,7 @@ The size of L2 cache, in kilobytes.
 
 @item min-insn-to-prefetch-ratio
 The minimum ratio between the number of instructions and the
-number of prefetches to enable prefetching in a loop with an
-unknown trip count.
+number of prefetches to enable prefetching in a loop.
 
 @item prefetch-min-insn-to-mem-ratio
 The minimum ratio between the number of instructions and the
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 5f5c34c..2ef19f7 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1589,17 +1589,9 @@ is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT est_niter,
       return false;
     }
 
-  /* Profitability of prefetching is highly dependent on the trip count.
-     For a given AHEAD distance, the first AHEAD iterations do not benefit
-     from prefetching, and the last AHEAD iterations execute useless
-     prefetches.  So, if the trip count is not large enough relative to AHEAD,
-     prefetching may cause serious performance degradation.  To avoid this
-     problem when the trip count is not known at compile time, we
-     conservatively skip loops with high prefetching costs.  For now, only
-     the I-cache cost is considered.  The relative I-cache cost is estimated
-     by taking the ratio between the number of prefetches and the total
-     number of instructions.  Since we are using integer arithmetic, we
-     compute the reciprocal of this ratio.
+  /* Prefetching most likely causes performance degradation when the instruction
+     to prefetch ratio is too small.  Too many prefetch instructions in a loop
+     may reduce the I-cache performance.
      (unroll_factor * ninsns) is used to estimate the number of instructions in
      the unrolled loop.  This implementation is a bit simplistic -- the number
      of issued prefetch instructions is also affected by unrolling.  So,
@@ -1609,12 +1601,21 @@ is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT est_niter,
      original loop * unroll_factor (at least the induction variable increases
      and the exit branches will get eliminated), so it might be better to use
      tree_estimate_loop_size + estimated_unrolled_size.  */
-  if (est_niter < 0)
+  insn_to_prefetch_ratio = (unroll_factor * ninsns) / prefetch_count;
+  if (insn_to_prefetch_ratio < MIN_INSN_TO_PREFETCH_RATIO)
     {
-      insn_to_prefetch_ratio = (unroll_factor * ninsns) / prefetch_count;
-      return insn_to_prefetch_ratio >= MIN_INSN_TO_PREFETCH_RATIO;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+        fprintf (dump_file,
+		 "Not prefetching -- instruction to prefetch ratio (%d) too small\n",
+		 insn_to_prefetch_ratio);
+      return false;
     }
 
+  /* Could not do further estimation if the trip count is unknown.  Just assume
+     prefetching is profitable. Too aggressive???  */
+  if (est_niter < 0)
+    return true;
+
   if (est_niter < (HOST_WIDE_INT) (TRIP_COUNT_TO_AHEAD_RATIO * ahead))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
-- 
1.6.3.3

Follow-Ups:
- Re: [patch] Improve prefetch heuristics
  - From: Sebastian Pop

References:
- RE: [patch] Improve prefetch heuristics
  - From: Fang, Changpeng
- Re: [patch] Improve prefetch heuristics
  - From: Zdenek Dvorak

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]