This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
RE: [patch] Improve prefetch heuristics
- From: "Fang, Changpeng" <Changpeng dot Fang at amd dot com>
- To: Zdenek Dvorak <rakdver at kam dot mff dot cuni dot cz>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "rguenther at suse dot de" <rguenther at suse dot de>, "sebpop at gmail dot com" <sebpop at gmail dot com>
- Date: Mon, 17 May 2010 11:44:26 -0500
- Subject: RE: [patch] Improve prefetch heuristics
- References: <20100430010543.GA30055@kam.mff.cuni.cz> <1C13CD442679CE45A2E80AE9251D7EF921803A43@SAUSEXMBP01.amd.com>,<20100512075207.GA8691@kam.mff.cuni.cz>
Hi,
Attached are two approved patches for gcc prefetch improvements. Note that patch 0004 is the first
part of original patch 0001-Reduce-useless-and-redundant-prefetches.patch.
Both patches passed bootstrap and regression tests.
Thanks,
Changpeng
>>> 0001-Reduce-useless-and-redundant-prefetches.patch
>>> ==========================================
>>> This patch has too parts:
>>> First one is in schedule_prefetches. If the actual unroll_factor is far less than what is
>>> required by the prefetch (i.e. the loop is not sufficiently unrolled), redundant prefetches
>>> will be introduced. For example, if the prefetch requires unrolling 16 times and the actually
>>> unroll factor is 1 (not unrolled), 15 out of 16 interations of the loop will issue redundant
>>> prefetches (16 prefetches fall on the same cache line).
>>> We add the following lines in schedule_prefetches to disable prefetches for such cases:
>>> if (prefetch_mod / unroll_factor > 8)
>>> continue;
>>
>> this part is OK.
________________________________________
From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
Sent: Wednesday, May 12, 2010 2:52 AM
To: Fang, Changpeng
Cc: gcc-patches@gcc.gnu.org; rguenther@suse.de; sebpop@gmail.com
Subject: Re: [patch] Improve prefetch heuristics
Hi,
> > >> Patch5: 0005-Also-apply-the-insn-to-prefetch-ratio-heuristic-to-l.patch
> > >> This patch applies the instruction to prefetch ratio heuristic also to loops with
> > >> known loop trip count. It improves 416.gamess by 3~4% and 445.gobmk by 3%.
> >
> >in principle, I agree that the patch might be reasonable. But, I would like to see
> >the comparison with the results that you get by decreasing SIMULTANEOUS_PREFETCHES to
> >some reasonable value (say 10),
>
> First of all, decrease SIMULTANEOUS_PREFETCHES to 10 seems drop some useful prefetches and
> causes ~20+% performance degradation on 470.lbm.
>
> For 416.gamess, with SIMULTANEOUS_PREFETCHES =10, the performance is still worse (~1%) than what I got
> by appltying instruction-to-prefetch-ratio heuristic. But we see a ~2% gain compared with
> SIMULTANEOUS_PREFETCHES =100 (default).
>
> For 445.gobmk, The performance is the same with SIMULTANEOUS_PREFETCHES =10 and =100. So, for
> this benchmark, too many prefetch may not be the reason of the degradation.
thanks for the comparison. The patch is OK,
Zdenek
From fcc7524f515a65e4687efd94f31d98b59eab733c Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Wed, 12 May 2010 15:28:18 -0700
Subject: [PATCH 4/5] Define PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO heuristic
* tree-ssa-loop-prefetch.c (PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO): New.
(schedule_prefetches): Do not generate a prefetch if the unroll factor
is far from what is required by the prefetch.
---
gcc/tree-ssa-loop-prefetch.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 2cc16bb..5f5c34c 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -226,6 +226,17 @@ struct mem_ref_group
#define TRIP_COUNT_TO_AHEAD_RATIO 4
#endif
+/* Do not generate a prefetch if the unroll factor is significantly less
+ than what is required by the prefetch. This is to avoid redundant
+ prefetches. For example, if prefetch_mod is 16 and unroll_factor is
+ 1, this means prefetching requires unrolling the loop 16 times, but
+ the loop is not going to be unrolled. In this case (ratio = 16),
+ prefetching is not likely to be beneficial. */
+
+#ifndef PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO
+#define PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO 8
+#endif
+
/* The memory reference. */
struct mem_ref
@@ -907,6 +918,12 @@ schedule_prefetches (struct mem_ref_group *groups, unsigned unroll_factor,
if (!should_issue_prefetch_p (ref))
continue;
+ /* The loop is far from being sufficiently unrolled for this
+ prefetch. Do not generate prefetch to avoid many redudant
+ prefetches. */
+ if (ref->prefetch_mod / unroll_factor > PREFETCH_MOD_TO_UNROLL_FACTOR_RATIO)
+ continue;
+
/* If we need to prefetch the reference each PREFETCH_MOD iterations,
and we unroll the loop UNROLL_FACTOR times, we need to insert
ceil (UNROLL_FACTOR / PREFETCH_MOD) instructions in each
--
1.6.3.3
From a4bfecdac69787065649817187e00ade61fbbe02 Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@houghton.(none)>
Date: Wed, 12 May 2010 15:50:43 -0700
Subject: [PATCH 5/5] Also apply the insn to prefetch ratio heuristic to loops with known trip count.
* doc/invoke.texi: Update documentation for min-insn-to-prefetch-ratio.
* tree-ssa-loop-prefetch.c (is_loop_prefetching_profitable):
Also apply the insn to prefetch ratio heuristic to loops with
known trip count.
---
gcc/doc/invoke.texi | 3 +--
gcc/tree-ssa-loop-prefetch.c | 29 +++++++++++++++--------------
2 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index bf3cd18..ba0bfc5 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8414,8 +8414,7 @@ The size of L2 cache, in kilobytes.
@item min-insn-to-prefetch-ratio
The minimum ratio between the number of instructions and the
-number of prefetches to enable prefetching in a loop with an
-unknown trip count.
+number of prefetches to enable prefetching in a loop.
@item prefetch-min-insn-to-mem-ratio
The minimum ratio between the number of instructions and the
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 5f5c34c..2ef19f7 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1589,17 +1589,9 @@ is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT est_niter,
return false;
}
- /* Profitability of prefetching is highly dependent on the trip count.
- For a given AHEAD distance, the first AHEAD iterations do not benefit
- from prefetching, and the last AHEAD iterations execute useless
- prefetches. So, if the trip count is not large enough relative to AHEAD,
- prefetching may cause serious performance degradation. To avoid this
- problem when the trip count is not known at compile time, we
- conservatively skip loops with high prefetching costs. For now, only
- the I-cache cost is considered. The relative I-cache cost is estimated
- by taking the ratio between the number of prefetches and the total
- number of instructions. Since we are using integer arithmetic, we
- compute the reciprocal of this ratio.
+ /* Prefetching most likely causes performance degradation when the instruction
+ to prefetch ratio is too small. Too many prefetch instructions in a loop
+ may reduce the I-cache performance.
(unroll_factor * ninsns) is used to estimate the number of instructions in
the unrolled loop. This implementation is a bit simplistic -- the number
of issued prefetch instructions is also affected by unrolling. So,
@@ -1609,12 +1601,21 @@ is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT est_niter,
original loop * unroll_factor (at least the induction variable increases
and the exit branches will get eliminated), so it might be better to use
tree_estimate_loop_size + estimated_unrolled_size. */
- if (est_niter < 0)
+ insn_to_prefetch_ratio = (unroll_factor * ninsns) / prefetch_count;
+ if (insn_to_prefetch_ratio < MIN_INSN_TO_PREFETCH_RATIO)
{
- insn_to_prefetch_ratio = (unroll_factor * ninsns) / prefetch_count;
- return insn_to_prefetch_ratio >= MIN_INSN_TO_PREFETCH_RATIO;
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ "Not prefetching -- instruction to prefetch ratio (%d) too small\n",
+ insn_to_prefetch_ratio);
+ return false;
}
+ /* Could not do further estimation if the trip count is unknown. Just assume
+ prefetching is profitable. Too aggressive??? */
+ if (est_niter < 0)
+ return true;
+
if (est_niter < (HOST_WIDE_INT) (TRIP_COUNT_TO_AHEAD_RATIO * ahead))
{
if (dump_file && (dump_flags & TDF_DETAILS))
--
1.6.3.3