This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: [patch] Improve prefetch heuristics

From: "Fang, Changpeng" <Changpeng dot Fang at amd dot com>
To: Zdenek Dvorak <rakdver at kam dot mff dot cuni dot cz>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "rguenther at suse dot de" <rguenther at suse dot de>, "sebpop at gmail dot com" <sebpop at gmail dot com>
Date: Mon, 3 May 2010 11:34:11 -0500
Subject: RE: [patch] Improve prefetch heuristics
References: <1C13CD442679CE45A2E80AE9251D7EF9220E442E@SAUSEXMBP01.amd.com>,<20100430010543.GA30055@kam.mff.cuni.cz>

Hi, Zhenek:

Thank you so much for your comments.  Please refer to the embeded replies:

>> Patch1: 0001-Do-not-insert-prefetches-if-they-would-hit-the-same-.patch
>> This patch modifies the prefetch generation logic.  We don't issue a
>>prefetch if it would fall on the same cache line with an existing
>>memory reference or prefetch.  This patch improves the following
>> benchmarks: 416.gamess (~7%), 434.zeusmp (~4%), 454.calculix (~2%) and
>>445.gobmk (~2%).

>this seems redundant, we handle this by setting prefetch_mod so that
>the prefetches do not fall on the same cache line in prune_ref_by_self_reuse.
>Due to rounding issues, the distance between the prefetches may be slightly
>less than the cache line size; still, I am quite surprised you got any
>results by this change -- do you have some small example where it is useful?
>This might indicate some problem with prune_ref_by_self_reuse/issue_prefetch_ref logic.

You are right.  It looks prune_ref_by_self_reuse has already adjusted the prefetch distance
through prefetch_mod. The reason I observed the short prefetch distance may due to the induction variable
"ap" in the issue_prefetch_ref logic:

   for (ap = 0; ap < n_prefetches; ap++)    /* <---------------------------- */
     {
       /* Determine the address to prefetch.  */
       delta = (ahead + ap * ref->prefetch_mod) * ref->group->step;

When ap equals 0, the prune_self_reuse adjustment is essentially ignored. Applying the following patch
can resolve the short prefetch distance problem:

 -  for (ap = 0; ap < n_prefetches; ap++)
 + for (ap = 1; ap <= n_prefetches; ap++) 
   {
       /* Determine the address to prefetch.  */
       delta = (ahead + ap * ref->prefetch_mod) * ref->group->step;


However, the 416.gamess, and 445.gobmk regression could not be fixed. I am looking for other reasons that cause
the degradation.



>> Patch2: 0002-Do-not-unroll-when-loop-unrolling-is-not-enabled.patch
>> We disable the prefetch-driven unrolling completely when loop
>> unrolling is not enabled.  The purpose of this patch is to control the
>> code size increase by prefetch.  Applying this patch, we see a 7%
>> improvement on 465.tonto, and 3.75% degradation on 450.soplex.  We
>> have verified that the improvement and degradation are purely from the
>> unrolling.

>I disagree with this change.  By enabling prefetching, the user already
>indicated that he is fine with some code size increase. Also, in many cases
>prefetching without unrolling does not make sense, and you should not be
>forced to have -funroll-loops (and thus also have unrelated loops unrolled)
>to enable it.

I understand loop unrolling is needed to avoid redundant prefetches on a cache
line. But aggressive unrolling may be a overhead. We may need to tune the cost
model or prefetch-driven unrolling logic for the prefetch to be beneficial. I withdraw
this patch for now. Thanks,

>> Patch3: 0003-Dump-a-diagnostic-for-a-insn-to-mem-ratio-too-small.patch
>> This patch adds the diagnostic information when the instruction to
>> memory ratio is too small.

>This patch is OK.
Thanks. We are going to check in this patch soon.


>> Patch4: 0004-Account-for-loop-unrolling-in-the-insn-to-prefetch-r.patch
>> This patch improves the instruction to prefetch ratio heuristics by accounting
>> the loop unrolling when computing the number of instructions in a loop.

>This implementation is a bit simplistic -- the number of issued prefetch instructions
>is also affected by unrolling.  So, prefetch_mod and the unroll factor should
>be taken into account when determining prefetch_count.  Also, the number of insns
>of the unrolled loop will usually be significantly smaller than the number of
>insns of the original loop * unroll_factor (at least the induction variable increases
>and the exit branches will get eliminated), so it might be better to use
>tree_estimate_loop_size + estimated_unrolled_size.

>Anyway, the patch is OK, assuming that you mention these problems in a comment.

Thanks. We are going to add the comments as you suggested and check in this patch.

I am going to investigate whether TRIP_COUNT_TO_AHEAD_RATIO heuristic can handle
all the following three problems,

Thanks,

Changpeng

> Patch5: 0005-Also-apply-the-insn-to-prefetch-ratio-heuristic-to-l.patch
> This patch applies the instruction to prefetch ratio heuristic also to loops with
> known loop trip count.  It improves 416.gamess by 3~4% and 445.gobmk by 3%.

I think it would be better to find out why the prefetching for the loops in
these examples is currently performed (or why it causes the degradation).  If
the instruction count is too small, AHEAD should be big enough to prevent the
prefetching.  Currently we just test whether est_niter <= ahead, which is
rather aggressive. We should only emit the prefetches if # of iterations is
significantly bigger than ahead, so that could be the place that needs to be
fixed.  Which ...

> Patch6: 0006-Define-the-TRIP_COUNT_TO_AHEAD_RATIO-heuristic.patch This
> patch defines the trip count to ahead ratio heuristic in the cost
> model: don't generate prefetches for loops where the trip count is
> less than TRIP_COUNT_TO_AHEAD_RATIO times the ahead iterations.

... you actually do here.  This is OK (I assume you did some experiments with
the value of TRIP_COUNT_TO_AHEAD_RATIO?)

> Patch 7: 0007-Don-t-generate-prefetches-for-loops-with-small-trip-.patch
> Don't generate prefetch for loops with small trip count.  We found that
> loop prefetch is not effective for loops with small trip counts, which
> usually have big loop bodies and appropriate prefetch scheduling is
> required.  This patch improves 454.calculix by ~5%.

This looks a bit redundant with the previous patch -- can't you just set
TRIP_COUNT_TO_AHEAD_RATIO to 5 (or whatever), so that it would take care
of this case as well?

Zdenek

Follow-Ups:
- Re: [patch] Improve prefetch heuristics
  - From: Zdenek Dvorak

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]