[patch] Improve prefetch heuristics

Fang, Changpeng Changpeng.Fang@amd.com
Mon May 3 23:17:00 GMT 2010


 
> >> You are right.  It looks prune_ref_by_self_reuse has already adjusted the prefetch distance
> >> through prefetch_mod. The reason I observed the short prefetch distance may due to the induction variable
> >> "ap" in the issue_prefetch_ref logic:
> >>
> >>    for (ap = 0; ap < n_prefetches; ap++)    /* <---------------------------- */
> >>      {
> >>        /* Determine the address to prefetch.  */
> >>        delta = (ahead + ap * ref->prefetch_mod) * ref->group->step;
> >>
> >> When ap equals 0, the prune_self_reuse adjustment is essentially ignored.  Applying the following patch
> >> can resolve the short prefetch distance problem:
> >>
> >>  -  for (ap = 0; ap < n_prefetches; ap++)
> >>  + for (ap = 1; ap <= n_prefetches; ap++)
> >>    {
> >>        /* Determine the address to prefetch.  */
> >>        delta = (ahead + ap * ref->prefetch_mod) * ref->group->step;
>
> >I think there is some missunderstanding.  The statement "When ap equals 0, the prune_self_reuse adjustment
> >is essentially ignored." does not make sense to me.  Also, the change you propose only increases the prefetch
> >distance by a constant offset.
>>
>> In my experiemnts, n_prefetches always equals to 1, as a result,  ap * ref->prefetch_mod == 0. This is what I meant
>> prefetch_mod takes no effect, and why I observed such short prefetch distance. (delta < L1_CACHE_LINE_SIZE).

>then the problem is with determining `ahead',


I use the following example to defend my original patch --0001-Do-not-insert-prefetches-if-they-would-hit-the-same-.patch.
I think the current prefetch pass will generate prefetches that fall on the same cache line with a existing memory reference.
The case is from 416.gamess:

ahead 1, prefetch_mod 16,  step 4 , unroll factor 1, trip count 1001, insn count 1817, mem ref count 11, prefetch count 7

The case shows a loop with big body (1817 instructions). For loop prefetch, the prefetch ahead for such big loop could
only be 1 (other value will lead to the prefetch data arriving the cache too earlier, and may evict useful data out of 
the cache)

The loop is not going to be unrolled, and the step size is 4..
 delta = (ahead + ap * ref->prefetch_mod) * ref->group->step 
          = 4 when ap=0

I could not see how can we generate effective prefetches for loops with big body, and small step size. 

Thanks,

Changpeng



More information about the Gcc-patches mailing list