This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [patch] Improve prefetch heuristics


Hi,
________________________________________
From: Christian Borntraeger [borntraeger@de.ibm.com]
Sent: Friday, April 30, 2010 2:13 AM
To: gcc-patches@gcc.gnu.org
Cc: Zdenek Dvorak; Fang, Changpeng; rguenther@suse.de; sebpop@gmail.com
Subject: Re: [patch] Improve prefetch heuristics

Am Freitag 30 April 2010 03:05:43 schrieb Zdenek Dvorak:
> Hi,
>
> > Patch1: 0001-Do-not-insert-prefetches-if-they-would-hit-the-same-.patch
> > This patch modifies the prefetch generation logic.  We don't issue a
> > prefetch if it would fall on the same cache line with an existing
> > memory reference or prefetch.  This patch improves the following
> > benchmarks: 416.gamess (~7%), 434.zeusmp (~4%), 454.calculix (~2%) and
> > 445.gobmk (~2%).
>
> this seems redundant, we handle this by setting prefetch_mod so that
> the prefetches do not fall on the same cache line in prune_ref_by_self_reuse.
> Due to rounding issues, the distance between the prefetches may be slightly
> less than the cache line size; still, I am quite surprised you got any
> results by this change -- do you have some small example where it is useful?
> This might indicate some problem with prune_ref_by_self_reuse/issue_prefetch_ref logic.

>Thinking more about this, the good results on x86 might come from a completely
>different aspect. This patch basically boils down to "issue less prefetches
>if the step size is very small".
>I can only speak for s390, but x86 might be simimar. On s390 we have a
>hardware stride prefetcher. Because translation is more expensive our hw
>prefetcher does not cross page boundaries, but the prefetch instruction
>does. Now that means for small step sizes the HW prefetcher works most
>of the time, no prefetch instruction is needed. If the step size gets
>bigger, the hardware prefetch becomes less and less useful and the prefetch
>instruction becomes more and more important.

>If x86 is similar then this patch might actually be a good thing. The current
>prefetch code only allows to specify if there is forward and backward prefetch
>but not these "inbetween" things.

>So mabye the patch should look more like

>[...]
>+ #ifndef PREFETCH_MIN_STEP_SIZE
>+ #define PREFETCH_MIN_STEP_SIZE 32
>+ #endif
>[...]
>+      /* Don't issue a prefetch if the step size is so small that the hw
>+         stride prefetch will do it anyway and page crossing is seldom */
>+      if (abs (delta - start_offset) < PREFETCH_MIN_STEP_SIZE)
>+        /* Drop the prefetch.  */
>[...]

Thanks for mentioning the hardware prefetch consideration. The interaction between software and hardware
prefetches is an interesting topic.

But, in this patch, I would like to address the cache line reuse and useless prefetch issue.  As Zdenek
has pointed out, the short prefetch distance I have observed is due to the incorrect determination of prefetch adead
iteration. 
  for (i =0; i < n; i++)
     .... a[i];

For single precision floating point, I would expect ahead should be at least 16 in order to avoid useless prefetch for
cache line size of 64. However, currently, ahead is determined by prefetch latency and the loop body size. We are
investigating how to effectively determine "ahead" to generate useful prefetch.

Thanks,

Changpeng










Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]