cache optimization

Sat Nov 28 18:19:00 GMT 2009

--- On Thu, 11/26/09, Tim Prince <n8tm@aol.com> wrote:

> From: Tim Prince <n8tm@aol.com>
> Subject: Re: cache optimization
> To: "£ukasz" <blurrpp@yahoo.com>
> Cc: gcc-help@gcc.gnu.org
> Date: Thursday, November 26, 2009, 4:38 PM
> £ukasz wrote:
> > Hi I want to learn how to optimaze cache usage in gcc.
> I find builtin function __builtin_prefetch which should
> prefetch datas to cache .. so i use cannonical :) example of
> vector addition.
> > 
> > for (i = 0; i < n; i++)
> >   {
> >     a[i] = a[i] + b[i];
> >     __builtin_prefetch
> (&a[i+1], 1, 1);
> >     __builtin_prefetch
> (&b[i+1], 0, 1);
> >     /* ... */
> >   }
> > 
> > and compile it with gcc without special options ....
> but its slower than
> > 
> > for (i = 0; i < n; i++)
> >   {
> >     a[i] = a[i] + b[i];
> >     /* ... */
> >   }
> > 
> > so maybe I should compile it with soem extra options
> to have advantage of cache prefatching
> ?(-fprefetch-loop-array doenst works )
> > 
> > 
> > 
> Under normal settings, on CPUs of the last 6 years or so,
> you are prefetching what has already been prefetched by
> hardware prefetcher.  If your search engine doesn't
> find you many success stories about the use of this feature,
> that might be a clue that it involves some serious
> investigation. You would look for slow spots in your code
> which don't fall in the usual hardware supported prefetch
> patterns (linear access with not too large a stride, or
> pairs of cache lines), and experiment with fetching the data
> sufficiently far in advance for it to do some good, without
> exceeding your cache capacity.
> I do see a "success story" about prefetching for a reversed
> loop. As the author doesn't divulge the CPU in use, one
> suspects it might be something like the old Athlon32 which
> supported hardware prefetch only in the forward direction.
> Don't you like advice which assumes no one will ever use a
> CPU different (e.g. more up to date) than the author's
> favorite?
> 

You are completely right, in this example gcc compiler change code to branch assembly, which ofcourse is already "predicted" ( means forward NOT TAKEN, backward TAKEN), but im looking for some nice example for modern procesors which would realy works I mean speed program(im searching net actualy ). In Intel Optimization Reference Manual they advice to use PREFETCH to any predicable memory access patern, but as you already mensioned some patern processor can predict by him self.