This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RFC: Prefetch instruction support



Hi
During last weekend I've implemented basic prefetch instruction framework
(tested on Athlon) based on the suggestions of Alex Dreyzen and Norbert Juffa
from AMD.

Basically it examines loops for arrays traversed and generate prefetches few
cache lines forward. Also it is able to suggest unroll_number for loop unroller
to fetch one cache line per loop iteration (in order to avoid
prefetching of single page multiple times).  This results in massive loop
unrolling (up to 64 times on Athlon), but seems to do good job.  Later we may
want to tweak this heruistics.

Here are few implementation details:

The prefetch instruction is represented as:

(insn 92 28 30 (unspec_volatile[ 
            (plus:SI (plus:SI (mult:SI (reg/v:SI 26)
                        (const_int 4 [0x4]))
                    (reg:SI 29))
                (const_int 128 [0x80]))
        ]  1) -1 (nil)
    (nil))

To detect the arrays I use GIV infromation collected by strength reduction pass,
so I am currently missing arrays with size same as the step of basic induction
variable, but I think I can easilly modify strength reduction to collect this
information for me.

The prefetch instructions are generated after GIV information is done and before
strength reduction process. The new GIVs are injected to the strength reduction's
tables, so the prefetch addresses are optimized as well.

Givs are split into base address, step, and constant addition value.
Givs with same address, step and close addition values are combined into single
prefetch.  Also writes to givs are detected, so Athlon's prefetchw instruction
can be used for block we write to.

In order to avoid unrolling make copies of prefetch instruction, I've modified
toplev to unroll loops in first loop pass and emit prefetch instruction is the
second pass. This is kind of kludge and I am open to suggestions how to avoid
this. On the other hand, it seems to work very well :)

To calculate prefetch distance, I am having two parameters. Cache line size
and number of prefetches doable at once.  I am prefetching max_prefetches/n_arrays.
When number of arrays is higher than number of prefetches, I give up.

There is some heruistic to avoid useless prefetching. Basically I don't prefetch
if one of following holds:
1) Loop has known (and low) iteration counts (1000 at the moment).
2) Loop has more than 10 prefetch blocks
3) Loop contains function call (such loop is probably not internal loop)
4) The density of prefetch (prefetchs' step versus sum of sizes of memory acceses)
   is less than 80%
5) When the write to don't have always_executed flag, the write flag is not set
   (to avoid useless dirtyfiing of cache pages)
6) When the step is larger than 4096 bytes, or negative

On the Athlon I am getting currently following results.  AMD folks has promised
100% speedup on STREAM benchmark. I am unable to get it even with hand modifying
the loop, but the speedups are still notable and future tunning is of course
possible.

For example following simple code:

int a[10000];
int b[10000];
main ()
{
  int i;
  for (i = 0; i < 10000; i++)
    {
      a[i] = a[i] + b[i];
    }
}

Loop detects two memory accesses:

Prefetch insn 37 address: (symbol_ref:SI ("a")) Index:4 step:4 density:100% read/write prefetch
Prefetch insn 31 address: (symbol_ref:SI ("b")) Index:4 step:4 density:100% read only prefetch
Real prefetches needed:2 (write:1)
Recommending 16 iterations unroll

Resulting code is:

main:
	xorl	%edx, %edx
	movl	$1000, %ecx
	.p2align 4
.L6:
	prefetch	b+128(%edx)
	movl	b(%edx), %eax
	addl	%eax, %eax
	prefetchw	a+128(%edx)
	movl	%eax, a(%edx)
	addl	$4, %edx
	decl	%ecx
	jne	.L6
	ret

With -funroll-all-loops, the situation looks wors, since loop gets unrolled 16 times. Since unroll
is unable to find divisor anywhere near this number, it defaults to naive unrolling. In second pass
of loop opt following prefetches are detected:

Prefetch insn 431 address: (plus:SI (reg:SI 39)
    (symbol_ref:SI ("b"))) Index:64 step:64 density:100% read only prefetch
Prefetch insn 433 address: (reg:SI 93) Index:64 step:64 density:100% read/write prefetch
Real prefetches needed:2 (write:1)

main:
	pushl	%ebx
	xorl	%edx, %edx
	movl	b(%edx), %eax
	movl	$8, %ebx
	addl	%eax, %eax
	movl	%eax, a(%edx)
	addl	$4, %edx
	movl	b(%edx), %eax
	addl	%eax, %eax
	movl	%eax, a(%edx)
<snip>
	cmpl	$999, %ebx
	jg	.L9
	leal	a(%edx), %ecx
	addl	$b, %edx
	.p2align 4
.L6:
	movl	(%edx), %eax
	addl	%eax, %eax
	movl	%eax, (%ecx)
	movl	4(%edx), %eax
	addl	%eax, %eax
	movl	%eax, 4(%ecx)
	movl	8(%edx), %eax
	addl	%eax, %eax
	movl	%eax, 8(%ecx)
<snip>
	movl	56(%edx), %eax
	addl	%eax, %eax
	movl	%eax, 56(%ecx)
	prefetch	188(%edx)
	movl	60(%edx), %eax
	addl	%eax, %eax
	prefetchw	188(%ecx)
	addl	$16, %ebx
	addl	$64, %edx
	movl	%eax, 60(%ecx)
	addl	$64, %ecx
	cmpl	$999, %ebx
	jle	.L6
.L9:
	popl	%ebx
	ret
.Lfe1:
	.size	 main,.Lfe1-main
	.ident	"GCC: (GNU) 2.96 20000402 (experimental)"

Here are few benchmarks I've made:

STREAM benchmark (moving, multiplying and summing 80MB long "double" arrays.

Without prefetch:
(code size: 6016 bytes)
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         390.2439       0.4100       0.4100       0.4100
Scale:        258.0645       0.6200       0.6200       0.6200
Add:          444.4444       0.5400       0.5400       0.5400
Triad:        303.7975       0.7900       0.7900       0.7900

With prefetch:
(code size: 6144 bytes)
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         484.8485       0.3320       0.3300       0.3400
Scale:        347.8261       0.4660       0.4600       0.4700
Add:          533.3333       0.4500       0.4500       0.4500
Triad:        413.7931       0.5860       0.5800       0.5900

Without prefetch, with unrolling:
(code size: 7872 bytes)
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         390.2439       0.4180       0.4100       0.4200
Scale:        457.1429       0.3580       0.3500       0.3600
Add:          500.0000       0.4860       0.4800       0.4900
Triad:        436.3636       0.5500       0.5500       0.5500

With prefetch, with unrolling:
(code size: 8128 bytes)
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         444.4444       0.3600       0.3600       0.3600
Scale:        470.5882       0.3460       0.3400       0.3500
Add:          521.7391       0.4640       0.4600       0.4700
Triad:        521.7391       0.4640       0.4600       0.4700

Similar speedups I've noticed in XaoS's filters (motion blur, emboss).  In byte
benchmark there seems to be no measurable difference in any dirrection except
for the LU DECOMPOSITION, that is roughly 8% faster now.  With -funroll-all-loops
the code grows from 6.5 KB to 8.5. Thats quite a lot. Without -funroll-all-loops,
the growth is about 0.5%

Thats probably all to say...

So I am looking forward for you ideas.  If you think the code is acceptable
for inclusion as is, I may cleanup it and prepare the patch.

Honza

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]