This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Vectorization: Loop peeling with misaligned support.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Richard Biener <richard dot guenther at gmail dot com>
Cc: Hendrik Greving <hendrik dot greving dot intel at gmail dot com>, Bingfeng Mei <bmei at broadcom dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
Date: Sun, 17 Nov 2013 18:24:26 +0100
Subject: Re: Vectorization: Loop peeling with misaligned support.
Authentication-results: sourceware.org; auth=none
References: <B71DF1153024A14EABB94E39368E44A6041FE5EA at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CAFiYyc2cj8GLL5PwVchTRreSMcCvxyVcz6k61_G=QqqXXVvzYQ at mail dot gmail dot com> <B71DF1153024A14EABB94E39368E44A6041FE756 at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CANc4vho3mLGo42ARGs7_toBax_xj6iG-d1qhA+oMfe3uFGbDKA at mail dot gmail dot com> <20131115222606 dot GA32059 at domone dot podge> <abdba6cc-3d06-4b3a-83a2-891dcc54f949 at email dot android dot com> <20131116114534 dot GA2642 at domone dot podge> <69272a5c-d3ee-410e-8e3b-a7f2acd55b40 at email dot android dot com>

On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote:
> "OndÅej BÃlka" <neleai@seznam.cz> wrote:
> >On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
> >> "OndÅej BÃlka" <neleai@seznam.cz> wrote:
> >> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
> >> 
> >> IIRC what can still be seen is store-buffer related slowdowns when
> >you have a big unaligned store load in your loop.  Thus aligning stores
> >still pays back last time I measured this.
> >
> >Then send you benchmark. What I did is a loop that stores 512 bytes.
> >Unaligned stores there are faster than aligned ones, so tell me when
> >aligning stores pays itself. Note that in filling store buffer you must
> >take into account extra stores to make loop aligned.
> 
> The issue is that the effective write bandwidth can be limited by the store buffer if you have multiple write streams.  IIRC at least some amd CPUs have to use two entries for stores crossing a cache line boundary.
>
So can be performance limited by branch misprediction. You need to show
that likely bottleneck is too much writes and not other factor.
 
> Anyway, a look into the optimization manuals will tell you what to do and the cost model should follow these recommendations.
> 
These tend to be quite out of data, you typically need to recheck
everything.

Take 
IntelÂ 64 and IA-32 Architectures Optimization Reference Manual 
from April 2012

A sugestion on store load forwarding there is to align loads and stores
to make it working (with P4 and core2 suggestions).

However this is false since nehalem, when I test a variant of memcpy
that is unaligned by one byte, code is following (full benchmark attached.):

	set:
.LFB0:
	.cfi_startproc
	xor	%rdx, %rdx
	addq	$1, %rsi
	lea	144(%rsi), %rdi  
.L:
	movdqu	0(%rsi,%rdx), %xmm0
	movdqu	16(%rsi,%rdx), %xmm1
	...
	movdqu	112(%rsi,%rdx), %xmm7
	movdqu	%xmm0, 0(%rdi,%rdx)
	...
	movdqu	%xmm7, 112(%rdi,%rdx)
	addq	$128, %rdx
	cmp	$5120, %rdx
	jle	.L
	ret

Then there is around 10% slowdown vs nonforwarding one.

real	0m2.098s
user	0m2.083s
sys	0m0.003s

However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16 then
performance degrades.

real	0m3.495s
user	0m3.480s
sys	0m0.000s

And other suggestions are similarly flimsy unless your target is pentium 4.

> >Also what do you do with loops that contain no store? If I modify test
> >to
> >
> >int set(int *p, int *q){
> >  int i;
> >  int sum = 0;
> >  for (i=0; i < 128; i++)
> >     sum += 42 * p[i];
> >  return sum;
> >}
> >
> >then it still does aligning.
> 
> Because the cost model simply does not exist for the decision whether to peel or not. Patches welcome.
> 
> >There may be a threshold after which aligning buffer makes sense then
> >you
> >need to show that loop spend most of time on sizes after that treshold.
> >
> >Also do you have data how common store-buffer slowdowns are? Without
> >knowing that you risk that you make few loops faster at expense of
> >majority which could likely slow whole application down. It would not
> >supprise me as these loops can be ran mostly on L1 cache data (which is
> >around same level as assuming that increased code size fits into
> >instruction cache.)
> >
> >
> >Actually these questions could be answered by a test, first compile
> >SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
> >use unaligned loads. Then results will tell if peeling is also good in
> >practice or not.
> 
> It should not be a on or off decision but rather a decision based on a cost model.
> 
You cannot decide that on cost model alone as performance is decided by
runtime usage pattern. If you do profiling then you could do that.
Alternatively you can add a branch to enable peeling only after preset
treshold.

Attachment: test.c
Description: Text document

Attachment: set2.s
Description: Text document

References:
- Vectorization: Loop peeling with misaligned support.
  - From: Bingfeng Mei
- Re: Vectorization: Loop peeling with misaligned support.
  - From: Richard Biener
- RE: Vectorization: Loop peeling with misaligned support.
  - From: Bingfeng Mei
- Re: Vectorization: Loop peeling with misaligned support.
  - From: Hendrik Greving
- Re: Vectorization: Loop peeling with misaligned support.
  - From: OndÅej BÃlka
- Re: Vectorization: Loop peeling with misaligned support.
  - From: Richard Biener
- Re: Vectorization: Loop peeling with misaligned support.
  - From: OndÅej BÃlka
- Re: Vectorization: Loop peeling with misaligned support.
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]