This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Vectorization: Loop peeling with misaligned support.

On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
> Also keep in mind that usually costs go up significantly if
> misalignment causes cache line splits (processor will fetch 2 lines).
> There are non-linear costs of filling up the store queue in modern
> out-of-order processors (x86). Bottom line is that it's much better to
> peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
> line boundaries otherwise. The solution is to either actually always
> peel for alignment, or insert an additional check for cache line
> boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.

You are forgetting that loop needs both cache lines when it issues
unaligned load. This will generaly take maximum of times needed to
access these lines. Now with peeling you accesss first cache line, and
after that in loop access the second, effectively doubling running time
when both lines were in main memory.

You also need to compute all factors not just that one factor is
expensive. There are several factor in plays, cost of branch
misprediction is main argument againist doing peeling, so you need to
show that cost of unaligned loads is bigger than cost of branch
misprediction of a peeled implementation.

As a quick example why peeling is generaly bad idea I did a simple
benchmark. Could somebody with haswell also test attached code generated
by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?

For the test we repeately call a function set with a pointer randomly
picked from 262144 bytes to stress a L2 cache, relevant tester 
is following (file test.c)

for (i=0;i<100000000;i++){
     set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60));

First vectorize by following function. A vectorizer here does
peeling (assembly is bit long, see file set1.s)

void set(int *p, int *q){
  int i;
  for (i=0; i<128; i++)
     p[i] = 42 * p[i];

When ran it I got

$ gcc -O3 -DSIZE= test.c
$ gcc test.o set1.s
$ time ./a.out

real	0m3.724s
user	0m3.724s
sys	0m0.000s

Now what happens if we use separate input and output arrays? A gcc
vectorizer fortunately does not peel in this case (file set2.s) which
gives better performance

void set(int *p, int *q){
  int i;
  for (i=0; i<128; i++)
     p[i] = 42 * q[i];

$ gcc test.o set2.s
$ time ./a.out

real	0m3.169s
user	0m3.170s
sys	0m0.000s

A speedup here is can be partialy explained by fact that inplace
modifications run slower. To eliminate this possibility we change
assembly to make input same as output (file set3.s)

 	jb	.L15
 	xorl	%eax, %eax
+	movq	%rdi, %rsi
 	.p2align 4,,10
 	.p2align 3

$ gcc test.o set3.s
$ time ./a.out

real	0m3.169s
user	0m3.170s
sys	0m0.000s

Which is still faster than what peeling vectorizer generated.

And in this test I did not alignment is constant so branch misprediction
is not a issue.

Attachment: test.c
Description: Text document

Attachment: set1.s
Description: Text document

Attachment: set2.s
Description: Text document

Attachment: set3.s
Description: Text document

Attachment: set1_avx2.s
Description: Text document

Attachment: set3_avx2.s
Description: Text document

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]