This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Vectorization: Loop peeling with misaligned support.
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>,Hendrik Greving <hendrik dot greving dot intel at gmail dot com>
- Cc: Bingfeng Mei <bmei at broadcom dot com>,"gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Sat, 16 Nov 2013 11:37:36 +0100
- Subject: Re: Vectorization: Loop peeling with misaligned support.
- Authentication-results: sourceware.org; auth=none
- References: <B71DF1153024A14EABB94E39368E44A6041FE5EA at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CAFiYyc2cj8GLL5PwVchTRreSMcCvxyVcz6k61_G=QqqXXVvzYQ at mail dot gmail dot com> <B71DF1153024A14EABB94E39368E44A6041FE756 at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CANc4vho3mLGo42ARGs7_toBax_xj6iG-d1qhA+oMfe3uFGbDKA at mail dot gmail dot com> <20131115222606 dot GA32059 at domone dot podge>
"OndÅej BÃlka" <neleai@seznam.cz> wrote:
>On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
>> Also keep in mind that usually costs go up significantly if
>> misalignment causes cache line splits (processor will fetch 2 lines).
>> There are non-linear costs of filling up the store queue in modern
>> out-of-order processors (x86). Bottom line is that it's much better
>to
>> peel e.g. for AVX2/AVX3 if the loop would cause loads that cross
>cache
>> line boundaries otherwise. The solution is to either actually always
>> peel for alignment, or insert an additional check for cache line
>> boundaries (for high trip count loops).
>
>That is quite bold claim do you have a benchmark to support that?
>
>Since nehalem there is no overhead of unaligned sse loads except of
>fetching
>cache lines. As haswell avx2 loads behave in similar way.
>
>You are forgetting that loop needs both cache lines when it issues
>unaligned load. This will generaly take maximum of times needed to
>access these lines. Now with peeling you accesss first cache line, and
>after that in loop access the second, effectively doubling running time
>when both lines were in main memory.
>
>You also need to compute all factors not just that one factor is
>expensive. There are several factor in plays, cost of branch
>misprediction is main argument againist doing peeling, so you need to
>show that cost of unaligned loads is bigger than cost of branch
>misprediction of a peeled implementation.
>
>As a quick example why peeling is generaly bad idea I did a simple
>benchmark. Could somebody with haswell also test attached code
>generated
>by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?
>
>For the test we repeately call a function set with a pointer randomly
>picked from 262144 bytes to stress a L2 cache, relevant tester
>is following (file test.c)
>
>for (i=0;i<100000000;i++){
>set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) +
>60));
>
>First vectorize by following function. A vectorizer here does
>peeling (assembly is bit long, see file set1.s)
>
>void set(int *p, int *q){
> int i;
> for (i=0; i<128; i++)
> p[i] = 42 * p[i];
>}
>
>When ran it I got
>
>$ gcc -O3 -DSIZE= test.c
>$ gcc test.o set1.s
>$ time ./a.out
>
>real 0m3.724s
>user 0m3.724s
>sys 0m0.000s
>
>Now what happens if we use separate input and output arrays? A gcc
>vectorizer fortunately does not peel in this case (file set2.s) which
>gives better performance
>
>void set(int *p, int *q){
> int i;
> for (i=0; i<128; i++)
> p[i] = 42 * q[i];
>}
>
>$ gcc test.o set2.s
>$ time ./a.out
>
>real 0m3.169s
>user 0m3.170s
>sys 0m0.000s
>
>
>A speedup here is can be partialy explained by fact that inplace
>modifications run slower. To eliminate this possibility we change
>assembly to make input same as output (file set3.s)
>
> jb .L15
> .L7:
> xorl %eax, %eax
>+ movq %rdi, %rsi
> .p2align 4,,10
> .p2align 3
> .L5:
>
>$ gcc test.o set3.s
>$ time ./a.out
>
>real 0m3.169s
>user 0m3.170s
>sys 0m0.000s
>
>Which is still faster than what peeling vectorizer generated.
>
>And in this test I did not alignment is constant so branch
>misprediction
>is not a issue.
IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this.
Richard.