This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: > Also keep in mind that usually costs go up significantly if > misalignment causes cache line splits (processor will fetch 2 lines). > There are non-linear costs of filling up the store queue in modern > out-of-order processors (x86). Bottom line is that it's much better to > peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache > line boundaries otherwise. The solution is to either actually always > peel for alignment, or insert an additional check for cache line > boundaries (for high trip count loops). That is quite bold claim do you have a benchmark to support that? Since nehalem there is no overhead of unaligned sse loads except of fetching cache lines. As haswell avx2 loads behave in similar way. You are forgetting that loop needs both cache lines when it issues unaligned load. This will generaly take maximum of times needed to access these lines. Now with peeling you accesss first cache line, and after that in loop access the second, effectively doubling running time when both lines were in main memory. You also need to compute all factors not just that one factor is expensive. There are several factor in plays, cost of branch misprediction is main argument againist doing peeling, so you need to show that cost of unaligned loads is bigger than cost of branch misprediction of a peeled implementation. As a quick example why peeling is generaly bad idea I did a simple benchmark. Could somebody with haswell also test attached code generated by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)? For the test we repeately call a function set with a pointer randomly picked from 262144 bytes to stress a L2 cache, relevant tester is following (file test.c) for (i=0;i<100000000;i++){ set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60)); First vectorize by following function. A vectorizer here does peeling (assembly is bit long, see file set1.s) void set(int *p, int *q){ int i; for (i=0; i<128; i++) p[i] = 42 * p[i]; } When ran it I got $ gcc -O3 -DSIZE= test.c $ gcc test.o set1.s $ time ./a.out real 0m3.724s user 0m3.724s sys 0m0.000s Now what happens if we use separate input and output arrays? A gcc vectorizer fortunately does not peel in this case (file set2.s) which gives better performance void set(int *p, int *q){ int i; for (i=0; i<128; i++) p[i] = 42 * q[i]; } $ gcc test.o set2.s $ time ./a.out real 0m3.169s user 0m3.170s sys 0m0.000s A speedup here is can be partialy explained by fact that inplace modifications run slower. To eliminate this possibility we change assembly to make input same as output (file set3.s) jb .L15 .L7: xorl %eax, %eax + movq %rdi, %rsi .p2align 4,,10 .p2align 3 .L5: $ gcc test.o set3.s $ time ./a.out real 0m3.169s user 0m3.170s sys 0m0.000s Which is still faster than what peeling vectorizer generated. And in this test I did not alignment is constant so branch misprediction is not a issue.
Attachment:
test.c
Description: Text document
Attachment:
set1.s
Description: Text document
Attachment:
set2.s
Description: Text document
Attachment:
set3.s
Description: Text document
Attachment:
set1_avx2.s
Description: Text document
Attachment:
set3_avx2.s
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |