[Bug target/83008] New: [performance] Is it better to avoid extra instructions in data passing between loops?
sergey.shalnov at intel dot com
gcc-bugzilla@gcc.gnu.org
Wed Nov 15 15:54:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
Bug ID: 83008
Summary: [performance] Is it better to avoid extra instructions
in data passing between loops?
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: sergey.shalnov at intel dot com
Target Milestone: ---
I found strange code generated by GCC-8.0/7.x with following command line
options:
-g -Ofast -march=skylake-avx512 -ftree-vectorize
There are not vectorized two loops.
First one doesn’t vectorized because:
test.c:6:23: note: cost model: the vector iteration cost = 1488 divided by the
scalar iteration cost = 328 is greater or equal to the vectorization factor =
4.
test.c:6:23: note: not vectorized: vectorization not profitable.
test.c:6:23: note: not vectorized: vector version will never be profitable.
Second one doesn’t vectorized because:
test.c:20:23: note: step unknown.
test.c:20:23: note: reduction: not commutative/associative: sum_87 = (int) _61;
test.c:20:23: note: Unknown def-use cycle pattern.
test.c:20:23: note: Unsupported pattern.
test.c:20:23: note: not vectorized: unsupported use in stmt.
test.c:20:23: note: unexpected pattern.
If we look into asm we found strange method to passing data in “tmp” array
between loops:
…loop 1 body…
13f: 41 8d 04 13 lea (%r11,%rdx,1),%eax
143: c5 f9 6e d8 vmovd %eax,%xmm3
147: c5 e1 62 db vpunpckldq %xmm3,%xmm3,%xmm3
14b: c5 f9 62 c0 vpunpckldq %xmm0,%xmm0,%xmm0
14f: c5 f1 62 c9 vpunpckldq %xmm1,%xmm1,%xmm1
153: c5 e9 62 d2 vpunpckldq %xmm2,%xmm2,%xmm2
157: c5 e9 6c d2 vpunpcklqdq %xmm2,%xmm2,%xmm2
15b: c5 f1 6c c9 vpunpcklqdq %xmm1,%xmm1,%xmm1
15f: c5 f9 6c c0 vpunpcklqdq %xmm0,%xmm0,%xmm0
163: c5 e1 6c db vpunpcklqdq %xmm3,%xmm3,%xmm3
167: c4 e3 6d 38 c9 01 vinserti128 $0x1,%xmm1,%ymm2,%ymm1
16d: c4 e3 7d 38 c3 01 vinserti128 $0x1,%xmm3,%ymm0,%ymm0
173: 62 f3 f5 48 3a c0 01 vinserti64x4 $0x1,%ymm0,%zmm1,%zmm0
17a: 62 f1 fd 48 7f 44 24 vmovdqa64 %zmm0,-0x40(%rsp)
181: ff
182: 8b 54 24 e0 mov -0x20(%rsp),%edx
186: 03 54 24 f0 add -0x10(%rsp),%edx
…loop 2 body…
if I'm not mistaken the algorithm looks like following:
1. Do first loop and keep values in GPR
2. Move these GPRs to XMMs
3. Pack these XMMs into YMMs
4. Pack these YMMs to ZMM
5. Spill ZMM into stack
6. Get values from stack to GPRs of the second loop
It might be better, from performance perspective, to pass values from first
loop directly to the second loop with GPRs (without all these vector
registers)?
The reproducer is:
1 int test(unsigned char * input1, unsigned char * input2)
2 {
3 unsigned int tmp[4][4];
4 unsigned int var0, var1, var2, var3;
5 int sum = 0;
6 for (int i = 0; i < 4; i++, input1 += 4, input2 += 4) {
7 var0 = (input1[0] + input2[0]) + (input1[4] + input2[4]);
8 var1 = (input1[1] + input2[1]) + (input1[5] + input2[5]);
9 var2 = (input1[2] + input2[2]) + (input1[6] + input2[6]);
10 var3 = (input1[3] + input2[3]) + (input1[7] + input2[7]);
11 int inter0 = var0 + var1;
12 int inter1 = var0 + var1;
13 int inter2 = var2 + var3;
14 int inter3 = var2 + var3;
15 tmp[i][0] = inter0 + inter2;
16 tmp[i][2] = inter0 + inter2;
17 tmp[i][1] = inter1 + inter3;
18 tmp[i][3] = inter1 + inter3;
19 }
20 for (int i = 0; i < 4; i++) {
21 int inter0 = tmp[0][i] + tmp[1][i];
22 int inter1 = tmp[0][i] + tmp[1][i];
23 int inter2 = tmp[2][i] + tmp[3][i];
24 int inter3 = tmp[2][i] + tmp[3][i];
25 var0 = inter0 + inter2;
26 var2 = inter0 + inter2;
27 var1 = inter1 + inter3;
28 var3 = inter1 + inter3;
29 sum += var0 + var1 + var2 + var3;
30 }
31
32 return sum;
33 }
Sergey
More information about the Gcc-bugs
mailing list