void test(double data[4][4]) { for (int i = 0; i < 4; i++) { for (int j = i; j < 4; j+=2) { data[i][j] = data[i][j] * data[i][j]; data[i][j+1] = data[i][j+1] * data[i][j+1]; } } } gcc creates this: test(double (*) [4]): vmovsd xmm0, QWORD PTR [rdi] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi], xmm0 vmovsd xmm0, QWORD PTR [rdi+8] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+8], xmm0 vmovsd xmm0, QWORD PTR [rdi+16] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+16], xmm0 vmovsd xmm0, QWORD PTR [rdi+24] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+24], xmm0 vmovsd xmm0, QWORD PTR [rdi+40] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+40], xmm0 vmovsd xmm0, QWORD PTR [rdi+48] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+48], xmm0 vmovsd xmm0, QWORD PTR [rdi+56] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+56], xmm0 vmovsd xmm0, QWORD PTR [rdi+64] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+64], xmm0 vmovsd xmm0, QWORD PTR [rdi+80] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+80], xmm0 vmovsd xmm0, QWORD PTR [rdi+88] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+88], xmm0 vmovsd xmm0, QWORD PTR [rdi+120] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+120], xmm0 vmovsd xmm0, QWORD PTR [rdi+128] vmulsd xmm0, xmm0, xmm0 vmovsd QWORD PTR [rdi+128], xmm0 ret clang detects that it is possible to use packed operations instead of scalar ones, and produces this. Please implement similar optimization in gcc too. test(double (*) [4]): # @test(double (*) [4]) vmovupd xmm0, xmmword ptr [rdi] vmovupd xmm1, xmmword ptr [rdi + 16] vmovupd xmm2, xmmword ptr [rdi + 40] vmovupd xmm3, xmmword ptr [rdi + 56] vmulpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi], xmm0 vmulpd xmm0, xmm1, xmm1 vmovupd xmmword ptr [rdi + 16], xmm0 vmulpd xmm0, xmm2, xmm2 vmovupd xmmword ptr [rdi + 40], xmm0 vmulpd xmm0, xmm3, xmm3 vmovupd xmmword ptr [rdi + 56], xmm0 vmovupd xmm0, xmmword ptr [rdi + 80] vmulpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi + 80], xmm0 vmovupd xmm0, xmmword ptr [rdi + 120] vmulpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi + 120], xmm0 ret
This was compiled with -O3 -mavx -ftree-vectorize After sending this I noticed that I wrote inner loop incorrectly, I meant one below. Anyway, it it also not optimized: for (int j = 0; j < i; j+=4) I also checked code which could be optimized using operations on YMM registers: void test(double data[8][8]) { for (int i = 0; i < 8; i++) { for (int j = 0; j < i; j+=4) { data[i][j] *= data[i][j]; data[i][j+1] *= data[i][j+1]; data[i][j+2] *= data[i][j+2]; data[i][j+3] *= data[i][j+3]; } } } gcc output is, hmm, interesting: test(double (*) [8]): vmovupd xmm0, XMMWORD PTR [rdi+64] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+80], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+64], xmm0 vextractf128 XMMWORD PTR [rdi+80], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+128] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+144], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+128], xmm0 vextractf128 XMMWORD PTR [rdi+144], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+192] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+208], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+192], xmm0 vextractf128 XMMWORD PTR [rdi+208], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+256] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+272], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+256], xmm0 vextractf128 XMMWORD PTR [rdi+272], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+320] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+336], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+320], xmm0 vextractf128 XMMWORD PTR [rdi+336], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+352] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+368], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+352], xmm0 vextractf128 XMMWORD PTR [rdi+368], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+384] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+400], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+384], xmm0 vextractf128 XMMWORD PTR [rdi+400], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+416] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+432], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+416], xmm0 vextractf128 XMMWORD PTR [rdi+432], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+448] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+464], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+448], xmm0 vextractf128 XMMWORD PTR [rdi+464], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+480] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+496], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+480], xmm0 vextractf128 XMMWORD PTR [rdi+496], ymm0, 0x1 vzeroupper ret
wiht += 4 the inner loop doesn't iterate so it's effectively void test(double data[4][4]) { for (int i = 0; i < 4; i++) { data[i][i] = data[i][i] * data[i][i]; data[i][i+1] = data[i][i+1] * data[i][i+1]; } } we fail to SLP here because we get confused by the computed group size of 5 as there's a gap of three elements between the first stores of each iteration. When later doing BB vectorization we fail to analyze dependences, likely because not analyzing refs as thoroughly as with loops. For your second example we fail to loop vectorize this because we completely peel the inner loop in cunrolli, leaving control flow inside the loop... I have a patch for that one.
For the other case the issue is I think that the SLP instance group size is not the number of scalar stmts but somehow set to the group-size. Changing that has quite some ripple-down effects though. -> GCC 9.
One more case. Code has to process diagonal half of matrix and uses SSE intrinsics - see test1() below. When n is constant like in test2() below, gcc unrolls loops. However more more transform could be performed, replace pairs of SSE instructions with one AVX one. #include <stdint.h> #include "immintrin.h" void test1(double data[100][100], unsigned int n) { for (int i = 0; i < n; i++) { for (int j = 0; j < i; j += 2) { __m128d v = _mm_loadu_pd(&data[i][j]); v = _mm_mul_pd(v, v); _mm_storeu_pd(&data[i][j], v); } } } void test2(double data[100][100]) { const unsigned int n = 6; for (int i = 0; i < n; i++) { for (int j = 0; j < i; j += 2) { __m128d v = _mm_loadu_pd(&data[i][j]); v = _mm_mul_pd(v, v); _mm_storeu_pd(&data[i][j], v); } } }
Author: rguenth Date: Wed Nov 29 14:38:06 2017 New Revision: 255233 URL: https://gcc.gnu.org/viewcvs?rev=255233&root=gcc&view=rev Log: 2017-11-29 Richard Biener <rguenther@suse.de> PR tree-optimization/83202 * tree-vect-slp.c (scalar_stmts_set_t): New typedef. (bst_fail): Use it. (vect_analyze_slp_cost_1): Add visited set, do not account SLP nodes vectorized to the same stmts multiple times. (vect_analyze_slp_cost): Allocate a visited set and pass it down. (vect_analyze_slp_instance): Adjust. (scalar_stmts_to_slp_tree_map_t): New typedef. (vect_schedule_slp_instance): Add a map recording the SLP node representing the vectorized stmts for a set of scalar stmts. Avoid code-generating redundancies. (vect_schedule_slp): Allocate map and pass it down. * gcc.dg/vect/costmodel/x86_64/costmodel-pr83202.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr83202.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-slp.c
There are multiple issues reflected in this bug. The last commit addressed the SLP cost model thing (not fixing any testcase on its own).
On Wed, 29 Nov 2017, bugzilla@poradnik-webmastera.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202 > > --- Comment #4 from Daniel Fruzynski <bugzilla@poradnik-webmastera.com> --- > One more case. Code has to process diagonal half of matrix and uses SSE > intrinsics - see test1() below. When n is constant like in test2() below, gcc > unrolls loops. However more more transform could be performed, replace pairs of > SSE instructions with one AVX one. GCC currently does not "vectorize" already vectorized code so this is a much farther away "goal" apart from eventually pattern-matching some very simple cases.
Author: rguenth Date: Thu Nov 30 07:53:31 2017 New Revision: 255267 URL: https://gcc.gnu.org/viewcvs?rev=255267&root=gcc&view=rev Log: 2017-11-30 Richard Biener <rguenther@suse.de> PR tree-optimization/83202 * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Add allow_peel argument and guard peeling. (canonicalize_loop_induction_variables): Likewise. (canonicalize_induction_variables): Pass false. (tree_unroll_loops_completely_1): Pass unroll_outer to disallow peeling from cunrolli. * gcc.dg/vect/pr83202-1.c: New testcase. * gcc.dg/tree-ssa/pr61743-1.c: Adjust. Added: trunk/gcc/testsuite/gcc.dg/vect/pr83202-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c trunk/gcc/tree-ssa-loop-ivcanon.c
The last commit fixed the testcase incomment #1.
The comment#4 case is sth completely different. If it's really interesting to re-vectorize already vectorized code please file a different bug. The other testcases seem to work fine for me now.