Vectorizer fails to handle this: ------------------------------------------------------------ #define NUMPOINTS 50000 #define align(x) __attribute__((align(x))) typedef float align(16) MATRIX[3][3]; static float points[NUMPOINTS][4]; static align(16) float opoints[NUMPOINTS][4]; static bool flags[NUMPOINTS]; static MATRIX gmatrix; void RotateVectors (void) { int i, r; for (r = 0; r < 4; r++) { for (i = 0; i < NUMPOINTS; i++) { opoints[i][0] = gmatrix[0][0] * points[i][0] + gmatrix[0][1] * points[i][1] + gmatrix[0][2] * points[i][2]; opoints[i][1] = gmatrix[1][0] * points[i][0] + gmatrix[1][1] * points[i][1] + gmatrix[1][2] * points[i][2]; opoints[i][2] = gmatrix[2][0] * points[i][0] + gmatrix[2][1] * points[i][1] + gmatrix[2][2] * points[i][2]; flags[i] = true; } } } ------------------------------------------------------------ loop at bench.cc:52: not vectorized: complicated access pattern. loop at bench.cc:52: bad data access.

Confirmed, ICC can do this but does not because it is not very inefficient to do it.

t.c:20: note: not vectorized: mixed data-types t.c:20: note: can't determine vectorization factor. Removing flags[i] = true; we get: t.c:20: note: not consecutive access t.c:20: note: not vectorized: complicated access pattern.

> t.c:20: note: not vectorized: mixed data-types > t.c:20: note: can't determine vectorization factor. > > Removing flags[i] = true; Multiple data-types vectorization is already supported in the autovect branch, and the patches for mainline (starting from http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00941.html) will be committed as soon as 4.3 is open. > we get: > t.c:20: note: not consecutive access > t.c:20: note: not vectorized: complicated access pattern. Vectorization of strided accesses is also already implemented in the autovect branch (and will be committed to the mainline 4.3). However, this case contains stores with gaps (stores to opoints[i][0], opoints[i][1], and opoints[i][2], without a store to opoints[i][3]), and only loads with gaps are currently supported. Therefore, this loop will be vectorizable in the autovect branch (and soon in the mainline 4.3) if a store to opoints[i][3] is added. Ira

Thanks Ira. What about store with gaps?

On the todo list. BTW, vectorization of strided accesses was committed to the mainline 4.3. Ira

Still not vectorized in recent GCC t.c:20: note: not vectorized: complicated access pattern. t.c:22: note: not vectorized: complicated access pattern. 1 typedef unsigned int bool; 2 #define true 1 3 4 #define NUMPOINTS 50000 5 6 #define align(x) __attribute__((align(x))) 7 8 typedef float align(16) MATRIX[3][3]; 9 10 static float points[NUMPOINTS][4]; 11 static align(16) float opoints[NUMPOINTS][4]; 12 static bool flags[NUMPOINTS]; 13 static MATRIX gmatrix; 14 15 16 void RotateVectors (void) 17 { 18 int i, r; 19 20 for (r = 0; r < 4; r++) 21 { 22 for (i = 0; i < NUMPOINTS; i++) 23 { 24 opoints[i][0] = gmatrix[0][0] * points[i][0] 25 + gmatrix[0][1] * points[i][1] 26 + gmatrix[0][2] * points[i][2]; 27 opoints[i][1] = gmatrix[1][0] * points[i][0] 28 + gmatrix[1][1] * points[i][1] 29 + gmatrix[1][2] * points[i][2]; 30 opoints[i][2] = gmatrix[2][0] * points[i][0] 31 + gmatrix[2][1] * points[i][1] 32 + gmatrix[2][2] * points[i][2]; 33 flags[i] = true; 34 } 35 } 36 } 37 "GCC: (GNU) 4.6.0 20110312 (experimental) [trunk revision 170907]"

Link to vectorizer missed-optimization meta-bug.

The issue is that we cannot use a vector v4sf store to &opoints[i][0] as opoints[i][4] is not stored to. Such "masked" store (or "interleaved store with gaps") is not supported by SLP.

I've looked into another case where inability to handle stores with gaps generates sub-optimal code. I'm interested in spending some time on fixing this, provided some guidance in the vectorizer. Is it substantially more difficult to handle stores with gaps compared to loads with gaps? The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), which is #2 function in 462.libquantum profile. This cycle accounts for about 25% of total 462.libquantum time. ===struct node_struct { float _Complex gap; unsigned long long state; }; struct reg_struct { int size; struct node_struct *node; }; void func(int target, struct reg_struct *reg) { int i; for(i=0; i<reg->size; i++) reg->node[i].state ^= ((unsigned long long) 1 << target); } === This loop vectorizes into <bb 5>: # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)> vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]); vect__5.11_41 = vect_array.10[0]; vect__5.12_42 = vect_array.10[1]; vect__7.13_44 = vect__5.11_41 ^ vect_cst__43; _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>; MEM[(long long unsigned int *)ivtmp_45] = _48; ivtmp_50 = ivtmp_45 + 16; _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>; MEM[(long long unsigned int *)ivtmp_50] = _51; which then becomes for aarch64: .L4: ld2 {v0.2d - v1.2d}, [x1] add w2, w2, 1 cmp w2, w7 eor v0.16b, v2.16b, v0.16b umov x4, v0.d[1] st1 {v0.d}[0], [x1] add x1, x1, 32 str x4, [x1, -16] bcc .L4

(In reply to Maxim Kuvyrkov from comment #9) > which then becomes for aarch64: > .L4: > ld2 {v0.2d - v1.2d}, [x1] > add w2, w2, 1 > cmp w2, w7 > eor v0.16b, v2.16b, v0.16b > umov x4, v0.d[1] > st1 {v0.d}[0], [x1] > add x1, x1, 32 > str x4, [x1, -16] > bcc .L4 IIUC, umov x4, v0.d[1] st1 {v0.d}[0], [x1] str x4, [x1, -16] could become just st2 {v0.d - v1.2d}, [x1]

(In reply to Maxim Kuvyrkov from comment #9) > I've looked into another case where inability to handle stores with gaps > generates sub-optimal code. I'm interested in spending some time on fixing > this, provided some guidance in the vectorizer. > > Is it substantially more difficult to handle stores with gaps compared to > loads with gaps? > > The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), > which is #2 function in 462.libquantum profile. This cycle accounts for > about 25% of total 462.libquantum time. > > ===struct node_struct > { > float _Complex gap; > unsigned long long state; > }; > > struct reg_struct > { > int size; > struct node_struct *node; > }; > > void > func(int target, struct reg_struct *reg) > { > int i; > > for(i=0; i<reg->size; i++) > reg->node[i].state ^= ((unsigned long long) 1 << target); > } > === > > This loop vectorizes into > <bb 5>: > # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)> > vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]); > vect__5.11_41 = vect_array.10[0]; > vect__5.12_42 = vect_array.10[1]; > vect__7.13_44 = vect__5.11_41 ^ vect_cst__43; > _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>; > MEM[(long long unsigned int *)ivtmp_45] = _48; > ivtmp_50 = ivtmp_45 + 16; > _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>; > MEM[(long long unsigned int *)ivtmp_50] = _51; > > which then becomes for aarch64: > .L4: > ld2 {v0.2d - v1.2d}, [x1] > add w2, w2, 1 > cmp w2, w7 > eor v0.16b, v2.16b, v0.16b > umov x4, v0.d[1] > st1 {v0.d}[0], [x1] > add x1, x1, 32 > str x4, [x1, -16] > bcc .L4 What I did for thunderx was create a vector cost model which caused this loop not be vectorized to get the regression from happening. Not this might actually be better code for some micro arch. I need to check with the new processor we have in house but that is next week or so. I don't know how much I can share next week though.

(In reply to Andrew Pinski from comment #11) > (In reply to Maxim Kuvyrkov from comment #9) > > which then becomes for aarch64: > > .L4: > > ld2 {v0.2d - v1.2d}, [x1] > > add w2, w2, 1 > > cmp w2, w7 > > eor v0.16b, v2.16b, v0.16b > > umov x4, v0.d[1] > > st1 {v0.d}[0], [x1] > > add x1, x1, 32 > > str x4, [x1, -16] > > bcc .L4 > > > What I did for thunderx was create a vector cost model which caused this > loop not be vectorized to get the regression from happening. Not this might > actually be better code for some micro arch. I need to check with the new > processor we have in house but that is next week or so. I don't know how > much I can share next week though. You are making an orthogonal point to this bug report: whether or not to vectorize such a loop. But if loop is vectorized, then on any microarchitecture it is better to have "st2" vs "umov; st1; str".

(In reply to Maxim Kuvyrkov from comment #9) > I've looked into another case where inability to handle stores with gaps > generates sub-optimal code. I'm interested in spending some time on fixing > this, provided some guidance in the vectorizer. > > Is it substantially more difficult to handle stores with gaps compared to > loads with gaps? It has the complication that we can't actually store to the gaps because that creates store data races (and we'd need a load-modify-write cycle). So we have to emit either scalar stores (which is what we currently do), emit masked stores (not implemented yet) or something you suggest (I suppose that's a store-lanes kind?). A slight complication is that we have to avoid detecting the store group if we want to end up with scalar stores (well, that's a vectorizer implementation limit). This is why we simply split all groups at gap boundaries. Cost-based selection of the kind of store (or even load) implementation is not implemented.

(In reply to Maxim Kuvyrkov from comment #12) > You are making an orthogonal point to this bug report: whether or not to > vectorize such a loop. But if loop is vectorized, then on any > microarchitecture it is better to have "st2" vs "umov; st1; str". Yes but thinking about the problem some more I do think there are some vector cost model issue in the aarch64 backend where we don't model int vs floating point cost differences. For an example ^ for scalar int might be one cycle but vector it is 4 cycles but for floating point scalar addition, it is 4 cycles while the floating point vector addition is just 4 cycles. struct cpu_vector_cost { const int scalar_stmt_cost; /* Cost of any scalar operation, excluding load and store. */ ... const int vec_stmt_cost; /* Cost of any vector operation, excluding load, store, permute, vector-to-scalar and scalar-to-vector operation. */ Anyways I filed PR 79262 for the regression.