Created attachment 44386 [details] bad vectorizatoin result for boundary size 16 For the case below, the code generated by “gcc -O3” is very ugly, and the inner loop can be correctly vectorized. Please refer to attached file test_loop_inner_16.s. char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for ( int y = 0; y < 128; y++ ) { for ( int x = 0; x < 16; x++ ) d[x] = s1[x] + s2[x]; d += 16; } } If we change inner loop “for ( int x = 0; x < 16; x++ )” to be like “for ( int x = 0; x < 32; x++ )”, i.e. the loop boundary size changes from 16 to 32, very beautiful vectorization code would be generated. For example, the code below is the aarch64 result for loop boundary size 32, and it the same case for x86. test_loop: .LFB0: .cfi_startproc adrp x2, g_s1 adrp x3, g_s2 add x2, x2, :lo12:g_s1 add x3, x3, :lo12:g_s2 adrp x0, g_d adrp x1, g_d+2048 add x0, x0, :lo12:g_d add x1, x1, :lo12:g_d+2048 ldp q1, q2, [x2] ldp q3, q0, [x3] add v1.16b, v1.16b, v3.16b add v0.16b, v0.16b, v2.16b .p2align 3,,7 .L2: str q1, [x0] str q0, [x0, 16]! cmp x0, x1 bne .L2 ret The code generated for loop boundary size 8 is also very bad. Any idea?
Created attachment 44387 [details] bad vectorizatoin result for boundary size 8
Confirmed, I've seen this in other similar loops. Thanks for filing a bug report for this
The issue is the inner complete loop unrolling pass which unrolls loops up to 16 times (a --param controls that number). You can get good code via -fdisable-tree-cunrolli for example. So the vectorization issue would be that basic-block vectorization doesn't catch this in a very nice way - on x86 we pull out the invariant computation and have a vectorized (outer) loop storing to d. But we fail to vectorize the add because we are restricted to a single basic-block and the stores are still in the inner loop (obviously): t.c:9:15: note: not vectorized: no grouped stores in basic block. instead we see _238 = MEM[(char *)&g_s2 + 15B]; _239 = (unsigned char) _238; _240 = _236 + _239; _242 = (char) _240; _234 = {_32, _46, _60, _74, _88, _102, _116, _130, _144, _158, _172, _186, _200, _214, _228, _242}; vect_cst__237 = _234; <bb 3> [local count: 63136020]: # vectp_g_d.0_227 = PHI <vectp_g_d.0_15(5), &g_d(2)> # ivtmp_31 = PHI <ivtmp_241(5), 0(2)> MEM[(char *)vectp_g_d.0_227] = vect_cst__237; vectp_g_d.0_15 = vectp_g_d.0_227 + 16; ivtmp_241 = ivtmp_31 + 1; if (ivtmp_241 < 128) goto <bb 5>; [99.00%] else goto <bb 4>; [1.00%] <bb 5> [local count: 62498283]: goto <bb 3>; [100.00%] <bb 4> [local count: 637738]: return; so this is a duplicate of the bug that says BB vectorization doesn't consider a vector CONSTRUCTOR as sink.
About to post a patch.
Hi Richard, Do you still plan on working on this? Otherwise I'd like to take it over for GCC10.
(In reply to Tamar Christina from comment #5) > Hi Richard, > > Do you still plan on working on this? Otherwise I'd like to take it over for > GCC10. I've already committed the patch mentioned in #c4 (can't remember which patch it was offhand) but it turned out not to fix this PR after all. So please go ahead.
Cheers, Thanks Richard, I'll grab this one too then.
(In reply to Richard Biener from comment #3) Hi Richard, > So the vectorization issue would be that basic-block vectorization doesn't > catch this in a very nice way - on x86 we pull out the invariant computation > and have a vectorized (outer) loop storing to d. Just a small clarification, do you mean to say that there is a difference between the way x86 and aarch64 handle this, as far as I can see they handle this in the same way.
(In reply to Joel Hutton from comment #8) > (In reply to Richard Biener from comment #3) > > Hi Richard, > > > So the vectorization issue would be that basic-block vectorization doesn't > > catch this in a very nice way - on x86 we pull out the invariant computation > > and have a vectorized (outer) loop storing to d. > > Just a small clarification, do you mean to say that there is a difference > between the way x86 and aarch64 handle this, as far as I can see they handle > this in the same way. No, I was just mentioning I inspected this on x86 because I do expect the handling to be the same.
Should be fixed on trunk
Should be fixed on trunk by r277784
Fixed in GCC 10.