This does not vectorize at -O3 on x86_64/-mavx or aarch64: int loop (int *data) { int tot = 0; for (int i = 0; i < 256; i++) data[i * 2] += 7; return tot; } -fdump-tree-vect-details reveals: loadstore.c:6:3: note: === vect_analyze_data_ref_accesses === loadstore.c:6:3: note: Detected single element interleaving *_8 step 8 loadstore.c:6:3: note: Data access with gaps requires scalar epilogue loop loadstore.c:6:3: note: not consecutive access *_8 = _10; loadstore.c:6:3: note: not vectorized: complicated access pattern. loadstore.c:6:3: note: bad data access. However, a similar testcase that only reads from those locations, vectorizes ok: int loop_12 (int *data) { int tot = 0; for (int i = 0; i < 256; i++) tot += data[i * 2]; return tot; } blocksort.c:6:3: note: === vect_analyze_data_ref_accesses === blocksort.c:6:3: note: Detected single element interleaving *_7 step 8 blocksort.c:6:3: note: Data access with gaps requires scalar epilogue loop
I believe this is a known issue, but have not identified an existing PR.
I believe Micha has patches for this?
While strided stores are now implemented the case is still not handled because single-element interleaving takes precedence (and single-element interleaving isn't supported for stores as that always produces gaps). I have a patch that produces .L2: movdqu 16(%rax), %xmm1 addq $32, %rax movdqu -32(%rax), %xmm0 shufps $136, %xmm1, %xmm0 paddd %xmm2, %xmm0 pshufd $85, %xmm0, %xmm1 movd %xmm0, -32(%rax) movd %xmm1, -24(%rax) movdqa %xmm0, %xmm1 punpckhdq %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movd %xmm1, -16(%rax) movd %xmm0, -8(%rax) cmpq %rdx, %rax jne .L2 when you disable the cost model. Otherwise it's deemed not profitable. Using scatters for AVX could in theory make it profitable (not sure). t.c:5:3: note: Cost model analysis: Vector inside of loop cost: 13 Vector prologue cost: 1 Vector epilogue cost: 12 Scalar iteration cost: 3 Scalar outside cost: 0 Vector outside cost: 13 prologue iterations: 0 epilogue iterations: 4 t.c:5:3: note: cost model: the vector iteration cost = 13 divided by the scalar iteration cost = 3 is greater or equal to the vectorization factor = 4. t.c:5:3: note: not vectorized: vectorization not profitable. t.c:5:3: note: not vectorized: vector version will never be profitable. t.c:5:3: note: ==> examining statement: *_8 = _10; t.c:5:3: note: vect_is_simple_use: operand _10 t.c:5:3: note: def_stmt: _10 = _9 + 7; t.c:5:3: note: type of def: internal t.c:5:3: note: vect_model_store_cost: inside_cost = 8, prologue_cost = 0 . so the strided store has cost 8, that's 4 extracts plus 4 scalar stores. With AVX we generate vmovd %xmm0, -32(%rax) vpextrd $1, %xmm0, -24(%rax) vpextrd $2, %xmm0, -16(%rax) vpextrd $3, %xmm0, -8(%rax) so it can combine extract and store, with SSE2 we get pshufd $85, %xmm0, %xmm1 movd %xmm0, -32(%rax) movd %xmm1, -24(%rax) movdqa %xmm0, %xmm1 punpckhdq %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movd %xmm1, -16(%rax) movd %xmm0, -8(%rax) which is even worse than expected ;) As usual the cost model isn't target aware enough here (and it errs on the conservative side here)
Fixed for GCC 6.
Author: rguenth Date: Thu Oct 22 13:33:17 2015 New Revision: 229172 URL: https://gcc.gnu.org/viewcvs?rev=229172&root=gcc&view=rev Log: 2015-10-22 Richard Biener <rguenther@suse.de> PR tree-optimization/19049 PR tree-optimization/65962 * tree-vect-data-refs.c (vect_analyze_group_access_1): Fall back to strided accesses if single-element interleaving doesn't work. * gcc.dg/vect/vect-strided-store-pr65962.c: New testcase. * gcc.dg/vect/vect-63.c: Adjust. * gcc.dg/vect/vect-70.c: Likewise. * gcc.dg/vect/vect-strided-u8-i2-gap.c: Likewise. * gcc.dg/vect/vect-strided-a-u8-i2-gap.c: Likewise. * gfortran.dg/vect/pr19049.f90: Likewise. * gfortran.dg/vect/vect-8.f90: Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/vect-63.c trunk/gcc/testsuite/gcc.dg/vect/vect-70.c trunk/gcc/testsuite/gcc.dg/vect/vect-strided-a-u8-i2-gap.c trunk/gcc/testsuite/gcc.dg/vect/vect-strided-u8-i2-gap.c trunk/gcc/testsuite/gfortran.dg/vect/pr19049.f90 trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90 trunk/gcc/tree-vect-data-refs.c
This commit (r229172) caused a vectorization failure for POWER: FAIL: gcc.dg/vect/vect-62.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/vect/vect-62.c scan-tree-dump-times vect "vectorized 1 loops" 1 Seems like an odd result, but that's what the bisection shows...
(In reply to Bill Schmidt from comment #6) > This commit (r229172) caused a vectorization failure for POWER: > > FAIL: gcc.dg/vect/vect-62.c -flto -ffat-lto-objects scan-tree-dump-times > vect "vectorized 1 loops" 1 > FAIL: gcc.dg/vect/vect-62.c scan-tree-dump-times vect "vectorized 1 loops" 1 > > Seems like an odd result, but that's what the bisection shows... Huh. Can you please attach vectorizer dumps before/after?
Created attachment 36589 [details] Vectorization dump for r229171
Created attachment 36590 [details] Vectorization dump for r229172
Above are the before-and-after vectorization dumps. I haven't looked at them in any detail myself yet.
The main difference is +/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:39:3: note: LOOP VECTORIZED +/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:39:3: note: OUTER LOOP VECTORIZED ... -/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:9:5: note: vectorized 1 loops in function. +/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:9:5: note: vectorized 2 loops in function. so we now vectorize two loops. The newly vectorized loop is /* Multidimensional array. Aligned. The "inner" dimensions are invariant in the inner loop. Vectorizable, but the vectorizer detects that everything is invariant and that the loop is better left untouched. (it should be optimized away). */ for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { ia[i][1][8] = ib[i]; } } on x86_64 the latch block is not empty - for some reason not so on ppc. I suspect that if we had a cddce pass after loop invariant/store motion (which should make the inner loop empty) we'd even remove the inner loop and vectorize this regularly. Ah, so on x86_64 we PREd ib[0] while on ppc the ib initializer is probably in a constant pool entry. Yes: <bb 2>: ib = *.LC0; vs. <bb 2>: ib[0] = 0; ib[1] = 3; ib[2] = 6; ib[3] = 9; ... The PRE heuristic to not confuse vectorization doesn't fire here. I have a fix for that (and the testcase).
Ah, great! More vectorization is good. :) Thanks for looking into this so quickly! Bill
Author: rguenth Date: Wed Oct 28 10:09:37 2015 New Revision: 229481 URL: https://gcc.gnu.org/viewcvs?rev=229481&root=gcc&view=rev Log: 2015-10-28 Richard Biener <rguenther@suse.de> PR tree-optimization/65962 * tree-ssa-pre.c (eliminate_dom_walker::before_dom_children): Avoid creating loop carried dependences also for outer loops of the loop a use to replace is in. * gcc.dg/vect/vect-62.c: Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/vect-62.c trunk/gcc/tree-ssa-pre.c
After r229172, I am also seeing: FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorized 2 loops" 1 FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect "vectorized 2 loops" 1 on target armeb-none-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 The test passes on arm-none-linux-gnueabihf
On Wed, 28 Oct 2015, clyon at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962 > > Christophe Lyon <clyon at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |clyon at gcc dot gnu.org > > --- Comment #14 from Christophe Lyon <clyon at gcc dot gnu.org> --- > After r229172, I am also seeing: > FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects > scan-tree-dump-times vect "vectorized 2 loops" 1 > FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect > "vectorized 2 loops" 1 > > on target armeb-none-linux-gnueabihf > --with-cpu=cortex-a9 > --with-fpu=neon-fp16 > > > The test passes on arm-none-linux-gnueabihf Can you please attach the vectorizer dump for the failing case? Not sure what should be special about big-endian here...
Created attachment 36614 [details] Vectorizer dump for little endian
Created attachment 36615 [details] Vectorizer dump for big endian
I've attached vectorizer dumps for LE and BE from trunk@229448.
Ok, so the difference is that BE doesn't support unaligned vector loads while LE does. The targetm.vectorize.support_vector_misalignment hook has static bool arm_builtin_support_vector_misalignment (machine_mode mode, const_tree type, int misalignment, bool is_packed) { if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) { whatever reason for. So the testcase would need adjustment with hw_misalign.
(In reply to Richard Biener from comment #19) > whatever reason for. So the testcase would need adjustment with hw_misalign. check_effective_target_vect_hw_misalign needs to be updated to include arm targets. I guess I should add arm-*-*, such that vect-strided-a-u8-i2-gap.c (after adding dg-require-effective-target vect_hw_misalign) still PASSes for arm-* and is UNSUPPORTED for armeb-*. I'm just wondering whether I should instead add arm*-*-*, if we want to consider there is a bug with the armeb behavior.