Bug 65962 - Missed vectorization of strided stores
Summary: Missed vectorization of strided stores
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 5.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-05-01 11:45 UTC by Alan Lawrence
Modified: 2015-10-30 09:49 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 6.0
Known to fail:
Last reconfirmed: 2015-05-04 00:00:00


Attachments
Vectorization dump for r229171 (3.76 KB, text/plain)
2015-10-26 16:51 UTC, Bill Schmidt
Details
Vectorization dump for r229172 (4.67 KB, text/plain)
2015-10-26 16:51 UTC, Bill Schmidt
Details
Vectorizer dump for little endian (7.71 KB, text/plain)
2015-10-29 14:57 UTC, Christophe Lyon
Details
Vectorizer dump for big endian (5.11 KB, text/plain)
2015-10-29 14:58 UTC, Christophe Lyon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Lawrence 2015-05-01 11:45:49 UTC
This does not vectorize at -O3 on x86_64/-mavx or aarch64:
int
loop (int *data)
{
  int tot = 0;
  for (int i = 0; i < 256; i++)
    data[i * 2] += 7;
  return tot;
}

-fdump-tree-vect-details reveals:

loadstore.c:6:3: note: === vect_analyze_data_ref_accesses ===
loadstore.c:6:3: note: Detected single element interleaving *_8 step 8
loadstore.c:6:3: note: Data access with gaps requires scalar epilogue loop
loadstore.c:6:3: note: not consecutive access *_8 = _10;

loadstore.c:6:3: note: not vectorized: complicated access pattern.
loadstore.c:6:3: note: bad data access.

However, a similar testcase that only reads from those locations, vectorizes ok:
int
loop_12 (int *data)
{
  int tot = 0;
  for (int i = 0; i < 256; i++)
    tot += data[i * 2];
  return tot;
}
blocksort.c:6:3: note: === vect_analyze_data_ref_accesses ===
blocksort.c:6:3: note: Detected single element interleaving *_7 step 8
blocksort.c:6:3: note: Data access with gaps requires scalar epilogue loop
Comment 1 Alan Lawrence 2015-05-01 11:46:18 UTC
I believe this is a known issue, but have not identified an existing PR.
Comment 2 Richard Biener 2015-05-04 11:19:49 UTC
I believe Micha has patches for this?
Comment 3 Richard Biener 2015-10-22 10:28:03 UTC
While strided stores are now implemented the case is still not handled because
single-element interleaving takes precedence (and single-element interleaving
isn't supported for stores as that always produces gaps).

I have a patch that produces

.L2:
        movdqu  16(%rax), %xmm1
        addq    $32, %rax
        movdqu  -32(%rax), %xmm0
        shufps  $136, %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        pshufd  $85, %xmm0, %xmm1
        movd    %xmm0, -32(%rax)
        movd    %xmm1, -24(%rax)
        movdqa  %xmm0, %xmm1
        punpckhdq       %xmm0, %xmm1
        pshufd  $255, %xmm0, %xmm0
        movd    %xmm1, -16(%rax)
        movd    %xmm0, -8(%rax)
        cmpq    %rdx, %rax
        jne     .L2

when you disable the cost model.  Otherwise it's deemed not profitable.  Using
scatters for AVX could in theory make it profitable (not sure).

t.c:5:3: note: Cost model analysis:
  Vector inside of loop cost: 13
  Vector prologue cost: 1
  Vector epilogue cost: 12
  Scalar iteration cost: 3
  Scalar outside cost: 0
  Vector outside cost: 13
  prologue iterations: 0
  epilogue iterations: 4
t.c:5:3: note: cost model: the vector iteration cost = 13 divided by the scalar iteration cost = 3 is greater or equal to the vectorization factor = 4.
t.c:5:3: note: not vectorized: vectorization not profitable.
t.c:5:3: note: not vectorized: vector version will never be profitable.

t.c:5:3: note: ==> examining statement: *_8 = _10;
t.c:5:3: note: vect_is_simple_use: operand _10
t.c:5:3: note: def_stmt: _10 = _9 + 7;
t.c:5:3: note: type of def: internal
t.c:5:3: note: vect_model_store_cost: inside_cost = 8, prologue_cost = 0 .

so the strided store has cost 8, that's 4 extracts plus 4 scalar stores.
With AVX we generate

        vmovd   %xmm0, -32(%rax)
        vpextrd $1, %xmm0, -24(%rax)
        vpextrd $2, %xmm0, -16(%rax)
        vpextrd $3, %xmm0, -8(%rax)

so it can combine extract and store, with SSE2 we get

        pshufd  $85, %xmm0, %xmm1
        movd    %xmm0, -32(%rax)
        movd    %xmm1, -24(%rax)
        movdqa  %xmm0, %xmm1
        punpckhdq       %xmm0, %xmm1
        pshufd  $255, %xmm0, %xmm0
        movd    %xmm1, -16(%rax)
        movd    %xmm0, -8(%rax)

which is even worse than expected ;)

As usual the cost model isn't target aware enough here (and it errs on the
conservative side here)
Comment 4 Richard Biener 2015-10-22 13:33:34 UTC
Fixed for GCC 6.
Comment 5 Richard Biener 2015-10-22 13:33:49 UTC
Author: rguenth
Date: Thu Oct 22 13:33:17 2015
New Revision: 229172

URL: https://gcc.gnu.org/viewcvs?rev=229172&root=gcc&view=rev
Log:
2015-10-22  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/19049
	PR tree-optimization/65962
	* tree-vect-data-refs.c (vect_analyze_group_access_1): Fall back
	to strided accesses if single-element interleaving doesn't work.

	* gcc.dg/vect/vect-strided-store-pr65962.c: New testcase.
	* gcc.dg/vect/vect-63.c: Adjust.
	* gcc.dg/vect/vect-70.c: Likewise.
	* gcc.dg/vect/vect-strided-u8-i2-gap.c: Likewise.
	* gcc.dg/vect/vect-strided-a-u8-i2-gap.c: Likewise.
	* gfortran.dg/vect/pr19049.f90: Likewise.
	* gfortran.dg/vect/vect-8.f90: Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/vect/vect-63.c
    trunk/gcc/testsuite/gcc.dg/vect/vect-70.c
    trunk/gcc/testsuite/gcc.dg/vect/vect-strided-a-u8-i2-gap.c
    trunk/gcc/testsuite/gcc.dg/vect/vect-strided-u8-i2-gap.c
    trunk/gcc/testsuite/gfortran.dg/vect/pr19049.f90
    trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90
    trunk/gcc/tree-vect-data-refs.c
Comment 6 Bill Schmidt 2015-10-23 22:32:36 UTC
This commit (r229172) caused a vectorization failure for POWER:

FAIL: gcc.dg/vect/vect-62.c -flto -ffat-lto-objects  scan-tree-dump-times vect "vectorized 1 loops" 1
FAIL: gcc.dg/vect/vect-62.c scan-tree-dump-times vect "vectorized 1 loops" 1

Seems like an odd result, but that's what the bisection shows...
Comment 7 Richard Biener 2015-10-26 11:51:07 UTC
(In reply to Bill Schmidt from comment #6)
> This commit (r229172) caused a vectorization failure for POWER:
> 
> FAIL: gcc.dg/vect/vect-62.c -flto -ffat-lto-objects  scan-tree-dump-times
> vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/vect-62.c scan-tree-dump-times vect "vectorized 1 loops" 1
> 
> Seems like an odd result, but that's what the bisection shows...

Huh.  Can you please attach vectorizer dumps before/after?
Comment 8 Bill Schmidt 2015-10-26 16:51:02 UTC
Created attachment 36589 [details]
Vectorization dump for r229171
Comment 9 Bill Schmidt 2015-10-26 16:51:59 UTC
Created attachment 36590 [details]
Vectorization dump for r229172
Comment 10 Bill Schmidt 2015-10-26 16:52:39 UTC
Above are the before-and-after vectorization dumps.  I haven't looked at them in any detail myself yet.
Comment 11 Richard Biener 2015-10-27 11:16:00 UTC
The main difference is

+/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:39:3: 
note: LOOP VECTORIZED
+/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:39:3: 
note: OUTER LOOP VECTORIZED
...
-/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:9:5: note: vectorized 1 loops in function.
+/home/wschmidt/gcc/gcc-mainline-base/gcc/testsuite/gcc.dg/vect/vect-62.c:9:5: note: vectorized 2 loops in function.

so we now vectorize two loops.  The newly vectorized loop is

  /* Multidimensional array. Aligned. The "inner" dimensions
     are invariant in the inner loop. Vectorizable, but the
     vectorizer detects that everything is invariant and that
     the loop is better left untouched. (it should be optimized away). */
  for (i = 0; i < N; i++)
    {
      for (j = 0; j < N; j++)
        {
           ia[i][1][8] = ib[i];
        }
    }

on x86_64 the latch block is not empty - for some reason not so on ppc.
I suspect that if we had a cddce pass after loop invariant/store motion
(which should make the inner loop empty) we'd even remove the inner loop
and vectorize this regularly.

Ah, so on x86_64 we PREd ib[0] while on ppc the ib initializer is probably
in a constant pool entry.  Yes:

  <bb 2>:
  ib = *.LC0;

vs.

  <bb 2>:
  ib[0] = 0;
  ib[1] = 3;
  ib[2] = 6;
  ib[3] = 9;
...

The PRE heuristic to not confuse vectorization doesn't fire here.

I have a fix for that (and the testcase).
Comment 12 Bill Schmidt 2015-10-27 13:17:25 UTC
Ah, great!  More vectorization is good. :)  Thanks for looking into this so quickly!

Bill
Comment 13 Richard Biener 2015-10-28 10:10:09 UTC
Author: rguenth
Date: Wed Oct 28 10:09:37 2015
New Revision: 229481

URL: https://gcc.gnu.org/viewcvs?rev=229481&root=gcc&view=rev
Log:
2015-10-28  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/65962
	* tree-ssa-pre.c (eliminate_dom_walker::before_dom_children):
	Avoid creating loop carried dependences also for outer loops
	of the loop a use to replace is in.

	* gcc.dg/vect/vect-62.c: Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/vect/vect-62.c
    trunk/gcc/tree-ssa-pre.c
Comment 14 Christophe Lyon 2015-10-28 16:47:11 UTC
After r229172, I am also seeing:
FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects  scan-tree-dump-times vect "vectorized 2 loops" 1
FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect "vectorized 2 loops" 1

on target armeb-none-linux-gnueabihf
--with-cpu=cortex-a9
--with-fpu=neon-fp16


The test passes on arm-none-linux-gnueabihf
Comment 15 rguenther@suse.de 2015-10-29 09:48:47 UTC
On Wed, 28 Oct 2015, clyon at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962
> 
> Christophe Lyon <clyon at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |clyon at gcc dot gnu.org
> 
> --- Comment #14 from Christophe Lyon <clyon at gcc dot gnu.org> ---
> After r229172, I am also seeing:
> FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects 
> scan-tree-dump-times vect "vectorized 2 loops" 1
> FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect
> "vectorized 2 loops" 1
> 
> on target armeb-none-linux-gnueabihf
> --with-cpu=cortex-a9
> --with-fpu=neon-fp16
> 
> 
> The test passes on arm-none-linux-gnueabihf

Can you please attach the vectorizer dump for the failing case?
Not sure what should be special about big-endian here...
Comment 16 Christophe Lyon 2015-10-29 14:57:52 UTC
Created attachment 36614 [details]
Vectorizer dump for little endian
Comment 17 Christophe Lyon 2015-10-29 14:58:30 UTC
Created attachment 36615 [details]
Vectorizer dump for big endian
Comment 18 Christophe Lyon 2015-10-29 14:59:55 UTC
I've attached vectorizer dumps for LE and BE from trunk@229448.
Comment 19 Richard Biener 2015-10-29 15:14:27 UTC
Ok, so the difference is that BE doesn't support unaligned vector loads while LE does. The targetm.vectorize.support_vector_misalignment hook has

static bool
arm_builtin_support_vector_misalignment (machine_mode mode,
                                         const_tree type, int misalignment,
                                         bool is_packed)
{
  if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
    {

whatever reason for.  So the testcase would need adjustment with hw_misalign.
Comment 20 Christophe Lyon 2015-10-30 09:49:20 UTC
(In reply to Richard Biener from comment #19)

> whatever reason for.  So the testcase would need adjustment with hw_misalign.

check_effective_target_vect_hw_misalign needs to be updated to include arm targets.

I guess I should add arm-*-*, such that vect-strided-a-u8-i2-gap.c (after adding dg-require-effective-target vect_hw_misalign) still PASSes for arm-* and is UNSUPPORTED for armeb-*.

I'm just wondering whether I should instead add arm*-*-*, if we want to consider there is a bug with the armeb behavior.