Bug 83202 - Try joining operations on consecutive array elements during tree vectorization
Summary: Try joining operations on consecutive array elements during tree vectorization
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.2.0
: P3 enhancement
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 83326
Blocks: vectorizer 66974 69224 82286
  Show dependency treegraph
 
Reported: 2017-11-28 21:09 UTC by Daniel Fruzynski
Modified: 2019-04-11 12:21 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-11-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Fruzynski 2017-11-28 21:09:06 UTC
void test(double data[4][4])
{
  for (int i = 0; i < 4; i++)
  {
    for (int j = i; j < 4; j+=2)
    {
      data[i][j] = data[i][j] * data[i][j];
      data[i][j+1] = data[i][j+1] * data[i][j+1];
    }
  }
}

gcc creates this:

test(double (*) [4]):
  vmovsd xmm0, QWORD PTR [rdi]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi], xmm0
  vmovsd xmm0, QWORD PTR [rdi+8]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+8], xmm0
  vmovsd xmm0, QWORD PTR [rdi+16]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+16], xmm0
  vmovsd xmm0, QWORD PTR [rdi+24]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+24], xmm0
  vmovsd xmm0, QWORD PTR [rdi+40]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+40], xmm0
  vmovsd xmm0, QWORD PTR [rdi+48]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+48], xmm0
  vmovsd xmm0, QWORD PTR [rdi+56]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+56], xmm0
  vmovsd xmm0, QWORD PTR [rdi+64]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+64], xmm0
  vmovsd xmm0, QWORD PTR [rdi+80]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+80], xmm0
  vmovsd xmm0, QWORD PTR [rdi+88]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+88], xmm0
  vmovsd xmm0, QWORD PTR [rdi+120]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+120], xmm0
  vmovsd xmm0, QWORD PTR [rdi+128]
  vmulsd xmm0, xmm0, xmm0
  vmovsd QWORD PTR [rdi+128], xmm0
  ret

clang detects that it is possible to use packed operations instead of scalar ones, and produces this. Please implement similar optimization in gcc too.

test(double (*) [4]): # @test(double (*) [4])
  vmovupd xmm0, xmmword ptr [rdi]
  vmovupd xmm1, xmmword ptr [rdi + 16]
  vmovupd xmm2, xmmword ptr [rdi + 40]
  vmovupd xmm3, xmmword ptr [rdi + 56]
  vmulpd xmm0, xmm0, xmm0
  vmovupd xmmword ptr [rdi], xmm0
  vmulpd xmm0, xmm1, xmm1
  vmovupd xmmword ptr [rdi + 16], xmm0
  vmulpd xmm0, xmm2, xmm2
  vmovupd xmmword ptr [rdi + 40], xmm0
  vmulpd xmm0, xmm3, xmm3
  vmovupd xmmword ptr [rdi + 56], xmm0
  vmovupd xmm0, xmmword ptr [rdi + 80]
  vmulpd xmm0, xmm0, xmm0
  vmovupd xmmword ptr [rdi + 80], xmm0
  vmovupd xmm0, xmmword ptr [rdi + 120]
  vmulpd xmm0, xmm0, xmm0
  vmovupd xmmword ptr [rdi + 120], xmm0
  ret
Comment 1 Daniel Fruzynski 2017-11-28 21:26:55 UTC
This was compiled with -O3 -mavx -ftree-vectorize
After sending this I noticed that I wrote inner loop incorrectly, I meant one below. Anyway, it it also not optimized:
for (int j = 0; j < i; j+=4)

I also checked code which could be optimized using operations on YMM registers:

void test(double data[8][8])
{
  for (int i = 0; i < 8; i++)
  {
    for (int j = 0; j < i; j+=4)
    {
      data[i][j] *= data[i][j];
      data[i][j+1] *= data[i][j+1];
      data[i][j+2] *= data[i][j+2];
      data[i][j+3] *= data[i][j+3];
    }
  }
}

gcc output is, hmm, interesting:

test(double (*) [8]):
  vmovupd xmm0, XMMWORD PTR [rdi+64]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+80], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+64], xmm0
  vextractf128 XMMWORD PTR [rdi+80], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+128]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+144], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+128], xmm0
  vextractf128 XMMWORD PTR [rdi+144], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+192]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+208], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+192], xmm0
  vextractf128 XMMWORD PTR [rdi+208], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+256]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+272], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+256], xmm0
  vextractf128 XMMWORD PTR [rdi+272], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+320]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+336], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+320], xmm0
  vextractf128 XMMWORD PTR [rdi+336], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+352]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+368], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+352], xmm0
  vextractf128 XMMWORD PTR [rdi+368], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+384]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+400], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+384], xmm0
  vextractf128 XMMWORD PTR [rdi+400], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+416]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+432], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+416], xmm0
  vextractf128 XMMWORD PTR [rdi+432], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+448]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+464], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+448], xmm0
  vextractf128 XMMWORD PTR [rdi+464], ymm0, 0x1
  vmovupd xmm0, XMMWORD PTR [rdi+480]
  vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+496], 0x1
  vmulpd ymm0, ymm0, ymm0
  vmovups XMMWORD PTR [rdi+480], xmm0
  vextractf128 XMMWORD PTR [rdi+496], ymm0, 0x1
  vzeroupper
  ret
Comment 2 Richard Biener 2017-11-29 09:36:42 UTC
wiht += 4 the inner loop doesn't iterate so it's effectively

void test(double data[4][4])
{
  for (int i = 0; i < 4; i++)
  {
    data[i][i] = data[i][i] * data[i][i];
    data[i][i+1] = data[i][i+1] * data[i][i+1];
  }
}

we fail to SLP here because we get confused by the computed group size of 5
as there's a gap of three elements between the first stores of each iteration.

When later doing BB vectorization we fail to analyze dependences, likely because
not analyzing refs as thoroughly as with loops.

For your second example we fail to loop vectorize this because we completely
peel the inner loop in cunrolli, leaving control flow inside the loop...
I have a patch for that one.
Comment 3 Richard Biener 2017-11-29 11:12:15 UTC
For the other case the issue is I think that the SLP instance group size is not
the number of scalar stmts but somehow set to the group-size.  Changing that
has quite some ripple-down effects though.

-> GCC 9.
Comment 4 Daniel Fruzynski 2017-11-29 13:54:59 UTC
One more case. Code has to process diagonal half of matrix and uses SSE intrinsics - see test1() below. When n is constant like in test2() below, gcc unrolls loops. However more more transform could be performed, replace pairs of SSE instructions with one AVX one.

#include <stdint.h>
#include "immintrin.h"

void test1(double data[100][100], unsigned int n)
{
    for (int i = 0; i < n; i++)
    {
        for (int j = 0; j < i; j += 2)
        {
            __m128d v = _mm_loadu_pd(&data[i][j]);
            v = _mm_mul_pd(v, v);
            _mm_storeu_pd(&data[i][j], v);
        }
    }
}

void test2(double data[100][100])
{
    const unsigned int n = 6;
    for (int i = 0; i < n; i++)
    {
        for (int j = 0; j < i; j += 2)
        {
            __m128d v = _mm_loadu_pd(&data[i][j]);
            v = _mm_mul_pd(v, v);
            _mm_storeu_pd(&data[i][j], v);
        }
    }
}
Comment 5 Richard Biener 2017-11-29 14:38:37 UTC
Author: rguenth
Date: Wed Nov 29 14:38:06 2017
New Revision: 255233

URL: https://gcc.gnu.org/viewcvs?rev=255233&root=gcc&view=rev
Log:
2017-11-29  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/83202
	* tree-vect-slp.c (scalar_stmts_set_t): New typedef.
	(bst_fail): Use it.
	(vect_analyze_slp_cost_1): Add visited set, do not account SLP
	nodes vectorized to the same stmts multiple times.
	(vect_analyze_slp_cost): Allocate a visited set and pass it down.
	(vect_analyze_slp_instance): Adjust.
	(scalar_stmts_to_slp_tree_map_t): New typedef.
	(vect_schedule_slp_instance): Add a map recording the SLP node
	representing the vectorized stmts for a set of scalar stmts.
	Avoid code-generating redundancies.
	(vect_schedule_slp): Allocate map and pass it down.

	* gcc.dg/vect/costmodel/x86_64/costmodel-pr83202.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr83202.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-slp.c
Comment 6 Richard Biener 2017-11-29 14:42:34 UTC
There are multiple issues reflected in this bug.  The last commit addressed the SLP cost model thing (not fixing any testcase on its own).
Comment 7 rguenther@suse.de 2017-11-29 14:51:23 UTC
On Wed, 29 Nov 2017, bugzilla@poradnik-webmastera.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202
> 
> --- Comment #4 from Daniel Fruzynski <bugzilla@poradnik-webmastera.com> ---
> One more case. Code has to process diagonal half of matrix and uses SSE
> intrinsics - see test1() below. When n is constant like in test2() below, gcc
> unrolls loops. However more more transform could be performed, replace pairs of
> SSE instructions with one AVX one.

GCC currently does not "vectorize" already vectorized code so
this is a much farther away "goal" apart from eventually pattern-matching
some very simple cases.
Comment 8 Richard Biener 2017-11-30 07:54:02 UTC
Author: rguenth
Date: Thu Nov 30 07:53:31 2017
New Revision: 255267

URL: https://gcc.gnu.org/viewcvs?rev=255267&root=gcc&view=rev
Log:
2017-11-30  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/83202
	* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Add
	allow_peel argument and guard peeling.
	(canonicalize_loop_induction_variables): Likewise.
	(canonicalize_induction_variables): Pass false.
	(tree_unroll_loops_completely_1): Pass unroll_outer to disallow
	peeling from cunrolli.

	* gcc.dg/vect/pr83202-1.c: New testcase.
	* gcc.dg/tree-ssa/pr61743-1.c: Adjust.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/pr83202-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr61743-1.c
    trunk/gcc/tree-ssa-loop-ivcanon.c
Comment 9 Richard Biener 2017-11-30 07:54:34 UTC
The last commit fixed the testcase incomment #1.
Comment 10 Richard Biener 2019-04-11 12:21:01 UTC
The comment#4 case is sth completely different.  If it's really interesting
to re-vectorize already vectorized code please file a different bug.

The other testcases seem to work fine for me now.