User account creation filtered due to spam.

Bug 18438 - vectorizer failed for vector matrix multiplication
Summary: vectorizer failed for vector matrix multiplication
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2004-11-12 01:29 UTC by Giovanni Bajo
Modified: 2017-01-28 08:36 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-01-28 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Giovanni Bajo 2004-11-12 01:29:03 UTC
Vectorizer fails to handle this:


------------------------------------------------------------
#define NUMPOINTS 50000

#define align(x) __attribute__((align(x)))

typedef float align(16) MATRIX[3][3];

static float points[NUMPOINTS][4];
static align(16) float opoints[NUMPOINTS][4];
static bool flags[NUMPOINTS];
static MATRIX gmatrix;


void RotateVectors (void)
{
  int i, r;

  for (r = 0; r < 4; r++)
  {
    for (i = 0; i < NUMPOINTS; i++)
    {
      opoints[i][0] =     gmatrix[0][0] * points[i][0]
                        + gmatrix[0][1] * points[i][1]
                        + gmatrix[0][2] * points[i][2];
      opoints[i][1] =     gmatrix[1][0] * points[i][0]
                        + gmatrix[1][1] * points[i][1]
                        + gmatrix[1][2] * points[i][2];
      opoints[i][2] =     gmatrix[2][0] * points[i][0]
                        + gmatrix[2][1] * points[i][1]
                        + gmatrix[2][2] * points[i][2];
      flags[i] = true;
    }
  }
}
------------------------------------------------------------
loop at bench.cc:52: not vectorized: complicated access pattern.
loop at bench.cc:52: bad data access.
Comment 1 Andrew Pinski 2004-11-12 02:43:35 UTC
Confirmed, ICC can do this but does not because it is not very inefficient to do it.
Comment 2 Andrew Pinski 2005-09-20 17:47:37 UTC
t.c:20: note: not vectorized: mixed data-types
t.c:20: note: can't determine vectorization factor.

Removing flags[i] = true;
we get:
t.c:20: note: not consecutive access
t.c:20: note: not vectorized: complicated access pattern.
Comment 3 Ira Rosen 2006-09-19 07:10:15 UTC
> t.c:20: note: not vectorized: mixed data-types
> t.c:20: note: can't determine vectorization factor.
>
> Removing flags[i] = true;

Multiple data-types vectorization is already supported in the autovect branch, and the patches for mainline (starting from http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00941.html) will be committed as soon as 4.3 is open.  


> we get:
> t.c:20: note: not consecutive access
> t.c:20: note: not vectorized: complicated access pattern.

Vectorization of strided accesses is also already implemented in the autovect branch (and will be committed to the mainline 4.3). However, this case contains stores with gaps (stores to opoints[i][0], opoints[i][1], and opoints[i][2], without a store to opoints[i][3]), and only loads with gaps are currently supported.

Therefore, this loop will be vectorizable in the autovect branch (and soon in the mainline 4.3) if a store to opoints[i][3] is added.

Ira
Comment 4 Giovanni Bajo 2007-01-05 00:37:39 UTC
Thanks Ira. What about store with gaps?
Comment 5 Ira Rosen 2007-01-07 07:40:29 UTC
On the todo list.

BTW, vectorization of strided accesses was committed to the mainline 4.3.

Ira
Comment 6 Steven Bosscher 2011-05-22 15:40:28 UTC
Still not vectorized in recent GCC 
t.c:20: note: not vectorized: complicated access pattern.
t.c:22: note: not vectorized: complicated access pattern.


     1	typedef unsigned int bool;
     2	#define true 1
     3	 
     4	#define NUMPOINTS 50000
     5	 
     6	#define align(x) __attribute__((align(x)))
     7	 
     8	typedef float align(16) MATRIX[3][3];
     9	 
    10	static float points[NUMPOINTS][4];
    11	static align(16) float opoints[NUMPOINTS][4];
    12	static bool flags[NUMPOINTS];
    13	static MATRIX gmatrix;
    14	 
    15	 
    16	void RotateVectors (void)
    17	{
    18	  int i, r;
    19	 
    20	  for (r = 0; r < 4; r++)
    21	  {
    22	    for (i = 0; i < NUMPOINTS; i++)
    23	    {
    24	      opoints[i][0] =     gmatrix[0][0] * points[i][0]
    25	                        + gmatrix[0][1] * points[i][1]
    26	                        + gmatrix[0][2] * points[i][2];
    27	      opoints[i][1] =     gmatrix[1][0] * points[i][0]
    28	                        + gmatrix[1][1] * points[i][1]
    29	                        + gmatrix[1][2] * points[i][2];
    30	      opoints[i][2] =     gmatrix[2][0] * points[i][0]
    31	                        + gmatrix[2][1] * points[i][1]
    32	                        + gmatrix[2][2] * points[i][2];
    33	      flags[i] = true;
    34	    }
    35	  }
    36	}
    37	

"GCC: (GNU) 4.6.0 20110312 (experimental) [trunk revision 170907]"
Comment 7 Richard Biener 2012-07-13 08:43:04 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 8 Richard Biener 2013-03-27 11:27:31 UTC
The issue is that we cannot use a vector v4sf store to &opoints[i][0]
as opoints[i][4] is not stored to.  Such "masked" store (or "interleaved
store with gaps") is not supported by SLP.
Comment 9 Maxim Kuvyrkov 2016-12-12 09:30:20 UTC
I've looked into another case where inability to handle stores with gaps generates sub-optimal code.  I'm interested in spending some time on fixing this, provided some guidance in the vectorizer.

Is it substantially more difficult to handle stores with gaps compared to loads with gaps?

The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), which is #2 function in 462.libquantum profile.  This cycle accounts for about 25% of total 462.libquantum time.

===struct node_struct
{
  float _Complex gap;
  unsigned long long state;
};

struct reg_struct
{
  int size;
  struct node_struct *node;
};

void
func(int target, struct reg_struct *reg)
{
  int i;

  for(i=0; i<reg->size; i++)
    reg->node[i].state ^= ((unsigned long long) 1 << target);
}
===

This loop vectorizes into
  <bb 5>:
  # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)>
  vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]);
  vect__5.11_41 = vect_array.10[0];
  vect__5.12_42 = vect_array.10[1];
  vect__7.13_44 = vect__5.11_41 ^ vect_cst__43;
  _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>;
  MEM[(long long unsigned int *)ivtmp_45] = _48;
  ivtmp_50 = ivtmp_45 + 16;
  _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>;
  MEM[(long long unsigned int *)ivtmp_50] = _51;

which then becomes for aarch64:
.L4:
	ld2	{v0.2d - v1.2d}, [x1]
	add	w2, w2, 1
	cmp	w2, w7
	eor	v0.16b, v2.16b, v0.16b
	umov	x4, v0.d[1]
	st1	{v0.d}[0], [x1]
	add	x1, x1, 32
	str	x4, [x1, -16]
	bcc	.L4
Comment 10 Maxim Kuvyrkov 2016-12-12 09:34:58 UTC
(In reply to Maxim Kuvyrkov from comment #9)
> which then becomes for aarch64:
> .L4:
> 	ld2	{v0.2d - v1.2d}, [x1]
> 	add	w2, w2, 1
> 	cmp	w2, w7
> 	eor	v0.16b, v2.16b, v0.16b
> 	umov	x4, v0.d[1]
> 	st1	{v0.d}[0], [x1]
> 	add	x1, x1, 32
> 	str	x4, [x1, -16]
> 	bcc	.L4

IIUC,
	umov	x4, v0.d[1]
	st1	{v0.d}[0], [x1]
	str	x4, [x1, -16]
could become just
	st2	{v0.d - v1.2d}, [x1]
Comment 11 Andrew Pinski 2016-12-12 09:39:42 UTC
(In reply to Maxim Kuvyrkov from comment #9)
> I've looked into another case where inability to handle stores with gaps
> generates sub-optimal code.  I'm interested in spending some time on fixing
> this, provided some guidance in the vectorizer.
> 
> Is it substantially more difficult to handle stores with gaps compared to
> loads with gaps?
> 
> The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(),
> which is #2 function in 462.libquantum profile.  This cycle accounts for
> about 25% of total 462.libquantum time.
> 
> ===struct node_struct
> {
>   float _Complex gap;
>   unsigned long long state;
> };
> 
> struct reg_struct
> {
>   int size;
>   struct node_struct *node;
> };
> 
> void
> func(int target, struct reg_struct *reg)
> {
>   int i;
> 
>   for(i=0; i<reg->size; i++)
>     reg->node[i].state ^= ((unsigned long long) 1 << target);
> }
> ===
> 
> This loop vectorizes into
>   <bb 5>:
>   # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)>
>   vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]);
>   vect__5.11_41 = vect_array.10[0];
>   vect__5.12_42 = vect_array.10[1];
>   vect__7.13_44 = vect__5.11_41 ^ vect_cst__43;
>   _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>;
>   MEM[(long long unsigned int *)ivtmp_45] = _48;
>   ivtmp_50 = ivtmp_45 + 16;
>   _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>;
>   MEM[(long long unsigned int *)ivtmp_50] = _51;
> 
> which then becomes for aarch64:
> .L4:
> 	ld2	{v0.2d - v1.2d}, [x1]
> 	add	w2, w2, 1
> 	cmp	w2, w7
> 	eor	v0.16b, v2.16b, v0.16b
> 	umov	x4, v0.d[1]
> 	st1	{v0.d}[0], [x1]
> 	add	x1, x1, 32
> 	str	x4, [x1, -16]
> 	bcc	.L4


What I did for thunderx was create a vector cost model which caused this loop not be vectorized to get the regression from happening.  Not this might actually be better code for some micro arch. I need to check with the new processor we have in house but that is next week or so.  I don't know how much I can share next week though.
Comment 12 Maxim Kuvyrkov 2016-12-12 10:03:09 UTC
(In reply to Andrew Pinski from comment #11)
> (In reply to Maxim Kuvyrkov from comment #9)
> > which then becomes for aarch64:
> > .L4:
> > 	ld2	{v0.2d - v1.2d}, [x1]
> > 	add	w2, w2, 1
> > 	cmp	w2, w7
> > 	eor	v0.16b, v2.16b, v0.16b
> > 	umov	x4, v0.d[1]
> > 	st1	{v0.d}[0], [x1]
> > 	add	x1, x1, 32
> > 	str	x4, [x1, -16]
> > 	bcc	.L4
> 
> 
> What I did for thunderx was create a vector cost model which caused this
> loop not be vectorized to get the regression from happening.  Not this might
> actually be better code for some micro arch. I need to check with the new
> processor we have in house but that is next week or so.  I don't know how
> much I can share next week though.

You are making an orthogonal point to this bug report: whether or not to vectorize such a loop.  But if loop is vectorized, then on any microarchitecture it is better to have "st2" vs "umov; st1; str".
Comment 13 Richard Biener 2016-12-13 09:49:43 UTC
(In reply to Maxim Kuvyrkov from comment #9)
> I've looked into another case where inability to handle stores with gaps
> generates sub-optimal code.  I'm interested in spending some time on fixing
> this, provided some guidance in the vectorizer.
> 
> Is it substantially more difficult to handle stores with gaps compared to
> loads with gaps?

It has the complication that we can't actually store to the gaps because
that creates store data races (and we'd need a load-modify-write cycle).

So we have to emit either scalar stores (which is what we currently do),
emit masked stores (not implemented yet) or something you suggest
(I suppose that's a store-lanes kind?).

A slight complication is that we have to avoid detecting the store group
if we want to end up with scalar stores (well, that's a vectorizer
implementation limit).  This is why we simply split all groups at gap
boundaries.  Cost-based selection of the kind of store (or even load)
implementation is not implemented.
Comment 14 Andrew Pinski 2017-01-28 08:36:31 UTC
(In reply to Maxim Kuvyrkov from comment #12) 
> You are making an orthogonal point to this bug report: whether or not to
> vectorize such a loop.  But if loop is vectorized, then on any
> microarchitecture it is better to have "st2" vs "umov; st1; str".

Yes but thinking about the problem some more I do think there are some vector cost model issue in the aarch64 backend where we don't model int vs floating point cost differences.  For an example ^ for scalar int might be one cycle but vector it is 4 cycles but for floating point scalar addition, it is 4 cycles while the floating point vector addition is just 4 cycles.
struct cpu_vector_cost
{
  const int scalar_stmt_cost;            /* Cost of any scalar operation,
                                            excluding load and store.  */
...

  const int vec_stmt_cost;               /* Cost of any vector operation,
                                            excluding load, store, permute,
                                            vector-to-scalar and
                                            scalar-to-vector operation.  */


Anyways I filed PR 79262 for the regression.