18437 – vectorizer failed for matrix multiplication

Bug 18437 - vectorizer failed for matrix multiplication

Summary: vectorizer failed for matrix multiplication

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2004-11-12 01:22 UTC by Giovanni Bajo
Modified:	2023-08-04 20:20 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2011-05-22 17:36:32

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Giovanni Bajo 2004-11-12 01:22:17 UTC

Vectorizer fails to handle this:

----------------------------------------------------
#define align(x) __attribute__((align(x)))
typedef float align(16) MATRIX[3][3];

void RotateMatrix(MATRIX ret, MATRIX a, MATRIX b)
{
  int i, j;

  for (j = 0; j < 3; j++)
    for (i = 0; i < 3; i++)
      ret[j][i] =   a[j][0] * b[0][i]
                  + a[j][1] * b[1][i]
                  + a[j][2] * b[2][i];
}
----------------------------------------------------

loop at bench.cc:33: not vectorized: unsupported scalar cycle.
loop at bench.cc:33: bad scalar cycle.

Comment 1 Andrew Pinski 2004-11-12 02:43:04 UTC

Confirmed, ICC can do this but does not because it is not very inefficient to do it.

Comment 2 Andrew Pinski 2005-02-11 14:12:15 UTC

We now get:
t3.c:9: note: not vectorized: can't determine dependence between: (*D.1338_16)[0] and 
(*D.1336_10)[i_53]

Comment 3 Andrew Pinski 2005-09-20 17:44:17 UTC

Oh, the issue here is that a, b, and ret all could point to the same array because the type is (float[3])* 
or arraryptr in:
typedef float array[3];
typedef array *arraryptr;

If we change ret, a, and b to be global variables, then the vectorizer could be done except for the fact:
t.c:11: note: not vectorized: iteration count too small.
t.c:11: note: bad operation or unsupported loop bound.
t.c:11: note: vectorized 0 loops in function.

Comment 4 Steven Bosscher 2011-05-22 15:36:52 UTC

Test case of comment #0 is not vectorized in recent GCC:

     1	#define align(x) __attribute__((align(x)))
     2	typedef float align(16) MATRIX[3][3];
     3	 
     4	void RotateMatrix(MATRIX ret, MATRIX a, MATRIX b)
     5	{
     6	  int i, j;
     7	 
     8	  for (j = 0; j < 3; j++)
     9	    for (i = 0; i < 3; i++)
    10	      ret[j][i] =   a[j][0] * b[0][i]
    11	                  + a[j][1] * b[1][i]
    12	                  + a[j][2] * b[2][i];
    13	}


t.c:8: note: not vectorized: loop contains function calls or data references that cannot be analyzed
t.c:8: note: bad data references.
t.c:4: note: vectorized 0 loops in function.

"GCC: (GNU) 4.6.0 20110312 (experimental) [trunk revision 170907]"

Comment 5 Richard Biener 2011-07-27 12:38:20 UTC

The initial testcase is probably a bad example (3x3 matrix).  The following
testcase is borrowed from Polyhedron rnflow and is vectorized by ICC but
not by GCC (the ICC variant is 15% faster):

      function trs2a2 (j, k, u, d, m)
      real, dimension (1:m,1:m) :: trs2a2  
      real, dimension (1:m,1:m) :: u, d
      integer, intent (in)      :: j, k, m
      real (kind = selected_real_kind (10,50)) :: dtmp
      trs2a2 = 0.0
      do iclw1 = j, k - 1
         do iclw2 = j, k - 1
            dtmp = 0.0d0
            do iclww = j, k - 1
               dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2)
            enddo
            trs2a2 (iclw1, iclw2) = dtmp
         enddo
      enddo
      return
      end function trs2a2

the reason why GCC cannot vectorize this is that the load from U has
a non-constant stride, so vectorization would need to load two scalars
and build up a vector (ICC does that).  If the stride were constant
but not power-of-two GCC would reject that as well, probably to not
confuse the interleaving code.  Data dependence analysis also rejects
non-constant strides.

Further complication (for the cost model) is the accumulator of
type double compared to the data types of float.  ICC uses only
half of the float vectors here to handle mixed float/double type
loops (but it still unrolls the loop).

Comment 6 Michael Matz 2012-04-17 13:54:36 UTC

Author: matz
Date: Tue Apr 17 13:54:26 2012
New Revision: 186530

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=186530
Log:
	PR tree-optimization/18437

	* tree-vectorizer.h (_stmt_vec_info.stride_load_p): New member.
	(STMT_VINFO_STRIDE_LOAD_P): New accessor.
	(vect_check_strided_load): Declare.
	* tree-vect-data-refs.c (vect_check_strided_load): New function.
	(vect_analyze_data_refs): Use it to accept strided loads.
	* tree-vect-stmts.c (vectorizable_load): Ditto and handle them.

testsuite/
	* gfortran.dg/vect/rnflow-trs2a2.f90: New test.

Added:
    trunk/gcc/testsuite/gfortran.dg/vect/rnflow-trs2a2.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-data-refs.c
    trunk/gcc/tree-vect-stmts.c
    trunk/gcc/tree-vectorizer.h

Comment 7 Richard Biener 2012-05-09 12:59:49 UTC

Author: rguenth
Date: Wed May  9 12:59:46 2012
New Revision: 187330

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187330
Log:
2012-05-09  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/18437
	* gfortran.dg/vect/rnflow-trs2a2.f90: Move ...
	* gfortran.dg/vect/fast-math-rnflow-trs2a2.f90: ... here.

Added:
    trunk/gcc/testsuite/gfortran.dg/vect/fast-math-rnflow-trs2a2.f90
      - copied unchanged from r187329, trunk/gcc/testsuite/gfortran.dg/vect/rnflow-trs2a2.f90
Removed:
    trunk/gcc/testsuite/gfortran.dg/vect/rnflow-trs2a2.f90
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 8 Richard Biener 2012-07-13 08:49:47 UTC

Link to vectorizer missed-optimization meta-bug.

Comment 9 Andrew Pinski 2023-08-04 20:19:12 UTC

For the original testcase in comment #0, with `-O3 -fno-vect-cost-model` GCC can vectorize it on aarch64 but not on x86_64.

Comment 10 Andrew Pinski 2023-08-04 20:20:16 UTC

(In reply to Andrew Pinski from comment #9)
> For the original testcase in comment #0, with `-O3 -fno-vect-cost-model` GCC
> can vectorize it on aarch64 but not on x86_64.

I should say starting in GCC 6 .