Bug 37021 - Fortran Complex reduction / multiplication not vectorized
Summary: Fortran Complex reduction / multiplication not vectorized
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.4.0
: P3 enhancement
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: alias, missed-optimization
Depends on: 43434 54939 56902
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2008-08-04 17:56 UTC by Richard Biener
Modified: 2014-05-15 12:10 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-01-21 15:43:08


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2008-08-04 17:56:06 UTC
subroutine to_product_of(self,a,b,a1,a2)
  complex(kind=8) :: self (:)
  complex(kind=8), intent(in) :: a(:,:)
  complex(kind=8), intent(in) :: b(:)
  integer a1,a2
  self = ZERO
  do i = 1,a1
    do j = 1,a2
      self(i) = self(i) + a(i,j)*b(j)
    end do
  end do
end subroutine

There are numerous reasons this is not vectorized at the moment:
 1) aliasing doesn't tell apart the data of self, a and b
 2) the reduction is obfuscated by extra SSA_NAME copies from store-motion
    (after you fix 1)
 3) REALPART_EXPR < (*ptr)[i] > is not handled by data dependency analysis
 4) ... ?
Comment 1 Richard Biener 2008-08-04 17:58:30 UTC
Patch for 1) http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00221.html
Patch for 2) http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00226.html

I didn't yet start on 3), so 4) is unknown yet (as is 5, 6, ... ;))
Comment 2 Richard Biener 2008-08-19 15:29:59 UTC
3) is because data-ref requires a constant step

  else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "failed: evolution of offset is not affine.\n");
      return;

but the step is (<unnamed-signed:64>) ((<unnamed-unsigned:64>) stride.3_36 * 16)

as we are dealing with general incoming arrays which are arbitrary striped.
Fixing this requires for example versioning for a constant stride.
Comment 3 Richard Biener 2009-01-21 15:43:08 UTC
Mine.  I am working on adding versioning for non-constant strides.
Comment 4 Richard Biener 2009-01-23 15:33:06 UTC
The testcase should be

subroutine to_product_of(self,a,b,a1,a2)
  complex(kind=8) :: self (:)
  complex(kind=8), intent(in) :: a(:,:)
  complex(kind=8), intent(in) :: b(:)
  integer a1,a2
  do i = 1,a1
    do j = 1,a2
      self(i) = self(i) + a(j,i)*b(j)
    end do
  end do
end subroutine

to be meaningful - otherwise we are accessing a in non-continuous ways in the
inner loop which would prevent vectorization.

With the versioning for stride == 1 I get then

.L13:
        movupd  16(%rax), %xmm1
        movupd  (%rax), %xmm3
        incl    %ecx
        movupd  (%rdx), %xmm4
        addq    $32, %rax
        movapd  %xmm3, %xmm0
        unpckhpd        %xmm1, %xmm3
        unpcklpd        %xmm1, %xmm0
        movupd  16(%rdx), %xmm1
        movapd  %xmm4, %xmm2
        addq    $32, %rdx
        movapd  %xmm3, %xmm9
        cmpl    %ecx, %r8d
        unpcklpd        %xmm1, %xmm2
        unpckhpd        %xmm1, %xmm4
        movapd  %xmm4, %xmm1
        movapd  %xmm2, %xmm4
        mulpd   %xmm1, %xmm9
        mulpd   %xmm0, %xmm4
        mulpd   %xmm3, %xmm2
        mulpd   %xmm1, %xmm0
        subpd   %xmm9, %xmm4
        addpd   %xmm2, %xmm0
        addpd   %xmm4, %xmm6
        addpd   %xmm0, %xmm5
        ja      .L13
        haddpd  %xmm5, %xmm5
        cmpl    %r15d, %edi
        movl    -4(%rsp), %ecx
        haddpd  %xmm6, %xmm6
        addsd   %xmm5, %xmm8
        addsd   %xmm6, %xmm7
        jne     .L12
        jmp     .L14

for the innermost loop, followed by a tail loop (peel for niters).  This is
about 15% faster on AMD K10 than the non-vectorized loop (if you disable
the cost-model and make sure to have enough iterations in the inner loop
to pay back for the extra guarding conditions).
Comment 5 Richard Biener 2009-01-23 15:36:07 UTC
So,

 4) The vectorized version sucks because we have to use peeling for niters
    because we need to unroll the loop once and cannot apply SLP here.

Q1: does SLP work with reductions at all?
Q2: does SLP do pattern recognition?

First of all we would need to recognize a complex reduction as a single
vectorized reduction.  Second we need to vectorize the complex multiplication
with SLP, feeding the reduction with one resulting complex vector.
Comment 6 Ira Rosen 2009-01-25 09:12:54 UTC
(In reply to comment #5)
> So,
>  4) The vectorized version sucks because we have to use peeling for niters
>     because we need to unroll the loop once and cannot apply SLP here.

What do you mean by "unroll the loop once"?

> Q1: does SLP work with reductions at all?

No. SLP currently originates from groups of strided stores.

> Q2: does SLP do pattern recognition?

Pattern recoginition is done before SLP, and SLP handles stmts that were marked as a part of a pattern. There is no SLP specific pattern recoginition.

> First of all we would need to recognize a complex reduction as a single
> vectorized reduction.  Second we need to vectorize the complex multiplication
> with SLP, feeding the reduction with one resulting complex vector.

Comment 7 rguenther@suse.de 2009-01-25 11:04:23 UTC
Subject: Re:  Fortran Complex reduction /
 multiplication not vectorized

On Sun, 25 Jan 2009, irar at il dot ibm dot com wrote:

> 
> 
> ------- Comment #6 from irar at il dot ibm dot com  2009-01-25 09:12 -------
> (In reply to comment #5)
> > So,
> >  4) The vectorized version sucks because we have to use peeling for niters
> >     because we need to unroll the loop once and cannot apply SLP here.
> 
> What do you mean by "unroll the loop once"?

The vectorization factor is two, so we need two copies of the loop body
(which means unrolling it once and creating an epilogue loop because
niter may be odd)

> > Q1: does SLP work with reductions at all?
> 
> No. SLP currently originates from groups of strided stores.

Ah, I see.  In this loop we have two reductions, so to apply SLP
we would need to see that we can use a group of reductions for SLP?

> > Q2: does SLP do pattern recognition?
> 
> Pattern recoginition is done before SLP, and SLP handles stmts that were marked
> as a part of a pattern. There is no SLP specific pattern recoginition.

Ok, but with a reduction it won't help me here.

Can a loop be vectorized with just pattern recognition?  Hm, if I
remember correctly we detect scalar patterns and then vectorize them.
We don't support detecting "vector patterns" from scalar code, correct?

Thanks,
Richard.
Comment 8 Ira Rosen 2009-01-25 12:17:29 UTC
(In reply to comment #7)
> > > Q1: does SLP work with reductions at all?
> > 
> > No. SLP currently originates from groups of strided stores.
> Ah, I see.  In this loop we have two reductions, so to apply SLP
> we would need to see that we can use a group of reductions for SLP?

Yes, I think this will work.

> > > Q2: does SLP do pattern recognition?
> > 
> > Pattern recoginition is done before SLP, and SLP handles stmts that were marked
> > as a part of a pattern. There is no SLP specific pattern recoginition.
> Ok, but with a reduction it won't help me here.
> Can a loop be vectorized with just pattern recognition?  Hm, if I
> remember correctly we detect scalar patterns and then vectorize them.
> We don't support detecting "vector patterns" from scalar code, correct?

Yes, if I understand you correctly, we detect scalar patterns, but adding vector pattern detection does not seem to be complicated.

Comment 9 dorit 2009-01-27 12:40:26 UTC
(In reply to comment #4)
> The testcase should be
> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
> to be meaningful - otherwise we are accessing a in non-continuous ways in the
> inner loop which would prevent vectorization.

this change from a(i,j) to a(j,i) is not required if we try to vectorize the outer-loop, where the stride is 1. It's also a better way to vectorize the reduction. A few limitations on the way though are:

1) somehow don't let gcc create guard code around the innermost loop to check that it executes more than zero iterations. This creates a complicated control flow structure within the outer-loop. For now you have to have  constant number of iterations for the inner-loop because of that, or insert a statement like "if (a2<=0) return;" before the loop...

2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the latch block and the vectorizer likes to have the latch block empty...

(see also PR33113 for related reference).


> With the versioning for stride == 1 I get then
> .L13:
>         movupd  16(%rax), %xmm1
>         movupd  (%rax), %xmm3
>         incl    %ecx
>         movupd  (%rdx), %xmm4
>         addq    $32, %rax
>         movapd  %xmm3, %xmm0
>         unpckhpd        %xmm1, %xmm3
>         unpcklpd        %xmm1, %xmm0
>         movupd  16(%rdx), %xmm1
>         movapd  %xmm4, %xmm2
>         addq    $32, %rdx
>         movapd  %xmm3, %xmm9
>         cmpl    %ecx, %r8d
>         unpcklpd        %xmm1, %xmm2
>         unpckhpd        %xmm1, %xmm4
>         movapd  %xmm4, %xmm1
>         movapd  %xmm2, %xmm4
>         mulpd   %xmm1, %xmm9
>         mulpd   %xmm0, %xmm4
>         mulpd   %xmm3, %xmm2
>         mulpd   %xmm1, %xmm0
>         subpd   %xmm9, %xmm4
>         addpd   %xmm2, %xmm0
>         addpd   %xmm4, %xmm6
>         addpd   %xmm0, %xmm5
>         ja      .L13
>         haddpd  %xmm5, %xmm5
>         cmpl    %r15d, %edi
>         movl    -4(%rsp), %ecx
>         haddpd  %xmm6, %xmm6
>         addsd   %xmm5, %xmm8
>         addsd   %xmm6, %xmm7
>         jne     .L12
>         jmp     .L14
> for the innermost loop, followed by a tail loop (peel for niters).  This is
> about 15% faster on AMD K10 than the non-vectorized loop (if you disable
> the cost-model and make sure to have enough iterations in the inner loop
> to pay back for the extra guarding conditions).

Comment 10 sebastian.hegler 2011-03-25 10:45:47 UTC
This one, as well as PR 33133, should be handled by "-floop-interchange". 

Fortran is row-major, so interchanging inner and outer loop would allow the loops to be coalesced into one, which in turn should be easily vectorized (if complex numbers can be vectorized, see PR 40770). 

Can you please give me some hints on how to find out if "-floop-interchange" actually does that? Thanks!
Comment 11 sebastian.hegler 2011-03-25 11:38:37 UTC
Forget that about folding stuff into one loop, I didn't have my morning coffee yet. However, the rest still applies. 

I'm looking forward to some help in that regard.

Thanks.
Comment 12 rguenther@suse.de 2011-03-25 12:40:10 UTC
On Fri, 25 Mar 2011, sebastian.hegler@tu-dresden.de wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
> 
> --- Comment #11 from sebastian.hegler@tu-dresden.de 2011-03-25 11:38:37 UTC ---
> Forget that about folding stuff into one loop, I didn't have my morning coffee
> yet. However, the rest still applies. 
> 
> I'm looking forward to some help in that regard.

Look at dump files (-fdump-tree-all-details).

Richard.
Comment 13 Richard Biener 2012-07-13 08:45:12 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 14 Richard Biener 2013-02-13 15:58:31 UTC
The following testcase shows the issue well:

_Complex double self[1024];
_Complex double a[1024][1024];
_Complex double b[1024];

void foo (void)
{
  int i, j;
  for (i = 0; i < 1024; i+=3)
    for (j = 0; j < 1024; j+=3)
      self[i] = self[i] + a[i][j]*b[j];
}

we have to get the complex multiplication pattern recognized by SLP
which looks like (without PRE):

  <bb 3>:

  <bb 4>:
  # j_21 = PHI <j_13(3), 0(7)>
  # self_I_RE_lsm.2_12 = PHI <_26(3), self_I_RE_lsm.2_7(7)>
  # self_I_IM_lsm.3_28 = PHI <_27(3), self_I_IM_lsm.3_8(7)>
  # ivtmp_16 = PHI <ivtmp_1(3), 342(7)>
  _2 = REALPART_EXPR <a[i_20][j_21]>;
  _18 = IMAGPART_EXPR <a[i_20][j_21]>;
  _19 = REALPART_EXPR <b[j_21]>;
  _17 = IMAGPART_EXPR <b[j_21]>;
  _4 = _19 * _2;
  _3 = _18 * _17;
  _6 = _17 * _2;
  _23 = _19 * _18;
  _24 = _4 - _3;
  _25 = _23 + _6;
  _26 = _24 + self_I_RE_lsm.2_12;
  _27 = _25 + self_I_IM_lsm.3_28;
  j_13 = j_21 + 3;
  ivtmp_1 = ivtmp_16 - 1;
  if (ivtmp_1 != 0)
    goto <bb 3>;

we fail to build the SLP tree for _25 = _23 + _6 because the matching
stmt is _24 = _4 - _3 which has a different operation (SSE4 addsub
would support vectorizing this).  I don't see how we can easily
make this supported with the current pattern support ... the
support doesn't allow tieing together two SLP group members.
Simply allowing analysis to proceeed here reveals the fact that
the interleaving has a gap of 6 which makes the analysis fail.
Allowing it to proceed for ncopies == 1 (thus no actual interleaving
required) reveals the next check is slightly bogus in that case.
Fixing that ends us with

t.c:9: note: Load permutation 0 0 1 0 1 1 0 1
t.c:9: note: Build SLP failed: unsupported load permutation _27 = _25 + self_I_IM_lsm.3_28;

... (to be continued)
Comment 15 Richard Biener 2013-03-27 10:39:00 UTC
Author: rguenth
Date: Wed Mar 27 10:38:29 2013
New Revision: 197158

URL: http://gcc.gnu.org/viewcvs?rev=197158&root=gcc&view=rev
Log:
2013-03-27  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/37021
	* tree-vect-data-refs.c (vect_check_strided_load): Allow
	REALPART/IMAGPART_EXPRs around the supported refs.
	* tree-ssa-structalias.c (find_func_aliases): Assume that
	floating-point values are not used to transfer pointers.

	* gfortran.dg/vect/fast-math-pr37021.f90: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/vect/fast-math-pr37021.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-structalias.c
    trunk/gcc/tree-vect-data-refs.c
Comment 16 Richard Biener 2013-03-27 10:40:40 UTC
We now vectorize this testcase by means of using strided loads, relying on
store motion turning the store to self(i) in the innermost look into a
reduction (no support for vectorized strided stores).
Comment 17 Dominique d'Humieres 2013-04-07 13:18:27 UTC
The test gfortran.dg/vect/fast-math-pr37021.f90 fails on powerpc*-* (see http://gcc.gnu.org/ml/gcc-testresults/2013-04/msg00677.html ). Isn't it expected? AFAICT doubles are not vectorized, at least on a G5.