This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[patch] [4.3 projects] outer-loop vectorization

From: Dorit Nuzman <DORIT at il dot ibm dot com>
To: gcc-patches at gcc dot gnu dot org
Date: Wed, 8 Aug 2007 00:00:04 +0300
Subject: [patch] [4.3 projects] outer-loop vectorization

Hi,

This patch brings over from autovect-branch the ability to vectorize
outer-loops (doubly-nested loops).
Here's an example for a loop that could be vectorized with this
optimization (it's an FIR-filter, extremely common in multimedia
applications):

            for (i=0; i<N; i++){
                  s=0;
                  for (j=0; j<M; j+=4)
                        s += in[i+j] * coeffs[j];
                  out[i]=s;
            }

...when outer-loop-vectorized it would look something like this:

            for (i=0; i<N; i+=4){
                  vs=[0,0,0,0]
                  for (j=0; j<M; j+=4)
                        vs += a[i+j,(i+1)+j,(i+2)+j,(i+3)+j] * b[j,j,j,j]
                  a[i,i+1,i+2,i+3] = vs
            }

Note that the inner-loop still executes M iterations sequentially, but in
each inner-loop iteration we do 4 consecutive outer-loop iterations
together.

Why is it better to vectorize the outer-loop in this case, rather than the
inner-loop?

* Correctness: sometimes you can't vectorize the inner-loop, cause maybe it
computes a reduction (e.g., summation of floats, like in the example
above), and you are not allowed to change the order of the computation.
Inner-loop vectorization would change the order of the computation, but
with outer-loop vectorization the reduction in the inner-loop will be
computed in the original order (in parallel for 4 different outputs).

* Performance: Inner-loop vectorization would create a reduction epilog (to
reduce the vector of partial results into a scalar result) after the
inner-loop (executed in each outer-loop iteration). Outer-loop
vectorization does not require a reduction epilog after the inner-loop. And
of course, with outer-loop vectorization everything is vectorized (both in
and out of the inner-loop).

There may be potentially other ways to vectorize outer-loops. One is
performing loop-interchange, but note that we have a non-perfect nest here,
which makes it much more difficult both to interchange, and to vectorize
after that. Another, I think more promising way, is doing unroll-and-jam
(unroll the outer-loop and jam together the copies of the inner-loop), and
then apply SLP in the outer-loop (and inner-loop) on top of that. Operating
on the outer-loop as a whole has the advantage of having a global view of
the memory ranges accessed in different loop-iteration, to exploit
data-reuse and improve alignment handling. Potentially such optimizations
could also be implemented separately after vectorization on top of SLP-ed
code. When we have unroll-and-jam and mature enough SLP in GCC, it would be
interesting to give that a try and compare the two schemes.

Attached below is the complete patch (almost half of it is testcases),
however I plan to commit it in two parts: The first part brings in the most
basic/minimal outer-loop vectorization support - it can vectorize
outerloops only if there are no memory-references in the inner-loop. The
second patch brings in the support for memory-references in the inner-loop.

Each of the patches was bootstrapped on powerp64-linux,
bootstraped with vectorization enabled on i386-linux,
and passed full regression testing on both platforms.

I'll submit the two parts tomorrow, and will wait at least a week before
committing them to allow people to review and comment.

The main features that are not supported yet (and that I will continue to
implment) are:
- multiple types in the inner-loop
- strided-accesses in the inner-loop
- things that require peeling/versioning of nested loops (unknown loop
bound in the outer-loop, and sometimes misaligned accesses in the
outer-loop)
- reduction detection improvements

thanks,
dorit

(See attached file: mainlineouterloopdiff123t.txt)

Attachment: mainlineouterloopdiff123t.txt
Description: Text document

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]