This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
| Other format: | [Raw text] | |
Hi,
This patch brings over from autovect-branch the ability to vectorize
outer-loops (doubly-nested loops).
Here's an example for a loop that could be vectorized with this
optimization (it's an FIR-filter, extremely common in multimedia
applications):
for (i=0; i<N; i++){
s=0;
for (j=0; j<M; j+=4)
s += in[i+j] * coeffs[j];
out[i]=s;
}
...when outer-loop-vectorized it would look something like this:
for (i=0; i<N; i+=4){
vs=[0,0,0,0]
for (j=0; j<M; j+=4)
vs += a[i+j,(i+1)+j,(i+2)+j,(i+3)+j] * b[j,j,j,j]
a[i,i+1,i+2,i+3] = vs
}
Note that the inner-loop still executes M iterations sequentially, but in
each inner-loop iteration we do 4 consecutive outer-loop iterations
together.
Why is it better to vectorize the outer-loop in this case, rather than the
inner-loop?
* Correctness: sometimes you can't vectorize the inner-loop, cause maybe it
computes a reduction (e.g., summation of floats, like in the example
above), and you are not allowed to change the order of the computation.
Inner-loop vectorization would change the order of the computation, but
with outer-loop vectorization the reduction in the inner-loop will be
computed in the original order (in parallel for 4 different outputs).
* Performance: Inner-loop vectorization would create a reduction epilog (to
reduce the vector of partial results into a scalar result) after the
inner-loop (executed in each outer-loop iteration). Outer-loop
vectorization does not require a reduction epilog after the inner-loop. And
of course, with outer-loop vectorization everything is vectorized (both in
and out of the inner-loop).
There may be potentially other ways to vectorize outer-loops. One is
performing loop-interchange, but note that we have a non-perfect nest here,
which makes it much more difficult both to interchange, and to vectorize
after that. Another, I think more promising way, is doing unroll-and-jam
(unroll the outer-loop and jam together the copies of the inner-loop), and
then apply SLP in the outer-loop (and inner-loop) on top of that. Operating
on the outer-loop as a whole has the advantage of having a global view of
the memory ranges accessed in different loop-iteration, to exploit
data-reuse and improve alignment handling. Potentially such optimizations
could also be implemented separately after vectorization on top of SLP-ed
code. When we have unroll-and-jam and mature enough SLP in GCC, it would be
interesting to give that a try and compare the two schemes.
Attached below is the complete patch (almost half of it is testcases),
however I plan to commit it in two parts: The first part brings in the most
basic/minimal outer-loop vectorization support - it can vectorize
outerloops only if there are no memory-references in the inner-loop. The
second patch brings in the support for memory-references in the inner-loop.
Each of the patches was bootstrapped on powerp64-linux,
bootstraped with vectorization enabled on i386-linux,
and passed full regression testing on both platforms.
I'll submit the two parts tomorrow, and will wait at least a week before
committing them to allow people to review and comment.
The main features that are not supported yet (and that I will continue to
implment) are:
- multiple types in the inner-loop
- strided-accesses in the inner-loop
- things that require peeling/versioning of nested loops (unknown loop
bound in the outer-loop, and sometimes misaligned accesses in the
outer-loop)
- reduction detection improvements
thanks,
dorit
(See attached file: mainlineouterloopdiff123t.txt)Attachment:
mainlineouterloopdiff123t.txt
Description: Text document
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |