This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi, This patch brings over from autovect-branch the ability to vectorize outer-loops (doubly-nested loops). Here's an example for a loop that could be vectorized with this optimization (it's an FIR-filter, extremely common in multimedia applications): for (i=0; i<N; i++){ s=0; for (j=0; j<M; j+=4) s += in[i+j] * coeffs[j]; out[i]=s; } ...when outer-loop-vectorized it would look something like this: for (i=0; i<N; i+=4){ vs=[0,0,0,0] for (j=0; j<M; j+=4) vs += a[i+j,(i+1)+j,(i+2)+j,(i+3)+j] * b[j,j,j,j] a[i,i+1,i+2,i+3] = vs } Note that the inner-loop still executes M iterations sequentially, but in each inner-loop iteration we do 4 consecutive outer-loop iterations together. Why is it better to vectorize the outer-loop in this case, rather than the inner-loop? * Correctness: sometimes you can't vectorize the inner-loop, cause maybe it computes a reduction (e.g., summation of floats, like in the example above), and you are not allowed to change the order of the computation. Inner-loop vectorization would change the order of the computation, but with outer-loop vectorization the reduction in the inner-loop will be computed in the original order (in parallel for 4 different outputs). * Performance: Inner-loop vectorization would create a reduction epilog (to reduce the vector of partial results into a scalar result) after the inner-loop (executed in each outer-loop iteration). Outer-loop vectorization does not require a reduction epilog after the inner-loop. And of course, with outer-loop vectorization everything is vectorized (both in and out of the inner-loop). There may be potentially other ways to vectorize outer-loops. One is performing loop-interchange, but note that we have a non-perfect nest here, which makes it much more difficult both to interchange, and to vectorize after that. Another, I think more promising way, is doing unroll-and-jam (unroll the outer-loop and jam together the copies of the inner-loop), and then apply SLP in the outer-loop (and inner-loop) on top of that. Operating on the outer-loop as a whole has the advantage of having a global view of the memory ranges accessed in different loop-iteration, to exploit data-reuse and improve alignment handling. Potentially such optimizations could also be implemented separately after vectorization on top of SLP-ed code. When we have unroll-and-jam and mature enough SLP in GCC, it would be interesting to give that a try and compare the two schemes. Attached below is the complete patch (almost half of it is testcases), however I plan to commit it in two parts: The first part brings in the most basic/minimal outer-loop vectorization support - it can vectorize outerloops only if there are no memory-references in the inner-loop. The second patch brings in the support for memory-references in the inner-loop. Each of the patches was bootstrapped on powerp64-linux, bootstraped with vectorization enabled on i386-linux, and passed full regression testing on both platforms. I'll submit the two parts tomorrow, and will wait at least a week before committing them to allow people to review and comment. The main features that are not supported yet (and that I will continue to implment) are: - multiple types in the inner-loop - strided-accesses in the inner-loop - things that require peeling/versioning of nested loops (unknown loop bound in the outer-loop, and sometimes misaligned accesses in the outer-loop) - reduction detection improvements thanks, dorit (See attached file: mainlineouterloopdiff123t.txt)
Attachment:
mainlineouterloopdiff123t.txt
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |