[autovect] [patch] support misalignment in outer-loop vectorization

Dorit Nuzman DORIT@il.ibm.com
Tue Jun 5 09:53:00 GMT 2007


This patch adds support for vectorization of misaligned-accesses in the
inner-loop during outer-loop vectorization. This does not yet include
peeling or versioning for alignment, but rather generation of misaligned
vector moves (as for SSE) or explicit realignment (i.e. using the
realign-load, as for Altivec and the SPU). With this patch we can vectorize
a FIR-filter when written in a vectorizer-friendly-way (alignment wise),
however not as efficiently as it could be vectorized (yet).

A few words about the bigger picture:
In the explicit realignment case there are two scenarios to consider:
(1) The misalignment remains fixed throughout the iterations of the
inner-loop. This happens only when the stride of the access in the
inner-loop is a multiple of the vector size (VS), as is the case in the
following example, operating on floats (we are vectorizing the i-loop):

    for (k = 0; k < 4; k++) {
     for (i = 0; i < N; i++) {
       diff = 0;
       for (j = k; j < M; j+=4) {
         diff += in[j+i]*coeff[j];
       }
       out[i] += diff;
     }

...since the stride in the inner-loop is 4 (== VS), the misalignment of the
'in' and 'coeff' accesses remains constant throughout the inner-loop.
(e.g., the misalignment is always 0 when k=0, it is always 1 when k=1 etc).

(2) The misalignment does *not* remain fixed throughout the iterations of
the loop. This happens when the stride of the access in the inner-loop is
*not* a multiple of VS, as is the case in the following example, also
operating on floats (again, we are vectorizing the i-loop):

     for (i = 0; i < N; i++) {
       diff = 0;
       for (j = 0; j < M; j++) {
         diff += in[j+i]*coeff[j];
       }
       fir_out[i] = diff;
     }

...the misalignment of the inner-loop accesses is 0,1,2,3,0,1,2,3,... for
j=0,1,2,3,4,5,6,7,... respectively (in other words it is different in
different j-loop iterations).


By the way, the two loop examples above are two ways to write the same
thing, which is basically an FIR-filter.


Case (1) can be vectorized using the optimized realignment-scheme (which
used to be called the "software-pipelined" scheme), roughly as follows:

            mis = p&0x3
            v1 = vload (align_ref (p));
      inner_loop:
            v2 = vload (align_ref (p+VS-1));
            realign_load (v1, v2, mis);
            v1 = v2; p += VS;

...the computation of the misalignment can be taken out of the loop, and
only one additional vector load is generated instead of 2 in each iteration
(we basically do predictive-commoning here). (This can be done only if j ==
VS. If j is larger we still need 2 vloads in each iteration).

Case (2) cannot be vectorized using the optimized scheme. Instead we need
to compute the misalignment inside the loop along with the two vector
loads, as follows:

      inner_loop:
            mis = p&0x3
            v1 = vload (align_ref (p));
            v2 = vload (align_ref (p+VS-1));
            realign_load (v1, v2, mis);
            v1 = v2; p += VS;

(...another alternative would be to unroll the loop by 4 (effectively
transforming it to the loop in case (1)) to do something more efficient,
but this is Future Work).


This patch adds the ability to do the (un-optimized) explicit realignment
scheme.

With this patch we can vectorize a FIR-filter when written like in case (1)
above, using a variation of the un-optimized explicit realignment scheme.
(by the way, for this we also needed the bit committed here
http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00247.html, cause
autovect-branch currently can't figure out that k is less than 4, and so it
thinks that the number of iterations may be negative. This will be fixed
with the next merge from mainline).

Follow-up patches will add support to:
1. vectorize case (1) using the optimized realignment scheme
2. vectorize a FIR-filter when written like in case (2) above


Bootstrapped with vectorization enabled, and tested on the vectorizer
testcases, on i386-linux and powerpc-linux.
Committed to autovect-branch.

dorit

        * tree-vectorizer.c (vect_supportable_dr_alignment): Misaligned
        accesses now supported for inner-loop references within outer-loop
        vectorization.
        * tree-vectorizer.h (dr_alignment_support): Renamed
        dr_unaligned_software_pipeline to dr_explicit_realign_optimized.
        Added a new value dr_explicit_realign.
        * tree-vect-trasnform.c (vect_setup_realignment): Takes additional
        argument dr_alignment_support.
        (vect_create_data_ref_ptr): Added comment and an assert.
        (bump_vector_ptr): Updated to support the new dr_explicit_realign
        scheme: takes additional argument bump; argument ptr_incr is now
        optional; updated documentation.
        (vectorizable_store): Fix typos. Call bump_vector_ptr with
additional
        argument.
        (vect_setup_realignment): Support the dr_explicit_realign scheme.
Takes
        additional argument alignment_support_scheme. Updated
documentation.
        (vectorizable_load): Fix typos. Support the dr_explicit_realign
scheme.
        Support misaligned inner-loop references within outer-loop
        vectorization.  Call vect_setup_realignment with additional
argument.
        Call bump_vector_ptr with additional argument.

        * gcc.dg/vect/vect-outer-4.c: Fix loop count.
        * gcc.dg/vect/vect-outer-fir.c: Fix typos. Add initialization.
        Test now get vectorized.

(See attached file: autovect.jun4.wo.loop.txt)
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: autovect.jun4.wo.loop.txt
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20070605/9fa653d1/attachment.txt>


More information about the Gcc-patches mailing list