This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[autovect] [patch] initial outer-loop vectorization


Hi,

This patch adds initial support for outer-loop vectorization.
As a first step, we vectorize outer-loops only if there are no
memory-references in the inner-loop. This is just in order to allow simple
incremental patches. So the only kind of (toy) loops we can now vectorize
looks like this:
      for (i=0; i<M; i++) {
            s = i;
            for (j=0; i<N; j++)
                  s += j;
            a[i] = s;
      }
Since scev-cprop likes to replace such inner-loops with a constant (and
rightfully so), I'm using -fno-tree-scev-cprop to test this initial
functionality.

This roughly how the vectorized outer-loop would look like:

for (i=0; i<M; i+=4){
      vs = [i,i+1,i+2,i+3];
      vj = [0,0,0,0];
      for (j=0; i<N; j++){
            vs = vj + vs;
            vj = vj + [1,1,1,1];
      }
      a[i,i+1,i+2,i+3] = vs;
}

Note that the inner-loop still executes N iterations sequentially, but in
each inner-loop iteration we do 4 outer-loop iterations together, as
opposed to vectorizing the inner-loop, which would look like this:

for (i=0; i<M; i++){
      vs = [0,0,0,0];
      vj = [0,1,2,3];
      for (j=0; i<N/4; j++){
            vs = vj + vs;
            vj = vj + [4,4,4,4];
      }
      s = extract_scalar(vs) #code that does the reduction epilog
      s = s + i;
      a[i] = s;
}

As you can see, the outer-vectorization has the potential of being more
efficient because we don't need to compute the reduction epilog in each
iteration of the outer-loop. Also, in cases when we are not allowed to
change the order of computation - we can't vectorize this loop unless we
use outer-loop vectorization (which preserves the order of computation in
the inner-loop).

So with this patch we can vectorize reductions and inductions in the
inner-loop. The next step is to support also datarefs in the inner-loop.

And here is the long list of other required extensions that are not
included in this first initial patch (I plan to addressed these gradually,
after I add the support for datarefs in the inner-loop):

1) Support more general loop-bounds in the outer-loop (see testcase
no-scevccp-outer-11.c): we are currently restricted to known loop-bounds
that divide by the vectorization factor, because otherwise we need to peel
the last few iterations, and peeling for nested-loops is not yet supported
in gcc (we'll have to extend the current available utility).

2) Support unaligned references in the outer-loop that require
peeling/versioning (see testcase no-scevccp-outer-8.c). This is missing for
the same reason as above - peeling and versioning for nested-loops is not
yet supported in gcc. (However, we do vectorize misaligned accesses in the
outer-loop that are supported by the target without peeling/versioning -
see testcase no-scevccp-outer-6.c).

3) Support more general forms of reductions. This is part of a long due
todo item that becomes even more necessary if we want to recognize
reductions as in testcase no-scevccp-outer-2.c, in which the reduction is
both in the inner and outer loops:
      s = 0;
      for (i=0; i<N; i++) {
            for (j=0; i<N; j++)
                  s += j;
      }
We can however vectorize "regular" reductions in the outer-loop (as in
testcases no-scevccp-outer-[13,14,17].c), e.g.:
      sum = 0;
      for (i=0; i<N; i++) {
            s = 0;
            for (j=0; i<N; j++)
                  s += j;
            sum += s;
      }
Similarly, induction that advances both through the inner- and outer-loops,
creates a def-use cycle that is currently not recognizable (see testcase
no-scevccp-outer-4.c).

4) pattern-detection may need be extended. We do detect and vectorize
patterns in the outer-loop (see testcase no-scevccp-outer-7.c), but not as
much as we could (if it looks just like it does when it's in an inner-most
loop then we'll catch it, but if we have to go through the inner-loop on
the way - we may fail to detect the pattern).

5) we currently don't support multiple-types in the inner-loop (within the
context of outer-loop vectorization). (Nothing special here, just need to
do it). See testcase no-scevccp-outer-12.c

6) The handling of inner-loop reduction (within the context of outer-loop
vectorization) has a bug and a feature: As explained above, the inner-loop
executes sequentially. However, we use the same routines as we have now to
classify scalar-cycles; this means that (1) we miss an optimization
opportunities: we think that we can't vectorize the loop if we operate in
the inner-loop on floats and fast-math is not on, and we don't support
inner-loop reductions when they are used in the inner-loop, although we
could. (2) we have a bug: stmts that are used only by operations that are
classified as a reduction are marked as "used_by_reduction", which
indicates that the order of computation when vectorizing these stmts
doesn't matter. However, in the case of inner-loop "reduction" it does
matter, cause we do preserve the order of computation. So we really need to
classify "inner-loop reductions" differently than regular reductions.


So much for what we can't do with this patch... here's what this patch does
include (not that many changes actually):

1) So far we supported single-BB loops (+empty latch), so the order by
which we traversed the loop BBs did not matter. Now, it does - we sort in
BBs in dfs order (since we don't allow if's in the loop, this should
guarantee visiting defs before their uses).

2) vect_analyze_loop_form was extend to allow a restricted form of
outer-loops. We currently support doubly-nested loops that consist of a
header, a single inner(most)-loop, a tail, and an empty latch (5 BBs all
together). This means, for example, that we can't vectorize outer-loops
when the inner-loop count is unknown, and we can't prove n>0 (because there
will be a guard code to skip the loop if n<=0, which means we'll have an
if-stmt in the loop). So this is a missed optimization that would be nice
to overcome. Also, there are cases in which the number of BBs is only 4 -
cause the inner- and outer-loops share the latch block (when the outer-loop
header is empty). This is also something I'd want to fix at some point. See
testcases no-scevccp-outer-[9,10].c and no-scevccp-no-reassoc-outer-1.c.

3) vect_analyze_loop_form calls a new function - vect_analyze_loop_1 - to
do a few analyses on the inner-loop (currently only two analyses:
analyze_loop_form and analyze_scalar_cycles), and to build a loop_info for
the inner-loop. It is destroyed soon after, but w/o destroying the
stmt_info's that were set up for the inner-loop stmts. This is something
we'll revisit as we continue to extend the outer-loop support - see if we
want to keep the inner-loop_info around, see which analyses should be
applied at the inner-loop level, etc.

4) One assumption that the support for outer-loops breaks in the current
code is the assumption that phi nodes are only in the loop-header, and
represent a scalar-cycle (induction or reduction). In outer-loops we also
have phi-nodes inside the loop - these are the loop-closed phis after the
inner-loop. This required a way to distinguish between these two kinds of
phis (we use 'is_loop_header_bb_p' for that), and a few small changes in
several places:
o new_stmt_vec_info: different def-type initialization for the two kinds of
phis
o vect_is_simple_reduction: the uses that are not the reduction-variable
can now be defined by a phi, though not a loop-header phi.
o vect_recog_dot_prod_pattern: a vect_loop_def might be a phi, and not
necessarily a gimple_modify_stmt.
o vect_get_vec_def_for_oprnd: a vect_loop_def can be a phi node, and not
necessarily a gimple_modify_stmt.

5) the enum "relevant" has two new values -
vect_used_in_outer[_by_reduction], which are propagated during the
mark_relevant pass.

The more significant/interesting changes are to vectorization of reduction
and induction. In both cases we need to be aware of whether the
induction/reduction-phi that we are vectorizing is in the same nest that is
being vectorized, or is 'nested_in_vect_loop' (is inside the inner-loop
while vectorizing the outer-loop):

6) vectorization of induction: In get_initial_def_for_induction, if this is
a 'nested_in_vect_loop' case, then:
o the initialization vector can be obtained using
vect_get_vec_def_for_operand (does not need to be built from scratch).
o the vector that holds the step of the vectorized induction is {S,S,S,S}
rather than {VF*S,VF*S,VF*S,VF*S} (where S is the step of the induction),
because in the vectorized inner-loop we are advancing sequentially (though
in parallel for VF (4 in the example) outer-loop iterations).
o the final vector for inductions is recorded in the corresponding
loop-exit phi (of the inner-loop) so that we can easily obtain it when we
vectorize stmts in the outer-loop that use it.

7) vectorization of reduction: The main thing here is that we don't need to
reduce the reduction to a single result; the final vector of partial
results will feed the vector operations that may use it in the outer-loop.
So:
o In get_initial_def_for_reduction, we may return a vector for the epilog
adjustment, rather than a scalar.
o epilog_for_reduction - skip the part that computes the final scalar
result in case this is a 'nested_in_vect_loop' case.
o and in vectorizable_reduction, we don't check that the reduction is
LIVE_P anymore (used out of the loop), cause it may be not used outside the
(outer) loop, but used inside the outer-loop (so as far as the inner-loop
reduction is concerned, it is used_in_outer_loop, but not live).


Bootstrapped on i386-linux, and passed full regression testing.
Bootstrapped with vectorization enabled on powerpc-linux, and passed full
regression testing.

Committed to autovect-branch.

see you after Passover!

thanks,
dorit

ChangeLogs:

        * tree-vectorizer.c (new_stmt_vec_info): When setting def_type for
        phis differentiate loop-header phis from other phis.
        (bb_in_loop_p): New function.
        (new_loop_vec_info): Inner-loop phis already have a stmt_vinfo, so
just
        update their loop_vinfo.  Order of BB traversal now matters - call
        dfs_enumerate_from with bb_in_loop_p.
        (destroy_loop_vec_info): Takes additional argument to control
whether
        stmt_vinfo of the loop stmts should be destroyed as well.
        (vect_is_simple_use): Check for multiple-types in inner-loop
(currently
        not supported during outer-loop vectorization).
        (vect_is_simple_reduction): Allow the "non-reduction" use of a
        reduction stmt to be defines by a non loop-header phi.
        (vectorize_loops): Call destroy_loop_vec_info with additional
argument.
        * tree-vectorizer.h (vect_relevant): Add enum values
        vect_used_in_outer_by_reduction and vect_used_in_outer.
        (is_loop_header_bb_p): New. Used to differentiate loop-header phis
        from other phis in the loop.
        (destroy_loop_vec_info): Add additional argument to declaration.
        * tree-vect-analyze.c (analyze_operations): Check for inner-loop
        loop-closed exit-phis during outer-loop vectorization that are live
or
        not used in the outerloop, cause this requires special handling.
        (vect_enhance_data_refs_alignment): Don't consider versioning for
        nested-loops.
        (vect_analyze_data_refs): Check that there are no datarefs in the
        inner-loop.
        (vect_mark_stmts_to_be_vectorized): Also consider
vect_used_in_outer
        and vect_used_in_outer_by_reduction cases.
        (process_use): Also consider the case of outer-loop stmt defining
an
        inner-loop stmt.
        (vect_analyze_loop_1): New function.
        (vect_analyze_loop_form): Extend, to allow a restricted form of
nested
        loops.  Call vect_analyze_loop_1.
        (vect_analyze_loop): Skip (inner-)loops within outer-loops that
have
        been vectorized.  Call destroy_loop_vec_info with additional
argument.
        * tree-vect-patterns.c (vect_recog_dot_prod_pattern): Add check for
        GIMPLE_MODIFY_STMT (in case we encounter a phi in the loop).
        * tree-vect-transform.c (vect_init_vector): Check which loop to
work
        with, in case there's an inner-loop.
        (get_initial_def_for_inducion): Extend to handle outer-loop
        vectorization.
        (vect_get_vec_def_for_operand): Support phis in the case
vect_loop_def.
        In the case vect_induction_def get the vector def from the
induction
        phi node (instead of calling get_initial_def_for_inducion.
        (vect_finish_stmt_generation): Support the case that the new stmt
is
        to be added after a label (rather than before some other stmt).
        (get_initial_def_for_reduction): Extend to handle outer-loop
        vectorization.
        (vect_create_epilog_for_reduction): Extend to handle outer-loop
        vectorization.
        (vectorizable_reduction): Check for multiple-types in inner-loop
        (currently not supported during outer-loop vectorization).
        (vect_transform_loop): Change assert to just skip this case.  Add a
        dump printout.


        * gcc.dg/vect/vect.exp: Compile tests with -fno-tree-scev-cprop
        and -fno-tree-reassoc.
        * gcc.dg/vect/no-tree-scev-cprop-vect-iv-1.c: Moved to...
        * gcc.dg/vect/no-scevccp-vect-iv-1.c: New test.
        * gcc.dg/vect/no-tree-scev-cprop-vect-iv-2.c: Moved to...
        * gcc.dg/vect/no-scevccp-vect-iv-2.c: New test.
        * gcc.dg/vect/no-scevccp-noreassoc-outer-1.c: New test.
        * gcc.dg/vect/no-scevccp-noreassoc-outer-2.c: New test.
        * gcc.dg/vect/no-scevccp-noreassoc-outer-3.c: New test.
        * gcc.dg/vect/no-scevccp-noreassoc-outer-4.c: New test.
        * gcc.dg/vect/no-scevccp-noreassoc-outer-5.c: New test.
        * gcc.dg/vect/no-scevccp-outer-1.c: New test.
        * gcc.dg/vect/no-scevccp-outer-2.c: New test.
        * gcc.dg/vect/no-scevccp-outer-3.c: New test.
        * gcc.dg/vect/no-scevccp-outer-4.c: New test.
        * gcc.dg/vect/no-scevccp-outer-5.c: New test.
        * gcc.dg/vect/no-scevccp-outer-6.c: New test.
        * gcc.dg/vect/no-scevccp-outer-7.c: New test.
        * gcc.dg/vect/no-scevccp-outer-8.c: New test.
        * gcc.dg/vect/no-scevccp-outer-9.c: New test.
        * gcc.dg/vect/no-scevccp-outer-9a.c: New test.
        * gcc.dg/vect/no-scevccp-outer-9b.c: New test.
        * gcc.dg/vect/no-scevccp-outer-10.c: New test.
        * gcc.dg/vect/no-scevccp-outer-10a.c: New test.
        * gcc.dg/vect/no-scevccp-outer-10b.c: New test.
        * gcc.dg/vect/no-scevccp-outer-11.c: New test.
        * gcc.dg/vect/no-scevccp-outer-12.c: New test.
        * gcc.dg/vect/no-scevccp-outer-13.c: New test.
        * gcc.dg/vect/no-scevccp-outer-14.c: New test.
        * gcc.dg/vect/no-scevccp-outer-15.c: New test.
        * gcc.dg/vect/no-scevccp-outer-16.c: New test.
        * gcc.dg/vect/no-scevccp-outer-17.c: New test.
        * gcc.dg/vect/no-scevccp-outer-18.c: New test.
        * gcc.dg/vect/no-scevccp-outer-19.c: New test.
        * gcc.dg/vect/no-scevccp-outer-20.c: New test.
        * gcc.dg/vect/no-scevccp-outer-21.c: New test.
        * gcc.dg/vect/no-scevccp-outer-22.c: New test.

Patch:

(See attached file: outerloopdiff.april2.txt)

Attachment: outerloopdiff.april2.txt
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]