This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[rfc] and [autovect patch] Vectorize reduction.





Initial support for vectorization of reduction.
The first kind of reduction that is supported is summation (e.g.
'for(i=i;i<N;i++){sum+=a[i]}').
Also target specific bits for summation of v4si and v4sf modes are added
for altivec.

We're recognizing roughly the following:
[Example1]
loop:
      a_1 = phi <a_0, a_2>
s1:   x = ...
s2:   a_2 = op <x, a_1>
loop_exit:
      a_3 = phi <a_2>
s3:   use <a_3>
s4:   use <a_3>

as a reduction idiom,
and if we decide it's ok to change the computation order,
we transform it into:
[Example2]
loop:
      va_1 = phi <va_0, va_2>
vs1:  vx = ...
vs2:  va_2 = vop <vx, va_1>
loop_exit:
      a_3 = phi <a_2> (dead)
      va_3 = phi <va_2>
vs3:  va_4 = reduc_op <va_3>
vs4:  a_5 = bit_field_ref <va_4, 0>
s3:   use <a_5>
s4:   use <a_5>

(there are other alternatives, that are discussed below. comments are
welcome!).

More enhancements to be done:
(1) currently we support reduction only when peeling is not going to take
place; this requires enhancing the code that preserves loop-closed form in
case of peeling in the vectorizer.
(2) w.r.t stmt s2: currently we assume that the first argument is defined
in the current iteration and the second argument is defined by the previous
iteration; support can be added to also handle the swapped case, e.g.
'a_2=op<a_1,x>'.
(3) support additional operations. i.e, currently we only introduced
support for the case that op in s2 is plus_expr; support can be added for
min, max, or, etc.
(4) support additional reduction patterns. i.e, idioms, like dot product,
that have special target support.
(5) currently reduction testcases are enabled for powerpc only; it should
be enabled for more targets (possibly add a new target keyword).
(6) more target specific bits
(7) based on this infrastructure, we can also add support for vectorization
of induction (e.g: 'for(i=i;i<N;i++){a[i]=i}').
(8) based on this infrastructure, we can also add support for vectorization
in the presence of values that are used out of the loop (currently this is
supported only for reduction).

While I continue to work down the list above, there are some issues I'd
like to consult on:

(a) There are several options for expressing the reduction epilog code:

option 1: single stmt (new tree-code) that reduces the vector of partial
results into a scalar register. This scheme exposes the least details at
the tree-level and is probably the most general:
[Example3]
      a_5 = reduc_op <va_3>

option 2: one stmt (new tree-code) that reduces the vector of partial
results into a scalar result that is placed in a vector register (at the
first/last index of the vector); an additional stmt extracts the scalar
result from the vector register into a scalar register. This scheme exposes
more implementation details, and is less general (doesn't fit targets that
directly support vector-reg-input-to-scalar-reg-output in one insn; but
this rarely exists I think). Also this scheme implies that we need to
assume in which index the scalar result is placed, or let each target tell
us (via target hook?). It was easy to implement this one (both from the
tree-level end and the target-specific end), which is the main reason why
it is the one used now:
[Example4]
      va_4 = reduc_op <va_3>
      a_5 = bit_field_ref_expr <va_4, offset>

option 3: for targets that support whole vector shifts, we can do the
following at the tree level (VS is the vector size in bytes):

for (offset = VS; offset > element_size; offset/=2) {
  build:    va' = vec_shift_left <va, offset>
  build:    va = vop <va, va'>
}
  build:    va = bit_field_ref_expr <va, offset>

For example, when VS=16 bytes and the data type operated upon is ints, this
will be generated:
[Example5]
      va_4 = vec_shift_left <va_3, 8>
      va_5 = vop <va_3, va_4>
      va_6 = vec_shift_left <va_5, 4>
      va_7 = vop <va_5, va_6>
      a_5 = bit_field_ref_expr <va_7, offset>

This scheme exposes even more implementation details, yet it's a common
method for implementing reduction, so it may allow avoiding code
duplication between targets. Of course this requires adding new tree-codes
for the vector-shifts, and of course we don't want to use this scheme if a
target has better support.

option 4: Can always store the vector of partial results to memory, and
then sequentially compute the reduction on the scalar elements.

option 5: all of the above:
      if (support vector-reg-to-scalar-reg-in-one-stmt)
        generate option 1
      else if (support vector-reg-to-vector-reg-reduction)
        generate option 2
      else if (support whole-vector-shift)
        generate option 3
      else
        generate option 4

(b) w.r.t how we use bit_field_ref_expr: the offset where the scalar result
is expected to be placed is currenly hard-coded (it is expectd to be at
offset 0), but probably should be left target specific, via a target hook?

(c) vectorization of reduction changes the order of the computations, which
in some cases may result in changing the semantics of the program.
Currently we vectorize reduction only in the following cases:
- case 1: we're operating on floats, and unsafe-math-optimizations is set.
(ok? other flags that should be used?)
- case 2: we're operating on unsigned integers. (we assume wrapping
arithmetic here. ok?)
- case 3: we're operating on signed integers, and wrapv is set. (ok? other
flags that should be checked?)
- case 4 [experimental...]: we're operating on signed integers, and wrapv
is not set, but the target supports saturating reduction. this is of course
unsafe, and should be guarded by a flag (which flag?
unsafe-math-optimizations? fast-math?). This seems to make (some) sense if
we can interpret the behavior of overflow in case of no-wrapv as saturating
arithmetic (although I guess the behavior is really undefined?). Of course
in saturating arithmetic changing the order of the computation is unsafe,
but if the user allows it, we'd like to vectorize. For this purpose I also
added opcodes/optabs for sat_reduc_plus. (we may also want to use
saturating operations inside the loop in this case ?). Alternatively we
could generate regular reduction (not saturating) in this case, guarded of
course by an appropriate flag. We could use saturating reduction only when
the original computation explicitly uses saturating arithmetic. Bottom
line, there are two issues here: (1) I'd like to be able to vectorize
reduction on signed integers under no-wrapv, and (2) I'd also like to be
able to use saturating reduction when appropriate.

comment: another option, if we can't change the order of the computation,
and the target supports a reduction operation that is guaranteed to
preserve the order of the computation (does there exist such functionality
in any target?), we could do the reduction of the partial sums in each
iteration instead of in the epilog. This is a different vectorization
scheme, less efficient, but the only way that we could vectorize in this
situation.

(d) So far I added REDUC_PLUS_EXPR as a new tree-code (and
SAT_REDUC_PLUS_EXPR as described above), with the intention to add
additional codes such as REDUC_MIN_EXPR, REDUC_MAX_EXPR etc. ok names?

(e) I added reduc_plus_optab (and sat_reduc_plus_optab). We may need
seperate optabs for the signed and unsigned cases (sreduc_plus_optab,
ureduc_plus_optab ?).

(f) I'm using replace_immediate_uses to replaces uses of a_3 in [Example1]
with uses of a_5 [Example2]. Is there a better way around this? (is there a
way to replace the defining stmt of an ssa-name with a new defining stmt
(for the same ssa-name), removing the original defining stmt? (without
calling the renamer?) e.g., in [Example2] generate something like:
loop_exit:
      [a_3 = phi <a_2> removed]
      va_3 = phi <va_2>
vs3:  va_4 = reduc_op <va_3>
vs4:  a_3 = bit_field_ref <va_4, 0>
s3:   use <a_3> [unchanged]
s4:   use <a_3> [unchanged]
?)

Bootstrapped and tested (including SPEC) on powerpc-darwin.
Committed to autovect.

dorit

Changelog:
        * tree.def (REDUC_PLUS_EXPR, SAT_REDUC_PLUS_EXPR): New.
        * expr.c (expand_expr_real_1): Added new cases SAT_REDUC_PLUS_EXPR
and
        REDUC_PLUS_EXPR.
        * tree-pretty-print.c (dump_generic_node, op_prio, op_symbol):
Likewise.
        * tree-ssa-operands.c (get_expr_operands): Likewise.
        * optabs.c (optab_for_tree_code): Likewise.
        (init_optabs): Initialize new optabs reduc_plus_optab and
        sat_reduc_plus_optab.
        * genopinit.c (reduc_plus_optab, sat_reduc_plus_optab): New optabs.
        * optabs.h (OTI_reduc_plus, OTI_sat_reduc_plus): New.
        (reduc_plus_optab, sat_reduc_plus_optab) New.
        * tree-flow.h (replace_immediate_uses): Add declaration.
        * tree-ssa.c (replace_immediate_uses): Externalize.
        * tree-vectorizer.h (vect_scalar_var): New value for enum
vect_var_kind.
        (after_loop_use): New filed in _stmt_vec_info.
        (STMT_VINFO_EXTERNAL_USE): New access macro.
        * tree-vectorizer.c (vectorizable_reduction): New.
        Call replace_immediate_uses.
        (vect_is_simple_reduction): New.
        (reduction_code_for_scalar_code): New.
        (vect_get_vec_def_for_operand): Handle vect_reduction_def case.
        (vect_transform_stmt): Handle reduc_vec_info_type case.
        (vect_is_simple_use): Don't fail on vect_reduction_def.
        (vect_determine_vectorization_factor): Check if vect_reduction_def.
        (vect_analyze_operations): Call vectorizable_reduction.
        (vect_analyze_scalar_cycles): Call vect_is_simple_reduction.
        (vect_transform_loop): Check STMT_VINFO_LIVE_P.
        (vectorizable_assignment): Check STMT_VINFO_RELEVANT_P and
        STMT_VINFO_DEF_TYPE.
        (vectorizable_operation, vectorizable_load, vectorizable_select):
        Likewise.
        (vect_stmt_relevant_p): Set STMT_VINFO_EXTERNAL_USE.
        (new_stmt_vec_info): Initialize new field STMT_VINFO_EXTERNAL_USE.
        (vect_get_new_vect_var): Handle new vect_scalar_var case.
        (vect_create_destination_var): Likewise.
        * config/rs6000/altivec.md (sat_reduc_plus_v4si): New
define_expand.
        (reduc_plus_v4si, reduc_plus_v4sf): New define_expand.

Patch:
(See attached file: reduc.patch.feb22)

Attachment: reduc.patch.feb22
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]