This is the mail archive of the
`gcc-patches@gcc.gnu.org`
mailing list for the GCC project.

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |

Other format: | [Raw text] |

*From*: Dorit Naishlos <DORIT at il dot ibm dot com>*To*: gcc-patches at gcc dot gnu dot org*Cc*: Richard Henderson <rth at redhat dot com>*Date*: Wed, 23 Feb 2005 15:17:59 +0200*Subject*: [rfc] and [autovect patch] Vectorize reduction.*Reply-to*:*Sensitivity*:

Initial support for vectorization of reduction. The first kind of reduction that is supported is summation (e.g. 'for(i=i;i<N;i++){sum+=a[i]}'). Also target specific bits for summation of v4si and v4sf modes are added for altivec. We're recognizing roughly the following: [Example1] loop: a_1 = phi <a_0, a_2> s1: x = ... s2: a_2 = op <x, a_1> loop_exit: a_3 = phi <a_2> s3: use <a_3> s4: use <a_3> as a reduction idiom, and if we decide it's ok to change the computation order, we transform it into: [Example2] loop: va_1 = phi <va_0, va_2> vs1: vx = ... vs2: va_2 = vop <vx, va_1> loop_exit: a_3 = phi <a_2> (dead) va_3 = phi <va_2> vs3: va_4 = reduc_op <va_3> vs4: a_5 = bit_field_ref <va_4, 0> s3: use <a_5> s4: use <a_5> (there are other alternatives, that are discussed below. comments are welcome!). More enhancements to be done: (1) currently we support reduction only when peeling is not going to take place; this requires enhancing the code that preserves loop-closed form in case of peeling in the vectorizer. (2) w.r.t stmt s2: currently we assume that the first argument is defined in the current iteration and the second argument is defined by the previous iteration; support can be added to also handle the swapped case, e.g. 'a_2=op<a_1,x>'. (3) support additional operations. i.e, currently we only introduced support for the case that op in s2 is plus_expr; support can be added for min, max, or, etc. (4) support additional reduction patterns. i.e, idioms, like dot product, that have special target support. (5) currently reduction testcases are enabled for powerpc only; it should be enabled for more targets (possibly add a new target keyword). (6) more target specific bits (7) based on this infrastructure, we can also add support for vectorization of induction (e.g: 'for(i=i;i<N;i++){a[i]=i}'). (8) based on this infrastructure, we can also add support for vectorization in the presence of values that are used out of the loop (currently this is supported only for reduction). While I continue to work down the list above, there are some issues I'd like to consult on: (a) There are several options for expressing the reduction epilog code: option 1: single stmt (new tree-code) that reduces the vector of partial results into a scalar register. This scheme exposes the least details at the tree-level and is probably the most general: [Example3] a_5 = reduc_op <va_3> option 2: one stmt (new tree-code) that reduces the vector of partial results into a scalar result that is placed in a vector register (at the first/last index of the vector); an additional stmt extracts the scalar result from the vector register into a scalar register. This scheme exposes more implementation details, and is less general (doesn't fit targets that directly support vector-reg-input-to-scalar-reg-output in one insn; but this rarely exists I think). Also this scheme implies that we need to assume in which index the scalar result is placed, or let each target tell us (via target hook?). It was easy to implement this one (both from the tree-level end and the target-specific end), which is the main reason why it is the one used now: [Example4] va_4 = reduc_op <va_3> a_5 = bit_field_ref_expr <va_4, offset> option 3: for targets that support whole vector shifts, we can do the following at the tree level (VS is the vector size in bytes): for (offset = VS; offset > element_size; offset/=2) { build: va' = vec_shift_left <va, offset> build: va = vop <va, va'> } build: va = bit_field_ref_expr <va, offset> For example, when VS=16 bytes and the data type operated upon is ints, this will be generated: [Example5] va_4 = vec_shift_left <va_3, 8> va_5 = vop <va_3, va_4> va_6 = vec_shift_left <va_5, 4> va_7 = vop <va_5, va_6> a_5 = bit_field_ref_expr <va_7, offset> This scheme exposes even more implementation details, yet it's a common method for implementing reduction, so it may allow avoiding code duplication between targets. Of course this requires adding new tree-codes for the vector-shifts, and of course we don't want to use this scheme if a target has better support. option 4: Can always store the vector of partial results to memory, and then sequentially compute the reduction on the scalar elements. option 5: all of the above: if (support vector-reg-to-scalar-reg-in-one-stmt) generate option 1 else if (support vector-reg-to-vector-reg-reduction) generate option 2 else if (support whole-vector-shift) generate option 3 else generate option 4 (b) w.r.t how we use bit_field_ref_expr: the offset where the scalar result is expected to be placed is currenly hard-coded (it is expectd to be at offset 0), but probably should be left target specific, via a target hook? (c) vectorization of reduction changes the order of the computations, which in some cases may result in changing the semantics of the program. Currently we vectorize reduction only in the following cases: - case 1: we're operating on floats, and unsafe-math-optimizations is set. (ok? other flags that should be used?) - case 2: we're operating on unsigned integers. (we assume wrapping arithmetic here. ok?) - case 3: we're operating on signed integers, and wrapv is set. (ok? other flags that should be checked?) - case 4 [experimental...]: we're operating on signed integers, and wrapv is not set, but the target supports saturating reduction. this is of course unsafe, and should be guarded by a flag (which flag? unsafe-math-optimizations? fast-math?). This seems to make (some) sense if we can interpret the behavior of overflow in case of no-wrapv as saturating arithmetic (although I guess the behavior is really undefined?). Of course in saturating arithmetic changing the order of the computation is unsafe, but if the user allows it, we'd like to vectorize. For this purpose I also added opcodes/optabs for sat_reduc_plus. (we may also want to use saturating operations inside the loop in this case ?). Alternatively we could generate regular reduction (not saturating) in this case, guarded of course by an appropriate flag. We could use saturating reduction only when the original computation explicitly uses saturating arithmetic. Bottom line, there are two issues here: (1) I'd like to be able to vectorize reduction on signed integers under no-wrapv, and (2) I'd also like to be able to use saturating reduction when appropriate. comment: another option, if we can't change the order of the computation, and the target supports a reduction operation that is guaranteed to preserve the order of the computation (does there exist such functionality in any target?), we could do the reduction of the partial sums in each iteration instead of in the epilog. This is a different vectorization scheme, less efficient, but the only way that we could vectorize in this situation. (d) So far I added REDUC_PLUS_EXPR as a new tree-code (and SAT_REDUC_PLUS_EXPR as described above), with the intention to add additional codes such as REDUC_MIN_EXPR, REDUC_MAX_EXPR etc. ok names? (e) I added reduc_plus_optab (and sat_reduc_plus_optab). We may need seperate optabs for the signed and unsigned cases (sreduc_plus_optab, ureduc_plus_optab ?). (f) I'm using replace_immediate_uses to replaces uses of a_3 in [Example1] with uses of a_5 [Example2]. Is there a better way around this? (is there a way to replace the defining stmt of an ssa-name with a new defining stmt (for the same ssa-name), removing the original defining stmt? (without calling the renamer?) e.g., in [Example2] generate something like: loop_exit: [a_3 = phi <a_2> removed] va_3 = phi <va_2> vs3: va_4 = reduc_op <va_3> vs4: a_3 = bit_field_ref <va_4, 0> s3: use <a_3> [unchanged] s4: use <a_3> [unchanged] ?) Bootstrapped and tested (including SPEC) on powerpc-darwin. Committed to autovect. dorit Changelog: * tree.def (REDUC_PLUS_EXPR, SAT_REDUC_PLUS_EXPR): New. * expr.c (expand_expr_real_1): Added new cases SAT_REDUC_PLUS_EXPR and REDUC_PLUS_EXPR. * tree-pretty-print.c (dump_generic_node, op_prio, op_symbol): Likewise. * tree-ssa-operands.c (get_expr_operands): Likewise. * optabs.c (optab_for_tree_code): Likewise. (init_optabs): Initialize new optabs reduc_plus_optab and sat_reduc_plus_optab. * genopinit.c (reduc_plus_optab, sat_reduc_plus_optab): New optabs. * optabs.h (OTI_reduc_plus, OTI_sat_reduc_plus): New. (reduc_plus_optab, sat_reduc_plus_optab) New. * tree-flow.h (replace_immediate_uses): Add declaration. * tree-ssa.c (replace_immediate_uses): Externalize. * tree-vectorizer.h (vect_scalar_var): New value for enum vect_var_kind. (after_loop_use): New filed in _stmt_vec_info. (STMT_VINFO_EXTERNAL_USE): New access macro. * tree-vectorizer.c (vectorizable_reduction): New. Call replace_immediate_uses. (vect_is_simple_reduction): New. (reduction_code_for_scalar_code): New. (vect_get_vec_def_for_operand): Handle vect_reduction_def case. (vect_transform_stmt): Handle reduc_vec_info_type case. (vect_is_simple_use): Don't fail on vect_reduction_def. (vect_determine_vectorization_factor): Check if vect_reduction_def. (vect_analyze_operations): Call vectorizable_reduction. (vect_analyze_scalar_cycles): Call vect_is_simple_reduction. (vect_transform_loop): Check STMT_VINFO_LIVE_P. (vectorizable_assignment): Check STMT_VINFO_RELEVANT_P and STMT_VINFO_DEF_TYPE. (vectorizable_operation, vectorizable_load, vectorizable_select): Likewise. (vect_stmt_relevant_p): Set STMT_VINFO_EXTERNAL_USE. (new_stmt_vec_info): Initialize new field STMT_VINFO_EXTERNAL_USE. (vect_get_new_vect_var): Handle new vect_scalar_var case. (vect_create_destination_var): Likewise. * config/rs6000/altivec.md (sat_reduc_plus_v4si): New define_expand. (reduc_plus_v4si, reduc_plus_v4sf): New define_expand. Patch: (See attached file: reduc.patch.feb22)

**Attachment:
reduc.patch.feb22**

**Follow-Ups**:**Re: [rfc] and [autovect patch] Vectorize reduction.***From:*Richard Henderson

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |