[PATCH, vec-tails 07/10] Support loop epilogue combining
Ilya Enkovich
enkovich.gnu@gmail.com
Mon Jul 11 13:39:00 GMT 2016
Ping
2016-06-28 15:24 GMT+03:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
> On 16 Jun 10:54, Jeff Law wrote:
>> On 05/19/2016 01:44 PM, Ilya Enkovich wrote:
>> >Hi,
>> >
>> >This patch introduces support for loop epilogue combining. This includes
>> >support in cost estimation and all required changes required to mask
>> >vectorized loop.
>> >
>> >Thanks,
>> >Ilya
>> >--
>> >gcc/
>> >
>> >2016-05-19 Ilya Enkovich <ilya.enkovich@intel.com>
>> >
>> > * dbgcnt.def (vect_tail_combine): New.
>> > * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>> > * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>> > * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>> > epilogue combined with loop body.
>> > (vect_do_peeling_for_loop_bound): Likewise.
>> > * tree-vect-loop.c Include alias.h and dbgcnt.h.
>> > (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>> > arg, compute number of iterations for which loop epilogue combining is
>> > profitable.
>> > (vect_generate_tmps_on_preheader): Support combined apilogue.
>> > (vect_gen_ivs_for_masking): New.
>> > (vect_get_mask_index_for_elems): New.
>> > (vect_get_mask_index_for_type): New.
>> > (vect_gen_loop_masks): New.
>> > (vect_mask_reduction_stmt): New.
>> > (vect_mask_mask_load_store_stmt): New.
>> > (vect_mask_load_store_stmt): New.
>> > (vect_combine_loop_epilogue): New.
>> > (vect_transform_loop): Support combined apilogue.
>> >
>> >
>> >diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>> >index fab5879..b3c0668 100644
>> >--- a/gcc/tree-vect-loop-manip.c
>> >+++ b/gcc/tree-vect-loop-manip.c
>> >@@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>> > bb_between_loops = new_exit_bb;
>> > bb_after_second_loop = split_edge (single_exit (second_loop));
>> >
>> >- pre_condition =
>> >- fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
>> >- skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
>> >- bb_after_second_loop, bb_before_first_loop,
>> >- inverse_probability (second_guard_probability));
>> >+ if (skip_second_after_first)
>> >+ /* We can just redirect edge from bb_between_loops to
>> >+ bb_after_second_loop but we have many code assuming
>> >+ we have a guard after the first loop. So just make
>> >+ always taken condtion. */
>> >+ pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
>> >+ integer_zero_node);
>> This isn't ideal, but I don't think it's that big of an issue.
>>
>> >@@ -1758,8 +1772,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>> > basic_block preheader;
>> > int loop_num;
>> > int max_iter;
>> >+ int bound2;
>> > tree cond_expr = NULL_TREE;
>> > gimple_seq cond_expr_stmt_list = NULL;
>> >+ bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>> >
>> > if (dump_enabled_p ())
>> > dump_printf_loc (MSG_NOTE, vect_location,
>> >@@ -1769,12 +1785,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>> >
>> > loop_num = loop->num;
>> >
>> >+ bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> Can you document what the TH parameter is to the various routines that use
>> it in tree-vect-loop-manip.c? I realize you didn't add it, but it would
>> help anyone looking at this code in the future to know it's the threshold of
>> iterations for vectorization without having to find it in other function
>> comment headers ;-)
>>
>> That's pre-approved to go in immediately :-)
>>
>> >@@ -1803,7 +1820,11 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>> > max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> > ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
>> > : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
>> >- if (check_profitability)
>> >+ /* When epilogue is combined only profitability
>> >+ treshold matters. */
>> s/treshold/threshold/
>>
>>
>>
>> > static void
>> > vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>> > int *ret_min_profitable_niters,
>> >- int *ret_min_profitable_estimate)
>> >+ int *ret_min_profitable_estimate,
>> >+ int *ret_min_profitable_combine_niters)
>> I'm torn a bit here. There's all kinds of things missing/incomplete in the
>> function comments throughout the vectorizer. And in some cases, like this
>> one, the parameters are largely self-documenting. But we've also got coding
>> standards that we'd like to adhere to.
>>
>> I don't think it's fair to require you to fix all these issues in the
>> vectorizer (though if you wanted to, I'd fully support those an independent
>> cleanups).
>>
>> Perhaps just document LOOP_VINFO with a generic comment about the ret_*
>> parameters for this function rather than a comment for each ret_* parameter.
>> Pre-approved for the trunk independent of the vec-tails work.
>>
>>
>> >@@ -3728,6 +3784,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>> > min_profitable_estimate);
>> >
>> >+
>> >+ unsigned combine_treshold
>> >+ = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
>> >+ /* Calculate profitability combining epilogue with the main loop.
>> >+ We have a threshold for inside cost overhead (not applied
>> >+ for low trip count loop case):
>> >+ MIC * 100 < VIC * CT
>> >+ Masked iteration should be better than a scalar prologue:
>> >+ MIC + VIC < SIC * epilogue_niters */
>> Can you double-check the whitespace formatting here. Where does the "100"
>> come from and should it be a param?
>
> I checked the formatting. We have 100 here because combine_treshold
> is measured in percent. E.g. value 2 means iterations masking overhead
> shouldn't exceed 2% of vector iteration cost.
>
>>
>>
>> >@@ -6886,6 +7030,485 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>> > return;
>> > }
>> >
>>
>> >+
>> >+/* Function vect_gen_loop_masks.
>> >+
>> >+ Create masks to mask a loop desvribed by LOOP_VINFO. Masks
>> s/desvribed/described/
>>
>> >+ are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
>> >+ into MASKS vector.
>> >+
>> >+ Index of a mask in a vector is computed according to a number
>> >+ of masks's elements. Masks are sorted by number of its elements
>> >+ in descending order. Index 0 is used to access a mask with
>> >+ current_vector_size elements. Among masks with the same number
>> >+ of elements the one with lower index is used to mask iterations
>> >+ with smaller iteration counter. Note that you may get NULL elements
>> >+ for masks which are not required. Use vect_get_mask_index_for_elems
>> >+ or vect_get_mask_index_for_type to access resulting vector. */
>> >+
>> >+static void
>> >+vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
>> >+{
>> >+ struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> >+ edge pe = loop_preheader_edge (loop);
>> >+ tree niters = LOOP_VINFO_NITERS (loop_vinfo);
>> >+ unsigned min_mask_elems, max_mask_elems, nmasks;
>> >+ unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
>> >+ auto_vec<tree> ivs;
>> >+ tree vectype, mask_type;
>> >+ tree vec_niters, vec_niters_val, mask;
>> >+ gimple *stmt;
>> >+ basic_block bb;
>> >+ gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
>> >+ unsigned vec_size;
>> >+
>> >+ /* Create required IVs. */
>> >+ vect_gen_ivs_for_masking (loop_vinfo, &ivs);
>> >+ vectype = TREE_TYPE (ivs[0]);
>> >+
>> >+ vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
>> >+ iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
>> >+
>> >+ /* Get a proper niter to build a vector. */
>> >+ if (!is_gimple_val (niters))
>> >+ {
>> >+ gimple_seq seq = NULL;
>> >+ niters = force_gimple_operand (niters, &seq, true, NULL);
>> >+ gsi_insert_seq_on_edge_immediate (pe, seq);
>> >+ }
>> >+ /* We may need a type cast in case niter has a too small type
>> >+ for generated IVs. */
>> Nit. There should be vertical whitespace after the close brace and the
>> comment for the next logical block of code. Can you do a scan over the
>> patchkit looking for other instances where the vertical whitespace is
>> needed.
>>
>> Generally, if you find that a blob of code needs a comment, then the comment
>> and blob of code should have that vertical whitespace to visually separate
>> it from everything else.
>>
>>
>>
>> >+/* Function vect_combine_loop_epilogue.
>> >+
>> >+ Combine loop epilogue with the main vectorized body. It requires
>> >+ masking of memory accesses and reductions. */
>> So you mask reductions, loads & stores. Is there anything else that we
>> might potentially need to mask to combine the loop & epilogue via masking?
>>
>>
>> I don't see anything particularly worrisome here either -- I have a slight
>> concern about correctness issues with only masking loads/stores and
>> reductions. But I will defer to your judgment on whether or not there's
>> other stuff that we need to mask to combine the epilogue with the loop via
>> masking.
>
> We have to mask operations which may cause errors if executed speculatively.
> For others we just ignore produced result. So we don't truly mask reductions
> but fix-up their results. I assume memory accesses are only ones we have to
> truly mask (plus non-const calls which are rejected now). For signalling
> arithmetic I assumed we just don't vectorize it.
>
> Basically we should act similar to if-conversion. I'll check if it has
> restrictions I miss.
>
>>
>> Jeff
>
> Here is an updated patch version.
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-28 Ilya Enkovich <ilya.enkovich@intel.com>
>
> * dbgcnt.def (vect_tail_combine): New.
> * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
> * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
> * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
> epilogue combined with loop body.
> (vect_do_peeling_for_loop_bound): LIkewise.
> (vect_do_peeling_for_alignment): ???
> * tree-vect-loop.c Include alias.h and dbgcnt.h.
> (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
> arg, compute number of iterations for which loop epilogue combining is
> profitable.
> (vect_generate_tmps_on_preheader): Support combined apilogue.
> (vect_gen_ivs_for_masking): New.
> (vect_get_mask_index_for_elems): New.
> (vect_get_mask_index_for_type): New.
> (vect_gen_loop_masks): New.
> (vect_mask_reduction_stmt): New.
> (vect_mask_mask_load_store_stmt): New.
> (vect_mask_load_store_stmt): New.
> (vect_combine_loop_epilogue): New.
> (vect_transform_loop): Support combined apilogue.
>
>
> diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
> index 78ddcc2..73c2966 100644
> --- a/gcc/dbgcnt.def
> +++ b/gcc/dbgcnt.def
> @@ -192,4 +192,5 @@ DEBUG_COUNTER (treepre_insert)
> DEBUG_COUNTER (tree_sra)
> DEBUG_COUNTER (vect_loop)
> DEBUG_COUNTER (vect_slp)
> +DEBUG_COUNTER (vect_tail_combine)
> DEBUG_COUNTER (dom_unreachable_edges)
> diff --git a/gcc/params.def b/gcc/params.def
> index 62a1e40..98d6c5a 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1220,6 +1220,11 @@ DEFPARAM (PARAM_MAX_SPECULATIVE_DEVIRT_MAYDEFS,
> "Maximum number of may-defs visited when devirtualizing "
> "speculatively", 50, 0, 0)
>
> +DEFPARAM (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD,
> + "vect-cost-increase-combine-threshold",
> + "Cost increase threshold to mask main loop for epilogue.",
> + 10, 0, 300)
> +
> /*
>
> Local variables:
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index a902a50..26e0cc1 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -4007,6 +4007,9 @@ vect_get_new_ssa_name (tree type, enum vect_var_kind var_kind, const char *name)
> case vect_scalar_var:
> prefix = "stmp";
> break;
> + case vect_mask_var:
> + prefix = "mask";
> + break;
> case vect_pointer_var:
> prefix = "vectp";
> break;
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index c26aa1d..7403686 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -1195,6 +1195,7 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
> int first_guard_probability = 2 * REG_BR_PROB_BASE / 3;
> int second_guard_probability = 2 * REG_BR_PROB_BASE / 3;
> int probability_of_second_loop;
> + bool skip_second_after_first = false;
>
> if (!slpeel_can_duplicate_loop_p (loop, e))
> return NULL;
> @@ -1393,7 +1394,11 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
> {
> loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> tree scalar_loop_iters = LOOP_VINFO_NITERSM1 (loop_vinfo);
> - unsigned limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
> + unsigned limit = 0;
> + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + skip_second_after_first = true;
> + else
> + limit = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1;
> if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> limit = limit + 1;
> if (check_profitability
> @@ -1464,11 +1469,20 @@ slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
> bb_between_loops = new_exit_bb;
> bb_after_second_loop = split_edge (single_exit (second_loop));
>
> - pre_condition =
> - fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> - skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> - bb_after_second_loop, bb_before_first_loop,
> - inverse_probability (second_guard_probability));
> + if (skip_second_after_first)
> + /* We can just redirect edge from bb_between_loops to
> + bb_after_second_loop but we have many code assuming
> + we have a guard after the first loop. So just make
> + always taken condtion. */
> + pre_condition = fold_build2 (EQ_EXPR, boolean_type_node, integer_zero_node,
> + integer_zero_node);
> + else
> + pre_condition =
> + fold_build2 (EQ_EXPR, boolean_type_node, *first_niters, niters);
> + skip_e
> + = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> + bb_after_second_loop, bb_before_first_loop,
> + inverse_probability (second_guard_probability));
> scale_loop_profile (second_loop, probability_of_second_loop, bound2);
> slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
> second_loop == new_loop, &new_exit_bb);
> @@ -1762,8 +1776,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> basic_block preheader;
> int loop_num;
> int max_iter;
> + int bound2;
> tree cond_expr = NULL_TREE;
> gimple_seq cond_expr_stmt_list = NULL;
> + bool combine = LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo);
>
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
> @@ -1773,12 +1789,13 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
>
> loop_num = loop->num;
>
> + bound2 = combine ? th : LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> new_loop
> = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
> &ratio_mult_vf_name, ni_name, false,
> th, check_profitability,
> cond_expr, cond_expr_stmt_list,
> - 0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> + 0, bound2);
> gcc_assert (new_loop);
> gcc_assert (loop_num == loop->num);
> slpeel_checking_verify_cfg_after_peeling (loop, new_loop);
> @@ -1807,7 +1824,12 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo,
> max_iter = (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 2
> : LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 2;
> - if (check_profitability)
> +
> + /* When epilogue is combined only profitability
> + threshold matters. */
> + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + max_iter = (int) th - 1;
> + else if (check_profitability)
> max_iter = MAX (max_iter, (int) th - 1);
> record_niter_bound (new_loop, max_iter, false, true);
> dump_printf (MSG_NOTE,
> @@ -2044,7 +2066,8 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo, tree ni_name,
> bound, 0);
>
> gcc_assert (new_loop);
> - slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
> + if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + slpeel_checking_verify_cfg_after_peeling (new_loop, loop);
> /* For vectorization factor N, we need to copy at most N-1 values
> for alignment and this means N-2 loopback edge executions. */
> max_iter = LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 2;
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 41b9380..08fad82 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -50,6 +50,8 @@ along with GCC; see the file COPYING3. If not see
> #include "gimple-fold.h"
> #include "cgraph.h"
> #include "tree-if-conv.h"
> +#include "alias.h"
> +#include "dbgcnt.h"
>
> /* Loop Vectorization Pass.
>
> @@ -149,7 +151,8 @@ along with GCC; see the file COPYING3. If not see
> http://gcc.gnu.org/projects/tree-ssa/vectorization.html
> */
>
> -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
> +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
> + int *);
>
> /* Function vect_determine_vectorization_factor
>
> @@ -2310,8 +2313,10 @@ start_over:
>
> /* Analyze cost. Decide if worth while to vectorize. */
> int min_profitable_estimate, min_profitable_iters;
> + int min_profitable_combine_iters;
> vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
> - &min_profitable_estimate);
> + &min_profitable_estimate,
> + &min_profitable_combine_iters);
>
> if (min_profitable_iters < 0)
> {
> @@ -2420,6 +2425,52 @@ start_over:
> gcc_assert (vectorization_factor
> == (unsigned)LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>
> + if (!LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> + {
> + LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> + LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = false;
> + }
> + else if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> + && min_profitable_combine_iters >= 0)
> + {
> + if (((LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> + && (LOOP_VINFO_INT_NITERS (loop_vinfo)
> + >= (unsigned) min_profitable_combine_iters))
> + || estimated_niter == -1
> + || estimated_niter >= min_profitable_combine_iters)
> + && dbg_cnt (vect_tail_combine))
> + {
> + LOOP_VINFO_MASK_EPILOGUE (loop_vinfo) = false;
> + LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo) = true;
> +
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Decided to combine loop with its epilogue.\n");
> +
> + /* We need to adjust profitability check if combine
> + epilogue considering additional vector iteration
> + and profitable combine iterations. */
> + if ((int)(min_profitable_combine_iters + vectorization_factor)
> + > min_scalar_loop_bound)
> + {
> + LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)
> + = (unsigned) min_profitable_combine_iters;
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Updated runtime profitability treshold: %d\n",
> + min_profitable_combine_iters);
> +
> + }
> + }
> + else
> + {
> + if (!LOOP_VINFO_NEED_MASKING (loop_vinfo) && dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Not combined loop with epilogue: iterations "
> + "count is too low (threshold is %d).\n",
> + min_profitable_combine_iters);
> + }
> + }
> +
> /* Ok to vectorize! */
> return true;
>
> @@ -3392,12 +3443,18 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
> profitability check.
>
> *RET_MIN_PROFITABLE_ESTIMATE is a profitability threshold to be used
> - for static check against estimated number of iterations. */
> + for static check against estimated number of iterations.
> +
> + *RET_MIN_PROFITABLE_COMBINE_NITERS is a cost model profitability threshold
> + of iterations for vectorization with combined loop epilogue. -1 means
> + combining is not profitable. Value may be used fo dynamic profitability
> + check. */
>
> static void
> vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> int *ret_min_profitable_niters,
> - int *ret_min_profitable_estimate)
> + int *ret_min_profitable_estimate,
> + int *ret_min_profitable_combine_niters)
> {
> int min_profitable_iters;
> int min_profitable_estimate;
> @@ -3641,6 +3698,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> vec_prologue_cost);
> dump_printf (MSG_NOTE, " Vector epilogue cost: %d\n",
> vec_epilogue_cost);
> + dump_printf (MSG_NOTE, " Masking prologue cost: %d\n",
> + masking_prologue_cost);
> + dump_printf (MSG_NOTE, " Masking inside cost: %d\n",
> + masking_inside_cost);
> dump_printf (MSG_NOTE, " Scalar iteration cost: %d\n",
> scalar_single_iter_cost);
> dump_printf (MSG_NOTE, " Scalar outside cost: %d\n",
> @@ -3744,6 +3805,77 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> min_profitable_estimate);
>
> *ret_min_profitable_estimate = min_profitable_estimate;
> +
> + *ret_min_profitable_combine_niters = -1;
> +
> + /* Don't try to vectorize epilogue of epilogue. */
> + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> + return;
> +
> + if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> + {
> + if (flag_vect_epilogue_cost_model == VECT_COST_MODEL_UNLIMITED)
> + {
> + if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> + *ret_min_profitable_combine_niters = 0;
> + return;
> + }
> +
> + unsigned combine_treshold
> + = PARAM_VALUE (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD);
> + /* Calculate profitability combining epilogue with the main loop.
> + We have a threshold for inside cost overhead (not applied
> + for low trip count loop case):
> + MIC * 100 < VIC * CT
> + Masked iteration should be better than a scalar prologue:
> + MIC + VIC < SIC * epilogue_niters */
> + if (masking_inside_cost * 100 >= vec_inside_cost * combine_treshold)
> + {
> + if (dump_enabled_p ())
> + {
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Combining loop with epilogue is not "
> + "profitable.\n");
> + dump_printf_loc (MSG_NOTE, vect_location,
> + " Combining overhead %d%% exceeds "
> + "treshold %d%%.\n",
> + masking_inside_cost * 100 / vec_inside_cost,
> + combine_treshold);
> + }
> + *ret_min_profitable_combine_niters = -1;
> + }
> + else if ((int)(masking_inside_cost + vec_inside_cost)
> + >= scalar_single_iter_cost * peel_iters_epilogue)
> + {
> + if (dump_enabled_p ())
> + {
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Combining loop with epilogue is not "
> + "profitable.\n");
> + dump_printf_loc (MSG_NOTE, vect_location,
> + " Scalar epilogue is faster than a "
> + "single masked iteration.\n");
> + }
> + *ret_min_profitable_combine_niters = -1;
> + }
> + else if (flag_tree_vectorize_epilogues & VECT_EPILOGUE_COMBINE)
> + {
> + int inside_cost = vec_inside_cost + masking_inside_cost;
> + int outside_cost = vec_outside_cost + masking_prologue_cost;
> + int profitable_iters = ((outside_cost - scalar_outside_cost) * vf
> + - inside_cost * peel_iters_prologue
> + - inside_cost * peel_iters_epilogue)
> + / ((scalar_single_iter_cost * vf)
> + - inside_cost);
> +
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Combinig loop with epilogue "
> + "pofitability treshold = %d\n",
> + profitable_iters);
> + *ret_min_profitable_combine_niters = profitable_iters;
> + }
> + }
> }
>
> /* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
> @@ -6852,20 +6984,37 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
> else
> ni_minus_gap_name = ni_name;
>
> - /* Create: ratio = ni >> log2(vf) */
> - /* ??? As we have ni == number of latch executions + 1, ni could
> - have overflown to zero. So avoid computing ratio based on ni
> - but compute it using the fact that we know ratio will be at least
> - one, thus via (ni - vf) >> log2(vf) + 1. */
> - ratio_name
> - = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> - fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> - fold_build2 (MINUS_EXPR, TREE_TYPE (ni_name),
> - ni_minus_gap_name,
> - build_int_cst
> - (TREE_TYPE (ni_name), vf)),
> - log_vf),
> - build_int_cst (TREE_TYPE (ni_name), 1));
> + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + {
> + /* Create ni + (vf-1) >> log2(vf) if epilogue is combined with loop. */
> + gcc_assert (!LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> + ratio_name
> + = fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> + fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> + ni_name,
> + build_int_cst (TREE_TYPE (ni_name),
> + vf - 1)),
> + log_vf);
> + }
> + else
> + {
> + /* Create: ratio = ni >> log2(vf) */
> + /* ??? As we have ni == number of latch executions + 1, ni could
> + have overflown to zero. So avoid computing ratio based on ni
> + but compute it using the fact that we know ratio will be at least
> + one, thus via (ni - vf) >> log2(vf) + 1. */
> + ratio_name
> + = fold_build2 (PLUS_EXPR, TREE_TYPE (ni_name),
> + fold_build2 (RSHIFT_EXPR, TREE_TYPE (ni_name),
> + fold_build2 (MINUS_EXPR,
> + TREE_TYPE (ni_name),
> + ni_minus_gap_name,
> + build_int_cst
> + (TREE_TYPE (ni_name), vf)),
> + log_vf),
> + build_int_cst (TREE_TYPE (ni_name), 1));
> + }
> +
> if (!is_gimple_val (ratio_name))
> {
> var = create_tmp_var (TREE_TYPE (ni_name), "bnd");
> @@ -6895,6 +7044,489 @@ vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
> return;
> }
>
> +/* Function vect_gen_ivs_for_masking.
> +
> + Create IVs to be used for masks computation to mask loop described
> + by LOOP_VINFO. Created IVs are stored in IVS vector. .
> +
> + Initial IV values is {0, 1, ..., VF - 1} (probably split into several
> + vectors, in this case IVS's elements with lower index hold IV with
> + smaller numbers). IV step is {VF, VF, ..., VF}. VF is a used
> + vectorization factor. */
> +
> +static void
> +vect_gen_ivs_for_masking (loop_vec_info loop_vinfo, vec<tree> *ivs)
> +{
> + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> + tree vectype = vect_get_masking_iv_type (loop_vinfo);
> + tree type = TREE_TYPE (vectype);
> + int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> + unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> + int ncopies = vf / elems;
> + int i, k;
> + tree iv, init_val, step_val;
> + bool insert_after;
> + gimple_stmt_iterator gsi;
> + tree *vtemp;
> +
> + /* Create {VF, ..., VF} vector constant. */
> + step_val = build_vector_from_val (vectype, build_int_cst (type, vf));
> +
> + vtemp = XALLOCAVEC (tree, vf);
> + for (i = 0; i < ncopies; i++)
> + {
> + /* Create initial IV value. */
> + for (k = 0; k < vf; k++)
> + vtemp[k] = build_int_cst (type, k + i * elems);
> + init_val = build_vector (vectype, vtemp);
> +
> + /* Create an inductive variable including phi node. */
> + standard_iv_increment_position (loop, &gsi, &insert_after);
> + create_iv (init_val, step_val, NULL, loop, &gsi, insert_after,
> + &iv, NULL);
> + ivs->safe_push (iv);
> + }
> +}
> +
> +/* Function vect_get_mask_index_for_elems.
> +
> + A helper function to access masks vector. See vect_gen_loop_masks
> + for masks vector sorting description. Return index of the first
> + mask having MASK_ELEMS elements. */
> +
> +static inline unsigned
> +vect_get_mask_index_for_elems (unsigned mask_elems)
> +{
> + return current_vector_size / mask_elems - 1;
> +}
> +
> +/* Function vect_get_mask_index_for_type.
> +
> + A helper function to access masks vector. See vect_gen_loop_masks
> + for masks vector sorting description. Return index of the first
> + mask appropriate for VECTYPE. */
> +
> +static inline unsigned
> +vect_get_mask_index_for_type (tree vectype)
> +{
> + unsigned elems = TYPE_VECTOR_SUBPARTS (vectype);
> + return vect_get_mask_index_for_elems (elems);
> +}
> +
> +/* Function vect_gen_loop_masks.
> +
> + Create masks to mask a loop described by LOOP_VINFO. Masks
> + are created according to LOOP_VINFO_REQUIRED_MASKS and are stored
> + into MASKS vector.
> +
> + Index of a mask in a vector is computed according to a number
> + of masks's elements. Masks are sorted by number of its elements
> + in descending order. Index 0 is used to access a mask with
> + current_vector_size elements. Among masks with the same number
> + of elements the one with lower index is used to mask iterations
> + with smaller iteration counter. Note that you may get NULL elements
> + for masks which are not required. Use vect_get_mask_index_for_elems
> + or vect_get_mask_index_for_type to access resulting vector. */
> +
> +static void
> +vect_gen_loop_masks (loop_vec_info loop_vinfo, vec<tree> *masks)
> +{
> + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> + edge pe = loop_preheader_edge (loop);
> + tree niters = LOOP_VINFO_NITERS (loop_vinfo);
> + unsigned min_mask_elems, max_mask_elems, nmasks;
> + unsigned iv_elems, cur_mask, prev_mask, cur_mask_elems;
> + auto_vec<tree> ivs;
> + tree vectype, mask_type;
> + tree vec_niters, vec_niters_val, mask;
> + gimple *stmt;
> + basic_block bb;
> + gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
> + unsigned vec_size;
> +
> + /* Create required IVs. */
> + vect_gen_ivs_for_masking (loop_vinfo, &ivs);
> + vectype = TREE_TYPE (ivs[0]);
> +
> + vec_size = tree_to_uhwi (TYPE_SIZE_UNIT (vectype));
> + iv_elems = TYPE_VECTOR_SUBPARTS (vectype);
> +
> + /* Get a proper niter to build a vector. */
> + if (!is_gimple_val (niters))
> + {
> + gimple_seq seq = NULL;
> + niters = force_gimple_operand (niters, &seq, true, NULL);
> + gsi_insert_seq_on_edge_immediate (pe, seq);
> + }
> +
> + /* We may need a type cast in case niter has a too small type
> + for generated IVs. */
> + if (!types_compatible_p (TREE_TYPE (vectype), TREE_TYPE (niters)))
> + {
> + tree new_niters = make_temp_ssa_name (TREE_TYPE (vectype),
> + NULL, "niters");
> + stmt = gimple_build_assign (new_niters, CONVERT_EXPR, niters);
> + bb = gsi_insert_on_edge_immediate (pe, stmt);
> + gcc_assert (!bb);
> + niters = new_niters;
> + }
> +
> + /* Create {NITERS, ..., NITERS} vector and put to SSA_NAME. */
> + vec_niters_val = build_vector_from_val (vectype, niters);
> + vec_niters = vect_get_new_ssa_name (vectype, vect_simple_var, "niters");
> + stmt = gimple_build_assign (vec_niters, vec_niters_val);
> + bb = gsi_insert_on_edge_immediate (pe, stmt);
> + gcc_assert (!bb);
> +
> + /* Determine which masks we need to compute and how many. */
> + vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
> + nmasks = vect_get_mask_index_for_elems (MIN (min_mask_elems, iv_elems) / 2);
> + masks->safe_grow_cleared (nmasks);
> +
> + /* Now create base masks through comparison IV < VEC_NITERS. */
> + mask_type = build_same_sized_truth_vector_type (vectype);
> + cur_mask = vect_get_mask_index_for_elems (iv_elems);
> + for (unsigned i = 0; i < ivs.length (); i++)
> + {
> + tree iv = ivs[i];
> + mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> + stmt = gimple_build_assign (mask, LT_EXPR, iv, vec_niters);
> + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> + (*masks)[cur_mask++] = mask;
> + }
> +
> + /* Create narrowed masks. */
> + cur_mask_elems = iv_elems;
> + nmasks = ivs.length ();
> + while (cur_mask_elems < max_mask_elems)
> + {
> + prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> + cur_mask_elems <<= 1;
> + nmasks >>= 1;
> +
> + cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> + mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> + for (unsigned i = 0; i < nmasks; i++)
> + {
> + tree mask_low = (*masks)[prev_mask++];
> + tree mask_hi = (*masks)[prev_mask++];
> + mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> + stmt = gimple_build_assign (mask, VEC_PACK_TRUNC_EXPR,
> + mask_low, mask_hi);
> + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> + (*masks)[cur_mask++] = mask;
> + }
> + }
> +
> + /* Created widened masks. */
> + cur_mask_elems = iv_elems;
> + nmasks = ivs.length ();
> + while (cur_mask_elems > min_mask_elems)
> + {
> + prev_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> + cur_mask_elems >>= 1;
> + nmasks <<= 1;
> +
> + cur_mask = vect_get_mask_index_for_elems (cur_mask_elems);
> +
> + mask_type = build_truth_vector_type (cur_mask_elems, vec_size);
> +
> + for (unsigned i = 0; i < nmasks; i += 2)
> + {
> + tree orig_mask = (*masks)[prev_mask++];
> +
> + mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> + stmt = gimple_build_assign (mask, VEC_UNPACK_LO_EXPR, orig_mask);
> + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> + (*masks)[cur_mask++] = mask;
> +
> + mask = vect_get_new_ssa_name (mask_type, vect_mask_var);
> + stmt = gimple_build_assign (mask, VEC_UNPACK_HI_EXPR, orig_mask);
> + gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
> + (*masks)[cur_mask++] = mask;
> + }
> + }
> +}
> +
> +/* Function vect_mask_reduction_stmt.
> +
> + Mask given vectorized reduction statement STMT using
> + MASK. In case scalar reduction statement is vectorized
> + into several vector statements then PREV holds a
> + preceding vetor statement copy for STMT.
> +
> + Masking is performed using VEC_COND_EXPR. E.g.
> +
> + S1: r_1 = r_2 + d_3
> +
> + is transformed into:
> +
> + S1': r_4 = r_2 + d_3
> + S2': r_1 = VEC_COND_EXPR<MASK, r_4, r_2>
> +
> + Return generated condition statement. */
> +
> +static gimple *
> +vect_mask_reduction_stmt (gimple *stmt, tree mask, gimple *prev)
> +{
> + gimple_stmt_iterator gsi;
> + tree vectype;
> + tree lhs, rhs, tmp;
> + gimple *new_stmt, *phi;
> +
> + lhs = gimple_assign_lhs (stmt);
> + vectype = TREE_TYPE (lhs);
> +
> + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> + == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> + /* Find operand RHS defined by PHI node. */
> + rhs = gimple_assign_rhs1 (stmt);
> + gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> + phi = SSA_NAME_DEF_STMT (rhs);
> +
> + if (phi != prev && gimple_code (phi) != GIMPLE_PHI)
> + {
> + rhs = gimple_assign_rhs2 (stmt);
> + gcc_assert (TREE_CODE (rhs) == SSA_NAME);
> + phi = SSA_NAME_DEF_STMT (rhs);
> + gcc_assert (phi == prev || gimple_code (phi) == GIMPLE_PHI);
> + }
> +
> + /* Convert reduction stmt to ordinary assignment to TMP. */
> + tmp = vect_get_new_ssa_name (vectype, vect_simple_var, NULL);
> + gimple_assign_set_lhs (stmt, tmp);
> +
> + /* Create VEC_COND_EXPR and insert it after STMT. */
> + new_stmt = gimple_build_assign (lhs, VEC_COND_EXPR, mask, tmp, rhs);
> + gsi = gsi_for_stmt (stmt);
> + gsi_insert_after (&gsi, new_stmt, GSI_SAME_STMT);
> +
> + return new_stmt;
> +}
> +
> +/* Function vect_mask_mask_load_store_stmt.
> +
> + Mask given vectorized MASK_LOAD or MASK_STORE statement
> + STMT using MASK. Function replaces a mask used by STMT
> + with its conjunction with MASK. */
> +
> +static void
> +vect_mask_mask_load_store_stmt (gimple *stmt, tree mask)
> +{
> + gimple *new_stmt;
> + tree old_mask, new_mask;
> + gimple_stmt_iterator gsi;
> +
> + gsi = gsi_for_stmt (stmt);
> + old_mask = gimple_call_arg (stmt, 2);
> +
> + gcc_assert (types_compatible_p (TREE_TYPE (old_mask), TREE_TYPE (mask)));
> +
> + new_mask = vect_get_new_ssa_name (TREE_TYPE (mask), vect_simple_var, NULL);
> + new_stmt = gimple_build_assign (new_mask, BIT_AND_EXPR, old_mask, mask);
> + gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
> +
> + gimple_call_set_arg (stmt, 2, new_mask);
> + update_stmt (stmt);
> +}
> +
> +
> +/* Function vect_mask_load_store_stmt.
> +
> + Mask given vectorized load or store statement STMT using
> + MASK. DR is a data reference for a scalar memory access.
> + Assignment is transformed into MASK_LOAD or MASK_STORE
> + statement. SI is either an iterator pointing to STMT and
> + is to be updated or NULL. */
> +
> +static void
> +vect_mask_load_store_stmt (gimple *stmt, tree vectype, tree mask,
> + data_reference *dr, gimple_stmt_iterator *si)
> +{
> + tree mem, val, addr, ptr;
> + gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> + unsigned align, misalign;
> + tree elem_type = TREE_TYPE (vectype);
> + gimple *new_stmt;
> +
> + gcc_assert (!si || gsi_stmt (*si) == stmt);
> +
> + gsi = gsi_for_stmt (stmt);
> + if (gimple_store_p (stmt))
> + {
> + val = gimple_assign_rhs1 (stmt);
> + mem = gimple_assign_lhs (stmt);
> + }
> + else
> + {
> + val = gimple_assign_lhs (stmt);
> + mem = gimple_assign_rhs1 (stmt);
> + }
> +
> + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> + == TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)));
> +
> + addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (mem),
> + true, NULL_TREE, true,
> + GSI_SAME_STMT);
> +
> + align = TYPE_ALIGN_UNIT (vectype);
> + if (aligned_access_p (dr))
> + misalign = 0;
> + else if (DR_MISALIGNMENT (dr) == -1)
> + {
> + align = TYPE_ALIGN_UNIT (elem_type);
> + misalign = 0;
> + }
> + else
> + misalign = DR_MISALIGNMENT (dr);
> + set_ptr_info_alignment (get_ptr_info (addr), align, misalign);
> + ptr = build_int_cst (reference_alias_ptr_type (mem),
> + misalign ? misalign & -misalign : align);
> +
> + if (gimple_store_p (stmt))
> + new_stmt = gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
> + mask, val);
> + else
> + {
> + new_stmt = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr, ptr,
> + mask);
> + gimple_call_set_lhs (new_stmt, val);
> + }
> + gsi_replace (si ? si : &gsi, new_stmt, false);
> +}
> +
> +/* Function vect_combine_loop_epilogue.
> +
> + Combine loop epilogue with the main vectorized body. It requires
> + masking of memory accesses and reductions. */
> +
> +static void
> +vect_combine_loop_epilogue (loop_vec_info loop_vinfo)
> +{
> + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> + basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
> + unsigned mask_no;
> + auto_vec<tree> masks;
> +
> + vect_gen_loop_masks (loop_vinfo, &masks);
> +
> + /* Convert reduction statements if any. */
> + for (unsigned i = 0; i < LOOP_VINFO_REDUCTIONS (loop_vinfo).length (); i++)
> + {
> + gimple *stmt = LOOP_VINFO_REDUCTIONS (loop_vinfo)[i];
> + gimple *prev_stmt = NULL;
> + stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +
> + mask_no = vect_get_mask_index_for_type (STMT_VINFO_VECTYPE (stmt_info));
> +
> + stmt = STMT_VINFO_VEC_STMT (stmt_info);
> + while (stmt)
> + {
> + prev_stmt = vect_mask_reduction_stmt (stmt, masks[mask_no++],
> + prev_stmt);
> + stmt_info = vinfo_for_stmt (stmt);
> + stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> + }
> + }
> +
> + /* Scan all loop statements to convert vector load/store including masked
> + form. */
> + for (unsigned i = 0; i < loop->num_nodes; i++)
> + {
> + basic_block bb = bbs[i];
> + for (gimple_stmt_iterator si = gsi_start_bb (bb);
> + !gsi_end_p (si); gsi_next (&si))
> + {
> + gimple *stmt = gsi_stmt (si);
> + stmt_vec_info stmt_info = NULL;
> + tree vectype = NULL;
> + data_reference *dr;
> +
> + /* Mask load case. */
> + if (is_gimple_call (stmt)
> + && gimple_call_internal_p (stmt)
> + && gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> + && !VECTOR_TYPE_P (TREE_TYPE (gimple_call_arg (stmt, 2))))
> + {
> + stmt_info = vinfo_for_stmt (stmt);
> + if (!STMT_VINFO_VEC_STMT (stmt_info))
> + continue;
> + stmt = STMT_VINFO_VEC_STMT (stmt_info);
> + vectype = STMT_VINFO_VECTYPE (stmt_info);
> + }
> + /* Mask store case. */
> + else if (is_gimple_call (stmt)
> + && gimple_call_internal_p (stmt)
> + && gimple_call_internal_fn (stmt) == IFN_MASK_STORE
> + && vinfo_for_stmt (stmt)
> + && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> + {
> + stmt_info = vinfo_for_stmt (stmt);
> + vectype = TREE_TYPE (gimple_call_arg (stmt, 2));
> + }
> + /* Load case. */
> + else if (gimple_assign_load_p (stmt)
> + && !VECTOR_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt))))
> + {
> + stmt_info = vinfo_for_stmt (stmt);
> +
> + /* Skip vector loads. */
> + if (!STMT_VINFO_VEC_STMT (stmt_info))
> + continue;
> +
> + /* Skip invariant loads. */
> + if (integer_zerop (nested_in_vect_loop_p (loop, stmt)
> + ? STMT_VINFO_DR_STEP (stmt_info)
> + : DR_STEP (STMT_VINFO_DATA_REF (stmt_info))))
> + continue;
> + stmt = STMT_VINFO_VEC_STMT (stmt_info);
> + vectype = STMT_VINFO_VECTYPE (stmt_info);
> + }
> + /* Store case. */
> + else if (gimple_code (stmt) == GIMPLE_ASSIGN
> + && gimple_store_p (stmt)
> + && vinfo_for_stmt (stmt)
> + && STMT_VINFO_FIRST_COPY_P (vinfo_for_stmt (stmt)))
> + {
> + stmt_info = vinfo_for_stmt (stmt);
> + vectype = STMT_VINFO_VECTYPE (stmt_info);
> + }
> + else
> + continue;
> +
> + /* Skip hoisted out statements. */
> + if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> + continue;
> +
> + mask_no = vect_get_mask_index_for_type (vectype);
> +
> + dr = STMT_VINFO_DATA_REF (stmt_info);
> + while (stmt)
> + {
> + if (is_gimple_call (stmt))
> + vect_mask_mask_load_store_stmt (stmt, masks[mask_no++]);
> + else
> + vect_mask_load_store_stmt (stmt, vectype, masks[mask_no++], dr,
> + /* Have to update iterator only if
> + it points to stmt we mask. */
> + stmt == gsi_stmt (si) ? &si : NULL);
> +
> + stmt_info = vinfo_for_stmt (stmt);
> + stmt = stmt_info ? STMT_VINFO_RELATED_STMT (stmt_info) : NULL;
> + }
> + }
> + }
> +
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "=== Loop epilogue was combined ===\n");
> +}
>
> /* Function vect_transform_loop.
>
> @@ -6936,7 +7568,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> run at least the vectorization factor number of times checking
> is pointless, too. */
> th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> - if (th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> + if ((th >= LOOP_VINFO_VECT_FACTOR (loop_vinfo) - 1
> + || (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> + && th > 1))
> && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> {
> if (dump_enabled_p ())
> @@ -6985,12 +7619,18 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> {
> tree ratio_mult_vf;
> if (!ni_name)
> - ni_name = vect_build_loop_niters (loop_vinfo);
> + {
> + ni_name = vect_build_loop_niters (loop_vinfo);
> + LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> + }
> vect_generate_tmps_on_preheader (loop_vinfo, ni_name, &ratio_mult_vf,
> &ratio);
> - epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> - ratio_mult_vf, th,
> - check_profitability);
> + /* If epilogue is combined with main loop peeling is not needed. */
> + if (!LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo)
> + || check_profitability)
> + epilogue = vect_do_peeling_for_loop_bound (loop_vinfo, ni_name,
> + ratio_mult_vf, th,
> + check_profitability);
> }
> else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
> @@ -6998,7 +7638,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> else
> {
> if (!ni_name)
> - ni_name = vect_build_loop_niters (loop_vinfo);
> + {
> + ni_name = vect_build_loop_niters (loop_vinfo);
> + LOOP_VINFO_NITERS (loop_vinfo) = ni_name;
> + }
> vect_generate_tmps_on_preheader (loop_vinfo, ni_name, NULL, &ratio);
> }
>
> @@ -7252,6 +7895,9 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>
> slpeel_make_loop_iterate_ntimes (loop, ratio);
>
> + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + vect_combine_loop_epilogue (loop_vinfo);
> +
> /* Reduce loop iterations by the vectorization factor. */
> scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor),
> expected_iterations / vectorization_factor);
> @@ -7263,20 +7909,28 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> loop->nb_iterations_likely_upper_bound
> = loop->nb_iterations_likely_upper_bound - 1;
> }
> - loop->nb_iterations_upper_bound
> - = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> - vectorization_factor) - 1;
> - loop->nb_iterations_likely_upper_bound
> - = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + 1,
> - vectorization_factor) - 1;
> +
> + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + loop->nb_iterations_upper_bound
> + = wi::div_ceil (loop->nb_iterations_upper_bound + 1,
> + vectorization_factor, UNSIGNED) - 1;
> + else
> + loop->nb_iterations_upper_bound
> + = wi::udiv_floor (loop->nb_iterations_upper_bound + 1,
> + vectorization_factor) - 1;
>
> if (loop->any_estimate)
> {
> - loop->nb_iterations_estimate
> - = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> - if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> - && loop->nb_iterations_estimate != 0)
> - loop->nb_iterations_estimate = loop->nb_iterations_estimate - 1;
> + if (LOOP_VINFO_COMBINE_EPILOGUE (loop_vinfo))
> + loop->nb_iterations_estimate
> + = wi::div_ceil (loop->nb_iterations_estimate, vectorization_factor,
> + UNSIGNED);
> + else
> + loop->nb_iterations_estimate
> + = wi::udiv_floor (loop->nb_iterations_estimate, vectorization_factor);
> + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> + && loop->nb_iterations_estimate != 0)
> + loop->nb_iterations_estimate -= 1;
> }
>
> if (dump_enabled_p ())
More information about the Gcc-patches
mailing list