[RFC][PATCH, vec-tails 00/10] Support vectorization of loop epilogues

Richard Biener richard.guenther@gmail.com
Wed Jun 15 12:06:00 GMT 2016


On Thu, May 19, 2016 at 9:35 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hi,
>
> This series is an extension of previous work on loop epilogue combining [1].
>
> It introduces three ways to handle vectorized loop epilogues: combine it with
> vectorized loop, vectorize it with masks, vectorize it using a smaller vector
> size.
>
> Also it supports vectorization of loops with low trip count.
>
> Epilogue combining is used as a basic masking transformation.  Epilogue
> masking and low trip count loop vectorization is considered as epilogue
> combining with a zero trip count vector loop.
>
> Epilogues vectorization is controlled via new option -ftree-vectorize-epilogues=
> which gets a comma separated list of enabled modes which include combine, mask,
> nomask.  There is a separate option -ftree-vectorize-short-loops for low trip
> count loops.
>
> To support epilogues vectorization I use a queue of loops to be vectorized in
> vectorize_loops and change vect_transform_loop to return generated epilogue
> (in case we want to try vectorize it).  If epilogue is returned then it is
> queued for processing.  This variant of epilogues processing was chosen because
> it is simple and works for all epilogue processing options.
>
> There are currently some limitations implied by this scheme:
>  - Copied loop misses some required optimization info (e.g. scev info)
> which may result in an epilogue which cannot be vectorized
>  - Loop epilogue may require if-convertion
>  - Alias/alignment checks are not inherited and therefore will be performed
> one more time for epilogue.  For now epilogue vectorization is just disabled
> in case alias versioning is required and alignment enhancement is
> disabled for epilogues.
>
> There is a set of new fields added to _loop_vec_info to support epilogues
> vectorization.
>
> LOOP_VINFO_CAN_BE_MASKED - true if vectorized loop can be masked.  It is
> computed during vectorization analysis (in various vectorizable_* functions).
>
> LOOP_VINFO_REQUIRED_MASKS - for loop which can be masked it holds all masks
> required to mask the loop.
>
> LOOP_VINFO_COMBINE_EPILOGUE - true if we decided vectorized loop should be
> masked.
>
> LOOP_VINFO_MASK_EPILOGUE - true if we decided an epilogue of this loop
> should be vectorized and masked
>
> LOOP_VINFO_NEED_MASKING - true if vectorized loop has to be masked (set for
> epilogues we want to mask and low trip count loops).
>
> LOOP_VINFO_ORIG_LOOP_INFO - for epilogues this holds loop_vec_info of the
> original vectorized loop.
>
> To make a decision whether we want to mask or combine a loop epilogue
> cost model is extended with masking costs.  This includes vect_masking_prologue
> and vect_masking_body elements added to vect_cost_model_location enum and
> finish_cost extended with two additional returned values correspondingly.  Also
> in addition to add_stmt_cost I also add add_stmt_masking_cost to compute
> a cost for masking a statement.
>
> vect_estimate_min_profitable_iters checks if epilogue masking is profitable
> and also computes a number of iterations required to have profitable
> epilogue combining (this number may be used as a threshold in vectorized
> loop guard).
>
> These patches do not enable any of new features by default for all optimization
> levels.  Masking features are expected to be mostly used for AVX-512 targets
> and lack of hardware suitable for wide performance testing is the reason cost
> model is not tuned and optimizations are not enabled by default.  With small
> tests using a small number of loop iterations and 'heavy' epilogues (e.g.
> number of iterations is VF*2-1) I see expected ~2x gain on existing KNL hardware.
> Later this year we expect to get an access to KNL machines and have an
> opportunity to tune masking cost model.
>
> On Haswell hardware I don't see performance gains on similar loops which means
> masked code is not better than a scalar one when we have a heavy masks usage.
> This still might be useful in case number statements requiring masking is
> relatively small (I used test a[i] += b[i] which needs masking for 3 out of 4
> vector statements).  We will continue search for cases where masking is
> profitable for Haswell to tune masking costs appropriately.

So I've gone over the patches and gave mostly high-level comments.
The vectorizer
is already in somewhat messy (aka not easy to follow) state, this
series doesn't improve
the situation (heh).  Esp. the high-level structure for code
generation and its documentation
needs work (where we do versioning / peeling and how we use the copies
in which condition
and where, etc).

Now - given my question on the profitability code for vectorized body
masking I wonder
if vectorized body masking shouldn't be better done via adding another
version for
low tripcount loops (not < vf but say < vf * N with N determined by a
cost model).
Otherwise I can't see how we'd ever mask the vectorized body for loops with
an parametric number of iterations (most loops in real life).

Thanks,
Richard.

> Below are ChangeLogs for whole series.
>
> [1] https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03014.html
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * common.opt (flag_tree_vectorize_epilogues): New.
>         (ftree-vectorize-short-loops): New.
>         (ftree-vectorize-epilogues=): New.
>         (fno-tree-vectorize-epilogues): New.
>         (fvect-epilogue-cost-model=): New.
>         * flag-types.h (enum vect_epilogue_mode): New.
>         * opts.c (parse_vectorizer_options): New.
>         (common_handle_option): Support -ftree-vectorize-epilogues=
>         and -fno-tree-vectorize-epilogues options.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * tree-vectorizer.h (struct _loop_vec_info): Add new fields
>         can_be_masked, required_masks, mask_epilogue, combine_epilogue,
>         need_masking, orig_loop_info.
>         (LOOP_VINFO_CAN_BE_MASKED): New.
>         (LOOP_VINFO_REQUIRED_MASKS): New.
>         (LOOP_VINFO_COMBINE_EPILOGUE): New.
>         (LOOP_VINFO_MASK_EPILOGUE): New.
>         (LOOP_VINFO_NEED_MASKING): New.
>         (LOOP_VINFO_ORIG_LOOP_INFO): New.
>         (LOOP_VINFO_EPILOGUE_P): New.
>         (LOOP_VINFO_ORIG_MASK_EPILOGUE): New.
>         (LOOP_VINFO_ORIG_VECT_FACTOR): New.
>         * tree-vect-loop.c (new_loop_vec_info): Initialize new
>         _loop_vec_info fields.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * tree-if-conv.c (tree_if_conversion): Make public.
>         * tree-if-conv.h: New file.
>         * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Don't
>         try to enhance alignment for epilogues.
>         * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Return
>         created loop.
>         * tree-vect-loop.c: include tree-if-conv.h.
>         (destroy_loop_vec_info): Preserve LOOP_VINFO_ORIG_LOOP_INFO in
>         loop->aux.
>         (vect_analyze_loop_form): Init LOOP_VINFO_ORIG_LOOP_INFO and reset
>         loop->aux.
>         (vect_analyze_loop): Reset loop->aux.
>         (vect_transform_loop): Check if created epilogue should be returned
>         for further vectorization.  If-convert epilogue if required.
>         * tree-vectorizer.c (vectorize_loops): Add a queue of loops to
>         process and insert vectorized loop epilogues into this queue.
>         * tree-vectorizer.h (vect_do_peeling_for_loop_bound): Return created
>         loop.
>         (vect_transform_loop): Return created loop.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * config/i386/i386.c (ix86_init_cost): Extend costs array.
>         (ix86_add_stmt_masking_cost): New.
>         (ix86_finish_cost): Add masking_prologue_cost and masking_body_cost
>         args.
>         (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New.
>         * config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): New.
>         * config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New.
>         * config/rs6000/rs6000.c (_rs6000_cost_data): Extend cost array.
>         (rs6000_init_cost): Initialize new cost elements.
>         (rs6000_finish_cost): Add masking_prologue_cost and masking_body_cost.
>         * config/spu/spu.c (spu_init_cost): Extend costs array.
>         (spu_finish_cost): Add masking_prologue_cost and masking_body_cost args.
>         * doc/tm.texi.in (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New.
>         * doc/tm.texi: Regenerated.
>         * target.def (add_stmt_masking_cost): New.
>         (finish_cost): Add masking_prologue_cost and masking_body_cost args.
>         * target.h (enum vect_cost_for_stmt): Add vector_mask_load and
>         vector_mask_store.
>         (enum vect_cost_model_location): Add vect_masking_prologue
>         and vect_masking_body.
>         * targhooks.c (default_builtin_vectorization_cost): Support
>         vector_mask_load and vector_mask_store.
>         (default_init_cost): Extend costs array.
>         (default_add_stmt_masking_cost): New.
>         (default_finish_cost): Add masking_prologue_cost and masking_body_cost
>         args.
>         * targhooks.h (default_add_stmt_masking_cost): New.
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Adjust
>         finish_cost call.
>         * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Likewise.
>         * tree-vectorizer.h (add_stmt_masking_cost): New.
>         (finish_cost): Add masking_prologue_cost and masking_body_cost args.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * tree-vect-loop.c: Include insn-config.h and recog.h.
>         (vect_check_required_masks_widening): New.
>         (vect_check_required_masks_narrowing): New.
>         (vect_get_masking_iv_elems): New.
>         (vect_get_masking_iv_type): New.
>         (vect_get_extreme_masks): New.
>         (vect_check_required_masks): New.
>         (vect_analyze_loop_operations): Add vect_check_required_masks
>         call to compute LOOP_VINFO_CAN_BE_MASKED.
>         (vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and
>         LOOP_VINFO_NEED_MASKING before starting over.
>         (vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and
>         masking cost.
>         * tree-vect-stmts.c (can_mask_load_store): New.
>         (vect_model_load_masking_cost): New.
>         (vect_model_store_masking_cost): New.
>         (vect_model_simple_masking_cost): New.
>         (vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED
>         and masking cost.
>         (vectorizable_simd_clone_call): Likewise.
>         (vectorizable_store): Likewise.
>         (vectorizable_load): Likewise.
>         (vect_stmt_should_be_masked_for_epilogue): New.
>         (vect_add_required_mask_for_stmt): New.
>         (vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED.
>         * tree-vectorizer.h (vect_model_load_masking_cost): New.
>         (vect_model_store_masking_cost): New.
>         (vect_model_simple_masking_cost): New.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * tree-vect-stmts.c (vectorizable_mask_load_store): Mark
>         the first copy of generated vector stores.
>         (vectorizable_store): Mark the first copy of generated
>         vector stores and provide it with vectype and the original
>         data reference.
>         * tree-vectorizer.h (struct _stmt_vec_info): Add first_copy_p
>         field.
>         (STMT_VINFO_FIRST_COPY_P): New.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * dbgcnt.def (vect_tail_combine): New.
>         * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New.
>         * tree-vect-data-refs.c (vect_get_new_ssa_name): Support vect_mask_var.
>         * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support
>         epilogue combined with loop body.
>         (vect_do_peeling_for_loop_bound): LIkewise.
>         (vect_do_peeling_for_alignment): ???
>         * tree-vect-loop.c Include alias.h and dbgcnt.h.
>         (vect_estimate_min_profitable_iters): Add ret_min_profitable_combine_niters
>         arg, compute number of iterations for which loop epilogue combining is
>         profitable.
>         (vect_generate_tmps_on_preheader): Support combined apilogue.
>         (vect_gen_ivs_for_masking): New.
>         (vect_get_mask_index_for_elems): New.
>         (vect_get_mask_index_for_type): New.
>         (vect_gen_loop_masks): New.
>         (vect_mask_reduction_stmt): New.
>         (vect_mask_mask_load_store_stmt): New.
>         (vect_mask_load_store_stmt): New.
>         (vect_combine_loop_epilogue): New.
>         (vect_transform_loop): Support combined apilogue.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * dbgcnt.def (vect_tail_mask): New.
>         * tree-vect-loop.c (vect_analyze_loop_2): Support masked loop
>         epilogues and low trip count loops.
>         (vect_get_known_peeling_cost): Ignore scalat epilogue cost for
>         loops we are going to mask.
>         (vect_estimate_min_profitable_iters): Support masked loop
>         epilogues and low trip count loops.
>         * tree-vectorizer.c (vectorize_loops): Add a message for a case
>         when loop epilogue can't be vectorized.
>
>
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * tree-vect-loop.c (vect_transform_loop): Print more info
>         about vectorized loop and specify used vector size.
>



More information about the Gcc-patches mailing list