This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH GCC][4/5]Improve loop distribution to handle hmmer


On Fri, Jun 2, 2017 at 1:51 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> This is the main patch of the change.  It improves loop distribution by versioning loop under
> runtime alias check conditions, as well as better partition fusion.  As described in comments,
> the patch basically implements distribution in the following steps:
>
>      1) Seed partitions with specific type statements.  For now we support
>         two types seed statements: statement defining variable used outside
>         of loop; statement storing to memory.
>      2) Build reduced dependence graph (RDG) for loop to be distributed.
>         The vertices (RDG:V) model all statements in the loop and the edges
>         (RDG:E) model flow and control dependences between statements.
>      3) Apart from RDG, compute data dependences between memory references.
>      4) Starting from seed statement, build up partition by adding depended
>         statements according to RDG's dependence information.  Partition is
>         classified as parallel type if it can be executed parallelly; or as
>         sequential type if it can't.  Parallel type partition is further
>         classified as different builtin kinds if it can be implemented as
>         builtin function calls.
>      5) Build partition dependence graph (PG) based on data dependences.
>         The vertices (PG:V) model all partitions and the edges (PG:E) model
>         all data dependences between every partitions pair.  In general,
>         data dependence is either compilation time known or unknown.  In C
>         family languages, there exists quite amount compilation time unknown
>         dependences because of possible alias relation of data references.
>         We categorize PG's edge to two types: "true" edge that represents
>         compilation time known data dependences; "alias" edge for all other
>         data dependences.
>      6) Traverse subgraph of PG as if all "alias" edges don't exist.  Merge
>         partitions in each strong connected commponent (SCC) correspondingly.
>         Build new PG for merged partitions.
>      7) Traverse PG again and this time with both "true" and "alias" edges
>         included.  We try to break SCCs by removing some edges.  Because
>         SCCs by "true" edges are all fused in step 6), we can break SCCs
>         by removing some "alias" edges.  It's NP-hard to choose optimal
>         edge set, fortunately simple approximation is good enough for us
>         given the small problem scale.
>      8) Collect all data dependences of the removed "alias" edges.  Create
>         runtime alias checks for collected data dependences.
>      9) Version loop under the condition of runtime alias checks.  Given
>         loop distribution generally introduces additional overhead, it is
>         only useful if vectorization is achieved in distributed loop.  We
>         version loop with internal function call IFN_LOOP_DIST_ALIAS.  If
>         no distributed loop can be vectorized, we simply remove distributed
>         loops and recover to the original one.
>
> Also, there are some more to improve in the future (which shouldn't be difficult):
>    TODO:
>      1) We only distribute innermost loops now.  This pass should handle loop
>         nests in the future.
>      2) We only fuse partitions in SCC now.  A better fusion algorithm is
>         desired to minimize loop overhead, maximize parallelism and maximize
>
> This patch also fixes couple of latent bugs in the original implementation.

It would be nice to split this patch up, for example fixing the latent
bugs first
(so we can backport that part).

You now compute _all_ dependences in the loop while I originally designed loop
distribution to do less dependence computation by only computing dependences
between datarefs in different partitions (and delaying that until
after partition fusing
required by things not requiring dependence info).
Please do not remove this optimization.

Richard.

> After this change, kernel loop in hmmer can be distributed and vectorized as a result.
> This gives obvious performance improvement.  There is still inefficient code generation
> issue which I will try to fix in loop split.  Apart from this, the next opportunity in hmmer
> is to eliminate number of dead stores under proper alias information.
> Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?
>
> Thanks,
> bin
> 2017-05-31  Bin Cheng  <bin.cheng@arm.com>
>
>         * cfgloop.h (struct loop): New field ldist_alias_id.
>         * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for
>         new internal function.
>         * internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
>         * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function.
>         * tree-loop-distribution.c: Add general explanantion on the pass.
>         Include header file.
>         (struct ddr_entry, struct ddr_entry_hasher): New structs.
>         (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions.
>         (bb_top_order_index, bb_top_order_index_size): New static vars.
>         (bb_top_order_cmp): New function.
>         (stmts_from_loop): Get basic blocks in topological order.  Don't
>         free data references.
>         (build_rdg): New parameter pointing to vector of data references.
>         Store data references in it.
>         (enum partition_type): New enum.
>         (enum partition_kind, struct partition): Add comments.  New fields.
>         (partition_alloc, partition_free): Handle new fields of partition.
>         (enum fuse_type, fuse_message): New enum and varaible.
>         (partition_merge_into): New parameter.  Update implementation wrto
>         new fields of partition.
>         (generate_loops_for_partition): Set ldist_alias_id information.
>         (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions.
>         (build_rdg_partition_for_vertex): New parameter.  Compute type info
>         for partition.
>         (classify_partition): New parameter.  Only classify partition as
>         reduction if the reduction is not included by all partitions.
>         Retrieve cached ddr, rather than compute it on the fly.
>         (ref_base_address): Delete.
>         (similar_memory_accesses): Rename to ...
>         (share_memory_accesses): ... this.
>         (rdg_build_partitions): New parameter and update uses.
>         (pg_add_dependence_edges): New parameter.  Store ddr in parameter
>         vector if it could be resolved by runtime alias check.
>         (rdg_compute_data_dependence): New function.
>         (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>         (init_partition_graph_vertices, add_partition_graph_edge): New.
>         (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>         (free_partition_graph_vdata, build_partition_graph): New.
>         (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>         (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>         (data_ref_segment_size, latch_dominated_by_data_ref): New.
>         (compute_alias_check_pairs, version_loop_by_alias_check): New.
>         (version_for_distribution_p, finalize_partitions): New.
>         (distribute_loop): Update uses.  Break SCC and version loop
>         with runtime alias checks.
>         (pass_loop_distribution::execute): Compute topological order for
>         basic blocks.  Update uses.
>         * tree-vectorizer.c (vect_loop_dist_alias_call): New.
>         (fold_loop_dist_alias_call): New.
>         (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on
>         successful vectorization or not.
>
> gcc/testsuite/ChangeLog
> 2017-05-31  Bin Cheng  <bin.cheng@arm.com>
>
>         * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>         * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>         * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>         * gcc.dg/tree-ssa/ldist-14.c: Ditto.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]