This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH GCC][4/5]Improve loop distribution to handle hmmer

From: "Bin.Cheng" <amker dot cheng at gmail dot com>
To: Richard Biener <richard dot guenther at gmail dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
Date: Wed, 7 Jun 2017 11:19:00 +0100
Subject: Re: [PATCH GCC][4/5]Improve loop distribution to handle hmmer
Authentication-results: sourceware.org; auth=none
References: <VI1PR0802MB2176D6EF8010E74F69340CD1E7F70@VI1PR0802MB2176.eurprd08.prod.outlook.com> <CAFiYyc104VabnRQETrp9yuU4_S0DwoskZRNyvRDFcTQMxA8Zpw@mail.gmail.com>

On Wed, Jun 7, 2017 at 11:03 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Jun 2, 2017 at 1:51 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> This is the main patch of the change.  It improves loop distribution by versioning loop under
>> runtime alias check conditions, as well as better partition fusion.  As described in comments,
>> the patch basically implements distribution in the following steps:
>>
>>      1) Seed partitions with specific type statements.  For now we support
>>         two types seed statements: statement defining variable used outside
>>         of loop; statement storing to memory.
>>      2) Build reduced dependence graph (RDG) for loop to be distributed.
>>         The vertices (RDG:V) model all statements in the loop and the edges
>>         (RDG:E) model flow and control dependences between statements.
>>      3) Apart from RDG, compute data dependences between memory references.
>>      4) Starting from seed statement, build up partition by adding depended
>>         statements according to RDG's dependence information.  Partition is
>>         classified as parallel type if it can be executed parallelly; or as
>>         sequential type if it can't.  Parallel type partition is further
>>         classified as different builtin kinds if it can be implemented as
>>         builtin function calls.
>>      5) Build partition dependence graph (PG) based on data dependences.
>>         The vertices (PG:V) model all partitions and the edges (PG:E) model
>>         all data dependences between every partitions pair.  In general,
>>         data dependence is either compilation time known or unknown.  In C
>>         family languages, there exists quite amount compilation time unknown
>>         dependences because of possible alias relation of data references.
>>         We categorize PG's edge to two types: "true" edge that represents
>>         compilation time known data dependences; "alias" edge for all other
>>         data dependences.
>>      6) Traverse subgraph of PG as if all "alias" edges don't exist.  Merge
>>         partitions in each strong connected commponent (SCC) correspondingly.
>>         Build new PG for merged partitions.
>>      7) Traverse PG again and this time with both "true" and "alias" edges
>>         included.  We try to break SCCs by removing some edges.  Because
>>         SCCs by "true" edges are all fused in step 6), we can break SCCs
>>         by removing some "alias" edges.  It's NP-hard to choose optimal
>>         edge set, fortunately simple approximation is good enough for us
>>         given the small problem scale.
>>      8) Collect all data dependences of the removed "alias" edges.  Create
>>         runtime alias checks for collected data dependences.
>>      9) Version loop under the condition of runtime alias checks.  Given
>>         loop distribution generally introduces additional overhead, it is
>>         only useful if vectorization is achieved in distributed loop.  We
>>         version loop with internal function call IFN_LOOP_DIST_ALIAS.  If
>>         no distributed loop can be vectorized, we simply remove distributed
>>         loops and recover to the original one.
>>
>> Also, there are some more to improve in the future (which shouldn't be difficult):
>>    TODO:
>>      1) We only distribute innermost loops now.  This pass should handle loop
>>         nests in the future.
>>      2) We only fuse partitions in SCC now.  A better fusion algorithm is
>>         desired to minimize loop overhead, maximize parallelism and maximize
>>
>> This patch also fixes couple of latent bugs in the original implementation.
>
> It would be nice to split this patch up, for example fixing the latent
> bugs first
> (so we can backport that part).
I only remember the major one in which topological order is needed
when sorting statements, dominance order is not enough.  I will try to
split it as much as possible, but the change falls in a big chunk to
some extend.

>
> You now compute _all_ dependences in the loop while I originally designed loop
> distribution to do less dependence computation by only computing dependences
> between datarefs in different partitions (and delaying that until
> after partition fusing
> required by things not requiring dependence info).
> Please do not remove this optimization.
Right, I did that and cache it in hash table because we need to query
dependence for two references multiple times.  I will restore the
on-demand computing behavior.

Thanks,
bin
>
> Richard.
>
>> After this change, kernel loop in hmmer can be distributed and vectorized as a result.
>> This gives obvious performance improvement.  There is still inefficient code generation
>> issue which I will try to fix in loop split.  Apart from this, the next opportunity in hmmer
>> is to eliminate number of dead stores under proper alias information.
>> Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?
>>
>> Thanks,
>> bin
>> 2017-05-31  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * cfgloop.h (struct loop): New field ldist_alias_id.
>>         * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for
>>         new internal function.
>>         * internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
>>         * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function.
>>         * tree-loop-distribution.c: Add general explanantion on the pass.
>>         Include header file.
>>         (struct ddr_entry, struct ddr_entry_hasher): New structs.
>>         (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions.
>>         (bb_top_order_index, bb_top_order_index_size): New static vars.
>>         (bb_top_order_cmp): New function.
>>         (stmts_from_loop): Get basic blocks in topological order.  Don't
>>         free data references.
>>         (build_rdg): New parameter pointing to vector of data references.
>>         Store data references in it.
>>         (enum partition_type): New enum.
>>         (enum partition_kind, struct partition): Add comments.  New fields.
>>         (partition_alloc, partition_free): Handle new fields of partition.
>>         (enum fuse_type, fuse_message): New enum and varaible.
>>         (partition_merge_into): New parameter.  Update implementation wrto
>>         new fields of partition.
>>         (generate_loops_for_partition): Set ldist_alias_id information.
>>         (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions.
>>         (build_rdg_partition_for_vertex): New parameter.  Compute type info
>>         for partition.
>>         (classify_partition): New parameter.  Only classify partition as
>>         reduction if the reduction is not included by all partitions.
>>         Retrieve cached ddr, rather than compute it on the fly.
>>         (ref_base_address): Delete.
>>         (similar_memory_accesses): Rename to ...
>>         (share_memory_accesses): ... this.
>>         (rdg_build_partitions): New parameter and update uses.
>>         (pg_add_dependence_edges): New parameter.  Store ddr in parameter
>>         vector if it could be resolved by runtime alias check.
>>         (rdg_compute_data_dependence): New function.
>>         (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>>         (init_partition_graph_vertices, add_partition_graph_edge): New.
>>         (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>>         (free_partition_graph_vdata, build_partition_graph): New.
>>         (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>>         (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>>         (data_ref_segment_size, latch_dominated_by_data_ref): New.
>>         (compute_alias_check_pairs, version_loop_by_alias_check): New.
>>         (version_for_distribution_p, finalize_partitions): New.
>>         (distribute_loop): Update uses.  Break SCC and version loop
>>         with runtime alias checks.
>>         (pass_loop_distribution::execute): Compute topological order for
>>         basic blocks.  Update uses.
>>         * tree-vectorizer.c (vect_loop_dist_alias_call): New.
>>         (fold_loop_dist_alias_call): New.
>>         (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on
>>         successful vectorization or not.
>>
>> gcc/testsuite/ChangeLog
>> 2017-05-31  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>>         * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>>         * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>>         * gcc.dg/tree-ssa/ldist-14.c: Ditto.

References:
- [PATCH GCC][4/5]Improve loop distribution to handle hmmer
  - From: Bin Cheng
- Re: [PATCH GCC][4/5]Improve loop distribution to handle hmmer
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]