This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH GCC][4/5]Improve loop distribution to handle hmmer
On Wed, Jun 7, 2017 at 11:03 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Jun 2, 2017 at 1:51 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> This is the main patch of the change. It improves loop distribution by versioning loop under
>> runtime alias check conditions, as well as better partition fusion. As described in comments,
>> the patch basically implements distribution in the following steps:
>>
>> 1) Seed partitions with specific type statements. For now we support
>> two types seed statements: statement defining variable used outside
>> of loop; statement storing to memory.
>> 2) Build reduced dependence graph (RDG) for loop to be distributed.
>> The vertices (RDG:V) model all statements in the loop and the edges
>> (RDG:E) model flow and control dependences between statements.
>> 3) Apart from RDG, compute data dependences between memory references.
>> 4) Starting from seed statement, build up partition by adding depended
>> statements according to RDG's dependence information. Partition is
>> classified as parallel type if it can be executed parallelly; or as
>> sequential type if it can't. Parallel type partition is further
>> classified as different builtin kinds if it can be implemented as
>> builtin function calls.
>> 5) Build partition dependence graph (PG) based on data dependences.
>> The vertices (PG:V) model all partitions and the edges (PG:E) model
>> all data dependences between every partitions pair. In general,
>> data dependence is either compilation time known or unknown. In C
>> family languages, there exists quite amount compilation time unknown
>> dependences because of possible alias relation of data references.
>> We categorize PG's edge to two types: "true" edge that represents
>> compilation time known data dependences; "alias" edge for all other
>> data dependences.
>> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
>> partitions in each strong connected commponent (SCC) correspondingly.
>> Build new PG for merged partitions.
>> 7) Traverse PG again and this time with both "true" and "alias" edges
>> included. We try to break SCCs by removing some edges. Because
>> SCCs by "true" edges are all fused in step 6), we can break SCCs
>> by removing some "alias" edges. It's NP-hard to choose optimal
>> edge set, fortunately simple approximation is good enough for us
>> given the small problem scale.
>> 8) Collect all data dependences of the removed "alias" edges. Create
>> runtime alias checks for collected data dependences.
>> 9) Version loop under the condition of runtime alias checks. Given
>> loop distribution generally introduces additional overhead, it is
>> only useful if vectorization is achieved in distributed loop. We
>> version loop with internal function call IFN_LOOP_DIST_ALIAS. If
>> no distributed loop can be vectorized, we simply remove distributed
>> loops and recover to the original one.
>>
>> Also, there are some more to improve in the future (which shouldn't be difficult):
>> TODO:
>> 1) We only distribute innermost loops now. This pass should handle loop
>> nests in the future.
>> 2) We only fuse partitions in SCC now. A better fusion algorithm is
>> desired to minimize loop overhead, maximize parallelism and maximize
>>
>> This patch also fixes couple of latent bugs in the original implementation.
>
> It would be nice to split this patch up, for example fixing the latent
> bugs first
> (so we can backport that part).
I only remember the major one in which topological order is needed
when sorting statements, dominance order is not enough. I will try to
split it as much as possible, but the change falls in a big chunk to
some extend.
>
> You now compute _all_ dependences in the loop while I originally designed loop
> distribution to do less dependence computation by only computing dependences
> between datarefs in different partitions (and delaying that until
> after partition fusing
> required by things not requiring dependence info).
> Please do not remove this optimization.
Right, I did that and cache it in hash table because we need to query
dependence for two references multiple times. I will restore the
on-demand computing behavior.
Thanks,
bin
>
> Richard.
>
>> After this change, kernel loop in hmmer can be distributed and vectorized as a result.
>> This gives obvious performance improvement. There is still inefficient code generation
>> issue which I will try to fix in loop split. Apart from this, the next opportunity in hmmer
>> is to eliminate number of dead stores under proper alias information.
>> Bootstrap and test at O2/O3 on x86_64 and AArch64. is it OK?
>>
>> Thanks,
>> bin
>> 2017-05-31 Bin Cheng <bin.cheng@arm.com>
>>
>> * cfgloop.h (struct loop): New field ldist_alias_id.
>> * cfgloopmanip.c (lv_adjust_loop_entry_edge): Refine comment for
>> new internal function.
>> * internal-fn.c (expand_LOOP_DIST_ALIAS): New function.
>> * internal-fn.def (IFN_LOOP_DIST_ALIAS): New internal function.
>> * tree-loop-distribution.c: Add general explanantion on the pass.
>> Include header file.
>> (struct ddr_entry, struct ddr_entry_hasher): New structs.
>> (ddr_entry_hasher::hash, ddr_entry_hasher::equal): New functions.
>> (bb_top_order_index, bb_top_order_index_size): New static vars.
>> (bb_top_order_cmp): New function.
>> (stmts_from_loop): Get basic blocks in topological order. Don't
>> free data references.
>> (build_rdg): New parameter pointing to vector of data references.
>> Store data references in it.
>> (enum partition_type): New enum.
>> (enum partition_kind, struct partition): Add comments. New fields.
>> (partition_alloc, partition_free): Handle new fields of partition.
>> (enum fuse_type, fuse_message): New enum and varaible.
>> (partition_merge_into): New parameter. Update implementation wrto
>> new fields of partition.
>> (generate_loops_for_partition): Set ldist_alias_id information.
>> (record_ddr, get_ddr, possible_data_dep_cycle_p): New functions.
>> (build_rdg_partition_for_vertex): New parameter. Compute type info
>> for partition.
>> (classify_partition): New parameter. Only classify partition as
>> reduction if the reduction is not included by all partitions.
>> Retrieve cached ddr, rather than compute it on the fly.
>> (ref_base_address): Delete.
>> (similar_memory_accesses): Rename to ...
>> (share_memory_accesses): ... this.
>> (rdg_build_partitions): New parameter and update uses.
>> (pg_add_dependence_edges): New parameter. Store ddr in parameter
>> vector if it could be resolved by runtime alias check.
>> (rdg_compute_data_dependence): New function.
>> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>> (init_partition_graph_vertices, add_partition_graph_edge): New.
>> (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>> (free_partition_graph_vdata, build_partition_graph): New.
>> (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>> (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>> (data_ref_segment_size, latch_dominated_by_data_ref): New.
>> (compute_alias_check_pairs, version_loop_by_alias_check): New.
>> (version_for_distribution_p, finalize_partitions): New.
>> (distribute_loop): Update uses. Break SCC and version loop
>> with runtime alias checks.
>> (pass_loop_distribution::execute): Compute topological order for
>> basic blocks. Update uses.
>> * tree-vectorizer.c (vect_loop_dist_alias_call): New.
>> (fold_loop_dist_alias_call): New.
>> (vectorize_loops): Fold IFN_LOOP_DIST_ALIAS call depending on
>> successful vectorization or not.
>>
>> gcc/testsuite/ChangeLog
>> 2017-05-31 Bin Cheng <bin.cheng@arm.com>
>>
>> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>> * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>> * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>> * gcc.dg/tree-ssa/ldist-14.c: Ditto.