This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check
On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> This is the main patch rewriting loop distribution in order to handle hmmer.
>>> It improves loop distribution by versioning loop under runtime alias check conditions.
>>> As described in comments, the patch basically implements distribution in the following
>>> steps:
>>>
>>> 1) Seed partitions with specific type statements. For now we support
>>> two types seed statements: statement defining variable used outside
>>> of loop; statement storing to memory.
>>> 2) Build reduced dependence graph (RDG) for loop to be distributed.
>>> The vertices (RDG:V) model all statements in the loop and the edges
>>> (RDG:E) model flow and control dependencies between statements.
>>> 3) Apart from RDG, compute data dependencies between memory references.
>>> 4) Starting from seed statement, build up partition by adding depended
>>> statements according to RDG's dependence information. Partition is
>>> classified as parallel type if it can be executed paralleled; or as
>>> sequential type if it can't. Parallel type partition is further
>>> classified as different builtin kinds if it can be implemented as
>>> builtin function calls.
>>> 5) Build partition dependence graph (PG) based on data dependencies.
>>> The vertices (PG:V) model all partitions and the edges (PG:E) model
>>> all data dependencies between every partitions pair. In general,
>>> data dependence is either compilation time known or unknown. In C
>>> family languages, there exists quite amount compilation time unknown
>>> dependencies because of possible alias relation of data references.
>>> We categorize PG's edge to two types: "true" edge that represents
>>> compilation time known data dependencies; "alias" edge for all other
>>> data dependencies.
>>> 6) Traverse subgraph of PG as if all "alias" edges don't exist. Merge
>>> partitions in each strong connected component (SCC) correspondingly.
>>> Build new PG for merged partitions.
>>> 7) Traverse PG again and this time with both "true" and "alias" edges
>>> included. We try to break SCCs by removing some edges. Because
>>> SCCs by "true" edges are all fused in step 6), we can break SCCs
>>> by removing some "alias" edges. It's NP-hard to choose optimal
>>> edge set, fortunately simple approximation is good enough for us
>>> given the small problem scale.
>>> 8) Collect all data dependencies of the removed "alias" edges. Create
>>> runtime alias checks for collected data dependencies.
>>> 9) Version loop under the condition of runtime alias checks. Given
>>> loop distribution generally introduces additional overhead, it is
>>> only useful if vectorization is achieved in distributed loop. We
>>> version loop with internal function call IFN_LOOP_DIST_ALIAS. If
>>> no distributed loop can be vectorized, we simply remove distributed
>>> loops and recover to the original one.
>>>
>>> Also, there are some more to improve in the future (which isn't difficult I think):
>>> TODO:
>>> 1) We only distribute innermost loops now. This pass should handle loop
>>> nests in the future.
>>> 2) We only fuse partitions in SCC now. A better fusion algorithm is
>>> desired to minimize loop overhead, maximize parallelism and maximize
>>>
>>> Bootstrap and test on x86_64 and AArch64. Is it OK?
>>>
>> Trivial updated due to changes in previous patches. Also fixed issues
>> mentioned by Kugan.
> Rebased V3 for changes in previous patches. Bootstap and test on
> x86_64 and aarch64.
why is ldist-12.c no longer distributed? your comment says it doesn't expose
more "parallelism" but the point is to reduce memory bandwith requirements
which it clearly does.
Likewise for -13.c, -14.c. -4.c may be a questionable case but the wording
of "parallelism" still confuses me.
Can you elaborate on that. Now onto the patch:
+ Loop distribution is the dual of loop fusion. It separates statements
+ of a loop (or loop nest) into multiple loops (or loop nests) with the
+ same loop header. The major goal is to separate statements which may
+ be vectorized from those that can't. This pass implements distribution
+ in the following steps:
misses the goal of being a memory stream optimization, not only a vectorization
enabler. distributing a loop can also reduce register pressure.
You introduce ldist_alias_id in struct loop (probably in 01/n which I
didn't look
into yet). If you don't use that please introduce it separately.
+ /* Be conservative. If data references are not well analyzed,
+ or the two data references have the same base address and
+ offset, add dependence and consider it alias to each other.
+ In other words, the dependence can not be resolved by
+ runtime alias check. */
+ if (!DR_BASE_ADDRESS (dr1) || !DR_BASE_ADDRESS (dr2)
+ || !DR_OFFSET (dr1) || !DR_OFFSET (dr2)
+ || !DR_INIT (dr1) || !DR_INIT (dr2)
+ || !DR_STEP (dr1) || !tree_fits_uhwi_p (DR_STEP (dr1))
+ || !DR_STEP (dr2) || !tree_fits_uhwi_p (DR_STEP (dr2))
+ || res == 0)
ISTR a helper that computes whether we can handle a runtime alias check for
a specific case?
+ /* Depend on vectorizer to fold IFN_LOOP_DIST_ALIAS. */
+ if (flag_tree_loop_vectorize)
+ {
so at this point I'd condition the whole runtime-alias check generating
on flag_tree_loop_vectorize. You seem to support versioning w/o
that here but in other places disable versioning w/o flag_tree_loop_vectorize.
That looks somewhat inconsistent...
+ /* Don't version loop if any partition is builtin. */
+ for (i = 0; partitions->iterate (i, &partition); ++i)
+ {
+ if (partition->kind != PKIND_NORMAL)
+ break;
+ }
why's that? Do you handle the case where only a subset of partitions
require a runtime alias check to be generated? Thus from a loop
generate three loops, only two of them being versioned? The
complication would be that both the runtime alias and non-alias
versions would be "transformed". Or do we rely on recursively
distributing the distribution result, thus if we have partitions that
can be handled w/o runtime alias check fuse the remaining partitions
and recurse on those?
Otherwise the patch looks good to me.
Thanks,
Richard.
> Thanks,
> bin
>>
>> Thanks,
>> bin
>>
>> 2017-06-17 Bin Cheng <bin.cheng@arm.com>
>>
>> * tree-loop-distribution.c: Add general explanantion on the pass.
>> (pg_add_dependence_edges): New parameter. handle alias data
>> dependence specially and record it in the parameter if asked.
>> (struct pg_vdata, pg_edata, pg_edge_callback_data): New structs.
>> (init_partition_graph_vertices, add_partition_graph_edge): New.
>> (pg_skip_alias_edge, free_partition_graph_edata_cb): New.
>> (free_partition_graph_vdata, build_partition_graph): New.
>> (sort_partitions_by_post_order, merge_dep_scc_partitions): New.
>> (pg_collect_alias_ddrs, break_alias_scc_partitions): New.
>> (data_ref_segment_size, latch_dominated_by_data_ref): New.
>> (compute_alias_check_pairs, version_loop_by_alias_check): New.
>> (version_for_distribution_p, finalize_partitions): New.
>> (distribute_loop): Handle alias data dependence specially. Factor
>> out loop fusion code as functions and call these functions.
>>
>> gcc/testsuite/ChangeLog
>> 2017-06-17 Bin Cheng <bin.cheng@arm.com>
>>
>> * gcc.dg/tree-ssa/ldist-4.c: Adjust test string.
>> * gcc.dg/tree-ssa/ldist-12.c: Ditto.
>> * gcc.dg/tree-ssa/ldist-13.c: Ditto.
>> * gcc.dg/tree-ssa/ldist-14.c: Ditto.