[PATCH GCC]A simple implementation of loop interchange
Richard Biener
richard.guenther@gmail.com
Mon Sep 4 13:55:00 GMT 2017
On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>>> +/* This pass performs loop interchange: for example, the loop nest
>>> +
>>> + for (int j = 0; j < N; j++)
>>> + for (int k = 0; k < N; k++)
>>> + for (int i = 0; i < N; i++)
>>> + c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> + is transformed to
>>> +
>>> + for (int i = 0; i < N; i++)
>>> + for (int j = 0; j < N; j++)
>>> + for (int k = 0; k < N; k++)
>>> + c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> + This pass implements loop interchange in the following steps:
>>> +
>>> + 1) Find perfect loop nest for each innermost loop and compute data
>>> + dependence relations for it. For above example, loop nest is
>>> + <loop_j, loop_k, loop_i>.
>>> + 2) From innermost to outermost loop, this pass tries to interchange
>>> + each loop pair. For above case, it firstly tries to interchange
>>> + <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>> + Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>> + <loop_i, loop_j, loop_k>. The overall effect is to move innermost
>>> + loop to the outermost position. For loop pair <loop_i, loop_j>
>>> + to be interchanged, we:
>>> + 3) Check if data dependence relations are valid for loop interchange.
>>> + 4) Check if both loops can be interchanged in terms of transformation.
>>> + 5) Check if interchanging the two loops is profitable.
>>> + 6) Interchange the two loops by mapping induction variables.
>>> +
>>> + This pass also handles reductions in loop nest. So far we only support
>>> + simple reduction of inner loop and double reduction of the loop nest. */
>>>
>>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>>> than permutation. Also as a traditional loop optimizer, it only works for perfect loop
>>> nest. I put it just after loop distribution thus ideally loop split/distribution could
>>> create perfect nest for it. Unfortunately, we don't get any perfect nest from distribution
>>> for now because it only works for innermost loop. For example, the motivation case in
>>> spec2k6/bwaves is not handled on this pass alone. I have a patch extending distribution
>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>> Another point is I deliberately make both the cost model and code transformation (very)
>>> conservative. We can support more cases, or more transformations with great care when
>>> it is for sure known beneficial. IMHO, we already hit over-baked issues quite often and
>>> don't want to introduce more.
>>> As for code generation, this patch has an issue that invariant code in outer loop could
>>> be moved to inner loop. For the moment, we rely on the last lim pass to handle all INV
>>> generated during interchange. In the future, we may need to avoid that in interchange
>>> itself, or another lim pass just like the one after graphite optimizations.
>>>
>>> Boostrap and test on x86_64 and AArch64. Various benchmarks built and run successfully.
>>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>>> programs with it enabled by default. Any comments?
>>
> Thanks for quick review.
>> +/* The same as above, but this one is only used for interchanging not
>> + innermost loops. */
>> +#define OUTER_STRIDE_RATIO (2)
>>
>> please make all these knobs --params.
>>
>> +/* Enum type for loop reduction variable. */
>> +enum reduction_type
>> +{
>> + UNKNOWN_RTYPE = 0,
>> + SIMPLE_RTYPE,
>> + DOUBLE_RTYPE
>> +};
>>
>> seeing this we should have some generic data structure / analysis for
>> reduction detection. This adds a third user (autopar and vectorizer
>> are the others). Just an idea.
>>
>> +/* Return true if E is abnormal edge. */
>> +
>> +static inline bool
>> +abnormal_edge (edge e)
>> +{
>> + return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>> +}
>>
>> bad name/comment for what it does.
>>
>> ... jumping to end of file / start of pass
>>
>> + /* Get the outer loop. */
>> + loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>
>> loop_outer (loop)?
>>
>> + /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>> + depends on outer loop's IV. */
>> + if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>> + break;
>>
>> but you don't check for a three-nest whether niters depends on outer outer
>> loop's IV that way. Either the check is superfluous here or incomplete.
> It is checked for multi-nest case in can_interchange_loops. I will
> move the check to this function so that we can save compilation time.
>>
>> + /* Check if start_loop forms a perfect loop nest. */
>> + loop_nest->create (3);
>> + do {
>> + datarefs->create (20);
>> + ddrs->create (20);
>> + loop_nest->truncate (0);
>> + if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
>> + datarefs, ddrs))
>> + {
>> + unsigned i;
>> + struct data_dependence_relation *ddr;
>> +
>> + for (i = 0; ddrs->iterate (i, &ddr); ++i)
>> + {
>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> + continue;
>> + /* Give up on any unknown dependence. */
>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
>> + || DDR_NUM_DIR_VECTS (ddr) == 0)
>> + break;
>> + }
>> +
>> + if (i == ddrs->length ())
>> + return true;
>>
>> better open-code this so we dont' waste time computing all dependences
>> when we give up in the majority of cases (unknown dependence).
> Right. Will make the change.
>> Memleak here (ddrs and datarefs).
>>
>> + start_loop = start_loop->inner;
>> + } while (start_loop && start_loop->inner);
>>
>> ick. So this is cubic -- nest depth * #drs * #drs ... (exactly why I
>> never committed loop distribution for nests ;)).
> Hmm, loop distribution for (innermost) nest is necessary for this pass
> to handle bwaves unfortunately. It is also necessary to distribute
> for (i;)
> for (j)
> arr[i * len + j] = 0;
> into a single memcall, rather than a loop of sub-memcall.
I know... bwaves is what I implemented the change for to enable interchange.
I think right now store-motion makes the nest difficult to handle as it
has a reduction (but I see you handle reductions in interchange -- just this
one cannot be handled I think?).
Richard.
>>
>> I see that should_interchange_loops only uses datarefs. This means
>> I'd rather do that as a very first step before considering validity
>> (and computing dependences). That analysis (for all possible
>> interchanges) should be much cheaper? I see it probably doesn't
> Yes it makes the code less straightforward. I think we can
> additionally call should_interchange_loops before computing
> dependencies. This will bypass for most cases, for example, ~3500
> loops can be interchanged without cost model, while only ~200 loops
> actually pass the cost model.
>
>> fit with the iteration you do very well... can't we somehow compute
>> a loop permutation and apply it in a single step rather than
>> piecewise with update_data_refs/deps?
> Arbitrary loop permutation would be non-trivial. If you meant
> computing loop shit (outermosting here) and transforming the code once
> for all, I think it's doable. Cost model check can be easily extended
> in that way, though transformation part needs quite lot rework. One
> of my concern is, other than matrix-mul, I don't have motivation case
> for multi-nest interchange so far.
>>
>> valid_data_dependences has almost no comments, it would be nice
>> to add some (overall) one(s).
> Sorry, I just realized it's wrong because direct vector is misused for
> distance vector... Surprised it didn't bite with thousands of
> interchanged loops in spec... Will fix it.
>
>>
>> Thanks,
>> Richard.
>>
>>
>>> Thanks,
>>> bin
>>> 2017-08-29 Bin Cheng <bin.cheng@arm.com>
>>>
>>> * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>> * common.opt (ftree-loop-interchange): New option.
>>> * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>> * passes.def (pass_linterchange): New pass.
>>> * timevar.def (TV_LINTERCHANGE): New time var.
>>> * tree-pass.h (make_pass_linterchange): New declaration.
>>> * tree-ssa-loop-interchange.cc: New file.
>>> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>
>>> gcc/testsuite
>>> 2017-08-29 Bin Cheng <bin.cheng@arm.com>
>>>
>>> * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
More information about the Gcc-patches
mailing list