[PATCH GCC]A simple implementation of loop interchange

Mon Sep 4 13:55:00 GMT 2017

On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>>> +/* This pass performs loop interchange: for example, the loop nest
>>> +
>>> +   for (int j = 0; j < N; j++)
>>> +     for (int k = 0; k < N; k++)
>>> +       for (int i = 0; i < N; i++)
>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> +   is transformed to
>>> +
>>> +   for (int i = 0; i < N; i++)
>>> +     for (int j = 0; j < N; j++)
>>> +       for (int k = 0; k < N; k++)
>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> +   This pass implements loop interchange in the following steps:
>>> +
>>> +     1) Find perfect loop nest for each innermost loop and compute data
>>> +       dependence relations for it.  For above example, loop nest is
>>> +       <loop_j, loop_k, loop_i>.
>>> +     2) From innermost to outermost loop, this pass tries to interchange
>>> +       each loop pair.  For above case, it firstly tries to interchange
>>> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
>>> +       to be interchanged, we:
>>> +     3) Check if data dependence relations are valid for loop interchange.
>>> +     4) Check if both loops can be interchanged in terms of transformation.
>>> +     5) Check if interchanging the two loops is profitable.
>>> +     6) Interchange the two loops by mapping induction variables.
>>> +
>>> +   This pass also handles reductions in loop nest.  So far we only support
>>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>>
>>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>>> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
>>> nest.  I put it just after loop distribution thus ideally loop split/distribution could
>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
>>> for now because it only works for innermost loop.  For example, the motivation case in
>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>> Another point is I deliberately make both the cost model and code transformation (very)
>>> conservative.  We can support more cases, or more transformations with great care when
>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
>>> don't want to introduce more.
>>> As for code generation, this patch has an issue that invariant code in outer loop could
>>> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
>>> generated during interchange.  In the future, we may need to avoid that in interchange
>>> itself, or another lim pass just like the one after graphite optimizations.
>>>
>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
>>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>>> programs with it enabled by default.  Any comments?
>>
> Thanks for quick review.
>> +/* The same as above, but this one is only used for interchanging not
>> +   innermost loops.  */
>> +#define OUTER_STRIDE_RATIO     (2)
>>
>> please make all these knobs --params.
>>
>> +/* Enum type for loop reduction variable.  */
>> +enum reduction_type
>> +{
>> +  UNKNOWN_RTYPE = 0,
>> +  SIMPLE_RTYPE,
>> +  DOUBLE_RTYPE
>> +};
>>
>> seeing this we should have some generic data structure / analysis for
>> reduction detection.  This adds a third user (autopar and vectorizer
>> are the others).  Just an idea.
>>
>> +/* Return true if E is abnormal edge.  */
>> +
>> +static inline bool
>> +abnormal_edge (edge e)
>> +{
>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>> +}
>>
>> bad name/comment for what it does.
>>
>> ... jumping to end of file / start of pass
>>
>> +      /* Get the outer loop.  */
>> +      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>
>> loop_outer (loop)?
>>
>> +      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>> +        depends on outer loop's IV.  */
>> +      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>> +       break;
>>
>> but you don't check for a three-nest whether niters depends on outer outer
>> loop's IV that way.  Either the check is superfluous here or incomplete.
> It is checked for multi-nest case in can_interchange_loops.  I will
> move the check to this function so that we can save compilation time.
>>
>> +  /* Check if start_loop forms a perfect loop nest.  */
>> +  loop_nest->create (3);
>> +  do {
>> +    datarefs->create (20);
>> +    ddrs->create (20);
>> +    loop_nest->truncate (0);
>> +    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
>> +                                          datarefs, ddrs))
>> +      {
>> +       unsigned i;
>> +       struct data_dependence_relation *ddr;
>> +
>> +       for (i = 0; ddrs->iterate (i, &ddr); ++i)
>> +         {
>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> +             continue;
>> +           /* Give up on any unknown dependence.  */
>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
>> +               || DDR_NUM_DIR_VECTS (ddr) == 0)
>> +             break;
>> +         }
>> +
>> +       if (i == ddrs->length ())
>> +         return true;
>>
>> better open-code this so we dont' waste time computing all dependences
>> when we give up in the majority of cases (unknown dependence).
> Right.  Will make the change.
>> Memleak here (ddrs and datarefs).
>>
>> +    start_loop = start_loop->inner;
>> +  } while (start_loop && start_loop->inner);
>>
>> ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
>> never committed loop distribution for nests ;)).
> Hmm, loop distribution for (innermost) nest is necessary for this pass
> to handle bwaves unfortunately.  It is also necessary to distribute
> for (i;)
>   for (j)
>     arr[i * len + j] = 0;
> into a single memcall, rather than a loop of sub-memcall.

I know...  bwaves is what I implemented the change for to enable interchange.
I think right now store-motion makes the nest difficult to handle as it
has a reduction (but I see you handle reductions in interchange -- just this
one cannot be handled I think?).

Richard.

>>
>> I see that should_interchange_loops only uses datarefs.  This means
>> I'd rather do that as a very first step before considering validity
>> (and computing dependences).  That analysis (for all possible
>> interchanges) should be much cheaper?  I see it probably doesn't
> Yes it makes the code less straightforward.  I think we can
> additionally call should_interchange_loops before computing
> dependencies.  This will bypass for most cases, for example, ~3500
> loops can be interchanged without cost model, while only  ~200 loops
> actually pass the cost model.
>
>> fit with the iteration you do very well...  can't we somehow compute
>> a loop permutation and apply it in a single step rather than
>> piecewise with update_data_refs/deps?
> Arbitrary loop permutation would be non-trivial.  If you meant
> computing loop shit (outermosting here) and transforming the code once
> for all, I think it's doable.  Cost model check can be easily extended
> in that way, though transformation part needs quite lot rework.  One
> of my concern is, other than matrix-mul, I don't have motivation case
> for multi-nest interchange so far.
>>
>> valid_data_dependences has almost no comments, it would be nice
>> to add some (overall) one(s).
> Sorry, I just realized it's wrong because direct vector is misused for
> distance vector...  Surprised it didn't bite with thousands of
> interchanged loops in spec...  Will fix it.
>
>>
>> Thanks,
>> Richard.
>>
>>
>>> Thanks,
>>> bin
>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>>         * common.opt (ftree-loop-interchange): New option.
>>>         * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>>         * passes.def (pass_linterchange): New pass.
>>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>>         * tree-ssa-loop-interchange.cc: New file.
>>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>
>>> gcc/testsuite
>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.