[PATCH GCC]A simple implementation of loop interchange

Bin.Cheng amker.cheng@gmail.com
Mon Sep 4 14:48:00 GMT 2017


On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>>>> +/* This pass performs loop interchange: for example, the loop nest
>>>> +
>>>> +   for (int j = 0; j < N; j++)
>>>> +     for (int k = 0; k < N; k++)
>>>> +       for (int i = 0; i < N; i++)
>>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   is transformed to
>>>> +
>>>> +   for (int i = 0; i < N; i++)
>>>> +     for (int j = 0; j < N; j++)
>>>> +       for (int k = 0; k < N; k++)
>>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   This pass implements loop interchange in the following steps:
>>>> +
>>>> +     1) Find perfect loop nest for each innermost loop and compute data
>>>> +       dependence relations for it.  For above example, loop nest is
>>>> +       <loop_j, loop_k, loop_i>.
>>>> +     2) From innermost to outermost loop, this pass tries to interchange
>>>> +       each loop pair.  For above case, it firstly tries to interchange
>>>> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>>> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>>> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>>> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
>>>> +       to be interchanged, we:
>>>> +     3) Check if data dependence relations are valid for loop interchange.
>>>> +     4) Check if both loops can be interchanged in terms of transformation.
>>>> +     5) Check if interchanging the two loops is profitable.
>>>> +     6) Interchange the two loops by mapping induction variables.
>>>> +
>>>> +   This pass also handles reductions in loop nest.  So far we only support
>>>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>>>
>>>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>>>> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
>>>> nest.  I put it just after loop distribution thus ideally loop split/distribution could
>>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
>>>> for now because it only works for innermost loop.  For example, the motivation case in
>>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
>>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>>> Another point is I deliberately make both the cost model and code transformation (very)
>>>> conservative.  We can support more cases, or more transformations with great care when
>>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
>>>> don't want to introduce more.
>>>> As for code generation, this patch has an issue that invariant code in outer loop could
>>>> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
>>>> generated during interchange.  In the future, we may need to avoid that in interchange
>>>> itself, or another lim pass just like the one after graphite optimizations.
>>>>
>>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
>>>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>>>> programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loops.  */
>>> +#define OUTER_STRIDE_RATIO     (2)
>>>
>>> please make all these knobs --params.
>>>
>>> +/* Enum type for loop reduction variable.  */
>>> +enum reduction_type
>>> +{
>>> +  UNKNOWN_RTYPE = 0,
>>> +  SIMPLE_RTYPE,
>>> +  DOUBLE_RTYPE
>>> +};
>>>
>>> seeing this we should have some generic data structure / analysis for
>>> reduction detection.  This adds a third user (autopar and vectorizer
>>> are the others).  Just an idea.
>>>
>>> +/* Return true if E is abnormal edge.  */
>>> +
>>> +static inline bool
>>> +abnormal_edge (edge e)
>>> +{
>>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>>> +}
>>>
>>> bad name/comment for what it does.
>>>
>>> ... jumping to end of file / start of pass
>>>
>>> +      /* Get the outer loop.  */
>>> +      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>>
>>> loop_outer (loop)?
>>>
>>> +      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>>> +        depends on outer loop's IV.  */
>>> +      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>>> +       break;
>>>
>>> but you don't check for a three-nest whether niters depends on outer outer
>>> loop's IV that way.  Either the check is superfluous here or incomplete.
>> It is checked for multi-nest case in can_interchange_loops.  I will
>> move the check to this function so that we can save compilation time.
>>>
>>> +  /* Check if start_loop forms a perfect loop nest.  */
>>> +  loop_nest->create (3);
>>> +  do {
>>> +    datarefs->create (20);
>>> +    ddrs->create (20);
>>> +    loop_nest->truncate (0);
>>> +    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
>>> +                                          datarefs, ddrs))
>>> +      {
>>> +       unsigned i;
>>> +       struct data_dependence_relation *ddr;
>>> +
>>> +       for (i = 0; ddrs->iterate (i, &ddr); ++i)
>>> +         {
>>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>> +             continue;
>>> +           /* Give up on any unknown dependence.  */
>>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
>>> +               || DDR_NUM_DIR_VECTS (ddr) == 0)
>>> +             break;
>>> +         }
>>> +
>>> +       if (i == ddrs->length ())
>>> +         return true;
>>>
>>> better open-code this so we dont' waste time computing all dependences
>>> when we give up in the majority of cases (unknown dependence).
>> Right.  Will make the change.
>>> Memleak here (ddrs and datarefs).
>>>
>>> +    start_loop = start_loop->inner;
>>> +  } while (start_loop && start_loop->inner);
>>>
>>> ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
>>> never committed loop distribution for nests ;)).
>> Hmm, loop distribution for (innermost) nest is necessary for this pass
>> to handle bwaves unfortunately.  It is also necessary to distribute
>> for (i;)
>>   for (j)
>>     arr[i * len + j] = 0;
>> into a single memcall, rather than a loop of sub-memcall.
>
> I know...  bwaves is what I implemented the change for to enable interchange.
> I think right now store-motion makes the nest difficult to handle as it
> has a reduction (but I see you handle reductions in interchange -- just this
> one cannot be handled I think?).
Yes, it's necessary to handle reduction in loop interchange because
store motion in GCC is powerful enough to change cases like bwaves and
matrix-mul into reduction.
As for bwaves, the transformed reduction is supported, it is the
initialization statement not handled by this patch.  Well, it actually
can but that's effectively doing distribution in interchange pass.
One of my idea is to only implement single-objective passes, so I'd
rather rely on distribution for that.

Thanks,
bin
>
> Richard.
>
>>>
>>> I see that should_interchange_loops only uses datarefs.  This means
>>> I'd rather do that as a very first step before considering validity
>>> (and computing dependences).  That analysis (for all possible
>>> interchanges) should be much cheaper?  I see it probably doesn't
>> Yes it makes the code less straightforward.  I think we can
>> additionally call should_interchange_loops before computing
>> dependencies.  This will bypass for most cases, for example, ~3500
>> loops can be interchanged without cost model, while only  ~200 loops
>> actually pass the cost model.
>>
>>> fit with the iteration you do very well...  can't we somehow compute
>>> a loop permutation and apply it in a single step rather than
>>> piecewise with update_data_refs/deps?
>> Arbitrary loop permutation would be non-trivial.  If you meant
>> computing loop shit (outermosting here) and transforming the code once
>> for all, I think it's doable.  Cost model check can be easily extended
>> in that way, though transformation part needs quite lot rework.  One
>> of my concern is, other than matrix-mul, I don't have motivation case
>> for multi-nest interchange so far.
>>>
>>> valid_data_dependences has almost no comments, it would be nice
>>> to add some (overall) one(s).
>> Sorry, I just realized it's wrong because direct vector is misused for
>> distance vector...  Surprised it didn't bite with thousands of
>> interchanged loops in spec...  Will fix it.
>>
>>>
>>> Thanks,
>>> Richard.
>>>
>>>
>>>> Thanks,
>>>> bin
>>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>>>         * common.opt (ftree-loop-interchange): New option.
>>>>         * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>>>         * passes.def (pass_linterchange): New pass.
>>>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>>>         * tree-ssa-loop-interchange.cc: New file.
>>>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>>
>>>> gcc/testsuite
>>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.



More information about the Gcc-patches mailing list