[PATCH GCC][V2]A simple implementation of loop interchange

Tue Nov 28 15:55:00 GMT 2017

Hi,
This is updated patch with review comments resolved.  Some explanation embedded below.

On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gmail.com> wrote:
> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
>>> Hello,
>>>
>>> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>>>
>>>> This is updated patch for loop interchange with review suggestions
>>>> resolved.  Changes are:
>>>>   1) It does more light weight checks like rectangle loop nest check
>>>> earlier than before.
>>>>   2) It checks profitability of interchange before data dependence computation.
>>>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>>>   4) Data dependence is open-computed so that we can skip instantly at
>>>> unknown dependence.
>>>>   5) It improves code generation in mapping induction variables for
>>>> loop nest, as well as
>>>>      adding a simple dead code elimination pass.
>>>>   6) It changes magic constants into parameters.
>>>
>>> So I have a couple comments/questions.  Something stylistic:
>> Hi Michael,
>> Thanks for reviewing.
>>
>>>
>>>> +class loop_cand
>>>> +{
>>>> +public:
>>>> ...
>>>> +  friend class tree_loop_interchange;
>>>> +private:
>>>
>>> Just make this all public (and hence a struct, not class).
>>> No need for friends in file local classes.
>> Done.
>>
>>>
>>>> +single_use_in_loop (tree var, struct loop *loop)
>>>> ...
>>>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>>>> +    {
>>>> +      stmt = USE_STMT (use_p);
>>>> ...
>>>> +      basic_block bb = gimple_bb (stmt);
>>>> +      gcc_assert (bb != NULL);
>>>
>>> This pattern reoccurs often in your patch: you check for a bb associated
>>> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
>>> for checking.
>> Done.
>>
>>>
>>> Then, something about your handling of simple reductions:
>>>
>>>> +void
>>>> +loop_cand::classify_simple_reduction (reduction_p re)
>>>> +{
>>>> ...
>>>> +  /* Require memory references in producer and consumer are the same so
>>>> +     that we can undo reduction during interchange.  */
>>>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>>>> +    return;
>>>
>>> Where is it checked that the undoing transformation is legal also
>>> from a data dep point of view?  Think code like this:
>>>
>>>    sum = X[i];
>>>    for (j ...)
>>>      sum += X[j];
>>>    X[i] = sum;
>>>
>>> Moving the store into the inner loop isn't always correct and I don't seem
>>> to find where the above situation is rejected.
>> Yeah.  for the old patch, it's possible to have such loop wrongly interchanged;
>> in practice, it's hard to create an example.  The pass will give up
>> when computing
>> data dep between references in inner/outer loops.  In this updated
>> patch, it's fixed
>> by giving up if there is any dependence between references of inner/outer loops.
>>
>>>
>>> Maybe I'm confused because I also don't see where you even can get into
>>> the above situation (though I do see testcases about this).  The thing is,
>>> for an 2d loop nest to contain something like the above reduction it can't
>>> be perfect:
>>>
>>>    for (j) {
>>>      int sum = X[j];      // 1
>>>      for (i)
>>>        sum += Y[j][i];
>>>      X[j] = sum;          // 2
>>>    }
>>>
>>> But you do check for perfectness in proper_loop_form_for_interchange and
>>> prepare_perfect_loop_nest, so either you can't get into the situation or
>>> the checking can't be complete, or you define the above to be perfect
>>> nevertheless (probably because the load and store are in outer loop
>>> header/exit blocks?).  The latter would mean that you accept also other
>>> code in header/footer of loops from a pure CFG perspective, so where is it
>>> checked that that other code (which aren't simple reductions) isn't
>>> harmful to the transformation?
>> Yes, I used the name perfect loop nest, but the pass can handle special form
>> imperfect loop nest for the simple reduction.  I added comments describing
>> this before function prepare_perfect_loop_nest.
>>
>>>
>>> Then, the data dependence part of the new pass:
>>>
>>>> +bool
>>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
>>>> +{
>>>> +  struct data_dependence_relation *ddr;
>>>> +
>>>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>>> +    {
>>>> +      /* Skip no-dependence case.  */
>>>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>>> +     continue;
>>>> +
>>>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>>> +     {
>>>> +       lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>>>> +       unsigned level = dependence_level (dist_vect, loop_nest.length ());
>>>> +
>>>> +       /* If there is no carried dependence.  */
>>>> +       if (level == 0)
>>>> +         continue;
>>>> +
>>>> +       level --;
>>>> +       /* Skip case which has '>' as the leftmost direction.  */
>>>> +       if (!lambda_vector_lexico_pos (dist_vect, level))
>>>> +         return false;
>>>
>>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
>> Done.
>>
>>>
>>>> +       /* If dependence is carried by outer loop of the two loops for
>>>> +          interchange.  */
>>>> +       if (level < outer)
>>>> +         continue;
>>>> +
>>>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>>> +       /* If directions at both inner/outer levels are the same.  */
>>>> +       if (dir_vect[inner] == dir_vect[outer])
>>>> +         continue;
>>>> +
>>>> +       /* Be conservative, skip case if either direction at inner/outer
>>>> +          levels is not '=' or '<'.  */
>>>> +       if (dir_vect[inner] != dir_equal
>>>> +           && dir_vect[inner] != dir_positive
>>>> +           && dir_vect[inner] != dir_independent
>>>> +           && dir_vect[inner] != dir_positive_or_equal)
>>>> +         return false;
>>>> +
>>>> +       if (dir_vect[outer] != dir_equal
>>>> +           && dir_vect[outer] != dir_positive
>>>> +           && dir_vect[outer] != dir_independent
>>>> +           && dir_vect[outer] != dir_positive_or_equal)
>>>> +         return false;
>>>
>>> Checking dir vectors doesn't make much sense in GCC: the elements are only
>>> ever set to dir_positive, dir_negative or dir_equal, exactly when distance
>>> is
>>>  > 0, < 0 or == 0.  So checking dist vector is enough. (though sameness of
>>> direction checks sameness of sign with zero).  Incidentally:
>> Done.
>>
>>>
>>>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
>>>> +{
>>>> +  struct data_dependence_relation *ddr;
>>>> +
>>>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>>> +    {
>>>> +      /* Skip no-dependence case.  */
>>>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>>> +     continue;
>>>> +
>>>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>>> +     {
>>>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>>> +       std::swap (dir_vect[inner], dir_vect[outer]);
>>>> +     }
>>>> +    }
>>>> +}
>>>
>>> Here you swap only the direction but not the distance vector, which can't
>>> be right.  I suggest only using (and updating) the distance vector.
>> Yeah, fixed.
>>
>>>
>>> And then your usage and update of DR_ACCESS_FNs: there's quite some
>>> complexity connected with that and I'm not sure how worthwhile it is.
>>> You're basically using the ACCESS_FNs to determine profitability (and not
>>> for validity, and that's good).  But e.g. for pointer based accesses like
>>> in fortran with explicit address arithmetic the relation between access-fn
>>> step and stride and actual access stride isn't that easy (e.g. in your
>>> should_interchange_loops function iloop_stride and oloop_stride will
>>> always be one for pointer based accesses).
>>>
>>> Conceptually what you should check is how the access address for each data
>>> ref revolves for each loop, so why not doing this explicitely?  What I
>>> mean is: calculate a (complicated) chrec for the DR addresses for the
>>> whole nest at the beginning.  It should be in the form like (assume "+"
>>> always):
>>>
>>>   {{{init, s1}_l1, s2}_l2, s3}_l3
>>>
>>> (i.e. all steps should be invariants/constants, and only one non-chrec
>>> init value).  Addresses which aren't in this form you're already ignoring
>>> right now, so you could continue doing that.  (Or better said, all
>>> non-constant steps you regard as being AVG_DIM_SIZE, which you still can
>>> continue doing).
>>>
>>> Now, with the above form you can form expressions for the difference
>>> between addresses per iteration for each loop (i.e. the address stride per
>>> loop); store these.  Then, when interchanging loops you need to merely
>>> swap these expressions like you have to with the distance vector, instead
>>> of fiddling inside the DR_ACCESS_FNs themself.  Much code would go away.
>> Yeah.  Did similar thing in loop nest distribution pass.  See
>> compute_access_range
>> in tree-loop-distribution.c.  Actually, I would do the same here if I
>> had implemented
>> this pass after loop nest distribution patches.  Done in this updated patch.
>>
>>>
>>> Testcases: given that we had to remove our old separate interchange pass
>>> because it miscompiled stuff all over I'm missing some testcases where
>>> interchange should _not_ happen for validity reasons, like my above
>>> example with an reduction that can't be moved inside.  Perhaps you can
>>> think of some more.
>> As mentioned above, it's hard to create test that fail exactly for this reason.
>> I added one that data dependence prevents us from interchanging the loop.
>>
>>>
>>> I hope this is of some help to you :)
>> Thanks again, it's very helpful.
>>
>> I also fixed several bugs of previous implementation, mostly about debug info
>> statements and simple reductions.  As for test, I enabled this pass by default,
>> bootstrap and regtest GCC, I also build/run specs.  There must be some other
>> latent bugs in it, but guess we have to exercise it by enabling it at
>> some point.
>>
>> So any comments?
>
>  bool
> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
>  {
>
> that you need this suggests you do stmt removal in wrong order (you need to
> do reverse dom order).
As below code in handling debug uses, this updated patch gives up on more cases
by scrapping debug uses now.  Hopefully this isn't a problem, debugging experience
for interchange loops is bad already?
>
> +/* Maximum number of statements in loop nest for loop interchange.  */
> +
> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
> +         "loop-interchange-max-num-stmts",
> +         "The maximum number of stmts in loop nest for loop interchange.",
> +         64, 0, 0)
>
> is that to limit dependence computation?  In this case you should probably
> limit the number of data references instead?
Hmm, I kept this one and it is to limit the size of loops for interchange.

>
> +ftree-loop-interchange
> +Common Report Var(flag_tree_loop_interchange) Optimization
> +Enable loop interchange on trees.
> +
>
> please re-use -floop-interchange instead and change the GRAPHITE tests
> to use -floop-nest-optimize.  You can do that as pre-approved thing now.
Done.  I will send an independent patch adjusting GRAPHITE tests.

>
> Please enable the pass by default at O3 via opts.c.
I will do it in a separated patch because many vectorization tests are vulnerable 
to interchange.  I checked these tests, interchange is good, we need to disable
explicitly.

>
> diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
>
> gimple-loop-interchange.cc please.
>
> new file mode 100644
> index 0000000..abffbf6
> --- /dev/null
> +++ b/gcc/tree-ssa-loop-interchange.cc
> @@ -0,0 +1,2274 @@
> +/* Loop invariant motion.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
>
> Loop invariant motion? ... ;)
>
> Please add a "Contributed by ..." to have an easy way to figure people to blame.
>
> +}*induction_p;
> +
>
> space after '*'
>
> +}*reduction_p;
> +
>
> likewise.
All done.

>
> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
> +   ssa var appearing in any abnormal phi node.  */
> +
> +static inline bool
> +unsupported_phi_node (gphi *phi)
> +{
> +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
> +    return true;
> +
> +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> +    {
> +      tree arg = PHI_ARG_DEF (phi, i);
> +      if (TREE_CODE (arg) == SSA_NAME
> +         && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
> +       return true;
> +    }
> +
> +  return false;
>
> I believe the above isn't necessary given you rule out abnormal edges
> into the loop.
> Did you have a testcase that broke?  A minor thing I guess if it is
> just for extra
> safety...
Extra safety I guess.  I now remove this and haven't run into compilation issues so far.

>
> +/* Return true if all stmts in BB can be supported by loop interchange,
> +   otherwise return false.  ILOOP is not NULL if this loop_cand is the
> +   outer loop in loop nest.  */
> +
> +bool
> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
> +{
>
> docs and return value suggest this be named supported_operation
Done.

>
> +      /* Or it's invariant memory reference and only used by inner loop.  */
> +      if (gimple_assign_single_p (stmt)
> +         && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> +         && TREE_CODE (lhs) == SSA_NAME
> +         && single_use_in_loop (lhs, iloop->loop))
> +       continue;
>
> comment suggests multiple uses in loop would be ok?
Comment changed.

>
> +      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
> +         || lhs != re->init)
> +       return;
> +
> +      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
> +         || !REFERENCE_CLASS_P (rhs))
> +       return;
>
> lhs and rhs are never NULL.  Please initialize them outside of the if.
> You want to disallow DECL_P rhs here?
Done.

>
> Can you add an overall function comment what this function does?  It seems
> to detect a reduction as produced by loop store-motion?  Thus it tries to
> get at enough information to perform the reverse transform?
Yes.  Comment added.

>
> During review I have a hard time distinguishing between locals and members
> so can you please prefix all member variables with m_ as according to our
> code guidelines?  I guess what adds to the confusion is the loop_cand argument
> that sometimes is present for loop_cand member functions...
> (I personally prefer to prefix all member accesses with this-> but that's harder
> to enforce)
Done by using m_* stuff.

>
> +static void
> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
> +{
>
> this extracts stmts starting from consumer but rules out other
> consumers (recursively).
> Is that intended?  I wonder if you can avoid this dance...
Okay, so in case the reduction is initialized from constant value, we need to generate new
load memory reference when undoing it.  The new memory reference may depends on idx
variables like in ARRAY_REF.  This function tries to find all depended stmts for inserting it
I didn't look into how GRAPHITE tracks the idx variable and regenerate it when necessary,
maybe we can do the same in the future.

>
> +      /* Transform simple reduction of below form:
> +
> +          init = 0;
> +          loop:
> +            var = phi<init, next>
> +            next = var op ...
> +          reduc_sum = phi<next>
> +          MEM_REF[...] = reduc_sum
> +
> +        into:
> +
>
> which one is 'consumer'?  I wonder if you can simply leave all the
> dead code in place
> just emitting the load-update-store stmts and removing the
> MEM_REF[...] = reduc_sum
> above?
Done.  I simplified the code generation for the pass and uses prerequisite simple dce
interface to remove dead code.  Hope it's clearer now.

>
> +/* Eliminate dead code after loop interchange.  */
> +
> +void
> +loop_cand::eliminate_dead_code (void)
> +{
>
> PRE tracks "possible" dead defs and has a worklist algorithm in
> remove_dead_inserted_code.
> I wonder if you can do sth similar?  That is, I wonder if doing a
> sweep from last to first
> stmt wouldn't be better here?
>
> +         /* Given copy propagation is done during interchange, we can
> +            simply check zero uses of var and eliminate it.  */
> +         if (is_gimple_assign (stmt)
> +             && !gimple_vuse (stmt)
>
> you probably meant to check gimple_vdef?
>
> +             && !gimple_has_volatile_ops (stmt)
> +             && !gimple_has_side_effects (stmt)
>
> the former is redundant
>
> +             && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> +             && TREE_CODE (lhs) == SSA_NAME
> +             && has_zero_uses (lhs))
>
> if you use gimple_get_lhs () you can also handle calls.
>
> That said, this seems to be a very poor DCE, why is it necessary at all?
Local DCE removed by reusing new interface simple_dce_from_worklist.
But the DCE is necessary giving dead code lasts to vectorizer.  At least we
can save compilation time.

>
> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters
> +   estimates information for now.  */
> +
> +static inline void
> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
> +{
> +  tree nb_iterations = oloop->nb_iterations;
> +
> +  oloop->any_upper_bound = false;
> +  oloop->any_likely_upper_bound = false;
> +  free_numbers_of_iterations_estimates (oloop);
> +
> +  oloop->nb_iterations = iloop->nb_iterations;
> +
> +  iloop->any_upper_bound = false;
> +  iloop->any_likely_upper_bound = false;
> +  free_numbers_of_iterations_estimates (iloop);
> +
> +  iloop->nb_iterations = nb_iterations;
>
> use std::swap?  Also I think if you can keep nb_iterations you
> can certainly keep the upper bounds.  You're probably
> afraid of the ->stmt references in the nb_iter_bound entries?
>
> Anyway, either scrap everything or try to keep everything.
Yeah, not only the stmts, but also the control_iv information because the SCEV
information may be corrupted during code transformation.
Now I discarded all the information.

>
> +  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
> +    {
> +      if (re->type != DOUBLE_RTYPE)
> +       gcc_unreachable ();
> +
> +      use_operand_p use_p;
> +      imm_use_iterator iterator;
> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
> +      if (TREE_CODE (re->init) == SSA_NAME)
> +       {
> +         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
> +           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
> +       }
>
> can you add a comment what you are doing here?
>
> Note that other loop opts simply scrap all debug stmts ...
As mentioned above, updated patch doesn't try hard to maintain debug use info any more.

>
> +static void
> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
> +                      data_reference_p dr)
> +{
> ...
> +  tree ref = DR_REF (dr);
> +  tree scev_base = build_fold_addr_expr (ref);
> +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
> +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
> +  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
> +
> +  do {
> +    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
> +    if (chrec_contains_undetermined (scev_fn)
> +       || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
> +      break;
> ...
> +    strides->safe_push (scev_step);
> +  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
> +
>
> I _think_ you want to do
>
>    scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
> DR_STMT (dr) is in loop
>    scev_fn = instantiate_parameters (nest, scev_fn);
>    if (chrec_contains_undetermined (scev_fn))
>      return; // false?
>
> and analyze the result which should be of the form
>
>   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>
> if canonical.  I think estimate_val_by_simplify_replace isn't needed
> if you do that
> (it also looks odd to replace all vairables in step by niter...).
I replied on this in previous message, instantiate_parameters doesn't always
give canonical form result as expected.  The loop here could be seen as a
local instantiate process, right?  
Also estimate_val_by_simplify_replace is needed for pointers, where strides
are computed from niters of loops which could be non compilation time constant.
But yes, it's an odd fixup after I failed to do anything better.

>
> I think keeping the chrec in the above form is also more suitable for what
> the caller does so the outermost loop is simply
>
>   loop = loop_nest;
>   loop-over-all-dr-chrecs
>     if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>       loop = CHREC_LOOP (chrec);
>
> given the outermost loop is the outer evolution.  If you sort the
> stride vecs from inner
> to outer you don't need prune_access_strides_not_in_loop.
Hmm, So stripping outer loops prefer inner to outer sort of strides, but cost computation
during interchange prefers outer to inner sort because loop_nest in tree-data-ref is sorted
in this way.  Seems a single prune_* function is better than fiddling with cost computation.

>
> +/* Count and return the number of loops in LOOP_NEST.  */
> +
> +unsigned int
> +num_loops_in_loop_nest (struct loop *loop_nest)
> +{
> +  unsigned num_loops;
> +  for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
> +    ;
> +  return num_loops;
>
> loop_depth (innermost) - loop_depth (nest)?
Done.

>
> +static bool
> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
> +                         vec<data_reference_p> datarefs,
> +                         bool innermost_loops_p, bool dump_info_p = true)
> +{
>
> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
> and figure a permutation sequence to arrive there?  That is for the local
> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
> when we should interchange?
>
> That subloop_stride_p and tracking invariant DRs looks a bit odd.  For loops
> where a DR is invariant you simply do not have an evolution in that loop.
> You seem to simply add strides in the inner and outer loops for each DR,
> can you explain how that works?  Also I guess strides bigger than the
> various cache-line size parameters should be treated 'equal'?  That is,
> if we don't get any DR to a step that results in L1 cache hits because
> two DRs share a cache line the interchange is pointless?
So given below loop:

  for (int i = 0; i < M; i++)
    for (int j = 0; j < M; j++)
      for (int k = 0; k < M; k++)
        a[k][0][i] = b[k][0][i]

We check if memory reference is invariant wrto a loop only if it has zero strides within
current loop nest.  In this example, there is no invariant given address changes in the
innermost loop.
For strides bigger than cache-line size, it's also possible the interchange is wanted, as
in below example:

  for (int i = 0; i < M; i++)          //loop 1
    for (int j = 0; j < M; j++)        //loop 2
      for (int k = 0; k < M; k++)    //loop 3
        a[j][i][k] = b[j][i][k]

Strides for loop 1/2 are very like to be big, but after interchange, we will have stream
access of both arrays.

More advanced heuristics may be possible, but so far the estimation is quite good by
checking all interchanges I looked into.

>
> +/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
> +
> +static inline void
> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
> +{
> +  struct data_reference *dr;
> +
> +  for (unsigned i = 0; datarefs.iterate (i, &dr);)
> +    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
> +      i++;
> +    else
> +      {
> +       datarefs.ordered_remove (i);
>
> that's expensive.  It's better to keep moving DRs we want to keep
> when walking the array.  That is, add a j you increment only when
> we keep a DR, moving *i to *j.
Done.

>
> +       if (dr->aux)
> +         {
> +           DR_ACCESS_STRIDE (dr)->release ();
> +           free (dr->aux);
> +         }
> +       free_data_ref (dr);
> +      }
> +}
>
> +
> +  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
> +
>
> Hmm.  If you instantiate the SCEV for the niters for each loop in the nest
> wrt the nest you can figure if it has any evolution in sth else than the
> loop (evolution_function_is_univariate_p).   That is, this is not a problem
> until you arrive at analyzing DR strides, right?  At which point you
> can check for the appropriate form?
Hmm, not really.  The niter relation may not appear in SCEV of reference addr.
For example, below loop:

  for (int i = 0; i < M; i++)          //loop 1
    for (int j = 0; j < M; j++)        //loop 2
      for (int k = 0; k < i; k++)    //loop 3
        a[k][0][i] = b[k][0][i]

There is no information in data reference about i/j loops.
Anyway, I refactored the code and put this check in proper_loop_form_for_interchange.
Simpler I think.

>
> +  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
> +      /* Check if there is no data reference.  */
> +      || datarefs->length () == 0
> +      /* Check if there are too many of data references.  */
> +      || ((int) datarefs->length ()
> +         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
> +      /* Check if there is any data reference in loop latch.  We can't handle
> +        loops which loop header and data references have different execution
> +        times.  */
> +      || dataref_niters_diff_to_loop_header (*datarefs)
>
> this suggests to do your own find_data_references_in_loop so you can early
> out.
I refactored the code a bit.  Now this check is in proper_loop_form_for_interchange,
but I do customized my own data references finder.  It's needed to strip outer loops
once a difficult reference is found.

>
> Overall the flow through the pass is a bit hard to follow given there are
> IMHO too many functions.
Yeah, I removed quite number of small functions and refactor the code a lot.  Hope this
version is more straightforward.  
>
> +unsigned int
> +pass_linterchange::execute (function *fun)
> +{
> +  if (number_of_loops (fun) <= 2)
> +    return 0;
> +
> +  bool changed_p = false;;
> +  struct loop *loop;
> +  vec<loop_p> loop_nest;
> +  vec<data_reference_p> datarefs;
> +  vec<ddr_p> ddrs;
> +
> +  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
> +    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
> +      {
> +       tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
> +       changed_p |= loop_interchange.interchange ();
> +      }
>
> you leak datarefs/ddrs?
It was release in destructor, but I refactored it anyway.  I will push the code to branch
gcc.gnu.org/svn/gcc/branches/gimple-linterchange.

Thanks again for the comment of you two.

Thanks,
bin
2017-11-27  Bin Cheng  <bin.cheng@arm.com>

	* Makefile.in (gimple-loop-interchange.o): New object file.
	* common.opt (floop-interchange): Reuse the option from graphite.
	* doc/invoke.texi (-floop-interchange): Ditto.  New document.
	* gimple-loop-interchange.cc: New file.
	* params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
	(PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
	* passes.def (pass_linterchange): New pass.
	* timevar.def (TV_LINTERCHANGE): New time var.
	* tree-pass.h (make_pass_linterchange): New declaration.
	* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
	interchange.  Record IV before/after increment in new parameters.
	* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.

gcc/testsuite
2017-11-27  Bin Cheng  <bin.cheng@arm.com>

	* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-10.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-11.c: New test.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: linterchange-20171126.txt
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20171128/d6b97733/attachment.txt>