[PATCH GCC][V2]A simple implementation of loop interchange

Thu Nov 30 14:34:00 GMT 2017

On Thu, Nov 30, 2017 at 2:01 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> This is updated patch with review comments resolved.  Some explanation embedded below.
>>
>> On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gmail.com> wrote:
>>> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>>>>>
>>>>>> This is updated patch for loop interchange with review suggestions
>>>>>> resolved.  Changes are:
>>>>>>   1) It does more light weight checks like rectangle loop nest check
>>>>>> earlier than before.
>>>>>>   2) It checks profitability of interchange before data dependence computation.
>>>>>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>>>>>   4) Data dependence is open-computed so that we can skip instantly at
>>>>>> unknown dependence.
>>>>>>   5) It improves code generation in mapping induction variables for
>>>>>> loop nest, as well as
>>>>>>      adding a simple dead code elimination pass.
>>>>>>   6) It changes magic constants into parameters.
>>>>>
>>>>> So I have a couple comments/questions.  Something stylistic:
>>>> Hi Michael,
>>>> Thanks for reviewing.
>>>>
>>>>>
>>>>>> +class loop_cand
>>>>>> +{
>>>>>> +public:
>>>>>> ...
>>>>>> +  friend class tree_loop_interchange;
>>>>>> +private:
>>>>>
>>>>> Just make this all public (and hence a struct, not class).
>>>>> No need for friends in file local classes.
>>>> Done.
>>>>
>>>>>
>>>>>> +single_use_in_loop (tree var, struct loop *loop)
>>>>>> ...
>>>>>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>>>>>> +    {
>>>>>> +      stmt = USE_STMT (use_p);
>>>>>> ...
>>>>>> +      basic_block bb = gimple_bb (stmt);
>>>>>> +      gcc_assert (bb != NULL);
>>>>>
>>>>> This pattern reoccurs often in your patch: you check for a bb associated
>>>>> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
>>>>> for checking.
>>>> Done.
>>>>
>>>>>
>>>>> Then, something about your handling of simple reductions:
>>>>>
>>>>>> +void
>>>>>> +loop_cand::classify_simple_reduction (reduction_p re)
>>>>>> +{
>>>>>> ...
>>>>>> +  /* Require memory references in producer and consumer are the same so
>>>>>> +     that we can undo reduction during interchange.  */
>>>>>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>>>>>> +    return;
>>>>>
>>>>> Where is it checked that the undoing transformation is legal also
>>>>> from a data dep point of view?  Think code like this:
>>>>>
>>>>>    sum = X[i];
>>>>>    for (j ...)
>>>>>      sum += X[j];
>>>>>    X[i] = sum;
>>>>>
>>>>> Moving the store into the inner loop isn't always correct and I don't seem
>>>>> to find where the above situation is rejected.
>>>> Yeah.  for the old patch, it's possible to have such loop wrongly interchanged;
>>>> in practice, it's hard to create an example.  The pass will give up
>>>> when computing
>>>> data dep between references in inner/outer loops.  In this updated
>>>> patch, it's fixed
>>>> by giving up if there is any dependence between references of inner/outer loops.
>>>>
>>>>>
>>>>> Maybe I'm confused because I also don't see where you even can get into
>>>>> the above situation (though I do see testcases about this).  The thing is,
>>>>> for an 2d loop nest to contain something like the above reduction it can't
>>>>> be perfect:
>>>>>
>>>>>    for (j) {
>>>>>      int sum = X[j];      // 1
>>>>>      for (i)
>>>>>        sum += Y[j][i];
>>>>>      X[j] = sum;          // 2
>>>>>    }
>>>>>
>>>>> But you do check for perfectness in proper_loop_form_for_interchange and
>>>>> prepare_perfect_loop_nest, so either you can't get into the situation or
>>>>> the checking can't be complete, or you define the above to be perfect
>>>>> nevertheless (probably because the load and store are in outer loop
>>>>> header/exit blocks?).  The latter would mean that you accept also other
>>>>> code in header/footer of loops from a pure CFG perspective, so where is it
>>>>> checked that that other code (which aren't simple reductions) isn't
>>>>> harmful to the transformation?
>>>> Yes, I used the name perfect loop nest, but the pass can handle special form
>>>> imperfect loop nest for the simple reduction.  I added comments describing
>>>> this before function prepare_perfect_loop_nest.
>>>>
>>>>>
>>>>> Then, the data dependence part of the new pass:
>>>>>
>>>>>> +bool
>>>>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
>>>>>> +{
>>>>>> +  struct data_dependence_relation *ddr;
>>>>>> +
>>>>>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>>>>> +    {
>>>>>> +      /* Skip no-dependence case.  */
>>>>>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>>>>> +     continue;
>>>>>> +
>>>>>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>>>>> +     {
>>>>>> +       lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>>>>>> +       unsigned level = dependence_level (dist_vect, loop_nest.length ());
>>>>>> +
>>>>>> +       /* If there is no carried dependence.  */
>>>>>> +       if (level == 0)
>>>>>> +         continue;
>>>>>> +
>>>>>> +       level --;
>>>>>> +       /* Skip case which has '>' as the leftmost direction.  */
>>>>>> +       if (!lambda_vector_lexico_pos (dist_vect, level))
>>>>>> +         return false;
>>>>>
>>>>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
>>>> Done.
>>>>
>>>>>
>>>>>> +       /* If dependence is carried by outer loop of the two loops for
>>>>>> +          interchange.  */
>>>>>> +       if (level < outer)
>>>>>> +         continue;
>>>>>> +
>>>>>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>>>>> +       /* If directions at both inner/outer levels are the same.  */
>>>>>> +       if (dir_vect[inner] == dir_vect[outer])
>>>>>> +         continue;
>>>>>> +
>>>>>> +       /* Be conservative, skip case if either direction at inner/outer
>>>>>> +          levels is not '=' or '<'.  */
>>>>>> +       if (dir_vect[inner] != dir_equal
>>>>>> +           && dir_vect[inner] != dir_positive
>>>>>> +           && dir_vect[inner] != dir_independent
>>>>>> +           && dir_vect[inner] != dir_positive_or_equal)
>>>>>> +         return false;
>>>>>> +
>>>>>> +       if (dir_vect[outer] != dir_equal
>>>>>> +           && dir_vect[outer] != dir_positive
>>>>>> +           && dir_vect[outer] != dir_independent
>>>>>> +           && dir_vect[outer] != dir_positive_or_equal)
>>>>>> +         return false;
>>>>>
>>>>> Checking dir vectors doesn't make much sense in GCC: the elements are only
>>>>> ever set to dir_positive, dir_negative or dir_equal, exactly when distance
>>>>> is
>>>>>  > 0, < 0 or == 0.  So checking dist vector is enough. (though sameness of
>>>>> direction checks sameness of sign with zero).  Incidentally:
>>>> Done.
>>>>
>>>>>
>>>>>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
>>>>>> +{
>>>>>> +  struct data_dependence_relation *ddr;
>>>>>> +
>>>>>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>>>>> +    {
>>>>>> +      /* Skip no-dependence case.  */
>>>>>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>>>>> +     continue;
>>>>>> +
>>>>>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>>>>> +     {
>>>>>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>>>>> +       std::swap (dir_vect[inner], dir_vect[outer]);
>>>>>> +     }
>>>>>> +    }
>>>>>> +}
>>>>>
>>>>> Here you swap only the direction but not the distance vector, which can't
>>>>> be right.  I suggest only using (and updating) the distance vector.
>>>> Yeah, fixed.
>>>>
>>>>>
>>>>> And then your usage and update of DR_ACCESS_FNs: there's quite some
>>>>> complexity connected with that and I'm not sure how worthwhile it is.
>>>>> You're basically using the ACCESS_FNs to determine profitability (and not
>>>>> for validity, and that's good).  But e.g. for pointer based accesses like
>>>>> in fortran with explicit address arithmetic the relation between access-fn
>>>>> step and stride and actual access stride isn't that easy (e.g. in your
>>>>> should_interchange_loops function iloop_stride and oloop_stride will
>>>>> always be one for pointer based accesses).
>>>>>
>>>>> Conceptually what you should check is how the access address for each data
>>>>> ref revolves for each loop, so why not doing this explicitely?  What I
>>>>> mean is: calculate a (complicated) chrec for the DR addresses for the
>>>>> whole nest at the beginning.  It should be in the form like (assume "+"
>>>>> always):
>>>>>
>>>>>   {{{init, s1}_l1, s2}_l2, s3}_l3
>>>>>
>>>>> (i.e. all steps should be invariants/constants, and only one non-chrec
>>>>> init value).  Addresses which aren't in this form you're already ignoring
>>>>> right now, so you could continue doing that.  (Or better said, all
>>>>> non-constant steps you regard as being AVG_DIM_SIZE, which you still can
>>>>> continue doing).
>>>>>
>>>>> Now, with the above form you can form expressions for the difference
>>>>> between addresses per iteration for each loop (i.e. the address stride per
>>>>> loop); store these.  Then, when interchanging loops you need to merely
>>>>> swap these expressions like you have to with the distance vector, instead
>>>>> of fiddling inside the DR_ACCESS_FNs themself.  Much code would go away.
>>>> Yeah.  Did similar thing in loop nest distribution pass.  See
>>>> compute_access_range
>>>> in tree-loop-distribution.c.  Actually, I would do the same here if I
>>>> had implemented
>>>> this pass after loop nest distribution patches.  Done in this updated patch.
>>>>
>>>>>
>>>>> Testcases: given that we had to remove our old separate interchange pass
>>>>> because it miscompiled stuff all over I'm missing some testcases where
>>>>> interchange should _not_ happen for validity reasons, like my above
>>>>> example with an reduction that can't be moved inside.  Perhaps you can
>>>>> think of some more.
>>>> As mentioned above, it's hard to create test that fail exactly for this reason.
>>>> I added one that data dependence prevents us from interchanging the loop.
>>>>
>>>>>
>>>>> I hope this is of some help to you :)
>>>> Thanks again, it's very helpful.
>>>>
>>>> I also fixed several bugs of previous implementation, mostly about debug info
>>>> statements and simple reductions.  As for test, I enabled this pass by default,
>>>> bootstrap and regtest GCC, I also build/run specs.  There must be some other
>>>> latent bugs in it, but guess we have to exercise it by enabling it at
>>>> some point.
>>>>
>>>> So any comments?
>>>
>>>  bool
>>> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
>>> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
>>>  {
>>>
>>> that you need this suggests you do stmt removal in wrong order (you need to
>>> do reverse dom order).
>> As below code in handling debug uses, this updated patch gives up on more cases
>> by scrapping debug uses now.  Hopefully this isn't a problem, debugging experience
>> for interchange loops is bad already?
>>>
>>> +/* Maximum number of statements in loop nest for loop interchange.  */
>>> +
>>> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
>>> +         "loop-interchange-max-num-stmts",
>>> +         "The maximum number of stmts in loop nest for loop interchange.",
>>> +         64, 0, 0)
>>>
>>> is that to limit dependence computation?  In this case you should probably
>>> limit the number of data references instead?
>> Hmm, I kept this one and it is to limit the size of loops for interchange.
>>
>>>
>>> +ftree-loop-interchange
>>> +Common Report Var(flag_tree_loop_interchange) Optimization
>>> +Enable loop interchange on trees.
>>> +
>>>
>>> please re-use -floop-interchange instead and change the GRAPHITE tests
>>> to use -floop-nest-optimize.  You can do that as pre-approved thing now.
>> Done.  I will send an independent patch adjusting GRAPHITE tests.
>>
>>>
>>> Please enable the pass by default at O3 via opts.c.
>> I will do it in a separated patch because many vectorization tests are vulnerable
>> to interchange.  I checked these tests, interchange is good, we need to disable
>> explicitly.
>>
>>>
>>> diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
>>>
>>> gimple-loop-interchange.cc please.
>>>
>>> new file mode 100644
>>> index 0000000..abffbf6
>>> --- /dev/null
>>> +++ b/gcc/tree-ssa-loop-interchange.cc
>>> @@ -0,0 +1,2274 @@
>>> +/* Loop invariant motion.
>>> +   Copyright (C) 2017 Free Software Foundation, Inc.
>>>
>>> Loop invariant motion? ... ;)
>>>
>>> Please add a "Contributed by ..." to have an easy way to figure people to blame.
>>>
>>> +}*induction_p;
>>> +
>>>
>>> space after '*'
>>>
>>> +}*reduction_p;
>>> +
>>>
>>> likewise.
>> All done.
>>
>>>
>>> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
>>> +   ssa var appearing in any abnormal phi node.  */
>>> +
>>> +static inline bool
>>> +unsupported_phi_node (gphi *phi)
>>> +{
>>> +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
>>> +    return true;
>>> +
>>> +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
>>> +    {
>>> +      tree arg = PHI_ARG_DEF (phi, i);
>>> +      if (TREE_CODE (arg) == SSA_NAME
>>> +         && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
>>> +       return true;
>>> +    }
>>> +
>>> +  return false;
>>>
>>> I believe the above isn't necessary given you rule out abnormal edges
>>> into the loop.
>>> Did you have a testcase that broke?  A minor thing I guess if it is
>>> just for extra
>>> safety...
>> Extra safety I guess.  I now remove this and haven't run into compilation issues so far.
>>
>>>
>>> +/* Return true if all stmts in BB can be supported by loop interchange,
>>> +   otherwise return false.  ILOOP is not NULL if this loop_cand is the
>>> +   outer loop in loop nest.  */
>>> +
>>> +bool
>>> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
>>> +{
>>>
>>> docs and return value suggest this be named supported_operation
>> Done.
>>
>>>
>>> +      /* Or it's invariant memory reference and only used by inner loop.  */
>>> +      if (gimple_assign_single_p (stmt)
>>> +         && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
>>> +         && TREE_CODE (lhs) == SSA_NAME
>>> +         && single_use_in_loop (lhs, iloop->loop))
>>> +       continue;
>>>
>>> comment suggests multiple uses in loop would be ok?
>> Comment changed.
>>
>>>
>>> +      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
>>> +         || lhs != re->init)
>>> +       return;
>>> +
>>> +      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
>>> +         || !REFERENCE_CLASS_P (rhs))
>>> +       return;
>>>
>>> lhs and rhs are never NULL.  Please initialize them outside of the if.
>>> You want to disallow DECL_P rhs here?
>> Done.
>>
>>>
>>> Can you add an overall function comment what this function does?  It seems
>>> to detect a reduction as produced by loop store-motion?  Thus it tries to
>>> get at enough information to perform the reverse transform?
>> Yes.  Comment added.
>>
>>>
>>> During review I have a hard time distinguishing between locals and members
>>> so can you please prefix all member variables with m_ as according to our
>>> code guidelines?  I guess what adds to the confusion is the loop_cand argument
>>> that sometimes is present for loop_cand member functions...
>>> (I personally prefer to prefix all member accesses with this-> but that's harder
>>> to enforce)
>> Done by using m_* stuff.
>>
>>>
>>> +static void
>>> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
>>> +{
>>>
>>> this extracts stmts starting from consumer but rules out other
>>> consumers (recursively).
>>> Is that intended?  I wonder if you can avoid this dance...
>> Okay, so in case the reduction is initialized from constant value, we need to generate new
>> load memory reference when undoing it.  The new memory reference may depends on idx
>> variables like in ARRAY_REF.  This function tries to find all depended stmts for inserting it
>> I didn't look into how GRAPHITE tracks the idx variable and regenerate it when necessary,
>> maybe we can do the same in the future.
>
> Ok.
>
>>>
>>> +      /* Transform simple reduction of below form:
>>> +
>>> +          init = 0;
>>> +          loop:
>>> +            var = phi<init, next>
>>> +            next = var op ...
>>> +          reduc_sum = phi<next>
>>> +          MEM_REF[...] = reduc_sum
>>> +
>>> +        into:
>>> +
>>>
>>> which one is 'consumer'?  I wonder if you can simply leave all the
>>> dead code in place
>>> just emitting the load-update-store stmts and removing the
>>> MEM_REF[...] = reduc_sum
>>> above?
>> Done.  I simplified the code generation for the pass and uses prerequisite simple dce
>> interface to remove dead code.  Hope it's clearer now.
>>
>>>
>>> +/* Eliminate dead code after loop interchange.  */
>>> +
>>> +void
>>> +loop_cand::eliminate_dead_code (void)
>>> +{
>>>
>>> PRE tracks "possible" dead defs and has a worklist algorithm in
>>> remove_dead_inserted_code.
>>> I wonder if you can do sth similar?  That is, I wonder if doing a
>>> sweep from last to first
>>> stmt wouldn't be better here?
>>>
>>> +         /* Given copy propagation is done during interchange, we can
>>> +            simply check zero uses of var and eliminate it.  */
>>> +         if (is_gimple_assign (stmt)
>>> +             && !gimple_vuse (stmt)
>>>
>>> you probably meant to check gimple_vdef?
>>>
>>> +             && !gimple_has_volatile_ops (stmt)
>>> +             && !gimple_has_side_effects (stmt)
>>>
>>> the former is redundant
>>>
>>> +             && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
>>> +             && TREE_CODE (lhs) == SSA_NAME
>>> +             && has_zero_uses (lhs))
>>>
>>> if you use gimple_get_lhs () you can also handle calls.
>>>
>>> That said, this seems to be a very poor DCE, why is it necessary at all?
>> Local DCE removed by reusing new interface simple_dce_from_worklist.
>> But the DCE is necessary giving dead code lasts to vectorizer.  At least we
>> can save compilation time.
>>
>>>
>>> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters
>>> +   estimates information for now.  */
>>> +
>>> +static inline void
>>> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
>>> +{
>>> +  tree nb_iterations = oloop->nb_iterations;
>>> +
>>> +  oloop->any_upper_bound = false;
>>> +  oloop->any_likely_upper_bound = false;
>>> +  free_numbers_of_iterations_estimates (oloop);
>>> +
>>> +  oloop->nb_iterations = iloop->nb_iterations;
>>> +
>>> +  iloop->any_upper_bound = false;
>>> +  iloop->any_likely_upper_bound = false;
>>> +  free_numbers_of_iterations_estimates (iloop);
>>> +
>>> +  iloop->nb_iterations = nb_iterations;
>>>
>>> use std::swap?  Also I think if you can keep nb_iterations you
>>> can certainly keep the upper bounds.  You're probably
>>> afraid of the ->stmt references in the nb_iter_bound entries?
>>>
>>> Anyway, either scrap everything or try to keep everything.
>> Yeah, not only the stmts, but also the control_iv information because the SCEV
>> information may be corrupted during code transformation.
>> Now I discarded all the information.
>
> Note that given you interchange the loops but not the CFG or the loop structures
> you might want to swap loop->num and flags like ->force_vectorize.  That is,
> essentially change the ->header/latch association (and other CFG related stuff
> like recorded exits).
>
> It might also be we want to / need to disable interchange for, say,
> ->force_vectorize
> inner loops or ->unroll != 0?  Or we need to clear them, maybe
> optionally diagnosing
> that fact.
>
> At least we need to think about what it means to preserve loop
> structure (semantically,
> loop->num should maintain association to the same source-level loop
> throughout the
> compilation) for transforms like interchange.
>
>>>
>>> +  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
>>> +    {
>>> +      if (re->type != DOUBLE_RTYPE)
>>> +       gcc_unreachable ();
>>> +
>>> +      use_operand_p use_p;
>>> +      imm_use_iterator iterator;
>>> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
>>> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
>>> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
>>> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
>>> +      if (TREE_CODE (re->init) == SSA_NAME)
>>> +       {
>>> +         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
>>> +           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
>>> +       }
>>>
>>> can you add a comment what you are doing here?
>>>
>>> Note that other loop opts simply scrap all debug stmts ...
>> As mentioned above, updated patch doesn't try hard to maintain debug use info any more.
>>
>>>
>>> +static void
>>> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
>>> +                      data_reference_p dr)
>>> +{
>>> ...
>>> +  tree ref = DR_REF (dr);
>>> +  tree scev_base = build_fold_addr_expr (ref);
>>> +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
>>> +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
>>> +  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
>>> +
>>> +  do {
>>> +    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
>>> +    if (chrec_contains_undetermined (scev_fn)
>>> +       || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
>>> +      break;
>>> ...
>>> +    strides->safe_push (scev_step);
>>> +  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
>>> +
>>>
>>> I _think_ you want to do
>>>
>>>    scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
>>> DR_STMT (dr) is in loop
>>>    scev_fn = instantiate_parameters (nest, scev_fn);
>>>    if (chrec_contains_undetermined (scev_fn))
>>>      return; // false?
>>>
>>> and analyze the result which should be of the form
>>>
>>>   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>>>
>>> if canonical.  I think estimate_val_by_simplify_replace isn't needed
>>> if you do that
>>> (it also looks odd to replace all vairables in step by niter...).
>> I replied on this in previous message, instantiate_parameters doesn't always
>> give canonical form result as expected.  The loop here could be seen as a
>> local instantiate process, right?
>
> Kind of.  I'll see if I can reproduce the difference with any of your
> intercahnge
> testcases - any hint which one to look at?

So, doing

  tree scev = analyze_scalar_evolution (loop, scev_base);
  scev = instantiate_scev (loop_preheader_edge (loop_nest), loop, scev);
  if (! chrec_contains_undetermined (scev))
    {
      tree sl = scev;
      struct loop *expected = loop;
      while (TREE_CODE (sl) == POLYNOMIAL_CHREC)
        {
          struct loop *sl_loop = get_chrec_loop (sl);
          while (sl_loop != expected)
            {
              strides2->safe_push (size_int (0));
              expected = loop_outer (expected);
            }
          strides2->safe_push (CHREC_RIGHT (sl));
          sl = CHREC_LEFT (sl);
          expected = loop_outer (expected);
        }
      while (expected != loop_outer (loop_nest))
        {
          strides2->safe_push (size_int (0));
          expected = loop_outer (expected);
        }
    }

produces exactly the same strides array as yours.

>> Also estimate_val_by_simplify_replace is needed for pointers, where strides
>> are computed from niters of loops which could be non compilation time constant.
>> But yes, it's an odd fixup after I failed to do anything better.
>
> But you are for example computing _1 - _2 to zero, right?  Because both _1
> and _2 are not constant and thus you replace it with the same (symbolical)
> constant 'niter'.
>
> I think that asks for garbage-in-garbage-out ...
>
> Which testcase is this important for so I can have a look?

.. which means your testcases don't cover this case?  And they also don't cover
the cases of EV_DIR_DECREASES/UNKNOWN?

If you want to do incremental analysis of the SCEV to be able to shrink the loop
nest if an outer loop poses an issue you should use analyze_scalar_evolution
(loop, scev_base) once and then instantiate_scev (...) like above for the outer
loop preheader edge you are looking at.

But having a testcase covering that case would be nice.

>>>
>>> I think keeping the chrec in the above form is also more suitable for what
>>> the caller does so the outermost loop is simply
>>>
>>>   loop = loop_nest;
>>>   loop-over-all-dr-chrecs
>>>     if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>>>       loop = CHREC_LOOP (chrec);
>>>
>>> given the outermost loop is the outer evolution.  If you sort the
>>> stride vecs from inner
>>> to outer you don't need prune_access_strides_not_in_loop.
>> Hmm, So stripping outer loops prefer inner to outer sort of strides, but cost computation
>> during interchange prefers outer to inner sort because loop_nest in tree-data-ref is sorted
>> in this way.  Seems a single prune_* function is better than fiddling with cost computation.
>
> Not sure how to interpret your answer...  I'll see to have a more
> detailed suggestion
> after playing with the code a bit.
>
>>>
>>> +/* Count and return the number of loops in LOOP_NEST.  */
>>> +
>>> +unsigned int
>>> +num_loops_in_loop_nest (struct loop *loop_nest)
>>> +{
>>> +  unsigned num_loops;
>>> +  for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
>>> +    ;
>>> +  return num_loops;
>>>
>>> loop_depth (innermost) - loop_depth (nest)?
>> Done.
>>
>>>
>>> +static bool
>>> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
>>> +                         vec<data_reference_p> datarefs,
>>> +                         bool innermost_loops_p, bool dump_info_p = true)
>>> +{
>>>
>>> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
>>> and figure a permutation sequence to arrive there?  That is for the local
>>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
>>> when we should interchange?
>>>
>>> That subloop_stride_p and tracking invariant DRs looks a bit odd.  For loops
>>> where a DR is invariant you simply do not have an evolution in that loop.
>>> You seem to simply add strides in the inner and outer loops for each DR,
>>> can you explain how that works?  Also I guess strides bigger than the
>>> various cache-line size parameters should be treated 'equal'?  That is,
>>> if we don't get any DR to a step that results in L1 cache hits because
>>> two DRs share a cache line the interchange is pointless?
>> So given below loop:
>>
>>   for (int i = 0; i < M; i++)
>>     for (int j = 0; j < M; j++)
>>       for (int k = 0; k < M; k++)
>>         a[k][0][i] = b[k][0][i]
>>
>> We check if memory reference is invariant wrto a loop only if it has zero strides within
>> current loop nest.  In this example, there is no invariant given address changes in the
>> innermost loop.
>
> But they simply wouldn't take part in the sorting?  That is, invariant
> refs in a loop
> shouldn't prevent it becoming more inner or more outer, no?
>
>> For strides bigger than cache-line size, it's also possible the interchange is wanted, as
>> in below example:
>>
>>   for (int i = 0; i < M; i++)          //loop 1
>>     for (int j = 0; j < M; j++)        //loop 2
>>       for (int k = 0; k < M; k++)    //loop 3
>>         a[j][i][k] = b[j][i][k]
>>
>> Strides for loop 1/2 are very like to be big, but after interchange, we will have stream
>> access of both arrays.
>>
>> More advanced heuristics may be possible, but so far the estimation is quite good by
>> checking all interchanges I looked into.
>>
>>>
>>> +/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
>>> +
>>> +static inline void
>>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
>>> +{
>>> +  struct data_reference *dr;
>>> +
>>> +  for (unsigned i = 0; datarefs.iterate (i, &dr);)
>>> +    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
>>> +      i++;
>>> +    else
>>> +      {
>>> +       datarefs.ordered_remove (i);
>>>
>>> that's expensive.  It's better to keep moving DRs we want to keep
>>> when walking the array.  That is, add a j you increment only when
>>> we keep a DR, moving *i to *j.
>> Done.
>>
>>>
>>> +       if (dr->aux)
>>> +         {
>>> +           DR_ACCESS_STRIDE (dr)->release ();
>>> +           free (dr->aux);
>>> +         }
>>> +       free_data_ref (dr);
>>> +      }
>>> +}
>>>
>>> +
>>> +  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
>>> +
>>>
>>> Hmm.  If you instantiate the SCEV for the niters for each loop in the nest
>>> wrt the nest you can figure if it has any evolution in sth else than the
>>> loop (evolution_function_is_univariate_p).   That is, this is not a problem
>>> until you arrive at analyzing DR strides, right?  At which point you
>>> can check for the appropriate form?
>> Hmm, not really.  The niter relation may not appear in SCEV of reference addr.
>> For example, below loop:
>>
>>   for (int i = 0; i < M; i++)          //loop 1
>>     for (int j = 0; j < M; j++)        //loop 2
>>       for (int k = 0; k < i; k++)    //loop 3
>>         a[k][0][i] = b[k][0][i]
>>
>> There is no information in data reference about i/j loops.
>> Anyway, I refactored the code and put this check in proper_loop_form_for_interchange.
>> Simpler I think.
>>
>>>
>>> +  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
>>> +      /* Check if there is no data reference.  */
>>> +      || datarefs->length () == 0
>>> +      /* Check if there are too many of data references.  */
>>> +      || ((int) datarefs->length ()
>>> +         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
>>> +      /* Check if there is any data reference in loop latch.  We can't handle
>>> +        loops which loop header and data references have different execution
>>> +        times.  */
>>> +      || dataref_niters_diff_to_loop_header (*datarefs)
>>>
>>> this suggests to do your own find_data_references_in_loop so you can early
>>> out.
>> I refactored the code a bit.  Now this check is in proper_loop_form_for_interchange,
>> but I do customized my own data references finder.  It's needed to strip outer loops
>> once a difficult reference is found.
>>
>>>
>>> Overall the flow through the pass is a bit hard to follow given there are
>>> IMHO too many functions.
>> Yeah, I removed quite number of small functions and refactor the code a lot.  Hope this
>> version is more straightforward.
>>>
>>> +unsigned int
>>> +pass_linterchange::execute (function *fun)
>>> +{
>>> +  if (number_of_loops (fun) <= 2)
>>> +    return 0;
>>> +
>>> +  bool changed_p = false;;
>>> +  struct loop *loop;
>>> +  vec<loop_p> loop_nest;
>>> +  vec<data_reference_p> datarefs;
>>> +  vec<ddr_p> ddrs;
>>> +
>>> +  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
>>> +    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
>>> +      {
>>> +       tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
>>> +       changed_p |= loop_interchange.interchange ();
>>> +      }
>>>
>>> you leak datarefs/ddrs?
>> It was release in destructor, but I refactored it anyway.  I will push the code to branch
>> gcc.gnu.org/svn/gcc/branches/gimple-linterchange.
>>
>> Thanks again for the comment of you two.
>
> Digging into the code now...
>
> Richard.
>
>> Thanks,
>> bin
>> 2017-11-27  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * Makefile.in (gimple-loop-interchange.o): New object file.
>>         * common.opt (floop-interchange): Reuse the option from graphite.
>>         * doc/invoke.texi (-floop-interchange): Ditto.  New document.
>>         * gimple-loop-interchange.cc: New file.
>>         * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>>         (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>>         * passes.def (pass_linterchange): New pass.
>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
>>         interchange.  Record IV before/after increment in new parameters.
>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>
>> gcc/testsuite
>> 2017-11-27  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-11.c: New test.