[PATCH] Unswitching outer loops.

Tue Oct 6 07:59:00 GMT 2015

On Mon, Oct 5, 2015 at 3:13 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Thanks Richard.
> I'd like to answer on your last comment related to using of exit edge
> argument for edge that skips loop.
> Let's consider the following test-case:
>
> #include <stdlib.h>
> #define N 32
> float *foo(int ustride, int size, float *src)
> {
>    float *buffer, *p;
>    int i, k;
>
>    if (!src)
>     return NULL;
>
>    buffer = (float *) malloc(N * size * sizeof(float));
>
>    if(buffer)
>       for(i=0, p=buffer; i<N; i++, src+=ustride)
> for(k=0; k<size; k++)
>  *p++ = src[k];
>
>    return buffer;
> }
>
> Before adding new edge we have in post-header bb:
>   <bb 9>:
>   # _6 = PHI <0B(8), buffer_20(16)>
>   return _6;
>
> It is clear that we must preserve function semantic and transform it to
> _6 = PHI <0B(12), buffer_19(9), buffer_19(4)>

Ah, yeah.  I was confusing the loop exit of the inner vs. the outer loop.

Richard.

>
> 2015-10-05 13:57 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Wed, Sep 30, 2015 at 12:46 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Hi Richard,
>>>
>>> I re-designed outer loop unswitching using basic idea of 23855 patch -
>>> hoist invariant guard if loop is empty without guard. Note that this
>>> was added to loop unswitching pass with simple modifications - using
>>> another loop iterator etc.
>>>
>>> Bootstrap and regression testing did not show any new failures.
>>> What is your opinion?
>>
>> Overall it looks good.  Some comments below - a few more testcases would
>> be nice as well.
>>
>> +  /* Loop must not be infinite.  */
>> +  if (!finite_loop_p (loop))
>> +    return false;
>>
>> why's that?
>>
>> +  body = get_loop_body_in_dom_order (loop);
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +    {
>> +      if (body[i]->loop_father != loop)
>> +       continue;
>> +      if (!empty_bb_without_guard_p (loop, body[i]))
>>
>> I wonder if there is a better way to iterate over the interesting
>> blocks and PHIs
>> we need to check for side-effects (and thus we maybe can avoid gathering
>> the loop in DOM order).
>>
>> +      FOR_EACH_SSA_TREE_OPERAND (name, stmt, op_iter, SSA_OP_DEF)
>> +       {
>> +         if (may_be_used_outside
>>
>> may_be_used_outside can be hoisted above the loop.  I wonder if we can take
>> advantage of loop-closed SSA form here (and the fact we have a single exit
>> from the loop).  Iterating over exit dest PHIs and determining whether the
>> exit edge DEF is inside the loop part it may not be should be enough.
>>
>> +  gcc_assert (single_succ_p (pre_header));
>>
>> that should be always true.
>>
>> +  gsi_remove (&gsi, false);
>> +  bb = guard->dest;
>> +  remove_edge (guard);
>> +  /* Update dominance for destination of GUARD.  */
>> +  if (EDGE_COUNT (bb->preds) == 0)
>> +    {
>> +      basic_block s_bb;
>> +      gcc_assert (single_succ_p (bb));
>> +      s_bb = single_succ (bb);
>> +      delete_basic_block (bb);
>> +      if (single_pred_p (s_bb))
>> +       set_immediate_dominator (CDI_DOMINATORS, s_bb, single_pred (s_bb));
>>
>> all this massaging should be simplified by leaving it to CFG cleanup by
>> simply adjusting the CONDs condition to always true/false.  There is
>> gimple_cond_make_{true,false} () for this (would be nice to have a variant
>> taking a bool).
>>
>> +  new_edge = make_edge (pre_header, exit->dest, flags);
>> +  if (fix_dom_of_exit)
>> +    set_immediate_dominator (CDI_DOMINATORS, exit->dest, pre_header);
>> +  update_stmt (gsi_stmt (gsi));
>>
>> the update_stmt should be not necessary, it's done by gsi_insert_after already.
>>
>> +  /* Add NEW_ADGE argument for all phi in post-header block.  */
>> +  bb = exit->dest;
>> +  for (gphi_iterator gsi = gsi_start_phis (bb);
>> +       !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>> +      gphi *phi = gsi.phi ();
>> +      /* edge_iterator ei; */
>> +      tree arg;
>> +      if (virtual_operand_p (gimple_phi_result (phi)))
>> +       {
>> +         arg = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
>> +         add_phi_arg (phi, arg, new_edge, UNKNOWN_LOCATION);
>> +       }
>> +      else
>> +       {
>> +         /* Use exit edge argument.  */
>> +         arg = PHI_ARG_DEF_FROM_EDGE (phi, exit);
>> +         add_phi_arg (phi, arg, new_edge, UNKNOWN_LOCATION);
>>
>> Hum.  How is it ok to use the exit edge argument for the edge that skips
>> the loop?  Why can't you always use the pre-header edge value?
>> That is, if we have
>>
>>  for(i=0;i<m;++i)
>>    {
>>      if (n > 0)
>>     {
>>      for (;;)
>>        {
>>        }
>>      }
>>    }
>>   ... = i;
>>
>> then i is used after the loop and the correct value to use if
>> n > 0 is false is '0'.  Maybe this way we can also relax
>> what check_exit_phi does?  IMHO the only restriction is
>> if sth defined inside the loop before the header check for
>> the inner loop is used after the loop.
>>
>> Thanks,
>> Richard.
>>
>>> Thanks.
>>>
>>> ChangeLog:
>>> 2015-09-30  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>
>>> * tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
>>> "cfghooks.h", add prototypes for introduced new functions.
>>> (tree_ssa_unswitch_loops): Use from innermost loop iterator, move all
>>> checks on ability of loop unswitching to tree_unswitch_single_loop;
>>> invoke tree_unswitch_single_loop or tree_unswitch_outer_loop depending
>>> on innermost loop check.
>>> (tree_unswitch_single_loop): Add all required checks on ability of
>>> loop unswitching under zero recursive level guard.
>>> (tree_unswitch_outer_loop): New function.
>>> (find_loop_guard): Likewise.
>>> (empty_bb_without_guard_p): Likewise.
>>> (used_outside_loop_p): Likewise.
>>> (hoist_guard): Likewise.
>>> (check_exit_phi): Likewise.
>>>
>>>    gcc/testsuite/ChangeLog:
>>> * gcc.dg/loop-unswitch-2.c: New test.
>>>
>>> 2015-09-16 11:26 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>> Yeah, as said, the patch wasn't fully ready and it also felt odd to do
>>>> this hoisting in loop header copying.  Integrating it
>>>> with LIM would be a better fit eventually.
>>>>
>>>> Note that we did agree to go forward with your original patch just
>>>> making it more "generically" perform outer loop
>>>> unswitching.  Did you explore that idea further?
>>>>
>>>>
>>>>
>>>> On Tue, Sep 15, 2015 at 6:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>> Thanks Richard.
>>>>>
>>>>> I found one more issue that could not be fixed simply. In 23855 you
>>>>> consider the following test-case:
>>>>> void foo(int *ie, int *je, double *x)
>>>>> {
>>>>>   int i, j;
>>>>>   for (j=0; j<*je; ++j)
>>>>>     for (i=0; i<*ie; ++i)
>>>>>       x[i+j] = 0.0;
>>>>> }
>>>>> and proposed to hoist up a check on *ie out of loop. It requires
>>>>> memref alias analysis since in general x and ie can alias (if their
>>>>> types are compatible - int *ie & int * x). Such analysis is performed
>>>>> by pre or lim passes. Without such analysis we can not hoist a test on
>>>>> non-zero for *ie out of loop using 238565 patch.
>>>>>  The second concern is that proposed copy header algorithm changes
>>>>> loop structure significantly and it is not accepted by vectorizer
>>>>> since latch is not empty (such transformation assumes loop peeling for
>>>>> one iteration. So I can propose to implement simple guard hoisting
>>>>> without copying header and tail blocks (if it is possible).
>>>>>
>>>>> I will appreciate you for any advice or help since without such
>>>>> hoisting we are not able to perform outer loop vectorization for
>>>>> important benchmark.
>>>>> and
>>>>>
>>>>> 2015-09-15 14:22 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>> On Thu, Sep 3, 2015 at 6:32 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>> Hi Richard,
>>>>>>>
>>>>>>> I started learning, tuning and debugging patch proposed in 23855 and
>>>>>>> discovered thta it does not work properly.
>>>>>>> So I wonder is it tested patch and it should work?
>>>>>>
>>>>>> I don't remember, but as it wasn't committed it certainly wasn't ready.
>>>>>>
>>>>>>> Should it accept for hoisting the following loop nest
>>>>>>>   for (i=0; i<n; i++) {
>>>>>>>     s = 0;
>>>>>>>     for (j=0; j<m; j++)
>>>>>>>       s += a[i] * b[j];
>>>>>>>     c[i] = s;
>>>>>>>   }
>>>>>>> Note that i-loop will nit be empty if m is equal to 0.
>>>>>>
>>>>>> if m is equal to 0 then we still have the c[i] = s store, no?  Of course
>>>>>> we could unswitch the outer loop on m == 0 but simple hoisting wouldn't work.
>>>>>>
>>>>>> Richard.
>>>>>>
>>>>>>> 2015-08-03 10:27 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>> On Fri, Jul 31, 2015 at 1:17 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>> Hi Richard,
>>>>>>>>>
>>>>>>>>> I learned your updated patch for 23825 and it is more general in
>>>>>>>>> comparison with my.
>>>>>>>>> I'd like to propose you a compromise - let's consider my patch only
>>>>>>>>> for force-vectorize outer loop only to allow outer-loop
>>>>>>>>> vecctorization.
>>>>>>>>
>>>>>>>> I don't see why we should special-case that if the approach in 23825
>>>>>>>> is sensible.
>>>>>>>>
>>>>>>>>> Note that your approach will not hoist invariant
>>>>>>>>> guards if loops contains something else except for inner-loop, i.e. it
>>>>>>>>> won't be empty for taken branch.
>>>>>>>>
>>>>>>>> Yes, it does not perform unswitching but guard hoisting.  Note that this
>>>>>>>> is originally Zdenek Dvoraks patch.
>>>>>>>>
>>>>>>>>> I also would like to answer on your last question - CFG cleanup is
>>>>>>>>> invoked to perform deletion of single-argument phi nodes from tail
>>>>>>>>> block through substitution - such phi's prevent outer-loop
>>>>>>>>> vectorization. But it is clear that such transformation can be done
>>>>>>>>> other pass.
>>>>>>>>
>>>>>>>> Hmm, I wonder why the copy_prop pass after unswitching does not
>>>>>>>> get rid of them?
>>>>>>>>
>>>>>>>>> What is your opinion?
>>>>>>>>
>>>>>>>> My opinion is that if we want to enhance unswitching to catch this
>>>>>>>> (or similar) cases then we should make it a lot more general than
>>>>>>>> your pattern-matching approach.  I see nothing that should prevent
>>>>>>>> us from considering unswitching non-innermost loops in general.
>>>>>>>> It should be only a cost consideration to not do non-innermost loop
>>>>>>>> unswitching (in addition to maybe a --param specifying the maximum
>>>>>>>> depth of a loop nest to unswitch).
>>>>>>>>
>>>>>>>> So my first thought when seeing your patch still holds - the patch
>>>>>>>> looks very much too specific.
>>>>>>>>
>>>>>>>> Richard.
>>>>>>>>
>>>>>>>>> Yuri.
>>>>>>>>>
>>>>>>>>> 2015-07-28 13:50 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>> On Thu, Jul 23, 2015 at 4:45 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>
>>>>>>>>>>> I checked that both test-cases from 23855 are sucessfully unswitched
>>>>>>>>>>> by proposed patch. I understand that it does not catch deeper loop
>>>>>>>>>>> nest as
>>>>>>>>>>>    for (i=0; i<10; i++)
>>>>>>>>>>>      for (j=0;j<n;j++)
>>>>>>>>>>>         for (k=0;k<20;k++)
>>>>>>>>>>>   ...
>>>>>>>>>>> but duplication of middle-loop does not look reasonable.
>>>>>>>>>>>
>>>>>>>>>>> Here is dump for your second test-case:
>>>>>>>>>>>
>>>>>>>>>>> void foo(int *ie, int *je, double *x)
>>>>>>>>>>> {
>>>>>>>>>>>   int i, j;
>>>>>>>>>>>   for (j=0; j<*je; ++j)
>>>>>>>>>>>     for (i=0; i<*ie; ++i)
>>>>>>>>>>>       x[i+j] = 0.0;
>>>>>>>>>>> }
>>>>>>>>>>> grep -i unswitch t6.c.119t.unswitch
>>>>>>>>>>> ;; Unswitching outer loop
>>>>>>>>>>
>>>>>>>>>> I was saying that why go with a limited approach when a patch (in
>>>>>>>>>> unknown state...)
>>>>>>>>>> is available that does it more generally?  Also unswitching is quite
>>>>>>>>>> expensive compared
>>>>>>>>>> to "moving" the invariant condition.
>>>>>>>>>>
>>>>>>>>>> In your patch:
>>>>>>>>>>
>>>>>>>>>> +  if (!nloop->force_vectorize)
>>>>>>>>>> +    nloop->force_vectorize = true;
>>>>>>>>>> +  if (loop->safelen != 0)
>>>>>>>>>> +    nloop->safelen = loop->safelen;
>>>>>>>>>>
>>>>>>>>>> I see no guard on force_vectorize so = true looks bogus here.  Please just use
>>>>>>>>>> copy_loop_info.
>>>>>>>>>>
>>>>>>>>>> +  if (integer_nonzerop (cond_new))
>>>>>>>>>> +    gimple_cond_set_condition_from_tree (cond_stmt, boolean_true_node);
>>>>>>>>>> +  else if (integer_zerop (cond_new))
>>>>>>>>>> +    gimple_cond_set_condition_from_tree (cond_stmt, boolean_false_node);
>>>>>>>>>>
>>>>>>>>>> gimple_cond_make_true/false (cond_stmt);
>>>>>>>>>>
>>>>>>>>>> btw, seems odd that we have to recompute which loop is the true / false variant
>>>>>>>>>> when we just fed a guard condition to loop_version.  Can't we statically
>>>>>>>>>> determine whether loop or nloop has the in-loop condition true or false?
>>>>>>>>>>
>>>>>>>>>> +  /* Clean-up cfg to remove useless one-argument phi in exit block of
>>>>>>>>>> +     outer-loop.  */
>>>>>>>>>> +  cleanup_tree_cfg ();
>>>>>>>>>>
>>>>>>>>>> I know unswitching is already O(number-of-unswitched-loops * size-of-function)
>>>>>>>>>> because it updates SSA form after each individual unswitching (and it does that
>>>>>>>>>> because it invokes itself recursively on unswitched loops).  But do you really
>>>>>>>>>> need to invoke CFG cleanup here?
>>>>>>>>>>
>>>>>>>>>> Richard.
>>>>>>>>>>
>>>>>>>>>>> Yuri.
>>>>>>>>>>>
>>>>>>>>>>> 2015-07-14 14:06 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>>>> On Fri, Jul 10, 2015 at 12:02 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is presented simple transformation which tries to hoist out of
>>>>>>>>>>>>> outer-loop a check on zero trip count for inner-loop. This is very
>>>>>>>>>>>>> restricted transformation since it accepts outer-loops with very
>>>>>>>>>>>>> simple cfg, as for example:
>>>>>>>>>>>>>     acc = 0;
>>>>>>>>>>>>>    for (i = 1; i <= m; i++) {
>>>>>>>>>>>>>       for (j = 0; j < n; j++)
>>>>>>>>>>>>>          if (l[j] == i) { v[j] = acc; acc++; };
>>>>>>>>>>>>>       acc <<= 1;
>>>>>>>>>>>>>    }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that degenerative outer loop (without inner loop) will be
>>>>>>>>>>>>> completely deleted as dead code.
>>>>>>>>>>>>> The main goal of this transformation was to convert outer-loop to form
>>>>>>>>>>>>> accepted by outer-loop vectorization (such test-case is also included
>>>>>>>>>>>>> to patch).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bootstrap and regression testing did not show any new failures.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it OK for trunk?
>>>>>>>>>>>>
>>>>>>>>>>>> I think this is
>>>>>>>>>>>>
>>>>>>>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=23855
>>>>>>>>>>>>
>>>>>>>>>>>> as well.  It has a patch adding a invariant loop guard hoisting
>>>>>>>>>>>> phase to loop-header copying.  Yeah, it needs updating to
>>>>>>>>>>>> trunk again I suppose.  It's always non-stage1 when I come
>>>>>>>>>>>> back to that patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Your patch seems to be very specific and only handles outer
>>>>>>>>>>>> loops of innermost loops.
>>>>>>>>>>>>
>>>>>>>>>>>> Richard.
>>>>>>>>>>>>
>>>>>>>>>>>>> ChangeLog:
>>>>>>>>>>>>> 2015-07-10  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> * tree-ssa-loop-unswitch.c: Include "tree-cfgcleanup.h" and
>>>>>>>>>>>>> "gimple-iterator.h", add prototype for tree_unswitch_outer_loop.
>>>>>>>>>>>>> (tree_ssa_unswitch_loops): Add invoke of tree_unswitch_outer_loop.
>>>>>>>>>>>>> (tree_unswitch_outer_loop): New function.
>>>>>>>>>>>>>
>>>>>>>>>>>>> gcc/testsuite/ChangeLog:
>>>>>>>>>>>>> * gcc.dg/tree-ssa/unswitch-outer-loop-1.c: New test.
>>>>>>>>>>>>> * gcc.dg/vect/vect-outer-simd-3.c: New test.