[PATCH 2/3] Extended if-conversion

Fri Jan 9 12:31:00 GMT 2015

On Mon, Dec 22, 2014 at 3:39 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
> Richard,
>
> I changed algorithm for bool pattern repair.
> It turned out that ifcvt_local_dce phaase is required since for
> test-case I sent you in previous mail vectorization is not performed
> without dead code elimination:
>
> For the loop
> #pragma omp simd safelen(8)
>   for (i=0; i<512; i++)
>   {
>     float t = a[i];
>     if (t > 0.0f & t < 1.0e+17f)
>       if (c[i] != 0)
> res += 1;
>   }
>
> I've got the following message from vectorizer:
>
> t3.c:10:11: note: ==> examining statement: _ifc__39 = t_5 > 0.0;
>
> t3.c:10:11: note: bit-precision arithmetic not supported.
> t3.c:10:11: note: not vectorized: relevant stmt not supported:
> _ifc__39 = t_5 > 0.0;
>
> It is caused by the following dead predicate computations after
> critical edge splitting:
>
> (after combine blocks):
>
> <bb 3>:
> # res_15 = PHI <res_1(7), 0(19)>
> # i_16 = PHI <i_11(7), 0(19)>
> # ivtmp_14 = PHI <ivtmp_13(7), 512(19)>
> t_5 = a[i_16];
> _6 = t_5 > 0.0;
> _7 = t_5 < 9.9999998430674944e+16;
> _8 = _6 & _7;
> _10 = &c[i_16];
> _ifc__36 = _8 ? 4294967295 : 0;
> _9 = MASK_LOAD (_10, 0B, _ifc__36);
> _28 = _8;
> _29 = _9 != 0;
> _30 = _28 & _29;
> // Statements below are dead!!
> _31 = _8;
> _32 = _9 != 0;
> _33 = ~_32;
> _34 = _31 & _33;
> // End of dead statements.
> _ifc__35 = _30 ? 1 : 0;
> res_1 = res_15 + _ifc__35;
> i_11 = i_16 + 1;
> ivtmp_13 = ivtmp_14 - 1;
> if (ivtmp_13 != 0)
>   goto <bb 7>;
> else
>   goto <bb 8>;
>
> But if we delete these statements loop will be vectorized.

Hm, ok.  We insert predicates too early obviously and not only when
needed.  But let's fix that later.

> New patch is attached.

 fold_build_cond_expr (tree type, tree cond, tree rhs, tree lhs)
 {
   tree rhs1, lhs1, cond_expr;
+
+  /* If COND is comparison r != 0 and r has boolean type, convert COND
+     to SSA_NAME to accept by vect bool pattern.  */
+  if (TREE_CODE (cond) == NE_EXPR)
+    {
+      tree op0 = TREE_OPERAND (cond, 0);
+      tree op1 = TREE_OPERAND (cond, 1);
+      if (TREE_CODE (op0) == SSA_NAME
+         && TREE_CODE (TREE_TYPE (op0)) == BOOLEAN_TYPE
+         && (integer_zerop (op1)))
+       cond = op0;
+      else if (TREE_CODE (op1) == SSA_NAME
+              && TREE_CODE (TREE_TYPE (op1)) == BOOLEAN_TYPE
+              && (integer_zerop (op0)))
+       cond = op1;

The 2nd form, 0 != SSA_NAME doesn't happen due to operand
canonicalization.  Please remove its handling.

+      if (gimple_phi_num_args (phi) != 2)
+       {
+         if (!aggressive_if_conv)

&& !aggressive_if_conv

+  if (EDGE_COUNT (bb->preds) > 2)
+    {
+      if (!aggressive_if_conv)

Likewise.

-      gimple reduc;
+         && (rhs = gimple_phi_arg_def (phi, 0)))) {

the { goes to the next line

 static void
 predicate_mem_writes (loop_p loop)
 {
-  unsigned int i, orig_loop_num_nodes = loop->num_nodes;
+  unsigned int i, j, orig_loop_num_nodes = loop->num_nodes;
+  tree mask_vec[10];

an upper limit of 10?

+      for (j=0; j<10; j++)

spaces around '<' and '='

+       mask_vec[j] = NULL_TREE;
+

+           gcc_assert (exact_log2 (bitsize) != -1);
+           if ((mask = mask_vec[exact_log2 (bitsize)]) == NULL_TREE)
+             {

this seems to be a completely separate "optimization"?  Note that
there are targets with non-power-of-two bitsize modes (PSImode),
so the assert will likely trigger.  I would prefer if you separate this
part of the patch.

+      if ( gimple_code (stmt) != GIMPLE_ASSIGN)
+       continue;

no space before gimple_code

+  imm_use_iterator imm_iter;
+
+
+  worklist.create (64);

excessive vertical space.

The patch misses the addition of new testcases - please add some,
otherwise the code will be totally untested.

I assume the patch passes bootstrap and regtest (you didn't say so).
Can you also do a bootstrap with aggressive_if_conv forced to
true and --with-build-config=bootstrap-O3 --disable-werror?

Thanks,
Richard.

> Thanks.
> Yuri.
>
> 2014-12-19 14:45 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>> On Thu, Dec 18, 2014 at 2:45 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>> Richard,
>>>
>>> I am sending you full patch (~1000 lines) but if you need only patch.1
>>> and patch.2 will let me know and i'll send you reduced patch.
>>>
>>> Below are few comments regarding your remarks for patch.3.
>>>
>>> 1. I deleted sub-phase ifcvt_local_dce since I did not find test-case
>>> when dead code elimination is required to vectorize loop, i.e. dead
>>> statement is marked as relevant.
>>> 2. You wrote:
>>>> The "retry" code also looks odd - why do you walk the BB multiple
>>>> times instead of just doing sth like
>>>>
>>>>  while (!has_single_use (lhs))
>>>>    {
>>>>      gimple copy = ifcvt_split_def_stmt (def_stmt);
>>>>      ifcvt_walk_pattern_tree (copy);
>>>>    }
>>>>
>>>> thus returning the copy you create and re-process it (the copy should
>>>> now have a single-use).
>>>
>>> The problem is that not only top SSA_NAME (lhs) may have multiple uses
>>> but some intermediate variables too. For example, for the following
>>> test-case
>>>
>>> float a[1000];
>>> int c[1000];
>>>
>>> int foo()
>>> {
>>>   int i, res = 0;
>>> #pragma omp simd safelen(8)
>>>   for (i=0; i<512; i++)
>>>   {
>>>     float t = a[i];
>>>     if (t > 0.0f & t < 1.0e+17f)
>>>       if (c[i] != 0)
>>> res += 1;
>>>   }
>>>   return res;
>>> }
>>>
>>> After combine_blocks we have the following bb:
>>>
>>> <bb 3>:
>>> # res_15 = PHI <res_1(7), 0(15)>
>>> # i_16 = PHI <i_11(7), 0(15)>
>>> # ivtmp_14 = PHI <ivtmp_13(7), 512(15)>
>>> t_5 = a[i_16];
>>> _6 = t_5 > 0.0;
>>> _7 = t_5 < 9.9999998430674944e+16;
>>> _8 = _6 & _7;
>>> _10 = &c[i_16];
>>> _ifc__32 = _8 ? 4294967295 : 0;
>>> _9 = MASK_LOAD (_10, 0B, _ifc__32);
>>> _28 = _8;
>>> _29 = _9 != 0;
>>> _30 = _28 & _29;
>>> _ifc__31 = _30 ? 1 : 0;
>>> res_1 = res_15 + _ifc__31;
>>> i_11 = i_16 + 1;
>>> ivtmp_13 = ivtmp_14 - 1;
>>> if (ivtmp_13 != 0)
>>>   goto <bb 7>;
>>> else
>>>   goto <bb 8>;
>>>
>>> and we can see that _8 has multiple uses. Also note that after splitting of
>>> _8 = _6 & _7
>>> we also get multiple uses for definition of  _6 and _7. So I used this
>>> iterative algorithm as the simplest one.
>>
>> But it walks the entire pattern again and again while you only need to
>> ensure you walk the pattern tree of the now single-use DEF again
>> (in fact, rather than replacing a random USE in ifcvt_split_def_stmt
>> you should pass down the user_operand_p that you need to make
>> single-use).
>>
>>> I think it would be nice to re-use some utility from tree-vect-patterns.c
>>> for stmt_is_root_of_bool_pattern.
>>>
>>> I assume that function stmt_is_root_of_bool_pattern can be simplified
>>> to check on COND_EXPR only since PHI predication and memory access
>>> predication produced only such statements,i.e. it can look like
>>>
>>> static bool
>>> stmt_is_root_of_bool_pattern (gimple stmt, tree *var)
>>> {
>>>   enum tree_code code;
>>>   tree lhs, rhs;
>>>
>>>   code = gimple_assign_rhs_code (stmt);
>>>   if (code == COND_EXPR)
>>>     {
>>>       rhs = gimple_assign_rhs1 (stmt);
>>>       if (TREE_CODE (rhs) != SSA_NAME)
>>> return false;
>>>       *var = rhs;
>>>       return true;
>>>     }
>>>   return false;
>>> }
>>>
>>> I also did few minor changes in patch.2.
>>>
>>> 3. You can also notice that I inserted code in tree_if_conversion to
>>> do loop version if explicit option "-ftree-loop-if-convert" was not
>>> passed to compiler, i.e. we perform if-conversion for loop
>>> vectorization only and if it does not take place, we should delete
>>> if-converted version of loop.
>>> What is your opinion?
>>
>> Overall part 1 and part 2 look good to me, predicate_scalar_phi
>> looks in need of some refactoring to avoid duplicate code.  We can
>> do that a followup.
>>
>> Part 3 still needs the iteration to be resolved and make the use we
>> actually care about single-use, not a random one so we can avoid
>> iterating completely.
>>
>> Richard.
>>
>>> Thanks.
>>> Yuri.
>>>
>>> 2014-12-17 18:41 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>> On Tue, Dec 16, 2014 at 4:15 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>> Hi Richard,
>>>>>
>>>>> Here is updated patch which includes
>>>>> (1) split critical edges for aggressive if conversion.
>>>>> (2) delete all stuff related to support of critical edge predication.
>>>>> (3) only one function - predicate_scalar_phi performs predication.
>>>>> (4) function find_phi_replacement_condition was deleted since it was
>>>>> included in predicate_scalar_phi for phi with two arguments.
>>>>>
>>>>> I checked that patch works in stress testing mode, i.e. with
>>>>> aggressive if conversion by default.
>>>>>
>>>>> What is your opinion?
>>>>
>>>> Looks ok overall, but please simply do
>>>>
>>>>   FOR_EACH_EDGE (e, ei, bb->succs)
>>>>     if (EDGE_CRITICAL_P (e) && e->dest->loop_father == loop)
>>>>       split_edge (e);
>>>>
>>>> for all blocks apart from the latch.
>>>>
>>>> Can you please send a combined patch up to this one?  Looking at
>>>> the incremental diff is somewhat hard.  Thus a patch including all
>>>> patches from patch1 to this one.
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>>
>>>>> Thanks.
>>>>> Yuri.
>>>>>
>>>>> 2014-12-11 11:59 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>> On Wed, Dec 10, 2014 at 4:22 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>> Richard,
>>>>>>>
>>>>>>> Thanks for your reply!
>>>>>>>
>>>>>>> I didn't understand your point:
>>>>>>>
>>>>>>> Well, I don't mind splitting all critical edges unconditionally
>>>>>>>
>>>>>>> but you do it unconditionally in proposed patch.
>>>>>>
>>>>>> I don't mind means I am fine with it.
>>>>>>
>>>>>>> Also I assume that
>>>>>>> call of split_critical_edges() can break ssa. For example, we can
>>>>>>> split headers of loops, loop exit blocks etc.
>>>>>>
>>>>>> How does that "break SSA"?  You mean loop-closed SSA?  I'd
>>>>>> be surprised if so but that may be possible.
>>>>>>
>>>>>>> I prefer to do something
>>>>>>> more loop-specialized, e.g. call edge_split() for critical edges
>>>>>>> outgoing from bb ending with GIMPLE_COND stmt (assuming that edge
>>>>>>> destination bb belongs to loop).
>>>>>>
>>>>>> That works for me as well but it is more complicated to implement.
>>>>>> Ideally you'd only split one edge if you find a block with only critical
>>>>>> predecessors (where we'd currently give up).  But note that this
>>>>>> requires re-computation of ifc_bbs in if_convertible_loop_p_1 and it
>>>>>> will change loop->num_nodes so we have to be more careful in
>>>>>> constructing the loop calling if_convertible_bb_p.
>>>>>>
>>>>>> Richard.
>>>>>>
>>>>>>>
>>>>>>> 2014-12-10 17:31 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>> On Wed, Dec 10, 2014 at 11:54 AM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>> Richard,
>>>>>>>>>
>>>>>>>>> Sorry that I forgot to delete debug dump from my fix.
>>>>>>>>> I have few questions about your comments.
>>>>>>>>>
>>>>>>>>> 1. You wrote :
>>>>>>>>>> You also still have two functions for PHI predication.  And the
>>>>>>>>>> new extended variant doesn't commonize the 2-args and general
>>>>>>>>>> path
>>>>>>>>>  Did you mean that I must combine predicate_scalar_phi and
>>>>>>>>> predicate_extended scalar phi to one function?
>>>>>>>>> Please note that if additional flag was not set up (i.e.
>>>>>>>>> aggressive_if_conv is false) extended predication is required more
>>>>>>>>> compile time since it builds hash_map.
>>>>>>>>
>>>>>>>> It's compile-time complexity is reasonable enough even for
>>>>>>>> non-aggressive if-conversion.
>>>>>>>>
>>>>>>>>> 2. About critical edge splitting.
>>>>>>>>>
>>>>>>>>> Did you mean that we should perform it (1) under aggressive_if_conv
>>>>>>>>> option only; (2) should we split all critical edges.
>>>>>>>>> Note that this leads to recomputing of topological order.
>>>>>>>>
>>>>>>>> Well, I don't mind splitting all critical edges unconditionally, thus
>>>>>>>> do something like
>>>>>>>>
>>>>>>>> Index: gcc/tree-if-conv.c
>>>>>>>> ===================================================================
>>>>>>>> --- gcc/tree-if-conv.c  (revision 218515)
>>>>>>>> +++ gcc/tree-if-conv.c  (working copy)
>>>>>>>> @@ -2235,12 +2235,21 @@ pass_if_conversion::execute (function *f
>>>>>>>>    if (number_of_loops (fun) <= 1)
>>>>>>>>      return 0;
>>>>>>>>
>>>>>>>> +  bool critical_edges_split_p = false;
>>>>>>>>    FOR_EACH_LOOP (loop, 0)
>>>>>>>>      if (flag_tree_loop_if_convert == 1
>>>>>>>>         || flag_tree_loop_if_convert_stores == 1
>>>>>>>>         || ((flag_tree_loop_vectorize || loop->force_vectorize)
>>>>>>>>             && !loop->dont_vectorize))
>>>>>>>> -      todo |= tree_if_conversion (loop);
>>>>>>>> +      {
>>>>>>>> +       if (!critical_edges_split_p)
>>>>>>>> +         {
>>>>>>>> +           split_critical_edges ();
>>>>>>>> +           critical_edges_split_p = true;
>>>>>>>> +           todo |= TODO_cleanup_cfg;
>>>>>>>> +         }
>>>>>>>> +       todo |= tree_if_conversion (loop);
>>>>>>>> +      }
>>>>>>>>
>>>>>>>>  #ifdef ENABLE_CHECKING
>>>>>>>>    {
>>>>>>>>
>>>>>>>>> It is worth noting that in current implementation bb's with 2
>>>>>>>>> predecessors and both are on critical edges are accepted without
>>>>>>>>> additional option.
>>>>>>>>
>>>>>>>> Yes, I know.
>>>>>>>>
>>>>>>>> tree-if-conv.c is a mess right now and if we can avoid adding more
>>>>>>>> to it and even fix the critical edge missed optimization with splitting
>>>>>>>> critical edges then I am all for that solution.
>>>>>>>>
>>>>>>>> Richard.
>>>>>>>>
>>>>>>>>> Thanks ahead.
>>>>>>>>> Yuri.
>>>>>>>>> 2014-12-09 18:20 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>> On Tue, Dec 9, 2014 at 2:11 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>> Richard,
>>>>>>>>>>>
>>>>>>>>>>> Here is updated patch2 with the following changes:
>>>>>>>>>>> 1. Delete functions  phi_has_two_different_args and find_insertion_point.
>>>>>>>>>>> 2. Use only one function for extended predication -
>>>>>>>>>>> predicate_extended_scalar_phi.
>>>>>>>>>>> 3. Save gsi before insertion of predicate computations for basic
>>>>>>>>>>> blocks if it has 2 predecessors and
>>>>>>>>>>> both incoming edges are critical or it gas more than 2 predecessors
>>>>>>>>>>> and at least one incoming edge
>>>>>>>>>>> is critical. This saved iterator can be used by extended phi predication.
>>>>>>>>>>>
>>>>>>>>>>> Here is motivated test-case which explains this point.
>>>>>>>>>>> Test-case is attached (t5.c) and it must be compiled with -O2
>>>>>>>>>>> -ftree-loop-vectorize -fopenmp options.
>>>>>>>>>>> The problem phi is in bb-7:
>>>>>>>>>>>
>>>>>>>>>>>   bb_5 (preds = {bb_4 }, succs = {bb_7 bb_9 })
>>>>>>>>>>>   {
>>>>>>>>>>>     <bb 5>:
>>>>>>>>>>>     xmax_edge_18 = xmax_edge_36 + 1;
>>>>>>>>>>>     if (xmax_17 == xmax_27)
>>>>>>>>>>>       goto <bb 7>;
>>>>>>>>>>>     else
>>>>>>>>>>>       goto <bb 9>;
>>>>>>>>>>>
>>>>>>>>>>>   }
>>>>>>>>>>>   bb_6 (preds = {bb_4 }, succs = {bb_7 bb_8 })
>>>>>>>>>>>   {
>>>>>>>>>>>     <bb 6>:
>>>>>>>>>>>     if (xmax_17 == xmax_27)
>>>>>>>>>>>       goto <bb 7>;
>>>>>>>>>>>     else
>>>>>>>>>>>       goto <bb 8>;
>>>>>>>>>>>
>>>>>>>>>>>   }
>>>>>>>>>>>   bb_7 (preds = {bb_6 bb_5 }, succs = {bb_11 })
>>>>>>>>>>>   {
>>>>>>>>>>>     <bb 7>:
>>>>>>>>>>>     # xmax_edge_30 = PHI <xmax_edge_36(6), xmax_edge_18(5)>
>>>>>>>>>>>     xmax_edge_19 = xmax_edge_39 + 1;
>>>>>>>>>>>     goto <bb 11>;
>>>>>>>>>>>
>>>>>>>>>>>   }
>>>>>>>>>>>
>>>>>>>>>>> Note that both incoming edges to bb_7 are critical. If we comment out
>>>>>>>>>>> restoring gsi in predicate_all_scalar_phi:
>>>>>>>>>>> #if 0
>>>>>>>>>>>  if ((EDGE_COUNT (bb->preds) == 2 && all_preds_critical_p (bb))
>>>>>>>>>>>      || (EDGE_COUNT (bb->preds) > 2 && has_pred_critical_p (bb)))
>>>>>>>>>>>    gsi = bb_insert_point (bb);
>>>>>>>>>>>  else
>>>>>>>>>>> #endif
>>>>>>>>>>>    gsi = gsi_after_labels (bb);
>>>>>>>>>>>
>>>>>>>>>>> we will get ICE:
>>>>>>>>>>> t5.c: In function 'foo':
>>>>>>>>>>> t5.c:9:6: error: definition in block 4 follows the use
>>>>>>>>>>>  void foo (int n)
>>>>>>>>>>>       ^
>>>>>>>>>>> for SSA_NAME: _1 in statement:
>>>>>>>>>>> _52 = _1 & _3;
>>>>>>>>>>> t5.c:9:6: internal compiler error: verify_ssa failed
>>>>>>>>>>>
>>>>>>>>>>> smce predicate computations were inserted in bb_7.
>>>>>>>>>>
>>>>>>>>>> The issue is obviously that the predicates have already been emitted
>>>>>>>>>> in the target BB - that's of course the wrong place.  This is done
>>>>>>>>>> by insert_gimplified_predicates.
>>>>>>>>>>
>>>>>>>>>> This just shows how edge predicate handling is broken - we don't
>>>>>>>>>> seem to have a sequence of gimplified stmts for edge predicates
>>>>>>>>>> but push those to e->dest which makes this really messy.
>>>>>>>>>>
>>>>>>>>>> Rather than having a separate phase where we insert all
>>>>>>>>>> gimplified bb predicates we should do that on-demand when
>>>>>>>>>> predicating a PHI.
>>>>>>>>>>
>>>>>>>>>> Your patch writes to stderr - that's bad - use dump_file and guard
>>>>>>>>>> the printfs properly.
>>>>>>>>>>
>>>>>>>>>> You also still have two functions for PHI predication.  And the
>>>>>>>>>> new extended variant doesn't commonize the 2-args and general
>>>>>>>>>> paths.
>>>>>>>>>>
>>>>>>>>>> I'm not at all happy with this code.  It may be existing if-conv codes
>>>>>>>>>> fault but making it even worse is not an option.
>>>>>>>>>>
>>>>>>>>>> Again - what's wrong with simply splitting critical edges if
>>>>>>>>>> aggressive_if_conv?  I think that would very much simplify
>>>>>>>>>> things here.  Or alternatively use gsi_insert_on_edge and
>>>>>>>>>> commit edge insertions before merging the blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Richard.
>>>>>>>>>>
>>>>>>>>>>> ChangeLog is
>>>>>>>>>>>
>>>>>>>>>>> 2014-12-09  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>> * tree-if-conv.c : Include hash-map.h.
>>>>>>>>>>> (struct bb_predicate_s): Add new field to save copy of gimple
>>>>>>>>>>> statement iterator.
>>>>>>>>>>> (bb_insert_point): New function.
>>>>>>>>>>> (set_bb_insert_point): New function.
>>>>>>>>>>> (has_pred_critical_p): New function.
>>>>>>>>>>> (if_convertible_bb_p): Allow bb has more than 2 predecessors if
>>>>>>>>>>> AGGRESSIVE_IF_CONV is true.
>>>>>>>>>>> (if_convertible_bb_p): Delete check that bb has at least one
>>>>>>>>>>> non-critical incoming edge.
>>>>>>>>>>> (is_cond_scalar_reduction): Add arguments ARG_0, ARG_1 and EXTENDED.
>>>>>>>>>>> Allow interchange PHI arguments if EXTENDED is false.
>>>>>>>>>>> Change check that block containing reduction statement candidate
>>>>>>>>>>> is predecessor of phi-block since phi may have more than two arguments.
>>>>>>>>>>> (predicate_scalar_phi): Add new arguments for call of
>>>>>>>>>>> is_cond_scalar_reduction.
>>>>>>>>>>> (get_predicate_for_edge): New function.
>>>>>>>>>>> (struct phi_args_hash_traits): New type.
>>>>>>>>>>> (phi_args_hash_traits::hash): New function.
>>>>>>>>>>> (phi_args_hash_traits::equal_keys): New function.
>>>>>>>>>>> (gen_phi_arg_condition): New function.
>>>>>>>>>>> (predicate_extended_scalar_phi): New function.
>>>>>>>>>>> (predicate_all_scalar_phis): Add boolean variable EXTENDED and set it
>>>>>>>>>>> to true if BB containing phi has more than 2 predecessors or both
>>>>>>>>>>> incoming edges are critical. Invoke find_phi_replacement_condition and
>>>>>>>>>>> predicate_scalar_phi if EXTENDED is false. Use saved gsi if BB
>>>>>>>>>>> has 2 predecessors and both incoming edges are critical or it has more
>>>>>>>>>>> than 2 predecessors and atleast one incoming edge is critical.
>>>>>>>>>>> Use standard gsi_after_labels otherwise.
>>>>>>>>>>> Invoke predicate_extended_scalar_phi if EXTENDED is true.
>>>>>>>>>>> (insert_gimplified_predicates): Add bool variable EXTENDED_PREDICATION
>>>>>>>>>>> to save gsi before insertion of predicate computations. SEt-up it to
>>>>>>>>>>> true for BB with 2 predecessors and critical incoming edges either
>>>>>>>>>>>         number of predecessors is geater 2 and at least one incoming edge is
>>>>>>>>>>> critical.
>>>>>>>>>>> Add check that non-predicated block may have statements to insert.
>>>>>>>>>>> Insert predicate computation of BB just after label if
>>>>>>>>>>> EXTENDED_PREDICATION is true.
>>>>>>>>>>> (tree_if_conversion): Add initialization of AGGRESSIVE_IF_CONV which
>>>>>>>>>>> is copy of inner or outer loop force_vectorize field.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-12-04 16:37 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>>>> On Thu, Dec 4, 2014 at 2:15 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>>>> Richard,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did simple change by saving gsi iterator for each bb that has
>>>>>>>>>>>>> critical edges by adding additional field to bb_predicate_s:
>>>>>>>>>>>>>
>>>>>>>>>>>>> typedef struct bb_predicate_s {
>>>>>>>>>>>>>
>>>>>>>>>>>>>   /* The condition under which this basic block is executed.  */
>>>>>>>>>>>>>   tree predicate;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   /* PREDICATE is gimplified, and the sequence of statements is
>>>>>>>>>>>>>      recorded here, in order to avoid the duplication of computations
>>>>>>>>>>>>>      that occur in previous conditions.  See PR44483.  */
>>>>>>>>>>>>>   gimple_seq predicate_gimplified_stmts;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   /* Insertion point for blocks having incoming critical edges.  */
>>>>>>>>>>>>>   gimple_stmt_iterator gsi;
>>>>>>>>>>>>> } *bb_predicate_p;
>>>>>>>>>>>>>
>>>>>>>>>>>>> and this iterator is saved in  insert_gimplified_predicates before
>>>>>>>>>>>>> insertion code for predicate computation. I checked that this fix
>>>>>>>>>>>>> works.
>>>>>>>>>>>>
>>>>>>>>>>>> Huh?  I still wonder what the issue is with inserting everything
>>>>>>>>>>>> after the PHI we predicate.
>>>>>>>>>>>>
>>>>>>>>>>>> Well, your updated patch will come with testcases for the testsuite
>>>>>>>>>>>> that will hopefully fail if doing that.
>>>>>>>>>>>>
>>>>>>>>>>>> Richard.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I am implementing merging of predicate_extended.. and
>>>>>>>>>>>>> predicate_arbitrary.. functions as you proposed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards.
>>>>>>>>>>>>> Yuri.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2014-12-04 15:41 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>>>>>> On Tue, Dec 2, 2014 at 4:28 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>>>>>> Thanks Richard for your quick reply!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. I agree that we can combine predicate_extended_ and
>>>>>>>>>>>>>>> predicate_arbitrary_ to one function as you proposed.
>>>>>>>>>>>>>>> 2. What is your opinion about using more simple decision about
>>>>>>>>>>>>>>> insertion point - if bb has use of phi result insert phi predication
>>>>>>>>>>>>>>> before it and at the bb end otherwise. I assume that critical edge
>>>>>>>>>>>>>>> splitting is not a good decision.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why not always insert before the use?  Which would be after labels,
>>>>>>>>>>>>>> what we do for two-arg PHIs.  That is, how can it be that you predicate
>>>>>>>>>>>>>> a PHI in BB1 and then for an edge predicate on one of its incoming
>>>>>>>>>>>>>> edges you get SSA uses with defs that are in BB1 itself?  That
>>>>>>>>>>>>>> can only happen for backedges but those you can't remove in any case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Richard.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards.
>>>>>>>>>>>>>>> Yuri.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2014-12-02 16:28 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>>>>>>>> On Mon, Dec 1, 2014 at 4:53 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I resend you patch1 and patch2 with minor changes:
>>>>>>>>>>>>>>>>> 1. I renamed flag_force_vectorize to aggressive_if_conv.
>>>>>>>>>>>>>>>>> 2. Use static cast for the first argument of gimple_phi_arg_edge.
>>>>>>>>>>>>>>>>> I also very sorry that I sent you bad patch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Now let me answer on your questions related to second patch.
>>>>>>>>>>>>>>>>> 1. Why we need both predicate_extended_scalar_phi and
>>>>>>>>>>>>>>>>> predicate_arbitrary_scalar_phi?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let's consider the following simple test-case:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   #pragma omp simd safelen(8)
>>>>>>>>>>>>>>>>>   for (i=0; i<512; i++)
>>>>>>>>>>>>>>>>>   {
>>>>>>>>>>>>>>>>>     float t = a[i];
>>>>>>>>>>>>>>>>>     if (t > 0.0f & t < 1.0e+17f)
>>>>>>>>>>>>>>>>>       if (c[i] != 0)  /* c is integer array. */
>>>>>>>>>>>>>>>>> res += 1;
>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> we can see the following phi node correspondent to res:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # res_1 = PHI <res_15(3), res_15(4), res_10(5)>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It is clear that we can optimize it to phi node with 2 arguments only
>>>>>>>>>>>>>>>>> and only one check can be used for phi predication (for reduction in
>>>>>>>>>>>>>>>>> our case), namely predicate of bb_5. In general case we can't do it
>>>>>>>>>>>>>>>>> even if we sort all phi argument values since we still have to produce
>>>>>>>>>>>>>>>>> a chain of cond expressions to perform phi predication (see comments
>>>>>>>>>>>>>>>>> for predicate_arbitrary_scalar_phi).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How so?  We can always use !(condition) for the "last" value, thus
>>>>>>>>>>>>>>>> treat it as an 'else' case.  That even works for
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # res_1 = PHI <res_15(3), res_15(4), res_10(5), res_10(7)>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> where the condition for edges 5 and 7 can be computed as
>>>>>>>>>>>>>>>> ! (condition for 3 || condition for 4).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Of course it is worthwhile to also sort single-occurances first
>>>>>>>>>>>>>>>> so your case gets just the condiiton for edge 5 and its inversion
>>>>>>>>>>>>>>>> used for edges 3 and 4 combined.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. Why we need to introduce find_insertion_point?
>>>>>>>>>>>>>>>>>  Let's consider another test-case extracted from 175.vpr ( t5.c is
>>>>>>>>>>>>>>>>> attached) and we can see that bb_7 and bb_9 containig phi nodes has
>>>>>>>>>>>>>>>>> only critical incoming edges and both contain code computing edge
>>>>>>>>>>>>>>>>> predicates, e.g.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <bb 7>:
>>>>>>>>>>>>>>>>> # xmax_edge_30 = PHI <xmax_edge_36(6), xmax_edge_18(5)>
>>>>>>>>>>>>>>>>> _46 = xmax_17 == xmax_37;
>>>>>>>>>>>>>>>>> _47 = xmax_17 == xmax_27;
>>>>>>>>>>>>>>>>> _48 = _46 & _47;
>>>>>>>>>>>>>>>>> _53 = xmax_17 == xmax_37;
>>>>>>>>>>>>>>>>> _54 = ~_53;
>>>>>>>>>>>>>>>>> _55 = xmax_17 == xmax_27;
>>>>>>>>>>>>>>>>> _56 = _54 & _55;
>>>>>>>>>>>>>>>>> _57 = _48 | _56;
>>>>>>>>>>>>>>>>> xmax_edge_19 = xmax_edge_39 + 1;
>>>>>>>>>>>>>>>>> goto <bb 11>;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It is evident that we can not put phi predication at the block
>>>>>>>>>>>>>>>>> beginning but need to put it after predicate computations.
>>>>>>>>>>>>>>>>> Note also that if there are no critical edges for phi arguments
>>>>>>>>>>>>>>>>> insertion point will be "after labels" Note also that phi result can
>>>>>>>>>>>>>>>>> have use in this block too, so we can't put predication code to the
>>>>>>>>>>>>>>>>> block end.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So the issue is that predicate insertion for edge predicates does
>>>>>>>>>>>>>>>> not happen on the edge but somewhere else (generally impossible
>>>>>>>>>>>>>>>> for critical edges unless you split them).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think I've told you before that I prefer simple solutions to such issues,
>>>>>>>>>>>>>>>> like splitting the edge!  Certainly not involving a function walking
>>>>>>>>>>>>>>>> GENERIC expressions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Richard.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let me know if you still have any questions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best regards.
>>>>>>>>>>>>>>>>> Yuri.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2014-11-28 15:43 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
>>>>>>>>>>>>>>>>>> On Wed, Nov 12, 2014 at 2:35 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here is the second patch related to extended predication.
>>>>>>>>>>>>>>>>>>> Few comments which explain a main goal of design.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. I don't want to insert any critical edge splitting since it may
>>>>>>>>>>>>>>>>>>> lead to less efficient binaries.
>>>>>>>>>>>>>>>>>>> 2. One special case of extended PHI node predication was introduced
>>>>>>>>>>>>>>>>>>> when #arguments is more than 2 but only two arguments are different
>>>>>>>>>>>>>>>>>>> and one argument has the only occurrence. For such PHI conditional
>>>>>>>>>>>>>>>>>>> scalar reduction is applied.
>>>>>>>>>>>>>>>>>>> This is correspondent to the following statement:
>>>>>>>>>>>>>>>>>>>     if (q1 && q2 && q3) var++
>>>>>>>>>>>>>>>>>>>  New function phi_has_two_different_args was introduced to detect such phi.
>>>>>>>>>>>>>>>>>>> 3. Original algorithm for PHI predication used assumption that at
>>>>>>>>>>>>>>>>>>> least one incoming edge for blocks containing PHI is not critical - it
>>>>>>>>>>>>>>>>>>> guarantees that all computations related to predicate of normal edge
>>>>>>>>>>>>>>>>>>> are already inserted above this block and
>>>>>>>>>>>>>>>>>>> code related to PHI predication can be inserted at the beginning of
>>>>>>>>>>>>>>>>>>> block. But this is not true for critical edges for which predicate
>>>>>>>>>>>>>>>>>>> computations are  in the block where code for phi predication must be
>>>>>>>>>>>>>>>>>>> inserted. So new function find_insertion_point is introduced which is
>>>>>>>>>>>>>>>>>>> simply found out the last statement in block defining predicates
>>>>>>>>>>>>>>>>>>> correspondent to all incoming edges and insert phi predication code
>>>>>>>>>>>>>>>>>>> after it (with some minor exceptions).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Unfortunately the patch doesn't apply for me - I get
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> patch: **** malformed patch at line 505: @@ -1720,6 +2075,8 @@
>>>>>>>>>>>>>>>>>> predicate_all_scalar_phis (struct loop *loop)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> a few remarks nevertheless.  I don't see how we need both
>>>>>>>>>>>>>>>>>> predicate_extended_scalar_phi and predicate_arbitrary_scalar_phi.
>>>>>>>>>>>>>>>>>> Couldn't we simply sort an array of (edge, value) pairs after value
>>>>>>>>>>>>>>>>>> and handle equal values specially in predicate_extended_scalar_phi?
>>>>>>>>>>>>>>>>>> That would even make PHI <a, a, b, c, c> more optimal.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't understand the need for find_insertion_point.  All SSA names
>>>>>>>>>>>>>>>>>> required for the predicates are defined upward - and the complex CFG
>>>>>>>>>>>>>>>>>> is squashed to a single basic-block, thus the defs will dominate the
>>>>>>>>>>>>>>>>>> inserted code if you insert after labels just like for the other case.
>>>>>>>>>>>>>>>>>> Or what am I missing?  ("flattening" of the basic-blocks of course needs
>>>>>>>>>>>>>>>>>> to happen in dominator order - but I guess that happens already?)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd like the extended PHI handling to be enablable by a flag even
>>>>>>>>>>>>>>>>>> for !force-vectorization - I've seen cases with 3 PHI args multiple
>>>>>>>>>>>>>>>>>> times that would have been nice to vectorize.  I suggest to
>>>>>>>>>>>>>>>>>> add -ftree-loop-if-convert-aggressive for this.  We can do this as
>>>>>>>>>>>>>>>>>> followup, but please rename the local flag_force_vectorize flag
>>>>>>>>>>>>>>>>>> to something less looking like a flag, like simply 'aggressive'.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Otherwise patch 2 looks ok to me.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Richard.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ChangeLog:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2014-10-24  Yuri Rumyantsev  <ysrumyan@gmail.com>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> * tree-if-conv.c (ifcvt_can_use_mask_load_store): Use
>>>>>>>>>>>>>>>>>>> FLAG_FORCE_VECTORIZE instead of loop flag.
>>>>>>>>>>>>>>>>>>> (if_convertible_bb_p): Allow bb has more than 2 predecessors if
>>>>>>>>>>>>>>>>>>> FLAG_FORCE_VECTORIZE is true.
>>>>>>>>>>>>>>>>>>> (if_convertible_bb_p): Delete check that bb has at least one
>>>>>>>>>>>>>>>>>>> non-critical incoming edge.
>>>>>>>>>>>>>>>>>>> (phi_has_two_different_args): New function.
>>>>>>>>>>>>>>>>>>> (is_cond_scalar_reduction): Add argument EXTENDED to choose access
>>>>>>>>>>>>>>>>>>> to phi arguments. Invoke phi_has_two_different_args to get phi
>>>>>>>>>>>>>>>>>>> arguments if EXTENDED is true. Change check that block
>>>>>>>>>>>>>>>>>>> containing reduction statement candidate is predecessor
>>>>>>>>>>>>>>>>>>> of phi-block since phi may have more than two arguments.
>>>>>>>>>>>>>>>>>>> (convert_scalar_cond_reduction): Add argument BEFORE to insert
>>>>>>>>>>>>>>>>>>> statement before/after gsi point.
>>>>>>>>>>>>>>>>>>> (predicate_scalar_phi): Add argument false (which means non-extended
>>>>>>>>>>>>>>>>>>> predication) to call of is_cond_scalar_reduction. Add argument
>>>>>>>>>>>>>>>>>>> true (which correspondent to argument BEFORE) to call of
>>>>>>>>>>>>>>>>>>> convert_scalar_cond_reduction.
>>>>>>>>>>>>>>>>>>> (get_predicate_for_edge): New function.
>>>>>>>>>>>>>>>>>>> (predicate_arbitrary_scalar_phi): New function.
>>>>>>>>>>>>>>>>>>> (predicate_extended_scalar_phi): New function.
>>>>>>>>>>>>>>>>>>> (find_insertion_point): New function.
>>>>>>>>>>>>>>>>>>> (predicate_all_scalar_phis): Add two boolean variables EXTENDED and
>>>>>>>>>>>>>>>>>>> BEFORE. Initialize EXTENDED to true if BB containing phi has more
>>>>>>>>>>>>>>>>>>> than 2 predecessors or both incoming edges are critical. Invoke
>>>>>>>>>>>>>>>>>>> find_phi_replacement_condition and predicate_scalar_phi or
>>>>>>>>>>>>>>>>>>> find_insertion_point and predicate_extended_scalar_phi depending on
>>>>>>>>>>>>>>>>>>> EXTENDED value.
>>>>>>>>>>>>>>>>>>> (insert_gimplified_predicates): Add check that non-predicated block
>>>>>>>>>>>>>>>>>>> may have statements to insert. Insert predicate of BB just after label
>>>>>>>>>>>>>>>>>>> if FLAG_FORCE_VECTORIZE is true.
>>>>>>>>>>>>>>>>>>> (tree_if_conversion): Add initialization of FLAG_FORCE_VECTORIZE which
>>>>>>>>>>>>>>>>>>> is copy of inner or outer loop field force_vectorize.