[RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

Jeff Law law@redhat.com
Fri May 29 15:54:00 GMT 2015


On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote:
> I have Designed and  implemented with the following design for the path splitting of the loops with conditional IF-THEN-ELSE.
> The implementation has gone through the bootstrap for Microblaze target along DEJA GNU regressions tests and
> running the MIBench/EEMBC benchmarks. There is no regression seen in Deja GNU tests and Deja C++ tests results are
> better with the path splitting changes(lesser failures). The Mibench/EEMBC benchmarks are run and no performance degradation is
> seen and the performance improvement in telcom_gsm( Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC
> benchmarks) the performance improvement of 9.4% is observed.
>
> Proposal on the below were made earlier on path splitting and the reference is attached.
The first thing I notice is you haven't included any tests for the new 
optimization.  We're trying really hard these days to have tests in the 
suite for this kind of new optimization/feature work.

I don't offhand know if any of the benchmarks you cite above are 
free-enough to derive a testcase from.  But one trick many of us use is 
to instrument the pass and compile some known free software (often gcc 
itself) to find triggering code and use that to generate tests for the 
new transformation.


>
> Design changes.
>
> 1. The dominators of the block with conditional IF statements say BB1 are found and the join node of the IF-THEN-ELSE
> inside the loops is found on the blocks dominated by the BB1 and are not successor of BB1 are the join node.
This isn't necessarily my area of strength, but it doesn't seem right to 
me.   I guess it would work if there aren't any more nodes in the loop 
after the join block, except for the loop latch, but it doesn't see 
right in the general case.  It might work if you also verify that the IF 
block is the immediate dominator of potential join node.



>
> 2. The Join node is same as the source of the loops latch basic blocks.
OK.  This is why your initial algorithm in #1 works.

>
> 3. If the above conditional in (1) and (2) are found the Join node  same as the source of the Loop latch node is
> moved into predecessors and the Join node ( Source of the Loop latch node) is made empty statement block with only
> the phi nodes.
Hopefully you do this by duplicating the join node and manipulating the 
CFG so that the original predecessors of the join each go to their copy 
of the join node.  The copied join nodes unconditionally then pass 
control to the original join node.


>
> 4. In the process of moving the Join node into its predecessors the result of the phi node in the Join node propagated
> with the corresponding phi arguments  based on which predecessor it came from in the Join blocks and move into its predecessors.
Right.  Note there's a phi-only cprop pass which is designed to do this 
propagation and in theory should be very fast.

> 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) and (4).
Right.  Getting the sequencing of the various updates can be tricky. 
Even more so if you try to do it as a part of jump threading because you 
have to (for example) deal with unreachable blocks in the CFG.


>
> 6. The Join which is same as the source of the Loop latch node is made empty with only the phi node in the Join node.
Right.

>
> 7. The implementation is done in Jump threading phase of the machine independent optimization on tree based
> representation. The path splitting is called after the Jump threading optimization is performed.
> The following files are modifed.
>
>      a) tree-vrp.c
>      b) tree-ssa-threadedge.c
>      c) tree-cfg.c
>      d) tree-ssa-threadedge.h
>      e) tree-cfg.h
>      f) cfghooks.c
>
> The diff is attached along with this mail and pasted below.
I would recommend this as a separate pass.  One of the goals for GCC 6 
is to break out the threader into its own pass, if you're piggybacking 
on the threader it just makes this task harder.

Also by implementing this as a distinct pass you have more freedom to 
put it where it really belongs in the optimization pipeline.  Given the 
duplication may end up exposing partial redundancies, dead code & 
propagation opportunities, it's probably best placed before DOM.  That 
also eliminates the need for running phi-only cprop because DOM will 
take care of it for you.

When jump threading is broken out into its own distinct pass, it'll 
probably end up just before or just after your path splitting pass.


>
> Please share your thoughts and feedbacks on the above optimization and the design and the coding and implementation
> done for the given above design.
Some general notes.  Every function should have a block comment which 
describes what the function does, its arguments & return value.  There 
are many examples throughout the code you can use to understand what 
we're looking for in those function comments.

Those comments are particularly useful during initial review and later 
maintenance, so in the future, please include them, even on RFCs.

You'll need to write a ChangeLog entry.  It's not particularly exciting 
or fun to do, but I regularly find that writing the ChangeLog forces me 
to look at each hunk in the patch and re-justify its purpose.  It's 
common that in doing so I rework the patch because I see something wrong 
or a cleaner/better way to solve a particular problem.


>
> diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c
> index 9faa339..559ca96 100644
> --- a/gcc/cfghooks.c
> +++ b/gcc/cfghooks.c
> @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb)
>
>         /* If we remove the header or the latch of a loop, mark the loop for
>   	 removal.  */
> -      if (loop->latch == bb
> +      if (loop && loop->latch == bb
>   	  || loop->header == bb)
>   	mark_loop_for_removal (loop);
So what caused you to add this additional test?  In general loop 
structures are supposed to always be available.  The change here implies 
that the loop structures were gone at some point.  That seems at first 
glance a mistake.


>
> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
> index aed5254..b25e409 100644
> --- a/gcc/tree-cfg.c
> +++ b/gcc/tree-cfg.c
> @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val)
>       }
>   }
>
> +void
> +gimple_threaded_merge_blocks (basic_block a, basic_block b)
If we keep this function it will need a block comment.

I say "if" for a couple reasons.  First we already have support routines 
that know how to merge blocks.  If you really need to merge blocks you 
should try to use them.

Second, I'm not sure that you really need to worry about block merging 
in this pass.  Just create the duplicates, wire them into the CFG and 
let the existing block merging support handle this problem.



>
>   static void
> @@ -2061,7 +2119,7 @@ remove_bb (basic_block bb)
>
>         /* If a loop gets removed, clean up the information associated
>   	 with it.  */
> -      if (loop->latch == bb
> +      if (loop && loop->latch == bb
Same concern here as earlier, why don't you have the loop structures 
available when this routine gets called?


>   	  || loop->header == bb)
>   	free_numbers_of_iterations_estimates_loop (loop);
>       }
> @@ -5779,7 +5837,7 @@ gimple_can_duplicate_bb_p (const_basic_block bb ATTRIBUTE_UNUSED)
>   /* Create a duplicate of the basic block BB.  NOTE: This does not
>      preserve SSA form.  */
>
> -static basic_block
> +basic_block
>   gimple_duplicate_bb (basic_block bb)
There's already interfaces for duplicating blocks that you can/should be 
using.  In particular "duplicate_block" or "gimple_duplicate_sese_region".

>   {
>     basic_block new_bb;
> @@ -5858,7 +5916,7 @@ gimple_duplicate_bb (basic_block bb)
>
>   /* Adds phi node arguments for edge E_COPY after basic block duplication.  */
>
> -static void
> +void
>   add_phi_args_after_copy_edge (edge e_copy)
I don't think you'll need this if you use the existing interfaces for 
block duplication.


> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
> index 4303a18..2c7d36d 100644
> --- a/gcc/tree-ssa-threadedge.c
> +++ b/gcc/tree-ssa-threadedge.c
> @@ -1359,6 +1359,322 @@ thread_through_normal_block (edge e,
>     return 0;
>   }
>
> +static void
> +replace_threaded_uses (basic_block a,basic_block b)
If you keep this function, then it'll need a function comment.

It looks like this is just doing const/copy propagation.  I think a 
better structure is to implement your optimization as a distinct pass, 
then rely on existing passes such as update_ssa, DOM, CCP to handle 
updating the SSA graph and propagation opportunities exposed by your 
transformation.


Similarly for the other replace_ functions.

}
> +
> +static bool
> +is_def_stmt_assert_expr (basic_block src1, basic_block bb,
> +                        int pred_index)
As a distinct pass I don't think you'd need to handle ASSERT_EXPRs as 
they're removed after VRP has completed.  THe only reason you're running 
into them is because the VRP runs the jump threader as a subroutine.

And it's odd that you also check for MIN_EXPR, I don't see how that's 
useful.  And yet you don't check for MAX_EXPR, so something definitely 
seems odd here.

+
> +void
> +process_path_splitting (basic_block bb)
Function comment.  This seems to be the meat of the transformation.



> +{
> +  vec<basic_block> bbs;
> +  basic_block bb1;
> +  unsigned int i;
> +  edge_iterator ei;
> +  edge e1;
> +  bool found = false ,found1;
> +
> +  bbs = get_all_dominated_blocks (CDI_DOMINATORS,
> +                                  bb );
> +  FOR_EACH_VEC_ELT (bbs, i, bb1)
> +  {
> +    found1 = false;
> +    FOR_EACH_EDGE (e1, ei, bb->succs)
> +    {
> +      if (bb1 == e1->dest)
> +        {
> +          found = true;
> +          found1 = true;
> +        }
> +    }
> +    if (!found1 && found)
> +      {
> +        unsigned int num_preds = 0;
> +        found = false;
> +
> +        FOR_EACH_EDGE (e1, ei, bb1->succs)
> +        {
> +          if (e1->flags & (EDGE_DFS_BACK))
> +            found = true;
> +        }
> +
> +        if (found && EDGE_COUNT(bb1->preds) == 2)
> +          {
> +            basic_block temp_bb = bb1;
> +            basic_block src1,src2;
> +            unsigned int k = 0, num_stmt = 0, num_phis = 0;
> +            bool is_src1 = false, is_src2 = false;
> +            gimple_stmt_iterator  psi,gsi;
> +
> +            FOR_EACH_EDGE (e1, ei, bb1->preds)
> +            {
> +              if ((e1->flags & (EDGE_DFS_BACK)))
> +                continue;
> +              if (k == 1)
> +                {
> +                  if (single_succ_p(e1->src) &&
> +                      single_succ_edge (e1->src)->flags & EDGE_FALLTHRU)
> +                    {
> +                      src2 = e1->src;
> +                      is_src2 = true;
> +                    }
> +                }
> +                else
> +                  {
> +                    if (single_succ_p(e1->src) &&
> +                        single_succ_edge (e1->src)->flags & EDGE_FALLTHRU)
> +                      {
> +                        is_src1 = true;
> +                        src1 = e1->src;
> +                      }
> +                  }
> +                k++;
> +           }
> +          for (gsi = gsi_start_bb (bb1); !gsi_end_p (gsi); gsi_next (&gsi))
> +            num_stmt++;
> +
> +          if ((num_stmt > 1) && is_src1 && is_src2)
> +            {
> +              bool found_non_virtual_result = false;
> +
> +              for (psi = gsi_start_phis (bb1); !gsi_end_p (psi); )
> +                 {
> +                   gimple stmt = gsi_stmt (psi);
> +
> +                   if (!virtual_operand_p (gimple_phi_result (stmt)))
> +                     num_phis++;
> +                   else
> +                     found_non_virtual_result = true;
> +
> +                   gsi_next(&psi);
> +                }
> +
> +              if ((num_phis >1) || found_non_virtual_result)
> +                return;
> +
> +              if (!(is_def_stmt_assert_expr (src1, bb1, 1) &&
> +                  is_def_stmt_assert_expr (src2, bb1, 2)))
> +                return;
So up to this point is just identification of candidates.  Seems like it 
ought to factor into its own function.

And from here to the end of the function is the actual transformation, 
which feels like it should factor into its own function as well.

> +
> +              temp_bb = gimple_duplicate_bb (bb1);
> +
> +              replace_threaded_uses (bb1, temp_bb);
> +
> +              make_edge (src1, temp_bb, EDGE_FALLTHRU);
> +
> +              make_edge (src2,temp_bb,EDGE_FALLTHRU);
> +
> +              FOR_EACH_EDGE (e1, ei, temp_bb->preds)
> +                add_phi_args_after_copy_edge (e1);
> +
> +              replace_phis(temp_bb, bb1);
> +
> +              replace_uses_phi(bb1, 1);
> +
> +              gimple_threaded_merge_blocks (src1, bb1);
> +
> +              replace_uses_phi(temp_bb, 2);
> +
> +              gimple_threaded_merge_blocks (src2, temp_bb);

> +
> +              while (EDGE_COUNT (temp_bb->preds) != 0)
> +                remove_edge (EDGE_PRED (temp_bb, 0));
> +
> +              while (EDGE_COUNT (temp_bb->succs) != 0)
> +                remove_edge (EDGE_SUCC (temp_bb, 0));
> +
> +              if (dom_info_available_p (CDI_DOMINATORS))
> +                delete_from_dominance_info (CDI_DOMINATORS, temp_bb);
> +
> +              if (dom_info_available_p (CDI_POST_DOMINATORS))
> +                delete_from_dominance_info (CDI_POST_DOMINATORS, temp_bb);
> +
> +              remove_phi_nodes (temp_bb);
> +              expunge_block (temp_bb);
> +
> +              return;
> +            }
> +          }
> +      }
> +  }
> +}
> +


> diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
> index fbecdf7..c0db184 100644
> --- a/gcc/tree-vrp.c
> +++ b/gcc/tree-vrp.c
> @@ -10141,6 +10141,10 @@ identify_jump_threads (void)
>   	      thread_across_edge (dummy, e, true, &equiv_stack,
>   				  simplify_stmt_for_jump_threading);
>   	    }
> +          if (gimple_code(last) == GIMPLE_COND)
> +            {
> +              process_path_splitting(bb);
So instead of running this as a subroutine of vrp, make it a first class 
pass.  Then find the right place to put the pass.  I think that'll 
ultimately be cleaner and simpler as you won't be trying to duplicate 
things like const/copy propagation.


Overall I like the the transformation you're trying to make, but would 
like to see this as a separate pass, if it can't be a separate pass, 
please indicate why.  Second, you should be looking to re-use existing 
block duplication routines, SSA updating, etc as much as possible.

jeff



More information about the Gcc mailing list