[Patch] Improving jump-thread pass for PR 54742

Mon Dec 1 21:06:00 GMT 2014

On 11/25/14 14:16, Sebastian Pop wrote:
> Sebastian Pop wrote:
>> >I will bootstrap and regression test this patch on x86_64-linux and
>> >powerpc64-linux.  I will also run it on our internal benchmarks, coremark, and
>> >the llvm test-suite.
>> >
>> >I will also include a longer testcase that makes sure we do not regress on
>> >coremark.
> Done all the above.  Attached is the new patch with a new testcase.  I have also
> added verify_seme inspired by the recent patch adding verify_sese.
>
> Sebastian
>
>
> 0001-extend-jump-thread-for-finite-state-automata-PR-5474.patch
>
>
>  From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001
> From: Sebastian Pop<s.pop@samsung.com>
> Date: Fri, 26 Sep 2014 14:54:20 -0500
> Subject: [PATCH] extend jump thread for finite state automata PR 54742
>
> Adapted from a patch from James Greenhalgh.
>
> 	* params.def (max-fsm-thread-path-insns, max-fsm-thread-length,
> 	max-fsm-thread-paths): New.
>
> 	* doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length,
> 	max-fsm-thread-paths): Documented.
>
> 	* tree-cfg.c (split_edge_bb_loc): Export.
> 	* tree-cfg.h (split_edge_bb_loc): Declared extern.
>
> 	* tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the
> 	original value of cond when simplification fails.
> 	(fsm_find_thread_path): New.
> 	(fsm_find_control_statement_thread_paths): New.
> 	(fsm_thread_through_normal_block): Call find_control_statement_thread_paths.
>
> 	* tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print
> 	EDGE_START_FSM_THREAD.
> 	(verify_seme): New.
> 	(duplicate_seme_region): New.
> 	(thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges
> 	calling gimple_duplicate_sese_region.
>
> 	* tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD.
>
> 	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New.
> 	* testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New.
> ---
>   gcc/doc/invoke.texi                              |   12 ++
>   gcc/params.def                                   |   15 ++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c |   43 +++++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c |  127 +++++++++++++
>   gcc/tree-cfg.c                                   |    2 +-
>   gcc/tree-cfg.h                                   |    1 +
>   gcc/tree-ssa-threadedge.c                        |  215 +++++++++++++++++++++-
>   gcc/tree-ssa-threadupdate.c                      |  201 +++++++++++++++++++-
>   gcc/tree-ssa-threadupdate.h                      |    1 +
>   9 files changed, 614 insertions(+), 3 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 89edddb..074183f 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level
>   @option{-O1} and higher.  This parameter is a maximum nubmer of statements
>   in a single generated constructor.  Default value is 5000.
>
> +@item max-fsm-thread-path-insns
> +Maximum number of instructions to copy when duplicating blocks on a
> +finite state automaton jump thread path.  The default is 100.
> +
> +@item max-fsm-thread-length
> +Maximum number of basic blocks on a finite state automaton jump thread
> +path.  The default is 10.
> +
> +@item max-fsm-thread-paths
> +Maximum number of new jump thread paths to create for a finite state
> +automaton.  The default is 50.
Has there been any tuning on these defaults.  I don't have any strong 
opinions about what they ought to be, this is more to get any such 
information recorded on the lists for historical purposes.

I think it's worth a note in the debug dump anytime you abort threading 
when you hit a limit.

I'm a bit worried about compile-time impacts of the all the recursion, 
but I'm willing to wait and see if it turns out to be a problem in practice.

> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
> index 8b0b7b8..c9fe212 100644
> --- a/gcc/tree-ssa-threadedge.c
> +++ b/gcc/tree-ssa-threadedge.c
> @@ -56,6 +56,8 @@ along with GCC; see the file COPYING3.  If not see
>   #include "params.h"
>   #include "tree-ssa-threadedge.h"
>   #include "builtins.h"
> +#include "cfg.h"
> +#include "cfganal.h"
>
>   /* To avoid code explosion due to jump threading, we limit the
>      number of statements we are going to copy.  This variable
> @@ -661,6 +663,7 @@ simplify_control_stmt_condition (edge e,
>        rather than use a relational operator.  These are simpler to handle.  */
>     if (TREE_CODE (cond) == SSA_NAME)
>       {
> +      tree original_lhs = cond;
>         cached_lhs = cond;
>
>         /* Get the variable's current value from the equivalence chains.
> @@ -689,6 +692,12 @@ simplify_control_stmt_condition (edge e,
>   	 pass specific callback to try and simplify it further.  */
>         if (cached_lhs && ! is_gimple_min_invariant (cached_lhs))
>           cached_lhs = (*simplify) (stmt, stmt);
> +
> +      /* We couldn't find an invariant.  But, callers of this
> +	 function may be able to do something useful with the
> +	 unmodified destination.  */
> +      if (!cached_lhs)
> +	cached_lhs = original_lhs;
>       }
>     else
>       cached_lhs = NULL;
Can't you just use COND rather than stuffing its value away into 
ORIGINAL_LHS?    CACHED_LHS may be better in some cases if it's an 
SSA_NAME (and it should be), but I doubt it matters in practice.

Or is it the case that you have to have the original condition -- 
without any context sensitive equivalences used to "simplify" the condition.

> @@ -948,6 +957,188 @@ thread_around_empty_blocks (edge taken_edge,
>     return false;
>   }
>
> +/* Return true if there is at least one path from START_BB to END_BB.
> +   VISITED_BBS is used to make sure we don't fall into an infinite loop.  */
> +
> +static bool
> +fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
> +		      vec<basic_block, va_gc> *&path,
> +		      hash_set<basic_block> *visited_bbs, int n_insns)
> +{
> +  if (start_bb == end_bb)
> +    {
> +      vec_safe_push (path, start_bb);
> +      return true;
> +    }
> +
> +  if (!visited_bbs->add (start_bb))
> +    {
> +      edge e;
> +      edge_iterator ei;
> +      FOR_EACH_EDGE (e, ei, start_bb->succs)
> +	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, n_insns))
> +	  {
> +	    vec_safe_push (path, start_bb);
> +	    return true;
> +	  }
> +    }
> +
> +  return false;
Update comment to indicate how PATH is used to return a path from 
START_BB to END_BB.

> +		  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> +		       gsi_next (&gsi))
> +		    ++n_insns;
Probably don't want to count labels and GIMPLE_NOPs.  Probably do want 
to count non-virtual PHIs since those may end up as a copies or constant 
initializations.

> +		      if (j == 0)
> +			kind = EDGE_START_FSM_THREAD;
> +		      else if (single_pred_p (e->src))
> +			kind = EDGE_NO_COPY_SRC_BLOCK;
> +		      else {
> +			kind = EDGE_COPY_SRC_JOINER_BLOCK;
> +			++joiners;
> +		      }
Presumably the mis-formatting was added when you tracked the # joiners. 
  AFAICT that is a write-only variable and ought to be removed.  Along 
with the braces on the final ELSE which should restore proper formatting.

> @@ -2343,6 +2493,55 @@ thread_through_all_blocks (bool may_peel_loop_headers)
>     threaded_blocks = BITMAP_ALLOC (NULL);
>     memset (&thread_stats, 0, sizeof (thread_stats));
>
> +  for (i = 0; i < paths.length ();)
Comment before this loop.  I can see what you're doing, but I'm already 
very familiar with this code.  Basically what are you looking for in 
this loop and what do you do?

Overall I think this is very very close and I really like the overall 
direction.  There's a few minor issues noted above and with those 
addressed, I think we should be ready to go.

Looking further out....

Removing most of tree-ssa-threadupdate.c and using SEME duplication 
would be a huge step forward for making this code more understandable. I 
look forward to any work you do in this space in the future.

Similarly moving towards a backwards dataflow driven model is definitely 
on my long term plan for this code.  Ideally with some kind of knob that 
says "optimize the trivial jump threads you can find and do so very 
quickly" (say by restricting the lookup to a single block) and a more 
expensive version.

The simple version could run early which would solve some problems Jan 
has run into.  Running the simple version early would also help DOM/VRP.

Ideally I want to disentangle threading from VRP and DOM -- most 
threading opportunities are fairly simple to find and exploit.  Yet 
right now we have to run DOM or VRP which are insanely expensive.

Jeff