[PATCH v3] pass: Run cleanup passes before SLP [PR96789]

Kewen.Lin linkw@linux.ibm.com
Mon Nov 2 09:00:02 GMT 2020


Hi Richard,

Thanks a lot for the review and sorry for the late reply.

As the review comments, the changes against v2 are:
  - Fix the introduced tabification issues in passes.def.
  - Add pending TODOs group (different bitmask numbering).
  - Move pending_TODOs unsigned to struct function.
  - Fix some comments as adviced.
  - Update pending TODOs setting in tree_unroll_loops_completely_1.

[... snipped]

>> +extern unsigned int pending_TODOs;
> 
> Would it be better to put this in struct function?  At least that
> would prevent any possibility of the flags being accidentally
> carried over between functions.  (Obviously that would be a bug,
> but there's a danger that it might be a mostly silent bug.)
> 

Good point, updated!

>> +	/* Once we process our father we will have processed
>> +	   the fathers of our children as well, so avoid doing
>> +	   redundant work and clear fathers we've gathered sofar.
>> +	   But don't clear it for one outermost loop since its
>> +	   propagation doesn't run necessarily all the time.  */
>> +	bitmap_clear (father_bbs);
>> +
>> +      bitmap_set_bit (father_bbs, loop_father->header->index);
> 
> Is this a way of telling the caller to set “outermost_unrolled”?
> If so, with the suggestion above, we could just set pending_TODOs
> directly.

Yes, updated.

> 
> I had to read the above several times before I (hopefully) understood it.
> It seems that it's really overloading the bitmap for two different things
> (hence the need to explain why the clearing doesn't happen for the outer
> loop).

Sorry, the appending part "But don't clear it for one outermost loop
since ..." is still not good to get it clear.  You are right, if the bit
set for the outermost loop (its father), it's not guaranteed that we will
do the propagation for it (and its children), so it shouldn't clear
father_bbs otherwise some expected propagation probably won't happen.

Bootstrapped/regtested on powerpc64le-linux-gnu P8 again.

Does it look better?  Thanks again!

BR,
Kewen
-----
gcc/ChangeLog:

	PR tree-optimization/96789
	* function.c (allocate_struct_function): Init pending_TODOs.
	* function.h (struct function): New member unsigned pending_TODOs.
	* passes.c (class pass_pre_slp_scalar_cleanup): New class.
	(make_pass_pre_slp_scalar_cleanup): New function.
	(pass_data_pre_slp_scalar_cleanup): New pass data.
	* passes.def: (pass_pre_slp_scalar_cleanup): New pass, add
	pass_fre and pass_dse as its children.
	* timevar.def (TV_SCALAR_CLEANUP): New timevar.
	* tree-pass.h (PENDING_TODO_force_next_scalar_cleanup): New
	pending TODO flag.
	(make_pass_pre_slp_scalar_cleanup): New declare.
	* tree-ssa-loop-ivcanon.c (tree_unroll_loops_completely_1):
	Once any outermost loop gets unrolled, flag cfun pending_TODOs
	PENDING_TODO_force_next_scalar_cleanup on.

gcc/testsuite/ChangeLog:

	PR tree-optimization/96789
	* gcc.dg/tree-ssa/ssa-dse-28.c: Adjust.
	* gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.
	* gcc.dg/vect/bb-slp-41.c: Likewise.
	* gcc.dg/tree-ssa/pr96789.c: New test.

-------------- next part --------------
diff --git a/gcc/function.c b/gcc/function.c
index 2c8fa217f1f..3e92ee9c665 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -4841,6 +4841,8 @@ allocate_struct_function (tree fndecl, bool abstract_p)
      binding annotations among them.  */
   cfun->debug_nonbind_markers = lang_hooks.emits_begin_stmt
     && MAY_HAVE_DEBUG_MARKER_STMTS;
+
+  cfun->pending_TODOs = 0;
 }
 
 /* This is like allocate_struct_function, but pushes a new cfun for FNDECL
diff --git a/gcc/function.h b/gcc/function.h
index d55cbddd0b5..ffed6520bf9 100644
--- a/gcc/function.h
+++ b/gcc/function.h
@@ -269,6 +269,13 @@ struct GTY(()) function {
   /* Value histograms attached to particular statements.  */
   htab_t GTY((skip)) value_histograms;
 
+  /* Different from normal TODO_flags which are handled right at the
+     begin or the end of one pass execution, the pending_TODOs are
+     passed down in the pipeline until one of its consumers can
+     perform the requested action.  Consumers should then clear the
+     flags for the actions that they have taken.  */
+  unsigned int pending_TODOs;
+
   /* For function.c.  */
 
   /* Points to the FUNCTION_DECL of this function.  */
diff --git a/gcc/passes.c b/gcc/passes.c
index 6ff31ec37d7..eb8efc11c90 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -731,7 +731,54 @@ make_pass_late_compilation (gcc::context *ctxt)
   return new pass_late_compilation (ctxt);
 }
 
+/* Pre-SLP scalar cleanup, it has several cleanup passes like FRE, DSE.  */
 
+namespace {
+
+const pass_data pass_data_pre_slp_scalar_cleanup =
+{
+  GIMPLE_PASS, /* type */
+  "*pre_slp_scalar_cleanup", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_SCALAR_CLEANUP, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_pre_slp_scalar_cleanup : public gimple_opt_pass
+{
+public:
+  pass_pre_slp_scalar_cleanup (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_pre_slp_scalar_cleanup, ctxt)
+  {
+  }
+
+  virtual bool
+  gate (function *fun)
+  {
+    return flag_tree_slp_vectorize
+	   && (fun->pending_TODOs & PENDING_TODO_force_next_scalar_cleanup);
+  }
+
+  virtual unsigned int
+  execute (function *fun)
+  {
+    fun->pending_TODOs &= ~PENDING_TODO_force_next_scalar_cleanup;
+    return 0;
+  }
+
+}; // class pass_pre_slp_scalar_cleanup
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_pre_slp_scalar_cleanup (gcc::context *ctxt)
+{
+  return new pass_pre_slp_scalar_cleanup (ctxt);
+}
 
 /* Set the static pass number of pass PASS to ID and record that
    in the mapping from static pass number to pass.  */
diff --git a/gcc/passes.def b/gcc/passes.def
index c0098d755bf..1f41f36396e 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -288,11 +288,16 @@ along with GCC; see the file COPYING3.  If not see
 	  /* pass_vectorize must immediately follow pass_if_conversion.
 	     Please do not add any other passes in between.  */
 	  NEXT_PASS (pass_vectorize);
-          PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
+	  PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
 	      NEXT_PASS (pass_dce);
-          POP_INSERT_PASSES ()
-          NEXT_PASS (pass_predcom);
+	  POP_INSERT_PASSES ()
+	  NEXT_PASS (pass_predcom);
 	  NEXT_PASS (pass_complete_unroll);
+	  NEXT_PASS (pass_pre_slp_scalar_cleanup);
+	  PUSH_INSERT_PASSES_WITHIN (pass_pre_slp_scalar_cleanup)
+	      NEXT_PASS (pass_fre, false /* may_iterate */);
+	      NEXT_PASS (pass_dse);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_slp_vectorize);
 	  NEXT_PASS (pass_loop_prefetch);
 	  /* Run IVOPTs after the last pass that uses data-reference analysis
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr96789.c b/gcc/testsuite/gcc.dg/tree-ssa/pr96789.c
new file mode 100644
index 00000000000..d6139a014d8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr96789.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -funroll-loops -ftree-vectorize -fdump-tree-dse-details" } */
+
+/* Test if scalar cleanup pass takes effects, mainly check
+   its secondary pass DSE can remove dead stores on array
+   tmp.  */
+
+#include "stdint.h"
+
+static inline void
+foo (int16_t *diff, int i_size, uint8_t *val1, int i_val1, uint8_t *val2,
+     int i_val2)
+{
+  for (int y = 0; y < i_size; y++)
+    {
+      for (int x = 0; x < i_size; x++)
+	diff[x + y * i_size] = val1[x] - val2[x];
+      val1 += i_val1;
+      val2 += i_val2;
+    }
+}
+
+void
+bar (int16_t res[16], uint8_t *val1, uint8_t *val2)
+{
+  int16_t d[16];
+  int16_t tmp[16];
+
+  foo (d, 4, val1, 16, val2, 32);
+
+  for (int i = 0; i < 4; i++)
+    {
+      int s03 = d[i * 4 + 0] + d[i * 4 + 3];
+      int s12 = d[i * 4 + 1] + d[i * 4 + 2];
+      int d03 = d[i * 4 + 0] - d[i * 4 + 3];
+      int d12 = d[i * 4 + 1] - d[i * 4 + 2];
+
+      tmp[0 * 4 + i] = s03 + s12;
+      tmp[1 * 4 + i] = 2 * d03 + d12;
+      tmp[2 * 4 + i] = s03 - s12;
+      tmp[3 * 4 + i] = d03 - 2 * d12;
+    }
+
+  for (int i = 0; i < 4; i++)
+    {
+      int s03 = tmp[i * 4 + 0] + tmp[i * 4 + 3];
+      int s12 = tmp[i * 4 + 1] + tmp[i * 4 + 2];
+      int d03 = tmp[i * 4 + 0] - tmp[i * 4 + 3];
+      int d12 = tmp[i * 4 + 1] - tmp[i * 4 + 2];
+
+      res[i * 4 + 0] = s03 + s12;
+      res[i * 4 + 1] = 2 * d03 + d12;
+      res[i * 4 + 2] = s03 - s12;
+      res[i * 4 + 3] = d03 - 2 * d12;
+    }
+}
+
+/* { dg-final { scan-tree-dump {Deleted dead store:.*tmp} "dse3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
index d3a1bbc5ce5..b81cabe604a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-28.c
@@ -17,5 +17,5 @@ int foo (int *p, int b)
 
 /* { dg-final { scan-tree-dump-not "Deleted dead store" "dse1"} } */
 /* { dg-final { scan-tree-dump-not "Deleted dead store" "dse2"} } */
-/* { dg-final { scan-tree-dump-not "Deleted dead store" "dse3"} } */
+/* { dg-final { scan-tree-dump-not "Deleted dead store" "dse4"} } */
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-29.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-29.c
index 31529e7caed..f4ef89c590c 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-29.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-29.c
@@ -22,5 +22,5 @@ foo(int cond, struct z *s)
 
 /* { dg-final { scan-tree-dump-times "Deleted dead store" 3 "dse1"} } */
 /* { dg-final { scan-tree-dump-not "Deleted dead store" "dse2"} } */
-/* { dg-final { scan-tree-dump-not "Deleted dead store" "dse3"} } */
+/* { dg-final { scan-tree-dump-not "Deleted dead store" "dse4"} } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-41.c b/gcc/testsuite/gcc.dg/vect/bb-slp-41.c
index 7de5ed1f5be..72245115f30 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-41.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-41.c
@@ -10,7 +10,10 @@ foo (int *a, int *b)
     a[i] = b[0] + b[1] + b[i+1] + b[i+2];
 }
 
-void bar (int *a, int *b)
+/* Disable pre-slp FRE to avoid unexpected SLP on the epilogue
+   of the 1st loop.  */
+void __attribute__((optimize("-fno-tree-fre")))
+bar (int *a, int *b)
 {
   int i;
   for (i = 0; i < (ARR_SIZE - 2); ++i)
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 7dd1e2685a4..7549a331bc2 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -193,6 +193,7 @@ DEFTIMEVAR (TV_TREE_LOOP_UNSWITCH    , "tree loop unswitching")
 DEFTIMEVAR (TV_LOOP_SPLIT            , "loop splitting")
 DEFTIMEVAR (TV_LOOP_JAM              , "unroll and jam")
 DEFTIMEVAR (TV_COMPLETE_UNROLL       , "complete unrolling")
+DEFTIMEVAR (TV_SCALAR_CLEANUP        , "scalar cleanup")
 DEFTIMEVAR (TV_TREE_PARALLELIZE_LOOPS, "tree parallelize loops")
 DEFTIMEVAR (TV_TREE_VECTORIZATION    , "tree vectorization")
 DEFTIMEVAR (TV_TREE_SLP_VECTORIZATION, "tree slp vectorization")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index f01e811917d..2561963df2e 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -310,6 +310,11 @@ protected:
 
 #define TODO_verify_all TODO_verify_il
 
+/* To-do flags for pending_TODOs.  */
+
+/* Tell the next scalar cleanup pass that there is
+   work for it to do.  */
+#define PENDING_TODO_force_next_scalar_cleanup  (1 << 1)
 
 /* Register pass info. */
 
@@ -380,6 +385,7 @@ extern gimple_opt_pass *make_pass_simduid_cleanup (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_pre_slp_scalar_cleanup (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index 298ab215530..9a9076cee67 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -1411,6 +1411,13 @@ tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer,
 	  bitmap_clear (father_bbs);
 	  bitmap_set_bit (father_bbs, loop_father->header->index);
 	}
+      else if (unroll_outer
+	       && !(cfun->pending_TODOs
+		    & PENDING_TODO_force_next_scalar_cleanup))
+	{
+	  /* Trigger scalar cleanup once any outermost loop gets unrolled.  */
+	  cfun->pending_TODOs |= PENDING_TODO_force_next_scalar_cleanup;
+	}
 
       return true;
     }


More information about the Gcc-patches mailing list