[PATCH][RFC] Add an early loop unrolling pass, address PRs 18754 and 34223

Richard Guenther rguenther@suse.de
Wed Apr 23 11:55:00 GMT 2008


This (again) tries to add an early complete loop unrolling pass to GCC
to unroll loops completely that expose optimization opportunities like
full scalarization of DOT_PRODUCT in Fortran or index computation in C++.

Unlike previous attempts this doesn't come at a compile-time penalty
but in the tramp3d case even reduces compile-time by 2%.  tramp3d
shows up to 15% performance improvements (for the -fopenmp case),
Polyhedron benefits as well, 8% for ac and 17% for induct with
no negative runtime effects on other tests.  The patch is neutral
for SPEC2000 with some minor improvements and minor regressions that
can be all accounted as noise (given that the tester also tracks
mainline changes).  All tests performed on x86_64.

I have to still address some fallout in the vectorization testsuite
(loops are no longer vectorized because they are DCEd) and a
wrong-code bug in the vectorizer that makes gfortran.dg/array_1.f90
fail at -O3.

Implementationwise the early loop unrolling is scheduled right after
final inlining (the earliest possibility would be after the profile
pass so basic block frequencies are set up) to let the following CCP
pass clean up.  The unrolling will not unroll outer loops unless
this is a size win to allow the vectorizer to continue to work
which needs at least one outer loop for SLP to work.  Unless -O3
is specified unrolling is only done if it doesn't increase code size.

Compile-time is saved compared to previous attempts by only rewriting
loops into loop closed SSA form once we decided to unroll one loop
(I have an alternate implementation that only puts single loops into
loop closed SSA form, but that in the end is slower than the current
approach).

The patch independently of the new pass fixes one shortcoming in
the loop unroller, the inability to continue unrolling non-innermost
loops after an innermost loop has been unrolled.  It does so by
cleaning up the CFG which will update the loop structures and makes
non-innermost loops innermost.  It also makes sure to walk the loops
in depth first order.  Technically these parts could go in separately.

Otherwise this bootstrapped and tested ok on x86_64-unknown-linux-gnu.

Does this sound reasonable?  Possible changes include adding a
new optimization option for this and/or restricting the unrolling more.

Thanks,
Richard.

2008-04-21  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/18754
	PR tree-optimization/34223
	* tree-pass.h (pass_complete_unrolli): Declare.
	* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Print
	loop size before and after unconditionally of UL_NO_GROWTH in effect.
	Rewrite loop into loop closed SSA form if it is not already.
	(tree_unroll_loops_completely): Re-structure to iterate over
	innermost loops with intermediate CFG cleanups.
	Unroll outermost loops only if requested or the code does not grow
	doing so.
	* tree-ssa-loop.c (tree_complete_unroll): Also unroll outermost
	loops.
	(tree_complete_unroll_inner): New function.
	(gate_tree_complete_unroll_inner): Likewise.
	(pass_complete_unrolli): New pass.
	* tree-flow.h (tree_unroll_loops_completely): Add extra parameter.
	* passes.c (init_optimization_passes): Schedule complete inner
	loop unrolling pass before the first CCP pass after final inlining.

	* gcc.dg/tree-ssa/loop-36.c: New testcase.
	* gcc.dg/tree-ssa/loop-37.c: Likewise.
	* gcc.dg/vect/vect-118.c: Likewise.

Index: gcc/tree-ssa-loop-manip.c
===================================================================
*** gcc/tree-ssa-loop-manip.c.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/tree-ssa-loop-manip.c	2008-04-22 13:59:31.000000000 +0200
*************** find_uses_to_rename_use (basic_block bb,
*** 248,257 ****
      return;
    def_loop = def_bb->loop_father;
  
!   /* If the definition is not inside loop, it is not interesting.  */
    if (!loop_outer (def_loop))
      return;
  
    if (!use_blocks[ver])
      use_blocks[ver] = BITMAP_ALLOC (NULL);
    bitmap_set_bit (use_blocks[ver], bb->index);
--- 248,262 ----
      return;
    def_loop = def_bb->loop_father;
  
!   /* If the definition is not inside a loop, it is not interesting.  */
    if (!loop_outer (def_loop))
      return;
  
+   /* If the use is not outside of the loop it is defined in, it is not
+      interesting.  */
+   if (flow_bb_inside_loop_p (def_loop, bb))
+     return;
+ 
    if (!use_blocks[ver])
      use_blocks[ver] = BITMAP_ALLOC (NULL);
    bitmap_set_bit (use_blocks[ver], bb->index);
Index: gcc/tree-pass.h
===================================================================
*** gcc/tree-pass.h.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/tree-pass.h	2008-04-22 13:59:31.000000000 +0200
*************** extern struct gimple_opt_pass pass_if_co
*** 290,295 ****
--- 290,296 ----
  extern struct gimple_opt_pass pass_loop_distribution;
  extern struct gimple_opt_pass pass_vectorize;
  extern struct gimple_opt_pass pass_complete_unroll;
+ extern struct gimple_opt_pass pass_complete_unrolli;
  extern struct gimple_opt_pass pass_parallelize_loops;
  extern struct gimple_opt_pass pass_loop_prefetch;
  extern struct gimple_opt_pass pass_iv_optimize;
Index: gcc/tree-ssa-loop-ivcanon.c
===================================================================
*** gcc/tree-ssa-loop-ivcanon.c.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/tree-ssa-loop-ivcanon.c	2008-04-22 14:01:36.000000000 +0200
*************** try_unroll_loop_completely (struct loop 
*** 187,209 ****
  	  > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS))
  	return false;
  
!       if (ul == UL_NO_GROWTH)
  	{
- 	  unr_insns = estimated_unrolled_size (ninsns, n_unroll);
- 	  
  	  if (dump_file && (dump_flags & TDF_DETAILS))
! 	    {
! 	      fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
! 	      fprintf (dump_file, "  Estimated size after unrolling: %d\n",
! 		       (int) unr_insns);
! 	    }
! 	  
! 	  if (unr_insns > ninsns)
! 	    {
! 	      if (dump_file && (dump_flags & TDF_DETAILS))
! 		fprintf (dump_file, "Not unrolling loop %d:\n", loop->num);
! 	      return false;
! 	    }
  	}
      }
  
--- 187,206 ----
  	  > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS))
  	return false;
  
!       unr_insns = estimated_unrolled_size (ninsns, n_unroll);
!       if (dump_file && (dump_flags & TDF_DETAILS))
! 	{
! 	  fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
! 	  fprintf (dump_file, "  Estimated size after unrolling: %d\n",
! 		   (int) unr_insns);
! 	}
! 
!       if (ul == UL_NO_GROWTH
! 	  && unr_insns > ninsns)
  	{
  	  if (dump_file && (dump_flags & TDF_DETAILS))
! 	    fprintf (dump_file, "Not unrolling loop %d.\n", loop->num);
! 	  return false;
  	}
      }
  
*************** try_unroll_loop_completely (struct loop 
*** 214,219 ****
--- 211,220 ----
        unsigned i;
        VEC (edge, heap) *to_remove = NULL;
  
+       /* Re-write this loop into loop closed SSA form if necessary.  */
+       if (!loops_state_satisfies_p (LOOP_CLOSED_SSA))
+ 	rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+ 
        initialize_original_copy_tables ();
        wont_exit = sbitmap_alloc (n_unroll + 1);
        sbitmap_ones (wont_exit);
*************** canonicalize_induction_variables (void)
*** 339,368 ****
     size of the code does not increase.  */
  
  unsigned int
! tree_unroll_loops_completely (bool may_increase_size)
  {
    loop_iterator li;
    struct loop *loop;
!   bool changed = false;
    enum unroll_level ul;
  
!   FOR_EACH_LOOP (li, loop, 0)
      {
!       if (may_increase_size && maybe_hot_bb_p (loop->header))
! 	ul = UL_ALL;
!       else
! 	ul = UL_NO_GROWTH;
!       changed |= canonicalize_loop_induction_variables (loop,
! 							false, ul,
! 							!flag_tree_loop_ivcanon);
!     }
  
!   /* Clean up the information about numbers of iterations, since complete
!      unrolling might have invalidated it.  */
!   scev_reset ();
  
-   if (changed)
-     return TODO_cleanup_cfg;
    return 0;
  }
  
--- 340,384 ----
     size of the code does not increase.  */
  
  unsigned int
! tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer)
  {
    loop_iterator li;
    struct loop *loop;
!   bool changed;
    enum unroll_level ul;
  
!   do
      {
!       changed = false;
  
!       FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
! 	{
! 	  if (may_increase_size && maybe_hot_bb_p (loop->header)
! 	      /* Unroll outermost loops only if asked to do so or they do
! 		 not cause code growth.  */
! 	      && (unroll_outer
! 		  || (loop_outer (loop) && loop_outer (loop_outer (loop)))))
! 	    ul = UL_ALL;
! 	  else
! 	    ul = UL_NO_GROWTH;
! 	  changed |= canonicalize_loop_induction_variables
! 		       (loop, false, ul, !flag_tree_loop_ivcanon);
! 	}
! 
!       if (changed)
! 	{
! 	  /* This will take care of removing completely unrolled loops
! 	     from the loop structures so we can continue unrolling now
! 	     innermost loops.  */
! 	  cleanup_tree_cfg ();
! 
! 	  /* Clean up the information about numbers of iterations, since
! 	     complete unrolling might have invalidated it.  */
! 	  scev_reset ();
! 	}
!     }
!   while (changed);
  
    return 0;
  }
  
Index: gcc/tree-ssa-loop.c
===================================================================
*** gcc/tree-ssa-loop.c.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/tree-ssa-loop.c	2008-04-22 13:59:31.000000000 +0200
*************** tree_complete_unroll (void)
*** 466,472 ****
  
    return tree_unroll_loops_completely (flag_unroll_loops
  				       || flag_peel_loops
! 				       || optimize >= 3);
  }
  
  static bool
--- 466,472 ----
  
    return tree_unroll_loops_completely (flag_unroll_loops
  				       || flag_peel_loops
! 				       || optimize >= 3, true);
  }
  
  static bool
*************** struct gimple_opt_pass pass_complete_unr
*** 495,500 ****
--- 495,547 ----
   }
  };
  
+ /* Complete unrolling of inner loops.  */
+ 
+ static unsigned int
+ tree_complete_unroll_inner (void)
+ {
+   unsigned ret = 0;
+ 
+   loop_optimizer_init (LOOPS_NORMAL
+ 		       | LOOPS_HAVE_RECORDED_EXITS);
+   if (number_of_loops () > 1)
+     {
+       scev_initialize ();
+       ret = tree_unroll_loops_completely (optimize >= 3, false);
+       free_numbers_of_iterations_estimates ();
+       scev_finalize ();
+     }
+   loop_optimizer_finalize ();
+ 
+   return ret;
+ }
+ 
+ static bool
+ gate_tree_complete_unroll_inner (void)
+ {
+   return optimize >= 2;
+ }
+ 
+ struct gimple_opt_pass pass_complete_unrolli =
+ {
+  {
+   GIMPLE_PASS,
+   "cunrolli",				/* name */
+   gate_tree_complete_unroll_inner,	/* gate */
+   tree_complete_unroll_inner,	       	/* execute */
+   NULL,					/* sub */
+   NULL,					/* next */
+   0,					/* static_pass_number */
+   TV_COMPLETE_UNROLL,	  		/* tv_id */
+   PROP_cfg | PROP_ssa,			/* properties_required */
+   0,					/* properties_provided */
+   0,					/* properties_destroyed */
+   0,					/* todo_flags_start */
+   TODO_dump_func | TODO_verify_loops
+     | TODO_ggc_collect 			/* todo_flags_finish */
+  }
+ };
+ 
  /* Parallelization.  */
  
  static bool
Index: gcc/tree-flow.h
===================================================================
*** gcc/tree-flow.h.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/tree-flow.h	2008-04-22 13:59:31.000000000 +0200
*************** basic_block *blocks_in_phiopt_order (voi
*** 1016,1022 ****
  void tree_ssa_lim (void);
  unsigned int tree_ssa_unswitch_loops (void);
  unsigned int canonicalize_induction_variables (void);
! unsigned int tree_unroll_loops_completely (bool);
  unsigned int tree_ssa_prefetch_arrays (void);
  unsigned int remove_empty_loops (void);
  void tree_ssa_iv_optimize (void);
--- 1016,1022 ----
  void tree_ssa_lim (void);
  unsigned int tree_ssa_unswitch_loops (void);
  unsigned int canonicalize_induction_variables (void);
! unsigned int tree_unroll_loops_completely (bool, bool);
  unsigned int tree_ssa_prefetch_arrays (void);
  unsigned int remove_empty_loops (void);
  void tree_ssa_iv_optimize (void);
Index: gcc/passes.c
===================================================================
*** gcc/passes.c.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/passes.c	2008-04-22 13:59:31.000000000 +0200
*************** init_optimization_passes (void)
*** 567,572 ****
--- 567,573 ----
        NEXT_PASS (pass_rename_ssa_copies);
  
        /* Initial scalar cleanups.  */
+       NEXT_PASS (pass_complete_unrolli);
        NEXT_PASS (pass_ccp);
        NEXT_PASS (pass_phiprop);
        NEXT_PASS (pass_fre);
Index: gcc/testsuite/gcc.dg/Wunreachable-8.c
===================================================================
*** gcc/testsuite/gcc.dg/Wunreachable-8.c.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/testsuite/gcc.dg/Wunreachable-8.c	2008-04-22 13:59:31.000000000 +0200
*************** float Factorial(float X)
*** 6,12 ****
    int k,j;
    for (k=1; k < 5; k++)
      {
!       val += 1.0;
      }
    return (val); /* { dg-bogus "will never be executed" } */
  }
--- 6,12 ----
    int k,j;
    for (k=1; k < 5; k++)
      {
!       val += 1.0; /* { dg-bogus "will never be executed" "" { xfail *-*-* } } */
      }
    return (val); /* { dg-bogus "will never be executed" } */
  }
Index: gcc/testsuite/gcc.dg/tree-ssa/loop-36.c
===================================================================
*** /dev/null	1970-01-01 00:00:00.000000000 +0000
--- gcc/testsuite/gcc.dg/tree-ssa/loop-36.c	2008-04-22 13:59:31.000000000 +0200
***************
*** 0 ****
--- 1,21 ----
+ /* { dg-do compile } */
+ /* { dg-options "-O2 -fdump-tree-dce2" } */
+ 
+ struct X { float array[4]; };
+ 
+ struct X a,b;
+ 
+ float foobar () {
+   float s = 0;
+   unsigned int d;
+   struct X c;
+   for (d=0; d<4; ++d)
+     c.array[d] = a.array[d] * b.array[d];
+   for (d=0; d<4; ++d)
+     s+=c.array[d];
+   return s;
+ }
+ 
+ /* The temporary structure should have been promoted to registers
+    by FRE after the loops have been unrolled by the early unrolling pass.  */
+ /* { dg-final { scan-tree-dump-not "c\.array" "dce2" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/loop-37.c
===================================================================
*** /dev/null	1970-01-01 00:00:00.000000000 +0000
--- gcc/testsuite/gcc.dg/tree-ssa/loop-37.c	2008-04-22 13:59:31.000000000 +0200
***************
*** 0 ****
--- 1,27 ----
+ /* { dg-do link } */
+ /* { dg-options "-O2 -fdump-tree-optimized" } */
+ 
+ extern void link_error (void);
+ static const int my_array [3] = { 4, 5, 6 };
+ 
+ void f0 (void)
+ {
+   int j, sum = 0;
+   for (j = 0; j < 3; j ++)
+     sum += my_array [j];
+   if (15 != sum)
+     link_error ();
+ }
+ 
+ int f1 (int a [])
+ {
+   int j, sum = 0;
+   for (j = 0; j < 3; j ++)
+     sum += a [j] + my_array [j];
+   return sum;
+ }
+ 
+ int main() { }
+ 
+ /* { dg-final { scan-tree-dump-not "my_array" "optimized" } } */
+ /* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-118.c
===================================================================
*** /dev/null	1970-01-01 00:00:00.000000000 +0000
--- gcc/testsuite/gcc.dg/vect/vect-118.c	2008-04-22 13:59:31.000000000 +0200
***************
*** 0 ****
--- 1,33 ----
+ /* { dg-require-effective-target vect_int } */
+ /* { dg-options "-O3 -fdump-tree-vect-details" } */
+ 
+ #include "tree-vect.h"
+ 
+ #define M 10
+ #define N 3
+ 
+ void __attribute__((noinline))
+ foo (int n, int *ub, int *uc)
+ {
+   int i, j, tmp1;
+ 
+   for (i = 0; i < n; i++)
+     {
+       tmp1 = 0;
+       for (j = 0; j < M; j++)
+         {
+           tmp1 += uc[i] * ((int)(j << N) / M);
+         }
+       ub[i] = tmp1;
+     }
+ }
+ 
+ int main()
+ {
+   int uc[16], ub[16];
+   check_vect ();
+   foo (16, uc, ub);
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-66.c
===================================================================
*** gcc/testsuite/gcc.dg/vect/vect-66.c.orig	2008-04-22 13:56:16.000000000 +0200
--- gcc/testsuite/gcc.dg/vect/vect-66.c	2008-04-22 13:59:31.000000000 +0200
***************
*** 6,12 ****
  #define N 16
  
  __attribute__ ((noinline))
! int main1 ()
  {
    int i, j;
    int ib[6] = {0,3,6,9,12,15};
--- 6,12 ----
  #define N 16
  
  __attribute__ ((noinline))
! void main1 ()
  {
    int i, j;
    int ib[6] = {0,3,6,9,12,15};
*************** int main1 ()
*** 31,36 ****
--- 31,46 ----
                  abort();
          }
      }
+ }
+ 
+ __attribute__ ((noinline))
+ void main2 ()
+ {
+   int i, j;
+   int ib[6] = {0,3,6,9,12,15};
+   int ia[8][5][6];
+   int ic[16][16][5][6];
+ 
    /* Multidimensional array. Aligned. */
    for (i = 0; i < 16; i++)
      {
*************** int main1 ()
*** 47,52 ****
--- 57,71 ----
                  abort();
          }
      }
+ }
+ 
+ __attribute__ ((noinline))
+ void main3 ()
+ {
+   int i, j;
+   int ib[6] = {0,3,6,9,12,15};
+   int ia[8][5][6];
+   int ic[16][16][5][6];
  
    /* Multidimensional array. Not aligned. */
    for (i = 0; i < 16; i++)
*************** int main1 ()
*** 66,80 ****
                  abort();
          }
      }
- 
-   return 0;
  }
  
  int main (void)
  { 
    check_vect ();
  
!   return main1 ();
  }
  
  /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */
--- 85,101 ----
                  abort();
          }
      }
  }
  
  int main (void)
  { 
    check_vect ();
  
!   main1 ();
!   main2 ();
!   main3 ();
! 
!   return 0;
  }
  
  /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" } } */



More information about the Gcc-patches mailing list