This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
- From: Jack Howarth <howarth at bromo dot med dot uc dot edu>
- To: "Fang, Changpeng" <Changpeng dot Fang at amd dot com>
- Cc: Zdenek Dvorak <rakdver at kam dot mff dot cuni dot cz>, Richard Guenther <richard dot guenther at gmail dot com>, Xinliang David Li <davidxl at google dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Mon, 3 Jan 2011 22:04:26 -0500
- Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
- References: <20101214075629.GA10020@kam.mff.cuni.cz> <D4C76825A6780047854A11E93CDE84D004C1768401@SAUSEXMBP01.amd.com> <20101214210552.GA19633@kam.mff.cuni.cz> <AANLkTim241bdY_JeZc2z-eLy9WU1+=wTz355XyjDfQXW@mail.gmail.com> <D4C76825A6780047854A11E93CDE84D004C1768405@SAUSEXMBP01.amd.com> <AANLkTinzdo0Sjx2WxGC9F-50fyRjXL2vCRNtyvAKT17y@mail.gmail.com> <20101215092220.GA9872@kam.mff.cuni.cz> <D4C76825A6780047854A11E93CDE84D004C1768412@SAUSEXMBP01.amd.com>
On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> Hi, Jack:
>
> Thanks for the testing.
>
> This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> It would be helpful if you can provide analysis why they are slowed down.
Changpeng,
The corrected merge against gcc trunk of...
Index: gcc/basic-block.h
===================================================================
--- gcc/basic-block.h (revision 168437)
+++ gcc/basic-block.h (working copy)
@@ -247,11 +247,14 @@
Only used in cfgcleanup.c. */
BB_NONTHREADABLE_BLOCK = 1 << 11,
+ /* Set on blocks that are headers of non-rolling loops. */
+ BB_HEADER_OF_NONROLLING_LOOP = 1 << 12,
+
/* Set on blocks that were modified in some way. This bit is set in
df_set_bb_dirty, but not cleared by df_analyze, so it can be used
to test whether a block has been modified prior to a df_analyze
call. */
- BB_MODIFIED = 1 << 12
+ BB_MODIFIED = 1 << 13
};
/* Dummy flag for convenience in the hot/cold partitioning code. */
for the proposed patch from http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01344.html
eliminated the performance regressions on x86_64-apple-darwin10. I now get...
Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
Execution Time
-m32
stock patched %increase
ac 10.59 10.59 0.0
aermod 19.49 19.13 -1.8
air 6.07 6.07 0.0
capacita 44.60 44.61 0.0
channel 1.98 1.98 0.0
doduc 31.19 31.31 0.4
fatigue 9.90 10.29 3.9
gas_dyn 4.72 4.71 -0.2
induct 13.93 13.93 0.0
linpk 15.50 15.49 -0.1
mdbx 11.28 11.26 -0.2
nf 27.62 27.58 -0.1
protein 38.70 38.60 -0.3
rnflow 24.68 24.68 0.0
test_fpu 10.13 10.13 0.0
tfft 1.92 1.92 0.0
Geometric Mean 12.06 12.08 0.2
Execution Time
-m64
stock patched %increase
ac 8.80 8.80 0.0
aermod 17.34 17.17 -1.0
air 5.48 5.52 0.7
capacita 32.38 32.50 0.4
channel 1.84 1.84 0.0
doduc 26.50 26.52 0.1
fatigue 8.35 8.33 -0.2
gas_dyn 4.30 4.29 -0.2
induct 12.83 12.83 0.0
linpk 15.49 15.49 0.0
mdbx 11.23 11.22 -0.1
nf 30.21 30.16 -0.2
protein 34.13 32.07 -6.0
rnflow 23.18 23.19 0.0
test_fpu 8.04 8.02 -0.2
tfft 1.87 1.86 -0.5
Geometric Mean 10.87 10.82 -0.5
Execution Time
>
> We did see a significant compilation time reduction for most pb05 programs.
> (I don't know why you do not have executable size data).
>
> >I would note that intel darwin now defaults
> >to -mtune=core2 and always has defaulted to -fPIC.
>
> I could not understand these default for darwin. My understanding is that,
> for x86_64, the default should be -mtune=generic.
>
> Thanks,
>
> Changpeng
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Thursday, December 16, 2010 11:31 PM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > Hi,
> >
> > Based on previous discussions, I modified the patch as such.
> >
> > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > functions will be affected.
> >
> > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > size decreases 20%, while a small (0.5%) performance increase was found which may
> > be just noise.
> >
> > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> >
> > Is it OK to commit to trunk?
>
> Changpeng,
> On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
>
> Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
>
> I get without the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 23:36:27
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.76 10000 8.78 12 0.0081
> aermod 54.94 10000 17.28 10 0.0307
> air 3.44 10000 5.53 13 0.0734
> capacita 2.64 10000 32.65 10 0.0096
> channel 0.89 10000 1.84 20 0.0977
> doduc 8.12 10000 27.00 10 0.0132
> fatigue 3.06 10000 8.36 10 0.0104
> gas_dyn 5.00 10000 4.30 17 0.0915
> induct 6.24 10000 12.42 10 0.0100
> linpk 1.02 10000 15.50 12 0.0542
> mdbx 2.55 10000 11.24 10 0.0256
> nf 2.90 10000 30.16 20 0.0989
> protein 7.98 10000 33.72 10 0.0070
> rnflow 9.34 10000 23.21 10 0.0551
> test_fpu 6.72 10000 8.05 10 0.0426
> tfft 0.76 10000 1.87 10 0.0597
>
> Geometric Mean Execution Time = 10.87 seconds
>
> and with the patch...
>
> ================================================================================
> Date & Time : 16 Dec 2010 21:31:06
> Test Name : gfortran_lin_O3
> Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> Maximum Times : 2000.0
> Target Error % : 0.100
> Minimum Repeats : 10
> Maximum Repeats : 100
>
> Benchmark Compile Executable Ave Run Number Estim
> Name (secs) (bytes) (secs) Repeats Err %
> --------- ------- ---------- ------- ------- ------
> ac 1.19 10000 8.78 10 0.0099
> aermod 47.91 10000 16.95 10 0.0123
> air 2.85 10000 5.34 12 0.0715
> capacita 1.63 10000 33.10 10 0.0361
> channel 0.67 10000 1.87 10 0.0884
> doduc 6.42 10000 27.35 10 0.0206
> fatigue 2.10 10000 8.32 10 0.0194
> gas_dyn 2.07 10000 4.30 17 0.0843
> induct 5.38 10000 12.58 10 0.0088
> linpk 0.71 10000 15.69 18 0.0796
> mdbx 1.95 10000 11.41 10 0.0238
> nf 1.24 10000 31.34 12 0.0991
> protein 3.88 10000 35.13 10 0.0659
> rnflow 4.73 10000 25.97 10 0.0629
> test_fpu 3.66 10000 8.88 11 0.0989
> tfft 0.52 10000 1.89 10 0.0403
>
> Geometric Mean Execution Time = 11.09 seconds
>
> This shows about a 2.0% performance reduction in the Geometric
> Mean Execution Time. I would note that intel darwin now defaults
> to -mtune=core2 and always has defaulted to -fPIC.
> Jack
>
> >
> > Thanks,
> >
> > Changpeng
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > Sent: Thursday, December 16, 2010 12:47 PM
> > To: Fang, Changpeng
> > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > Hi,
> >
> > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > not unrolling, (2) not prefetching. Which one do you prefer?
> >
> > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > loops),
> >
> > Zdenek
> >
>
> Content-Description: polyhedron1.txt
> > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> >
> > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > ============================================================================================================================
> > || | Before Patch | After Patch | Changes ||
> > ||========================================================================================================================||
> > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > ||========================================================================================================================||
> > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > ||========================================================================================================================||
> > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > ============================================================================================================================
>
> Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > From: Changpeng Fang <chfang@houghton.(none)>
> > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> >
> > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > (non_rolling_loop_p): New function declaration.
> > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > marked NON-ROLLING.
> > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > non-rolling loop.
> > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > function. (non_rolling_loop_p): Implement the new function.
> > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > loop.
> > ---
> > gcc/basic-block.h | 6 +++++-
> > gcc/cfg.c | 7 ++++---
> > gcc/cfgloop.h | 2 ++
> > gcc/predict.c | 6 ++++++
> > gcc/tree-ssa-loop-manip.c | 3 +++
> > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > 7 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > index be0a1d1..850472d 100644
> > --- a/gcc/basic-block.h
> > +++ b/gcc/basic-block.h
> > @@ -245,7 +245,11 @@ enum bb_flags
> >
> > /* Set on blocks that cannot be threaded through.
> > Only used in cfgcleanup.c. */
> > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > +
> > + /* Set on blocks that are headers of non-rolling loops. */
> > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > +
> > };
> >
> > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > index c8ef799..e59a637 100644
> > --- a/gcc/cfg.c
> > +++ b/gcc/cfg.c
> > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > connect_src (e);
> > }
> >
> > -/* Clear all basic block flags, with the exception of partitioning and
> > - setjmp_target. */
> > +/* Clear all basic block flags, with the exception of partitioning,
> > + setjmp_target, and the non-rolling loop marker. */
> > void
> > clear_bb_flags (void)
> > {
> > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> >
> > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > bb->flags = (BB_PARTITION (bb)
> > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > }
> >
> > /* Check the consistency of profile information. We can't do that
> > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > index bf2614e..e856a78 100644
> > --- a/gcc/cfgloop.h
> > +++ b/gcc/cfgloop.h
> > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > +void mark_non_rolling_loop (struct loop *);
> > +bool non_rolling_loop_p (struct loop *);
> >
> > /* Loop manipulation. */
> > extern bool can_duplicate_loop_p (const struct loop *loop);
> > diff --git a/gcc/predict.c b/gcc/predict.c
> > index c691990..bf729f8 100644
> > --- a/gcc/predict.c
> > +++ b/gcc/predict.c
> > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > bool
> > optimize_loop_for_size_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return true;
> > return optimize_bb_for_size_p (loop->header);
> > }
> >
> > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > bool
> > optimize_loop_for_speed_p (struct loop *loop)
> > {
> > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > + if (non_rolling_loop_p (loop))
> > + return false;
> > return optimize_bb_for_speed_p (loop->header);
> > }
> >
> > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > index 87b2c0d..bc977bb 100644
> > --- a/gcc/tree-ssa-loop-manip.c
> > +++ b/gcc/tree-ssa-loop-manip.c
> > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > gcc_assert (new_loop != NULL);
> > update_ssa (TODO_update_ssa);
> >
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > /* Determine the probability of the exit edge of the unrolled loop. */
> > new_est_niter = est_niter / factor;
> >
> > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > index ee85f6f..1e2e4b2 100644
> > --- a/gcc/tree-ssa-loop-niter.c
> > +++ b/gcc/tree-ssa-loop-niter.c
> > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > fold_undefer_and_ignore_overflow_warnings ();
> > }
> >
> > +/* Mark LOOP as a non-rolling loop. */
> > +
> > +void
> > +mark_non_rolling_loop (struct loop *loop)
> > +{
> > + gcc_assert (loop && loop->header);
> > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > +}
> > +
> > +/* Return true if LOOP is a non-rolling loop. */
> > +
> > +bool
> > +non_rolling_loop_p (struct loop *loop)
> > +{
> > + int masked_flags;
> > + gcc_assert (loop && loop->header);
> > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > + return (masked_flags != 0);
> > +}
> > +
> > /* Returns true if statement S1 dominates statement S2. */
> >
> > bool
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 6ecd304..216de78 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > cond_expr, cond_expr_stmt_list);
> > gcc_assert (new_loop);
> > gcc_assert (loop_num == loop->num);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > #endif
> > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > th, true, NULL_TREE, NULL);
> >
> > gcc_assert (new_loop);
> > +
> > + /* NEW_LOOP is a non-rolling loop. */
> > + mark_non_rolling_loop (new_loop);
> > +
> > #ifdef ENABLE_CHECKING
> > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > #endif
> > --
> > 1.6.3.3
> >