This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
On Fri, Dec 17, 2010 at 10:13:24AM -0600, Fang, Changpeng wrote:
>
> Hi, Jack:
>
> Have you tested my original patch (I remembered you bootstrapped it) in this thread?
> That is with the same logic which only applies to prefetch, unswitch and unrolling?
>
> It helps a lot if you have data about that?
Not for pb05 benchmarking but I try that patch and benchmark against it as well.
Jack
>
> Thanks,
>
> Changpeng
>
>
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Friday, December 17, 2010 9:12 AM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
>
> On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> > Hi, Jack:
> >
> > Thanks for the testing.
> >
> > This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> > It would be helpful if you can provide analysis why they are slowed down.
>
> Unfortunately gprof is pretty broken on darwin10 but I can try to do profiling
> in Shark with -fno-omit-frame-pointer. Could you try rebenchmarking with the flags...
>
> -ffast-math -funroll-loops -O3 -fPIC -mtune=generic
>
> I'll try benchmarking again with '-Ofast -funroll-loops -mtune=generic' and
> see if the performance loss is reduced.
>
> >
> > We did see a significant compilation time reduction for most pb05 programs.
> > (I don't know why you do not have executable size data).
>
> The sources I have for pb05 aren't fixed for detecting the code size properly
> under darwin and I never spent the the time trying to fix that.
>
> >
> > >I would note that intel darwin now defaults
> > >to -mtune=core2 and always has defaulted to -fPIC.
> >
> > I could not understand these default for darwin. My understanding is that,
> > for x86_64, the default should be -mtune=generic.
>
> This is a recent change.
>
> Author: mrs
> Date: Wed Dec 8 23:32:27 2010
> New Revision: 167611
>
> URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=167611
> Log:
> 2010-12-08 Iain Sandoe <iains@gcc.gnu.org>
>
> gcc/config.gcc (with_cpu): Default i[34567]86-*-darwin* and
> x86_64-*-darwin* to with_cpu:-core2.
> gcc/config/i386/mmx.md (*mov<mode>_internal_rex64): Replace movq
> with movd for darwin assembler.
> gcc/config/i386/sse.md (*vec_concatv2di_rex64_sse4_1): Ditto.
> (*vec_concatv2di_rex64_sse): Ditto.
>
> Modified:
> trunk/gcc/ChangeLog
> trunk/gcc/config.gcc
> trunk/gcc/config/i386/mmx.md
> trunk/gcc/config/i386/sse.md
>
> We had intended to do this awhile back (since all darwin hardware is core2
> or recent Xeon), but held off until the core2 costs tables were improved to
> match those of generic.
>
> http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00000.html
>
> >
> > Thanks,
> >
> > Changpeng
> >
> > ________________________________________
> > From: Jack Howarth [howarth@bromo.med.uc.edu]
> > Sent: Thursday, December 16, 2010 11:31 PM
> > To: Fang, Changpeng
> > Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > > Hi,
> > >
> > > Based on previous discussions, I modified the patch as such.
> > >
> > > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > > functions will be affected.
> > >
> > > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > > size decreases 20%, while a small (0.5%) performance increase was found which may
> > > be just noise.
> > >
> > > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> > >
> > > Is it OK to commit to trunk?
> >
> > Changpeng,
> > On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> > the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
> >
> > Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
> >
> > I get without the patch...
> >
> > ================================================================================
> > Date & Time : 16 Dec 2010 23:36:27
> > Test Name : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times : 2000.0
> > Target Error % : 0.100
> > Minimum Repeats : 10
> > Maximum Repeats : 100
> >
> > Benchmark Compile Executable Ave Run Number Estim
> > Name (secs) (bytes) (secs) Repeats Err %
> > --------- ------- ---------- ------- ------- ------
> > ac 1.76 10000 8.78 12 0.0081
> > aermod 54.94 10000 17.28 10 0.0307
> > air 3.44 10000 5.53 13 0.0734
> > capacita 2.64 10000 32.65 10 0.0096
> > channel 0.89 10000 1.84 20 0.0977
> > doduc 8.12 10000 27.00 10 0.0132
> > fatigue 3.06 10000 8.36 10 0.0104
> > gas_dyn 5.00 10000 4.30 17 0.0915
> > induct 6.24 10000 12.42 10 0.0100
> > linpk 1.02 10000 15.50 12 0.0542
> > mdbx 2.55 10000 11.24 10 0.0256
> > nf 2.90 10000 30.16 20 0.0989
> > protein 7.98 10000 33.72 10 0.0070
> > rnflow 9.34 10000 23.21 10 0.0551
> > test_fpu 6.72 10000 8.05 10 0.0426
> > tfft 0.76 10000 1.87 10 0.0597
> >
> > Geometric Mean Execution Time = 10.87 seconds
> >
> > and with the patch...
> >
> > ================================================================================
> > Date & Time : 16 Dec 2010 21:31:06
> > Test Name : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times : 2000.0
> > Target Error % : 0.100
> > Minimum Repeats : 10
> > Maximum Repeats : 100
> >
> > Benchmark Compile Executable Ave Run Number Estim
> > Name (secs) (bytes) (secs) Repeats Err %
> > --------- ------- ---------- ------- ------- ------
> > ac 1.19 10000 8.78 10 0.0099
> > aermod 47.91 10000 16.95 10 0.0123
> > air 2.85 10000 5.34 12 0.0715
> > capacita 1.63 10000 33.10 10 0.0361
> > channel 0.67 10000 1.87 10 0.0884
> > doduc 6.42 10000 27.35 10 0.0206
> > fatigue 2.10 10000 8.32 10 0.0194
> > gas_dyn 2.07 10000 4.30 17 0.0843
> > induct 5.38 10000 12.58 10 0.0088
> > linpk 0.71 10000 15.69 18 0.0796
> > mdbx 1.95 10000 11.41 10 0.0238
> > nf 1.24 10000 31.34 12 0.0991
> > protein 3.88 10000 35.13 10 0.0659
> > rnflow 4.73 10000 25.97 10 0.0629
> > test_fpu 3.66 10000 8.88 11 0.0989
> > tfft 0.52 10000 1.89 10 0.0403
> >
> > Geometric Mean Execution Time = 11.09 seconds
> >
> > This shows about a 2.0% performance reduction in the Geometric
> > Mean Execution Time. I would note that intel darwin now defaults
> > to -mtune=core2 and always has defaulted to -fPIC.
> > Jack
> >
> > >
> > > Thanks,
> > >
> > > Changpeng
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________________
> > > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > > Sent: Thursday, December 16, 2010 12:47 PM
> > > To: Fang, Changpeng
> > > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> > >
> > > Hi,
> > >
> > > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > > not unrolling, (2) not prefetching. Which one do you prefer?
> > >
> > > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > > loops),
> > >
> > > Zdenek
> > >
> >
> > Content-Description: polyhedron1.txt
> > > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> > >
> > > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > > ============================================================================================================================
> > > || | Before Patch | After Patch | Changes ||
> > > ||========================================================================================================================||
> > > || Benchmark | Compile Binary Size Run Time | Compile Binary Size Run Time | Compile Binary Performance ||
> > > || Name | Time (s) (bytes) (s) | Time(s) (bytes) (secs) | Time (%) Size(%) (%) ||
> > > ||========================================================================================================================||
> > > || ac | 3.36 41976 13.26 | 2.52 34424 13.21 | -25.00 -17.99 0.38 ||
> > > || aermod | 103.45 1412221 44.36 | 86.55 1268861 43.08 | -16.34 -10.15 2.97 ||
> > > || air | 6.11 75186 11.73 | 5.72 71090 11.56 | -6.38 -5.45 1.47 ||
> > > || capacita | 6.83 91257 88.40 | 4.70 74585 88.01 | -31.19 -18.27 0.44 ||
> > > || channel | 2.14 39984 6.65 | 1.84 35888 6.69 | -14.02 -10.24 -0.60 ||
> > > || doduc | 12.78 198624 38.59 | 12.20 186336 38.18 | -4.54 -6.19 1.07 ||
> > > || fatigue | 9.11 110008 10.15 | 5.93 92472 10.12 | -34.91 -15.94 0.30 ||
> > > || gas_dyn | 15.69 149726 7.14 | 8.45 109342 6.96 | -46.14 -26.97 2.59 ||
> > > || induct | 10.98 191800 20.66 | 10.62 188168 20.61 | -3.28 -1.89 0.24 ||
> > > || linpk | 2.27 46073 19.03 | 1.68 33401 19.03 | -25.99 -27.50 0.00 ||
> > > || mdbx | 5.63 103731 21.41 | 4.24 83251 21.33 | -24.69 -19.74 0.38 ||
> > > || nf | 14.18 118451 22.88 | 5.55 60499 23.14 | -60.86 -48.92 -1.12 ||
> > > || protein | 34.20 177700 47.04 | 19.16 135012 46.94 | -43.98 -24.02 0.21 ||
> > > || rnflow | 42.13 283645 40.30 | 20.92 178477 40.65 | -50.34 -37.08 -0.86 ||
> > > || test_fpu | 30.17 252080 14.46 | 14.44 149136 14.32 | -52.14 -40.84 0.98 ||
> > > || tfft | 1.50 32450 7.71 | 1.12 26546 7.67 | -25.33 -18.19 0.52 ||
> > > ||========================================================================================================================||
> > > || average | 19.57 | 19.46 | -29.07 -20.59 0.56 ||
> > > ============================================================================================================================
> >
> > Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > > From: Changpeng Fang <chfang@houghton.(none)>
> > > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> > >
> > > * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > > * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > > * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > > (non_rolling_loop_p): New function declaration.
> > > * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > > NON-ROLLING. (optimize_loop_for_speed_p): Return false if the loop was
> > > marked NON-ROLLING.
> > > * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > > non-rolling loop.
> > > * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > > function. (non_rolling_loop_p): Implement the new function.
> > > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > > non-rolling loop. (vect_do_peeling_for_alignment): Mark the non-rolling
> > > loop.
> > > ---
> > > gcc/basic-block.h | 6 +++++-
> > > gcc/cfg.c | 7 ++++---
> > > gcc/cfgloop.h | 2 ++
> > > gcc/predict.c | 6 ++++++
> > > gcc/tree-ssa-loop-manip.c | 3 +++
> > > gcc/tree-ssa-loop-niter.c | 20 ++++++++++++++++++++
> > > gcc/tree-vect-loop-manip.c | 8 ++++++++
> > > 7 files changed, 48 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > > index be0a1d1..850472d 100644
> > > --- a/gcc/basic-block.h
> > > +++ b/gcc/basic-block.h
> > > @@ -245,7 +245,11 @@ enum bb_flags
> > >
> > > /* Set on blocks that cannot be threaded through.
> > > Only used in cfgcleanup.c. */
> > > - BB_NONTHREADABLE_BLOCK = 1 << 11
> > > + BB_NONTHREADABLE_BLOCK = 1 << 11,
> > > +
> > > + /* Set on blocks that are headers of non-rolling loops. */
> > > + BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > > +
> > > };
> > >
> > > /* Dummy flag for convenience in the hot/cold partitioning code. */
> > > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > > index c8ef799..e59a637 100644
> > > --- a/gcc/cfg.c
> > > +++ b/gcc/cfg.c
> > > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > > connect_src (e);
> > > }
> > >
> > > -/* Clear all basic block flags, with the exception of partitioning and
> > > - setjmp_target. */
> > > +/* Clear all basic block flags, with the exception of partitioning,
> > > + setjmp_target, and the non-rolling loop marker. */
> > > void
> > > clear_bb_flags (void)
> > > {
> > > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> > >
> > > FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > > bb->flags = (BB_PARTITION (bb)
> > > - | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > > + | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > > + + BB_HEADER_OF_NONROLLING_LOOP)));
> > > }
> > >
> > > /* Check the consistency of profile information. We can't do that
> > > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > > index bf2614e..e856a78 100644
> > > --- a/gcc/cfgloop.h
> > > +++ b/gcc/cfgloop.h
> > > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > > void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > > HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > > bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > > +void mark_non_rolling_loop (struct loop *);
> > > +bool non_rolling_loop_p (struct loop *);
> > >
> > > /* Loop manipulation. */
> > > extern bool can_duplicate_loop_p (const struct loop *loop);
> > > diff --git a/gcc/predict.c b/gcc/predict.c
> > > index c691990..bf729f8 100644
> > > --- a/gcc/predict.c
> > > +++ b/gcc/predict.c
> > > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > > bool
> > > optimize_loop_for_size_p (struct loop *loop)
> > > {
> > > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > > + if (non_rolling_loop_p (loop))
> > > + return true;
> > > return optimize_bb_for_size_p (loop->header);
> > > }
> > >
> > > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > > bool
> > > optimize_loop_for_speed_p (struct loop *loop)
> > > {
> > > + /* Loops marked NON-ROLLING are not likely to be hot. */
> > > + if (non_rolling_loop_p (loop))
> > > + return false;
> > > return optimize_bb_for_speed_p (loop->header);
> > > }
> > >
> > > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > > index 87b2c0d..bc977bb 100644
> > > --- a/gcc/tree-ssa-loop-manip.c
> > > +++ b/gcc/tree-ssa-loop-manip.c
> > > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > > gcc_assert (new_loop != NULL);
> > > update_ssa (TODO_update_ssa);
> > >
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > /* Determine the probability of the exit edge of the unrolled loop. */
> > > new_est_niter = est_niter / factor;
> > >
> > > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > > index ee85f6f..1e2e4b2 100644
> > > --- a/gcc/tree-ssa-loop-niter.c
> > > +++ b/gcc/tree-ssa-loop-niter.c
> > > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > > fold_undefer_and_ignore_overflow_warnings ();
> > > }
> > >
> > > +/* Mark LOOP as a non-rolling loop. */
> > > +
> > > +void
> > > +mark_non_rolling_loop (struct loop *loop)
> > > +{
> > > + gcc_assert (loop && loop->header);
> > > + loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > > +}
> > > +
> > > +/* Return true if LOOP is a non-rolling loop. */
> > > +
> > > +bool
> > > +non_rolling_loop_p (struct loop *loop)
> > > +{
> > > + int masked_flags;
> > > + gcc_assert (loop && loop->header);
> > > + masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > > + return (masked_flags != 0);
> > > +}
> > > +
> > > /* Returns true if statement S1 dominates statement S2. */
> > >
> > > bool
> > > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > > index 6ecd304..216de78 100644
> > > --- a/gcc/tree-vect-loop-manip.c
> > > +++ b/gcc/tree-vect-loop-manip.c
> > > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > > cond_expr, cond_expr_stmt_list);
> > > gcc_assert (new_loop);
> > > gcc_assert (loop_num == loop->num);
> > > +
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > #ifdef ENABLE_CHECKING
> > > slpeel_verify_cfg_after_peeling (loop, new_loop);
> > > #endif
> > > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > > th, true, NULL_TREE, NULL);
> > >
> > > gcc_assert (new_loop);
> > > +
> > > + /* NEW_LOOP is a non-rolling loop. */
> > > + mark_non_rolling_loop (new_loop);
> > > +
> > > #ifdef ENABLE_CHECKING
> > > slpeel_verify_cfg_after_peeling (new_loop, loop);
> > > #endif
> > > --
> > > 1.6.3.3
> > >
>