This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops


On Fri, Dec 17, 2010 at 10:13:24AM -0600, Fang, Changpeng wrote:
> 
> Hi, Jack:
> 
> Have you tested my original patch (I remembered you bootstrapped it) in this thread?
> That is with the same logic which only applies to prefetch, unswitch and unrolling?
> 
> It helps a lot if you have data about that?

Not for pb05 benchmarking but I try that patch and benchmark against it as well.
    Jack

> 
> Thanks,
> 
> Changpeng
> 
> 
> ________________________________________
> From: Jack Howarth [howarth@bromo.med.uc.edu]
> Sent: Friday, December 17, 2010 9:12 AM
> To: Fang, Changpeng
> Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> 
> On Fri, Dec 17, 2010 at 01:14:49AM -0600, Fang, Changpeng wrote:
> > Hi, Jack:
> >
> > Thanks for the testing.
> >
> > This patch is not supposed to slow down a program by 10% (rnflow and test_fpu).
> > It would be helpful if you can provide analysis why they are slowed down.
> 
> Unfortunately gprof is pretty broken on darwin10 but I can try to do profiling
> in Shark with -fno-omit-frame-pointer. Could you try rebenchmarking with the flags...
> 
> -ffast-math -funroll-loops -O3 -fPIC -mtune=generic
> 
> I'll try benchmarking again with '-Ofast -funroll-loops -mtune=generic' and
> see if the performance loss is reduced.
> 
> >
> > We did see a significant compilation time reduction for most pb05 programs.
> > (I don't know why you do not have executable size data).
> 
>   The sources I have for pb05 aren't fixed for detecting the code size properly
> under darwin and I never spent the the time trying to fix that.
> 
> >
> > >I would note that intel darwin now defaults
> > >to -mtune=core2 and always has defaulted to -fPIC.
> >
> > I could not understand these default for darwin. My understanding is that,
> > for x86_64, the default should be -mtune=generic.
> 
> This is a recent change.
> 
> Author: mrs
> Date: Wed Dec  8 23:32:27 2010
> New Revision: 167611
> 
> URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=167611
> Log:
> 2010-12-08  Iain Sandoe <iains@gcc.gnu.org>
> 
>         gcc/config.gcc (with_cpu): Default i[34567]86-*-darwin* and
>         x86_64-*-darwin* to with_cpu:-core2.
>         gcc/config/i386/mmx.md (*mov<mode>_internal_rex64): Replace movq
>         with movd for darwin assembler.
>         gcc/config/i386/sse.md (*vec_concatv2di_rex64_sse4_1): Ditto.
>         (*vec_concatv2di_rex64_sse): Ditto.
> 
> Modified:
>     trunk/gcc/ChangeLog
>     trunk/gcc/config.gcc
>     trunk/gcc/config/i386/mmx.md
>     trunk/gcc/config/i386/sse.md
> 
> We had intended to do this awhile back (since all darwin hardware is core2
> or recent Xeon), but held off until the core2 costs tables were improved to
> match those of generic.
> 
> http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00000.html
> 
> >
> > Thanks,
> >
> > Changpeng
> >
> > ________________________________________
> > From: Jack Howarth [howarth@bromo.med.uc.edu]
> > Sent: Thursday, December 16, 2010 11:31 PM
> > To: Fang, Changpeng
> > Cc: Zdenek Dvorak; Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> >
> > On Thu, Dec 16, 2010 at 06:13:52PM -0600, Fang, Changpeng wrote:
> > > Hi,
> > >
> > > Based on previous discussions, I modified the patch as such.
> > >
> > > If a loop is marked as non-rolling, optimize_loop_for_size_p returns TRUE
> > > and optimize_loop_for_speed_p returns FALSE. All users of these two
> > > functions will be affected.
> > >
> > > After applying the modified patch, pb05 compilation time decreases 29%, binary
> > > size decreases 20%, while a small (0.5%) performance increase was found which may
> > > be just noise.
> > >
> > > Modified patch passed bootstrapping and gcc regression tests on x86_64-unknown-linux-gnu.
> > >
> > > Is it OK to commit to trunk?
> >
> > Changpeng,
> >     On x86_64-apple-darwin10, I am finding a more severe penalty for this patch with
> > the pb05 benchmarks. Using a profiledbootstrap BOOT_CFLAGS="-g -O3" build with...
> >
> > Configured with: ../gcc-4.6-20101216/configure --prefix=/sw --prefix=/sw/lib/gcc4.6 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.6/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.6 --enable-checking=yes --enable-cloog-backend=isl --enable-build-with-cxx
> >
> > I get without the patch...
> >
> > ================================================================================
> > Date & Time     : 16 Dec 2010 23:36:27
> > Test Name       : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times   :     2000.0
> > Target Error %  :      0.100
> > Minimum Repeats :    10
> > Maximum Repeats :   100
> >
> >    Benchmark   Compile  Executable   Ave Run  Number   Estim
> >         Name    (secs)     (bytes)    (secs) Repeats   Err %
> >    ---------   -------  ----------   ------- -------  ------
> >           ac      1.76       10000      8.78      12  0.0081
> >       aermod     54.94       10000     17.28      10  0.0307
> >          air      3.44       10000      5.53      13  0.0734
> >     capacita      2.64       10000     32.65      10  0.0096
> >      channel      0.89       10000      1.84      20  0.0977
> >        doduc      8.12       10000     27.00      10  0.0132
> >      fatigue      3.06       10000      8.36      10  0.0104
> >      gas_dyn      5.00       10000      4.30      17  0.0915
> >       induct      6.24       10000     12.42      10  0.0100
> >        linpk      1.02       10000     15.50      12  0.0542
> >         mdbx      2.55       10000     11.24      10  0.0256
> >           nf      2.90       10000     30.16      20  0.0989
> >      protein      7.98       10000     33.72      10  0.0070
> >       rnflow      9.34       10000     23.21      10  0.0551
> >     test_fpu      6.72       10000      8.05      10  0.0426
> >         tfft      0.76       10000      1.87      10  0.0597
> >
> > Geometric Mean Execution Time =      10.87 seconds
> >
> > and with the patch...
> >
> > ================================================================================
> > Date & Time     : 16 Dec 2010 21:31:06
> > Test Name       : gfortran_lin_O3
> > Compile Command : gfortran -ffast-math -funroll-loops -O3 %n.f90 -o %n
> > Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
> > Maximum Times   :     2000.0
> > Target Error %  :      0.100
> > Minimum Repeats :    10
> > Maximum Repeats :   100
> >
> >    Benchmark   Compile  Executable   Ave Run  Number   Estim
> >         Name    (secs)     (bytes)    (secs) Repeats   Err %
> >    ---------   -------  ----------   ------- -------  ------
> >           ac      1.19       10000      8.78      10  0.0099
> >       aermod     47.91       10000     16.95      10  0.0123
> >          air      2.85       10000      5.34      12  0.0715
> >     capacita      1.63       10000     33.10      10  0.0361
> >      channel      0.67       10000      1.87      10  0.0884
> >        doduc      6.42       10000     27.35      10  0.0206
> >      fatigue      2.10       10000      8.32      10  0.0194
> >      gas_dyn      2.07       10000      4.30      17  0.0843
> >       induct      5.38       10000     12.58      10  0.0088
> >        linpk      0.71       10000     15.69      18  0.0796
> >         mdbx      1.95       10000     11.41      10  0.0238
> >           nf      1.24       10000     31.34      12  0.0991
> >      protein      3.88       10000     35.13      10  0.0659
> >       rnflow      4.73       10000     25.97      10  0.0629
> >     test_fpu      3.66       10000      8.88      11  0.0989
> >         tfft      0.52       10000      1.89      10  0.0403
> >
> > Geometric Mean Execution Time =      11.09 seconds
> >
> > This shows about a 2.0% performance reduction in the Geometric
> > Mean Execution Time. I would note that intel darwin now defaults
> > to -mtune=core2 and always has defaulted to -fPIC.
> >            Jack
> >
> > >
> > > Thanks,
> > >
> > > Changpeng
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ________________________________________
> > > From: Zdenek Dvorak [rakdver@kam.mff.cuni.cz]
> > > Sent: Thursday, December 16, 2010 12:47 PM
> > > To: Fang, Changpeng
> > > Cc: Richard Guenther; Xinliang David Li; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH, Loop optimizer]: Add logic to disable certain loop optimizations on pre-/post-loops
> > >
> > > Hi,
> > >
> > > > For prefetching of prologue or epilogue loops, we have two choices (1) prefetching not
> > > > not unrolling, (2) not prefetching.  Which one do you prefer?
> > >
> > > it is better not to prefetch (the current placement of prefetches is not good for non-rolling
> > > loops),
> > >
> > > Zdenek
> > >
> >
> > Content-Description: polyhedron1.txt
> > > Comparison of Polyhedron (2005) Compile Time, Binary Size and Performance After Applying the NON-ROLLING Marking Patch
> > >
> > > gfortran -Ofast -funroll-loops -march=amdfam10 %n.f90 -o %n
> > >     ============================================================================================================================
> > > ||             |  Before Patch                     |  After Patch                       | Changes                       ||
> > > ||========================================================================================================================||
> > > ||  Benchmark    |  Compile    Binary Size  Run Time |  Compile   Binary Size Run Time  | Compile  Binary   Performance ||
> > > ||        Name   |  Time (s)   (bytes)           (s)       |  Time(s)     (bytes)      (secs)   | Time (%) Size(%)    (%)       ||
> > > ||========================================================================================================================||
> > > ||          ac   |    3.36       41976            13.26    |    2.52       34424      13.21     |  -25.00   -17.99    0.38      ||
> > > ||      aermod   |  103.45     1412221            44.36    |   86.55     1268861      43.08     |  -16.34   -10.15    2.97      ||
> > > ||         air   |    6.11       75186            11.73    |    5.72       71090      11.56     |   -6.38    -5.45    1.47      ||
> > > ||    capacita   |    6.83       91257            88.40    |    4.70       74585      88.01     |  -31.19   -18.27    0.44      ||
> > > ||     channel   |    2.14       39984             6.65    |    1.84       35888       6.69     |  -14.02   -10.24   -0.60      ||
> > > ||       doduc   |   12.78      198624            38.59    |   12.20      186336      38.18     |   -4.54    -6.19    1.07      ||
> > > ||     fatigue   |    9.11      110008            10.15    |    5.93       92472      10.12     |  -34.91   -15.94    0.30      ||
> > > ||     gas_dyn   |   15.69      149726             7.14    |    8.45      109342       6.96     |  -46.14   -26.97    2.59      ||
> > > ||      induct   |   10.98      191800            20.66    |   10.62      188168      20.61     |   -3.28    -1.89    0.24      ||
> > > ||       linpk   |    2.27       46073            19.03    |    1.68       33401      19.03     |  -25.99   -27.50    0.00      ||
> > > ||        mdbx   |    5.63      103731            21.41    |    4.24       83251      21.33     |  -24.69   -19.74    0.38      ||
> > > ||          nf   |   14.18      118451            22.88    |    5.55       60499      23.14     |  -60.86   -48.92   -1.12      ||
> > > ||     protein   |   34.20      177700            47.04    |   19.16      135012      46.94     |  -43.98   -24.02    0.21      ||
> > > ||      rnflow   |   42.13      283645            40.30    |   20.92      178477      40.65     |  -50.34   -37.08   -0.86      ||
> > > ||    test_fpu   |   30.17      252080            14.46    |   14.44      149136      14.32     |  -52.14   -40.84    0.98      ||
> > > ||        tfft   |    1.50       32450             7.71    |    1.12       26546       7.67     |  -25.33   -18.19    0.52      ||
> > > ||========================================================================================================================||
> > > ||       average   |                      19.57    |                          19.46     |  -29.07   -20.59    0.56      ||
> > > ============================================================================================================================
> >
> > Content-Description: 0001-Don-t-perform-certain-loop-optimizations-on-pre-post.patch
> > > From cd8b85bba1b39e108235f44d9d07918179ff3d79 Mon Sep 17 00:00:00 2001
> > > From: Changpeng Fang <chfang@houghton.(none)>
> > > Date: Mon, 13 Dec 2010 12:01:49 -0800
> > > Subject: [PATCH] Don't perform certain loop optimizations on pre/post loops
> > >
> > >       * basic-block.h (bb_flags): Add a new flag BB_HEADER_OF_NONROLLING_LOOP.
> > >       * cfg.c (clear_bb_flags): Keep BB_HEADER_OF_NONROLLING marker.
> > >       * cfgloop.h (mark_non_rolling_loop): New function declaration.
> > >         (non_rolling_loop_p): New function declaration.
> > >       * predict.c (optimize_loop_for_size_p): Return true if the loop was marked
> > >         NON-ROLLING.  (optimize_loop_for_speed_p): Return false if the loop was
> > >         marked NON-ROLLING.
> > >       * tree-ssa-loop-manip.c (tree_transform_and_unroll_loop): Mark the
> > >         non-rolling loop.
> > >       * tree-ssa-loop-niter.c (mark_non_rolling_loop): Implement the new
> > >         function.  (non_rolling_loop_p): Implement the new function.
> > >       * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Mark the
> > >         non-rolling loop.  (vect_do_peeling_for_alignment): Mark the non-rolling
> > >         loop.
> > > ---
> > >  gcc/basic-block.h          |    6 +++++-
> > >  gcc/cfg.c                  |    7 ++++---
> > >  gcc/cfgloop.h              |    2 ++
> > >  gcc/predict.c              |    6 ++++++
> > >  gcc/tree-ssa-loop-manip.c  |    3 +++
> > >  gcc/tree-ssa-loop-niter.c  |   20 ++++++++++++++++++++
> > >  gcc/tree-vect-loop-manip.c |    8 ++++++++
> > >  7 files changed, 48 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/gcc/basic-block.h b/gcc/basic-block.h
> > > index be0a1d1..850472d 100644
> > > --- a/gcc/basic-block.h
> > > +++ b/gcc/basic-block.h
> > > @@ -245,7 +245,11 @@ enum bb_flags
> > >
> > >    /* Set on blocks that cannot be threaded through.
> > >       Only used in cfgcleanup.c.  */
> > > -  BB_NONTHREADABLE_BLOCK = 1 << 11
> > > +  BB_NONTHREADABLE_BLOCK = 1 << 11,
> > > +
> > > +  /* Set on blocks that are headers of non-rolling loops.  */
> > > +  BB_HEADER_OF_NONROLLING_LOOP = 1 << 12
> > > +
> > >  };
> > >
> > >  /* Dummy flag for convenience in the hot/cold partitioning code.  */
> > > diff --git a/gcc/cfg.c b/gcc/cfg.c
> > > index c8ef799..e59a637 100644
> > > --- a/gcc/cfg.c
> > > +++ b/gcc/cfg.c
> > > @@ -425,8 +425,8 @@ redirect_edge_pred (edge e, basic_block new_pred)
> > >    connect_src (e);
> > >  }
> > >
> > > -/* Clear all basic block flags, with the exception of partitioning and
> > > -   setjmp_target.  */
> > > +/* Clear all basic block flags, with the exception of partitioning,
> > > +   setjmp_target, and the non-rolling loop marker.  */
> > >  void
> > >  clear_bb_flags (void)
> > >  {
> > > @@ -434,7 +434,8 @@ clear_bb_flags (void)
> > >
> > >    FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
> > >      bb->flags = (BB_PARTITION (bb)
> > > -              | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET)));
> > > +              | (bb->flags & (BB_DISABLE_SCHEDULE + BB_RTL + BB_NON_LOCAL_GOTO_TARGET
> > > +                              + BB_HEADER_OF_NONROLLING_LOOP)));
> > >  }
> > >
> > >  /* Check the consistency of profile information.  We can't do that
> > > diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
> > > index bf2614e..e856a78 100644
> > > --- a/gcc/cfgloop.h
> > > +++ b/gcc/cfgloop.h
> > > @@ -279,6 +279,8 @@ extern rtx doloop_condition_get (rtx);
> > >  void estimate_numbers_of_iterations_loop (struct loop *, bool);
> > >  HOST_WIDE_INT estimated_loop_iterations_int (struct loop *, bool);
> > >  bool estimated_loop_iterations (struct loop *, bool, double_int *);
> > > +void mark_non_rolling_loop (struct loop *);
> > > +bool non_rolling_loop_p (struct loop *);
> > >
> > >  /* Loop manipulation.  */
> > >  extern bool can_duplicate_loop_p (const struct loop *loop);
> > > diff --git a/gcc/predict.c b/gcc/predict.c
> > > index c691990..bf729f8 100644
> > > --- a/gcc/predict.c
> > > +++ b/gcc/predict.c
> > > @@ -279,6 +279,9 @@ optimize_insn_for_speed_p (void)
> > >  bool
> > >  optimize_loop_for_size_p (struct loop *loop)
> > >  {
> > > +  /* Loops marked NON-ROLLING are not likely to be hot.  */
> > > +  if (non_rolling_loop_p (loop))
> > > +    return true;
> > >    return optimize_bb_for_size_p (loop->header);
> > >  }
> > >
> > > @@ -287,6 +290,9 @@ optimize_loop_for_size_p (struct loop *loop)
> > >  bool
> > >  optimize_loop_for_speed_p (struct loop *loop)
> > >  {
> > > +  /* Loops marked NON-ROLLING are not likely to be hot.  */
> > > +  if (non_rolling_loop_p (loop))
> > > +    return false;
> > >    return optimize_bb_for_speed_p (loop->header);
> > >  }
> > >
> > > diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> > > index 87b2c0d..bc977bb 100644
> > > --- a/gcc/tree-ssa-loop-manip.c
> > > +++ b/gcc/tree-ssa-loop-manip.c
> > > @@ -931,6 +931,9 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
> > >    gcc_assert (new_loop != NULL);
> > >    update_ssa (TODO_update_ssa);
> > >
> > > +  /* NEW_LOOP is a non-rolling loop.  */
> > > +  mark_non_rolling_loop (new_loop);
> > > +
> > >    /* Determine the probability of the exit edge of the unrolled loop.  */
> > >    new_est_niter = est_niter / factor;
> > >
> > > diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
> > > index ee85f6f..1e2e4b2 100644
> > > --- a/gcc/tree-ssa-loop-niter.c
> > > +++ b/gcc/tree-ssa-loop-niter.c
> > > @@ -3011,6 +3011,26 @@ estimate_numbers_of_iterations (bool use_undefined_p)
> > >    fold_undefer_and_ignore_overflow_warnings ();
> > >  }
> > >
> > > +/* Mark LOOP as a non-rolling loop.  */
> > > +
> > > +void
> > > +mark_non_rolling_loop (struct loop *loop)
> > > +{
> > > +  gcc_assert (loop && loop->header);
> > > +  loop->header->flags |= BB_HEADER_OF_NONROLLING_LOOP;
> > > +}
> > > +
> > > +/* Return true if LOOP is a non-rolling loop.  */
> > > +
> > > +bool
> > > +non_rolling_loop_p (struct loop *loop)
> > > +{
> > > +  int masked_flags;
> > > +  gcc_assert (loop && loop->header);
> > > +  masked_flags = (loop->header->flags & BB_HEADER_OF_NONROLLING_LOOP);
> > > +  return (masked_flags != 0);
> > > +}
> > > +
> > >  /* Returns true if statement S1 dominates statement S2.  */
> > >
> > >  bool
> > > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > > index 6ecd304..216de78 100644
> > > --- a/gcc/tree-vect-loop-manip.c
> > > +++ b/gcc/tree-vect-loop-manip.c
> > > @@ -1938,6 +1938,10 @@ vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > >                                           cond_expr, cond_expr_stmt_list);
> > >    gcc_assert (new_loop);
> > >    gcc_assert (loop_num == loop->num);
> > > +
> > > +  /* NEW_LOOP is a non-rolling loop.  */
> > > +  mark_non_rolling_loop (new_loop);
> > > +
> > >  #ifdef ENABLE_CHECKING
> > >    slpeel_verify_cfg_after_peeling (loop, new_loop);
> > >  #endif
> > > @@ -2191,6 +2195,10 @@ vect_do_peeling_for_alignment (loop_vec_info loop_vinfo)
> > >                                  th, true, NULL_TREE, NULL);
> > >
> > >    gcc_assert (new_loop);
> > > +
> > > +  /* NEW_LOOP is a non-rolling loop.  */
> > > +  mark_non_rolling_loop (new_loop);
> > > +
> > >  #ifdef ENABLE_CHECKING
> > >    slpeel_verify_cfg_after_peeling (new_loop, loop);
> > >  #endif
> > > --
> > > 1.6.3.3
> > >
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]