[PATCH] New fdo summary-based icache sensitive unrolling (issue6351086)
Teresa Johnson
tejohnson@google.com
Wed Jul 18 15:48:00 GMT 2012
Ping (retrying ping in plain text mode so that it goes through properly).
Thanks,
Teresa
On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Ports some patches related to improving FDO program summary information
> and using it to guide loop unrolling from google branches to mainline.
> The patch is enhanced to add additional summary information to aid
> in determining hot/cold decisions.
>
> The original patch description is at:
> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> and further discussion about incorporating onto mainline is at:
> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>
> Honza, can you take a look to see if this patch meets your needs?
>
> Full description:
>
> This patch adds new program summary information to the gcov
> profile files that indicate how many profiled counts compose
> the majority of the program's execution time. This is used to
> provide an indication of the overall code size of the program.
>
> The new profile summary information is then used to guide
> codesize based unroll and peel decisions, to prevent those
> optimizations from increasing code size too much when the
> program may be sensitive to icache effects.
>
> This patch also pulls in dependent portions of google/main r187660 that cache
> additional loop analysis results in the niter_desc auxiliary information
> hanging off the loop structure (the optimization portions of that
> change are not included here, and have an outstanding review request
> for mainline).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Thanks,
> Teresa
>
> 2012-07-11 Teresa Johnson <tejohnson@google.com>
>
> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
> (gcov_compute_cutoff_values): Ditto.
> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
> information.
> * gcc/doc/invoke.texi (roll much): Document new options
> -fpeel-codesize-limit and -funroll-codesize-limit, and new params
> codesize-hotness-threshold and unrollpeel-hotness-threshold.
> * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
> (gcov_read_summary): Read new summary info.
> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
> hot_cutoff_value.
> * gcc/loop-unroll.c (code_size_limit_factor): New function.
> (decide_unroll_runtime_iterations): Call code_size_limit_factor
> to control the unroll factor, and retrieve number of branches from
> niter_desc instead of via function that walks loop.
> (decide_peel_simple, decide_unroll_stupid): Ditto.
> * gcc/coverage.c (read_counts_file): Propagate new summary info.
> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
> function, and add guards to enable this function to work for the
> outermost loop.
> * gcc/common.opt: Add -fpeel-codesize-limit and
> -funroll-codesize-limit.
> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
> (num_loop_branches): Remove.
> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
> additional loop analysis information.
> (num_loop_branches): Remove.
> (analyze_loop_insns): Declare.
> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
> * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>
> Index: libgcc/libgcov.c
> ===================================================================
> --- libgcc/libgcov.c (revision 189413)
> +++ libgcc/libgcov.c (working copy)
> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
> return 1;
> }
>
> +/* Used by qsort to sort gcov values in descending order. */
> +
> +static int
> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
> +{
> + const gcov_type a = *(gcov_type const *)pa;
> + const gcov_type b = *(gcov_type const *)pb;
> +
> + if (b > a)
> + return 1;
> + else if (b == a)
> + return 0;
> + else
> + return -1;
> +}
> +
> +/* Determines the number of counters required to cover a given percentage
> + of the total sum of execution counts in the summary, which is then also
> + recorded in SUM. */
> +
> +static void
> +gcov_compute_cutoff_values (struct gcov_summary *sum)
> +{
> + struct gcov_info *gi_ptr;
> + const struct gcov_fn_info *gfi_ptr;
> + const struct gcov_ctr_info *ci_ptr;
> + struct gcov_ctr_summary *cs_ptr;
> + unsigned t_ix, f_ix, i, ctr_info_ix, index;
> + gcov_unsigned_t c_num;
> + gcov_type *value_array;
> + gcov_type cum, cum_cutoff;
> + char *cutoff_str;
> + unsigned cutoff_perc;
> +
> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
> + if (cutoff_str && strlen (cutoff_str))
> + cutoff_perc = atoi (cutoff_str);
> + else
> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
> +
> + /* This currently only applies to arc counters. */
> + t_ix = GCOV_COUNTER_ARCS;
> +
> + /* First check if there are any counts recorded for this counter. */
> + cs_ptr = &(sum->ctrs[t_ix]);
> + if (!cs_ptr->num)
> + return;
> +
> + /* Determine the cumulative counter value at the specified cutoff
> + percentage and record the percentage for use by gcov consumers.
> + Check for overflow when sum_all is multiplied by the cutoff_perc,
> + and if so, do the divide first. */
> + if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
> + /* Overflow, do the divide first. */
> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
> + else
> + /* Otherwise multiply first to get the correct value for small
> + values of sum_all. */
> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
> +
> + /* Next, walk through all the per-object structures and save each of
> + the count values in value_array. */
> + index = 0;
> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
> + {
> + if (!gi_ptr->merge[t_ix])
> + continue;
> +
> + /* Find the appropriate index into the gcov_ctr_info array
> + for the counter we are currently working on based on the
> + existence of the merge function pointer for this object. */
> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
> + {
> + if (gi_ptr->merge[i])
> + ctr_info_ix++;
> + }
> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
> + {
> + gfi_ptr = gi_ptr->functions[f_ix];
> +
> + if (!gfi_ptr || gfi_ptr->key != gi_ptr)
> + continue;
> +
> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
> + /* Sanity check that there are enough entries in value_arry
> + for this function's counters. Gracefully handle the case when
> + there are not, in case something in the profile info is
> + corrupted. */
> + c_num = ci_ptr->num;
> + if (index + c_num > cs_ptr->num)
> + c_num = cs_ptr->num - index;
> + /* Copy over this function's counter values. */
> + memcpy (&value_array[index], ci_ptr->values,
> + sizeof (gcov_type) * c_num);
> + index += c_num;
> + }
> + }
> +
> + /* Sort all the counter values by descending value and finally
> + accumulate the values from hottest on down until reaching
> + the cutoff value computed earlier. */
> + qsort (value_array, cs_ptr->num, sizeof (gcov_type),
> + sort_by_reverse_gcov_value);
> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
> + cum += value_array[c_num];
> + /* Record the number of counters required to reach the cutoff value,
> + as well as the counter value that caused cum to reach the cutoff. */
> + cs_ptr->num_hot_counters = c_num;
> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
> + free (value_array);
> +}
> +
> /* Dump the coverage counts. We merge with existing counts when
> possible, to avoid growing the .da files ad infinitum. We use this
> program's checksum to make sure we only accumulate whole program
> @@ -347,6 +461,7 @@ gcov_exit (void)
> }
> }
> }
> + gcov_compute_cutoff_values (&this_prg);
>
> {
> /* Check if the level of dirs to strip off specified. */
> @@ -598,7 +713,14 @@ gcov_exit (void)
> if (gi_ptr->merge[t_ix])
> {
> if (!cs_prg->runs++)
> - cs_prg->num = cs_tprg->num;
> + {
> + cs_prg->num = cs_tprg->num;
> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
> + {
> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
> + }
> + }
> cs_prg->sum_all += cs_tprg->sum_all;
> if (cs_prg->run_max < cs_tprg->run_max)
> cs_prg->run_max = cs_tprg->run_max;
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi (revision 189413)
> +++ gcc/doc/invoke.texi (working copy)
> @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}.
> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
> -fprefetch-loop-arrays @gol
> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
> -fprofile-generate=@var{path} @gol
> @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}.
> -ftree-reassoc @gol
> -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
> -fwhole-program -fwpa -fuse-linker-plugin @gol
> @@ -8505,6 +8505,14 @@ the loop is entered. This usually makes programs
> @option{-funroll-all-loops} implies the same options as
> @option{-funroll-loops}.
>
> +@item -funroll-codesize-limit
> +@opindex funroll-codesize-limit
> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
> @item -fpeel-loops
> @opindex fpeel-loops
> Peels loops for which there is enough information that they do not
> @@ -8513,6 +8521,14 @@ roll much (from profile feedback). It also turns
>
> Enabled with @option{-fprofile-use}.
>
> +@item -fpeel-codesize-limit
> +@opindex fpeel-codesize-limit
> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
> @item -fmove-loop-invariants
> @opindex fmove-loop-invariants
> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled
> @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s
> @item max-completely-peel-loop-nest-depth
> The maximum depth of a loop nest suitable for complete peeling.
>
> +@item unrollpeel-codesize-threshold
> +Maximum profile-based code size footprint estimate for loop unrolling and
> +peeling.
> +
> +@item unrollpeel-hotness-threshold
> +Maximum ratio of total execution count to loop entry block count under which
> +most profile-based code size estimates will be ignored for loop unrolling and
> +peeling.
> +
> @item max-unswitch-insns
> The maximum number of insns of an unswitched loop.
>
> Index: gcc/gcov-io.c
> ===================================================================
> --- gcc/gcov-io.c (revision 189413)
> +++ gcc/gcov-io.c (working copy)
> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
> {
> gcov_write_unsigned (csum->num);
> + gcov_write_unsigned (csum->num_hot_counters);
> + gcov_write_counter (csum->hot_cutoff_value);
> gcov_write_unsigned (csum->runs);
> gcov_write_counter (csum->sum_all);
> gcov_write_counter (csum->run_max);
> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
> {
> csum->num = gcov_read_unsigned ();
> + csum->num_hot_counters = gcov_read_unsigned ();
> + csum->hot_cutoff_value = gcov_read_counter ();
> csum->runs = gcov_read_unsigned ();
> csum->sum_all = gcov_read_counter ();
> csum->run_max = gcov_read_counter ();
> Index: gcc/gcov-io.h
> ===================================================================
> --- gcc/gcov-io.h (revision 189413)
> +++ gcc/gcov-io.h (working copy)
> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */
> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
> #define GCOV_TAG_SUMMARY_LENGTH \
> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>
> /* Counters that are collected. */
> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */
> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
> struct gcov_ctr_summary
> {
> gcov_unsigned_t num; /* number of counters. */
> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
> + percent of sum_all. */
> + gcov_type hot_cutoff_value; /* value of smallest counter included in
> + num_hot_counters. */
> gcov_unsigned_t runs; /* number of program runs */
> gcov_type sum_all; /* sum of all counters accumulated. */
> gcov_type run_max; /* maximum value on a single run. */
> Index: gcc/loop-unroll.c
> ===================================================================
> --- gcc/loop-unroll.c (revision 189413)
> +++ gcc/loop-unroll.c (working copy)
> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see
> #include "hashtab.h"
> #include "recog.h"
> #include "target.h"
> +#include "gcov-io.h"
>
> /* This pass performs loop unrolling and peeling. We only perform these
> optimizations on innermost loops (with single exception) because
> @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc
> basic_block);
> static rtx get_expansion (struct var_to_expand *);
>
> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
> + based on code footprint estimates. Returns the codesize-based factor to be
> + divided into the max instructions in an unrolled or peeled loop:
> + 1) For size <= threshold, do not limit (by returning 1).
> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
> + unrolled instructions according to loop hotness.
> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
> + INT_MAX). */
> +
> +static int
> +code_size_limit_factor(struct loop *loop)
> +{
> + unsigned size_threshold;
> + struct niter_desc *desc = get_simple_loop_desc (loop);
> + gcov_type sum_to_header_ratio;
> + int hotness_ratio_threshold;
> + int limit_factor;
> +
> + /* First check if the application has a large codesize footprint.
> + This is estimated from FDO profile summary information for the
> + program, where the num_hot_counters indicates the number of hottest
> + counters (blocks) that compose most of the execution time of
> + the program. A large value would indicate a large flat execution
> + profile where icache misses may be a concern. */
> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
> + if (!profile_info
> + || profile_info->num_hot_counters <= size_threshold
> + || !profile_info->sum_all)
> + return 1;
> +
> + /* Next, exclude some loops where unrolling/peeling may be more
> + important to overall performance. */
> +
> + /* Ignore FP loops, which are more likely to benefit heavily from
> + unrolling. */
> + if (desc->has_fp)
> + return 1;
> +
> + /* Next, set the value of the codesize-based unroll factor divisor which in
> + most loops will need to be set to a value that will reduce or eliminate
> + unrolling/peeling. */
> + if (profile_info->num_hot_counters < size_threshold * 2)
> + {
> + /* For applications that are less than twice the codesize limit, allow
> + limited unrolling for very hot loops. */
> + sum_to_header_ratio = profile_info->sum_all / loop->header->count;
> + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
> + /* When the profile count sum to loop entry header ratio is smaller than
> + the threshold (i.e. the loop entry is hot enough, the divisor is set
> + to 1 so the unroll/peel factor is not reduced. When it is bigger
> + than the ratio, increase the divisor by the amount this ratio
> + is over the threshold, which will quickly reduce the unroll/peel
> + factor to zero as the loop's hotness reduces. */
> + if (sum_to_header_ratio > hotness_ratio_threshold)
> + {
> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
> + gcc_assert (limit_factor >= 1);
> + }
> + else
> + limit_factor = 1;
> + }
> + else
> + /* For appliations that are at least twice the codesize limit, set
> + the divisor to a large value that will force the unroll factor to 0. */
> + limit_factor = INT_MAX;
> +
> + return limit_factor;
> +}
> +
> /* Unroll and/or peel (depending on FLAGS) LOOPS. */
> void
> unroll_and_peel_loops (int flags)
> @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo
> {
> unsigned nunroll, nunroll_by_av, i;
> struct niter_desc *desc;
> + int limit_factor = 1;
>
> if (!(flags & UAP_UNROLL))
> {
> @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo
> "\n;; Considering unrolling loop with runtime "
> "computable number of iterations\n");
>
> + if (flag_unroll_codesize_limit)
> + {
> + /* Determine whether to limit code size growth from unrolling,
> + using FDO profile summary information that gives an
> + estimated number of executed blocks. */
> + limit_factor = code_size_limit_factor (loop);
> + if (dump_file && limit_factor > 1)
> + {
> + fprintf (dump_file,
> + ";; Due to large code size footprint estimate, limit "
> + "max unrolled insns by divisor %d\n", limit_factor);
> + }
> + }
> +
> /* nunroll = total number of copies of the original loop body in
> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */
> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> + / loop->ninsns;
> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> + / limit_factor / loop->av_ninsns;
> if (nunroll > nunroll_by_av)
> nunroll = nunroll_by_av;
> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags)
> {
> unsigned npeel;
> struct niter_desc *desc;
> + int limit_factor = 1;
>
> if (!(flags & UAP_PEEL))
> {
> @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags)
> if (dump_file)
> fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>
> + if (flag_peel_codesize_limit)
> + {
> + /* Determine whether to limit code size growth from peeling,
> + using FDO profile summary information that gives an
> + estimated number of executed blocks. */
> + limit_factor = code_size_limit_factor (loop);
> + if (dump_file && limit_factor > 1)
> + {
> + fprintf (dump_file,
> + ";; Due to large code size footprint estimate, limit "
> + "max peeled insns by divisor %d\n", limit_factor);
> + }
> + }
> +
> /* npeel = number of iterations to peel. */
> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
> + / loop->ninsns;
> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>
> @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags)
>
> /* Do not simply peel loops with branches inside -- it increases number
> of mispredicts. */
> - if (num_loop_branches (loop) > 1)
> + if (desc->num_branches > 1)
> {
> if (dump_file)
> fprintf (dump_file, ";; Not peeling, contains branches\n");
> @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags
> {
> unsigned nunroll, nunroll_by_av, i;
> struct niter_desc *desc;
> + int limit_factor = 1;
>
> if (!(flags & UAP_UNROLL_ALL))
> {
> @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags
> if (dump_file)
> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>
> + if (flag_unroll_codesize_limit)
> + {
> + /* Determine whether to limit code size growth from unrolling,
> + using FDO profile summary information that gives an
> + estimated number of executed blocks. */
> + limit_factor = code_size_limit_factor (loop);
> + if (dump_file && limit_factor > 1)
> + {
> + fprintf (dump_file,
> + ";; Due to large code size footprint estimate, limit "
> + "max unrolled insns by divisor %d\n", limit_factor);
> + }
> + }
> +
> /* nunroll = total number of copies of the original loop body in
> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */
> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> - nunroll_by_av
> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> + / loop->ninsns;
> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> + / limit_factor / loop->av_ninsns;
> if (nunroll > nunroll_by_av)
> nunroll = nunroll_by_av;
> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>
> /* Do not unroll loops with branches inside -- it increases number
> of mispredicts. */
> - if (num_loop_branches (loop) > 1)
> + if (desc->num_branches > 1)
> {
> if (dump_file)
> fprintf (dump_file, ";; Not unrolling, contains branches\n");
> Index: gcc/coverage.c
> ===================================================================
> --- gcc/coverage.c (revision 189413)
> +++ gcc/coverage.c (working copy)
> @@ -245,6 +245,14 @@ read_counts_file (void)
> gcov_read_summary (&sum);
> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
> {
> + if (summary.ctrs[ix].num_hot_counters
> + < sum.ctrs[ix].num_hot_counters)
> + {
> + summary.ctrs[ix].num_hot_counters
> + = sum.ctrs[ix].num_hot_counters;
> + summary.ctrs[ix].hot_cutoff_value
> + = sum.ctrs[ix].hot_cutoff_value;
> + }
> summary.ctrs[ix].runs += sum.ctrs[ix].runs;
> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
> Index: gcc/loop-iv.c
> ===================================================================
> --- gcc/loop-iv.c (revision 189413)
> +++ gcc/loop-iv.c (working copy)
> @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop)
> /* At least desc->infinite is not always initialized by
> find_simple_loop_exit. */
> desc = XCNEW (struct niter_desc);
> - iv_analysis_loop_init (loop);
> - find_simple_exit (loop, desc);
> + if (loop->latch != EXIT_BLOCK_PTR)
> + {
> + iv_analysis_loop_init (loop);
> + find_simple_exit (loop, desc);
> + }
> + analyze_loop_insns (loop, desc);
> loop->aux = desc;
>
> if (desc->simple_p && (desc->assumptions || desc->infinite))
> Index: gcc/common.opt
> ===================================================================
> --- gcc/common.opt (revision 189413)
> +++ gcc/common.opt (working copy)
> @@ -1551,6 +1551,10 @@ fpcc-struct-return
> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
> Return small aggregates in memory, not registers
>
> +fpeel-codesize-limit
> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop peeling under profile estimates of large code footprint
> +
> fpeel-loops
> Common Report Var(flag_peel_loops) Optimization
> Perform loop peeling
> @@ -2108,6 +2112,10 @@ funroll-all-loops
> Common Report Var(flag_unroll_all_loops) Optimization
> Perform loop unrolling for all loops
>
> +funroll-codesize-limit
> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint
> +
> ; Nonzero means that loop optimizer may assume that the induction variables
> ; that control loops do not overflow and that the loops with nontrivial
> ; exit condition are not infinite
> Index: gcc/cfgloop.c
> ===================================================================
> --- gcc/cfgloop.c (revision 189413)
> +++ gcc/cfgloop.c (working copy)
> @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop)
> return edges;
> }
>
> -/* Counts the number of conditional branches inside LOOP. */
> +/* Determine if INSN is a floating point set. */
>
> -unsigned
> -num_loop_branches (const struct loop *loop)
> +static bool
> +insn_has_fp_set(rtx insn)
> {
> - unsigned i, n;
> - basic_block * body;
> + int i;
> + rtx pat = PATTERN(insn);
> + if (GET_CODE (pat) == SET)
> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
> + else if (GET_CODE (pat) == PARALLEL)
> + {
> + for (i = 0; i < XVECLEN (pat, 0); i++)
> + {
> + rtx sub = XVECEXP (pat, 0, i);
> + if (GET_CODE (sub) == SET)
> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
> + }
> + }
> + return false;
> +}
>
> - gcc_assert (loop->latch != EXIT_BLOCK_PTR);
> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
> + the number of conditional branch instructions, calls and fp instructions,
> + as well as the average number of branches executed per iteration. */
>
> +void
> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
> +{
> + unsigned i, nbranch;
> + gcov_type weighted_nbranch;
> + bool has_call, has_fp;
> + basic_block * body, bb;
> + rtx insn;
> + gcov_type header_count = loop->header->count;
> +
> + nbranch = weighted_nbranch = 0;
> + has_call = has_fp = false;
> +
> body = get_loop_body (loop);
> - n = 0;
> for (i = 0; i < loop->num_nodes; i++)
> - if (EDGE_COUNT (body[i]->succs) >= 2)
> - n++;
> + {
> + bb = body[i];
> +
> + if (EDGE_COUNT (bb->succs) >= 2)
> + {
> + nbranch++;
> +
> + /* If this block is executed less frequently than the header (loop
> + entry), then it is weighted based on its execution count, which
> + will be turned into a ratio compared to the loop header below. */
> + if (bb->count < header_count)
> + weighted_nbranch += bb->count;
> +
> + /* When it is executed more frequently than the header (i.e. it is
> + in a nested inner loop), simply weight the branch the same as the
> + header execution count, so that it will contribute 1 branch to
> + the ratio computed below. */
> + else
> + weighted_nbranch += header_count;
> + }
> +
> + /* No need to iterate through the instructions below if
> + both flags have already been set. */
> + if (has_call && has_fp)
> + continue;
> +
> + FOR_BB_INSNS (bb, insn)
> + {
> + if (!INSN_P (insn))
> + continue;
> +
> + if (!has_call)
> + has_call = CALL_P (insn);
> +
> + if (!has_fp)
> + has_fp = insn_has_fp_set (insn);
> + }
> + }
> free (body);
>
> - return n;
> + desc->num_branches = nbranch;
> + /* Now divide the weights computed above by the loop header execution count,
> + to compute the average number of branches through the loop. By adding
> + header_count/2 to the numerator we round to nearest with integer
> + division. */
> + if (header_count != 0)
> + desc->av_num_branches
> + = (weighted_nbranch + header_count/2) / header_count;
> + else
> + desc->av_num_branches = 0;
> + desc->has_call = has_call;
> + desc->has_fp = has_fp;
> }
>
> /* Adds basic block BB to LOOP. */
> Index: gcc/cfgloop.h
> ===================================================================
> --- gcc/cfgloop.h (revision 189413)
> +++ gcc/cfgloop.h (working copy)
> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>
> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
> edge single_exit (const struct loop *);
> -extern unsigned num_loop_branches (const struct loop *);
>
> extern edge loop_preheader_edge (const struct loop *);
> extern edge loop_latch_edge (const struct loop *);
> @@ -366,7 +365,8 @@ struct rtx_iv
> };
>
> /* The description of an exit from the loop and of the number of iterations
> - till we take the exit. */
> + till we take the exit. Also includes other information used primarily
> + by the loop unroller. */
>
> struct niter_desc
> {
> @@ -407,6 +407,18 @@ struct niter_desc
>
> /* The number of iterations of the loop. */
> rtx niter_expr;
> +
> + /* The number of branches in the loop. */
> + unsigned num_branches;
> +
> + /* The number of executed branches per iteration. */
> + unsigned av_num_branches;
> +
> + /* Whether the loop contains a call instruction. */
> + bool has_call;
> +
> + /* Whether the loop contains fp instructions. */
> + bool has_fp;
> };
>
> extern void iv_analysis_loop_init (struct loop *);
> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>
> extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
> extern void free_simple_loop_desc (struct loop *loop);
> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>
> static inline struct niter_desc *
> simple_loop_desc (struct loop *loop)
> Index: gcc/params.def
> ===================================================================
> --- gcc/params.def (revision 189413)
> +++ gcc/params.def (working copy)
> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
> "max-completely-peel-loop-nest-depth",
> "The maximum depth of a loop nest we completely peel",
> 8, 0, 0)
> +/* The maximum code size estimate under which loop unrolling and peeling
> + * is allowed in a profile feedback compile. This currently applies to loops
> + * with non-constant iteration counts and no floating point computations. */
> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
> + "unrollpeel-codesize-threshold",
> + "Maximum profile-based code size footprint estimate for loop unrolling "
> + "and peeling",
> + 15000, 0, 0)
> +/* The maximum ratio of total profiled execution counts to loop entry block
> + count that must be exceeded to ignore most code size limits when unrolling
> + and peeling. */
> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
> + "unrollpeel-hotness-threshold",
> + "Maximum ratio of total profiled execution count to loop entry "
> + "block count under which most codesize limits for unrolling and "
> + "peeling will be ignored",
> + 100, 1, 0)
>
> /* The maximum number of insns of an unswitched loop. */
> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
> Index: gcc/gcov-dump.c
> ===================================================================
> --- gcc/gcov-dump.c (revision 189413)
> +++ gcc/gcov-dump.c (working copy)
> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
> {
> printf ("\n");
> print_prefix (filename, 0, 0);
> - printf ("\t\tcounts=%u, runs=%u",
> - summary.ctrs[ix].num, summary.ctrs[ix].runs);
> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
> + summary.ctrs[ix].num,
> + summary.ctrs[ix].num_hot_counters,
> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
> + summary.ctrs[ix].runs);
>
> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>
> --
> This patch is available for review at http://codereview.appspot.com/6351086
--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
More information about the Gcc-patches
mailing list