Patch to Avoid Bad Prefetching

Fri Jun 5 15:01:00 GMT 2009

On Fri, Jun 5, 2009 at 4:53 PM, Shobaki, Ghassan<Ghassan.Shobaki@amd.com> wrote:
> A ChangeLog entry was included in the original email. Here it is again. Does it look OK?

It misses the doc/invoke.texi changes but otherwise looks ok.

Richard.

> 2009-06-02  Ghassan Shobaki  <ghassan.shobaki@amd.com>
>
>        * tree-ssa-loop-prefetch.c
>        (gather_memory_references): Introduced a counter for the number of
>        memory references.
>        (anything_to_prefetch_p): Introduced a counter for the number of
>        prefetches.
>        (is_loop_prefetching_profitable): New function with a cost model
>        for prefetching.
>        (loop_prefetch_arrays): Use the new cost model to determine if
>        prefetching is profitable.
>        * params.def (MIN_INSN_TO_PREFETCH_RATIO,
>        PREFETCH_MIN_INSN_TO_MEM_RATIO): New parameters.
>        * params.h (MIN_INSN_TO_PREFETCH_RATIO,
>        PREFETCH_MIN_INSN_TO_MEM_RATIO): New parameters.
>
> Thanks
> -Ghassan
>
> -----Original Message-----
> From: Richard Guenther [mailto:richard.guenther@gmail.com]
> Sent: Friday, June 05, 2009 2:01 AM
> To: Shobaki, Ghassan
> Cc: Zdenek Dvorak; gcc-patches@gcc.gnu.org
> Subject: Re: Patch to Avoid Bad Prefetching
>
> On Fri, Jun 5, 2009 at 4:29 AM, Shobaki, Ghassan
> <Ghassan.Shobaki@amd.com> wrote:
>> Here is the revised patch. I made the changes requested by Zdenek and
>> changed the default parameter value to 3. I have also added entries for
>> the new parameters to doc/invoke.texi. The patch bootstrapped and passed
>> "make check". Is it now OK for check in?
>
> Ok with a proper ChangeLog entry.
>
> Thanks,
> Richard.
>
>> Thanks
>> -Ghassan
>>
>>
>> Index: doc/invoke.texi
>> ===================================================================
>> --- doc/invoke.texi     (revision 148159)
>> +++ doc/invoke.texi     (working copy)
>> @@ -7839,6 +7839,15 @@ The size of L1 cache, in kilobytes.
>>  @item l2-cache-size
>>  The size of L2 cache, in kilobytes.
>>
>> +@item min-insn-to-prefetch-ratio
>> +The minimum ratio between the number of instructions and the
>> +number of prefetches to enable prefetching in a loop with an
>> +unknown trip count.
>> +
>> +@item prefetch-min-insn-to-mem-ratio
>> +The minimum ratio between the number of instructions and the
>> +number of memory references to enable prefetching in a loop.
>> +
>>  @item use-canonical-types
>>  Whether the compiler should use the ``canonical'' type system.  By
>>  default, this should always be 1, which uses a more efficient internal
>> Index: params.h
>> ===================================================================
>> --- params.h    (revision 148159)
>> +++ params.h    (working copy)
>> @@ -166,4 +166,8 @@ typedef enum compiler_param
>>   PARAM_VALUE (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP)
>>  #define SLP_MAX_INSNS_IN_BB \
>>   PARAM_VALUE (PARAM_SLP_MAX_INSNS_IN_BB)
>> +#define MIN_INSN_TO_PREFETCH_RATIO \
>> +  PARAM_VALUE (PARAM_MIN_INSN_TO_PREFETCH_RATIO)
>> +#define PREFETCH_MIN_INSN_TO_MEM_RATIO \
>> +  PARAM_VALUE (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO)
>>  #endif /* ! GCC_PARAMS_H */
>> Index: tree-ssa-loop-prefetch.c
>> ===================================================================
>> --- tree-ssa-loop-prefetch.c    (revision 148159)
>> +++ tree-ssa-loop-prefetch.c    (working copy)
>> @@ -109,6 +109,23 @@ along with GCC; see the file COPYING3.
>>       prefetch instructions with guards in cases where 5) was not
>> sufficient
>>       to satisfy the constraints?
>>
>> +   The function is_loop_prefetching_profitable() implements a cost
>> model
>> +   to determine if prefetching is profitable for a given loop. The cost
>> +   model has two heuristcs:
>> +   1. A heuristic that determines whether the given loop has enough CPU
>> +      ops that can be overlapped with cache missing memory ops.
>> +      If not, the loop won't benefit from prefetching. This is
>> implemented
>> +      by requirung the ratio between the instruction count and the mem
>> ref
>> +      count to be above a certain minimum.
>> +   2. A heuristic that disables prefetching in a loop with an unknown
>> trip
>> +      count if the prefetching cost is above a certain limit. The
>> relative
>> +      prefetching cost is estimated by taking the ratio between the
>> +      prefetch count and the total intruction count (this models the
>> I-cache
>> +      cost).
>> +   The limits used in these heuristics are defined as parameters with
>> +   reasonable default values. Machine-specific default values will be
>> +   added later.
>> +
>>    Some other TODO:
>>       -- write and use more general reuse analysis (that could be also
>> used
>>         in other cache aimed loop optimizations)
>> @@ -476,7 +493,7 @@ gather_memory_references_ref (struct loo
>>    true if there are no other memory references inside the loop.  */
>>
>>  static struct mem_ref_group *
>> -gather_memory_references (struct loop *loop, bool *no_other_refs)
>> +gather_memory_references (struct loop *loop, bool *no_other_refs,
>> unsigned *ref_count)
>>  {
>>   basic_block *body = get_loop_body_in_dom_order (loop);
>>   basic_block bb;
>> @@ -487,6 +504,7 @@ gather_memory_references (struct loop *l
>>   struct mem_ref_group *refs = NULL;
>>
>>   *no_other_refs = true;
>> +  *ref_count = 0;
>>
>>   /* Scan the loop body in order, so that the former references precede
>> the
>>      later ones.  */
>> @@ -513,11 +531,17 @@ gather_memory_references (struct loop *l
>>          rhs = gimple_assign_rhs1 (stmt);
>>
>>          if (REFERENCE_CLASS_P (rhs))
>> +           {
>>            *no_other_refs &= gather_memory_references_ref (loop, &refs,
>>                                                            rhs, false,
>> stmt);
>> +           *ref_count += 1;
>> +           }
>>          if (REFERENCE_CLASS_P (lhs))
>> +           {
>>            *no_other_refs &= gather_memory_references_ref (loop, &refs,
>>                                                            lhs, true,
>> stmt);
>> +           *ref_count += 1;
>> +           }
>>        }
>>     }
>>   free (body);
>> @@ -846,20 +870,20 @@ schedule_prefetches (struct mem_ref_grou
>>   return any;
>>  }
>>
>> -/* Determine whether there is any reference suitable for prefetching
>> -   in GROUPS.  */
>> +/* Estimate the number of prefetches in the given GROUPS.  */
>>
>> -static bool
>> -anything_to_prefetch_p (struct mem_ref_group *groups)
>> +static int
>> +estimate_prefetch_count (struct mem_ref_group *groups)
>>  {
>>   struct mem_ref *ref;
>> +  int prefetch_count = 0;
>>
>>   for (; groups; groups = groups->next)
>>     for (ref = groups->refs; ref; ref = ref->next)
>>       if (should_issue_prefetch_p (ref))
>> -       return true;
>> +         prefetch_count++;
>>
>> -  return false;
>> +  return prefetch_count;
>>  }
>>
>>  /* Issue prefetches for the reference REF into loop as decided before.
>> @@ -1449,6 +1473,71 @@ determine_loop_nest_reuse (struct loop *
>>     }
>>  }
>>
>> +/* Do a cost-benefit analysis to determine if prefetching is profitable
>> +   for the current loop given the following parameters:
>> +   AHEAD: the iteration ahead distance,
>> +   EST_NITER: the estimated trip count,
>> +   NINSNS: estimated number of instructions in the loop,
>> +   PREFETCH_COUNT: an estimate of the number of prefetches
>> +   MEM_REF_COUNT: total number of memory references in the loop.  */
>> +
>> +static bool
>> +is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT
>> est_niter,
>> +                               unsigned ninsns, unsigned
>> prefetch_count,
>> +                               unsigned mem_ref_count)
>> +{
>> +  int insn_to_mem_ratio, insn_to_prefetch_ratio;
>> +
>> +  if (mem_ref_count == 0)
>> +    return false;
>> +
>> +  /* Prefetching improves performance by overlapping cache missing
>> +     memory accesses with CPU operations.  If the loop does not have
>> +     enough CPU operations to overlap with memory operations,
>> prefetching
>> +     won't give a significant benefit.  One approximate way of checking
>>
>> +     this is to require the ratio of instructions to memory references
>> to
>> +     be above a certain limit.  This approximation works well in
>> practice.
>> +     TODO: Implement a more precise computation by estimating the time
>> +     for each CPU or memory op in the loop. Time estimates for memory
>> ops
>> +     should account for cache misses.  */
>> +  insn_to_mem_ratio = ninsns / mem_ref_count;
>> +
>> +  if (insn_to_mem_ratio < PREFETCH_MIN_INSN_TO_MEM_RATIO)
>> +    return false;
>> +
>> +  /* Profitability of prefetching is highly dependent on the trip
>> count.
>> +     For a given AHEAD distance, the first AHEAD iterations do not
>> benefit
>> +     from prefetching, and the last AHEAD iterations execute useless
>> +     prefetches.  So, if the trip count is not large enough relative to
>> AHEAD,
>> +     prefetching may cause serious performance degradation.  To avoid
>> this
>> +     problem when the trip count is not known at compile time, we
>> +     conservatively skip loops with high prefetching costs.  For now,
>> only
>> +     the I-cache cost is considered.  The relative I-cache cost is
>> estimated
>> +     by taking the ratio between the number of prefetches and the total
>> +     number of instructions.  Since we are using integer arithmetic, we
>> +     compute the reciprocal of this ratio.
>> +     TODO: Account for loop unrolling, which may reduce the costs of
>> +     shorter stride prefetches.  Note that not accounting for loop
>> +     unrolling over-estimates the cost and hence gives more
>> conservative
>> +     results.  */
>> +  if (est_niter < 0)
>> +    {
>> +      insn_to_prefetch_ratio = ninsns / prefetch_count;
>> +      return insn_to_prefetch_ratio >= MIN_INSN_TO_PREFETCH_RATIO;
>> +    }
>> +
>> +  if (est_niter <= (HOST_WIDE_INT) ahead)
>> +    {
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       fprintf (dump_file,
>> +                "Not prefetching -- loop estimated to roll only %d
>> times\n",
>> +                (int) est_niter);
>> +      return false;
>> +    }
>> +  return true;
>> +}
>> +
>> +
>>  /* Issue prefetch instructions for array references in LOOP.  Returns
>>    true if the LOOP was unrolled.  */
>>
>> @@ -1460,6 +1549,8 @@ loop_prefetch_arrays (struct loop *loop)
>>   HOST_WIDE_INT est_niter;
>>   struct tree_niter_desc desc;
>>   bool unrolled = false, no_other_refs;
>> +  unsigned prefetch_count;
>> +  unsigned mem_ref_count;
>>
>>   if (optimize_loop_nest_for_size_p (loop))
>>     {
>> @@ -1469,12 +1560,13 @@ loop_prefetch_arrays (struct loop *loop)
>>     }
>>
>>   /* Step 1: gather the memory references.  */
>> -  refs = gather_memory_references (loop, &no_other_refs);
>> +  refs = gather_memory_references (loop, &no_other_refs,
>> &mem_ref_count);
>>
>>   /* Step 2: estimate the reuse effects.  */
>>   prune_by_reuse (refs);
>>
>> -  if (!anything_to_prefetch_p (refs))
>> +  prefetch_count = estimate_prefetch_count (refs);
>> +  if (prefetch_count == 0)
>>     goto fail;
>>
>>   determine_loop_nest_reuse (loop, refs, no_other_refs);
>> @@ -1485,27 +1577,22 @@ loop_prefetch_arrays (struct loop *loop)
>>      the loop body.  */
>>   time = tree_num_loop_insns (loop, &eni_time_weights);
>>   ahead = (PREFETCH_LATENCY + time - 1) / time;
>> -  est_niter = estimated_loop_iterations_int (loop, false);
>> -
>> -  /* The prefetches will run for AHEAD iterations of the original loop.
>> Unless
>> -     the loop rolls at least AHEAD times, prefetching the references
>> does not
>> -     make sense.  */
>> -  if (est_niter >= 0 && est_niter <= (HOST_WIDE_INT) ahead)
>> -    {
>> -      if (dump_file && (dump_flags & TDF_DETAILS))
>> -       fprintf (dump_file,
>> -                "Not prefetching -- loop estimated to roll only %d
>> times\n",
>> -                (int) est_niter);
>> -      goto fail;
>> -    }
>> -
>> -  mark_nontemporal_stores (loop, refs);
>> +  est_niter = estimated_loop_iterations_int (loop, false);
>>
>>   ninsns = tree_num_loop_insns (loop, &eni_size_weights);
>>   unroll_factor = determine_unroll_factor (loop, refs, ninsns, &desc,
>>                                           est_niter);
>>   if (dump_file && (dump_flags & TDF_DETAILS))
>> -    fprintf (dump_file, "Ahead %d, unroll factor %d\n", ahead,
>> unroll_factor);
>> +    fprintf (dump_file, "Ahead %d, unroll factor %d, trip count %ld\n"
>> +            "insn count %d, mem ref count %d, prefetch count %d\n",
>> +            ahead, unroll_factor, est_niter, ninsns, mem_ref_count,
>> +            prefetch_count);
>> +
>> +  if (!is_loop_prefetching_profitable (ahead, est_niter, ninsns,
>> +                                      prefetch_count, mem_ref_count))
>> +    goto fail;
>> +
>> +  mark_nontemporal_stores (loop, refs);
>>
>>   /* Step 4: what to prefetch?  */
>>   if (!schedule_prefetches (refs, unroll_factor, ahead))
>> @@ -1556,7 +1643,11 @@ tree_ssa_prefetch_arrays (void)
>>       fprintf (dump_file, "    L1 cache size: %d lines, %d kB\n",
>>               L1_CACHE_SIZE_BYTES / L1_CACHE_LINE_SIZE, L1_CACHE_SIZE);
>>       fprintf (dump_file, "    L1 cache line size: %d\n",
>> L1_CACHE_LINE_SIZE);
>> -      fprintf (dump_file, "    L2 cache size: %d kB\n", L2_CACHE_SIZE);
>> +      fprintf (dump_file, "    L2 cache size: %d kB\n", L2_CACHE_SIZE);
>>
>> +      fprintf (dump_file, "    min insn-to-prefetch ratio: %d \n",
>> +              MIN_INSN_TO_PREFETCH_RATIO);
>> +      fprintf (dump_file, "    min insn-to-mem ratio: %d \n",
>> +              PREFETCH_MIN_INSN_TO_MEM_RATIO);
>>       fprintf (dump_file, "\n");
>>     }
>>
>> Index: params.def
>> ===================================================================
>> --- params.def  (revision 148159)
>> +++ params.def  (working copy)
>> @@ -741,6 +741,17 @@ DEFPARAM (PARAM_SLP_MAX_INSNS_IN_BB,
>>           "Maximum number of instructions in basic block to be
>> considered for SLP vectorization",
>>           1000, 0, 0)
>>
>> +DEFPARAM (PARAM_MIN_INSN_TO_PREFETCH_RATIO,
>> +         "min-insn-to-prefetch-ratio",
>> +         "min. ratio of insns to prefetches to enable prefetching for "
>> +          "a loop with an unknown trip count",
>> +         10, 0, 0)
>> +
>> +DEFPARAM (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO,
>> +         "prefetch-min-insn-to-mem-ratio",
>> +         "min. ratio of insns to mem ops to enable prefetching in a
>> loop",
>> +         3, 0, 0)
>> +
>>  /*
>>  Local variables:
>>  mode:c
>>
>> -----Original Message-----
>> From: Shobaki, Ghassan
>> Sent: Thursday, June 04, 2009 2:54 PM
>> To: 'Richard Guenther'
>> Cc: Zdenek Dvorak; gcc-patches@gcc.gnu.org
>> Subject: RE: Patch to Avoid Bad Prefetching
>>
>>
>> I meant the time estimate.
>>
>> For example, consider a loop with 2 cache missing memory ops and 4 CPU
>> ops. Assuming that we set the min_insn_to_mem_ratio to 3, the current
>> *imprecise* heuristic will calculate an insn-to-mem ratio of (2+4)/2=3
>> and conclude that this loop will benefit from prfetching whether the 4
>> CPU ops are integer adds or FP divides. A more precise calculation will
>> go like this:
>>
>> Time for memory ops = 200 cycle (assuming a cache-miss latency of 200
>> and that the machine can do 2 mem ops simultaneously)
>>
>> If the 4 CPU ops are integer arithmetic:
>> Time for CPU ops = 4*1=4 (assuming that int arith ops take one cycle
>> each)
>> Max potential benefit from prefetching = 4/204 = 2% (probably not
>> significant enough to pay off for the prefething cost)
>>
>> On the other hand,
>> If the 4 CPU ops are FP divides:
>> Time for CPU ops = 4*20=80 (assuming that FP divides take 20 cycles
>> each)
>> Max potential benefit from prefetching = 80/(200+80) = 29% (most likely
>> significant enough to pay off for the prefething cost and deliver a good
>> performance gain)
>>
>> As you can see, this is much more precise than simply looking at the
>> ratio, but it requires a good time estimate for each operation. I assume
>> that the function tree_num_loop_insns() internally computes such time
>> estimates if we pass it time weights. Of course, I know that middle-end
>> time estimates will not be very precise compared to backend estimates.
>> BTW, are the middle-end time estimates machine dependent?
>>
>> Are you OK with checking in the patch with the ratio for now?
>>
>> Thanks
>> -Ghassan
>>
>> -----Original Message-----
>> From: Richard Guenther [mailto:richard.guenther@gmail.com]
>> Sent: Thursday, June 04, 2009 1:57 PM
>> To: Shobaki, Ghassan
>> Cc: Zdenek Dvorak; gcc-patches@gcc.gnu.org
>> Subject: Re: Patch to Avoid Bad Prefetching
>>
>> On Thu, Jun 4, 2009 at 9:24 PM, Shobaki, Ghassan
>> <Ghassan.Shobaki@amd.com> wrote:
>>>
>>> I agree with Zdenek that the first heuristic is imprecise and needs
>> improvement. At least it has to distinguish between simple integer
>> arithmetic ops and expensive FP ops. I am planning on doing this very
>> soon. As I explained in the previous email, that will require computing
>> an instruction-by-instruction time estimate using the estimates that GCC
>> already computes but adapting them to properly account for the costs of
>> cache missing memory ops.
>>
>> GCC has both size and time estimates for code - I guess this is where
>> you use
>> the size estimate but really ment the time estimate, no?
>>
>> Richard.
>>
>>
>>
>
>
>