This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Patch to Avoid Bad Prefetching


On Fri, Jun 5, 2009 at 8:01 AM, Richard
Guenther<richard.guenther@gmail.com> wrote:
> On Fri, Jun 5, 2009 at 4:53 PM, Shobaki, Ghassan<Ghassan.Shobaki@amd.com> wrote:
>> A ChangeLog entry was included in the original email. Here it is again. Does it look OK?
>
> It misses the doc/invoke.texi changes but otherwise looks ok.


I think this caused:

FAIL: gcc.dg/tree-ssa/prefetch-5.c scan-tree-dump-times aprefetch
"Issued prefetch" 2

on Linux/ia32.


H.J.
--
> Richard.
>
>> 2009-06-02 ?Ghassan Shobaki ?<ghassan.shobaki@amd.com>
>>
>> ? ? ? ?* tree-ssa-loop-prefetch.c
>> ? ? ? ?(gather_memory_references): Introduced a counter for the number of
>> ? ? ? ?memory references.
>> ? ? ? ?(anything_to_prefetch_p): Introduced a counter for the number of
>> ? ? ? ?prefetches.
>> ? ? ? ?(is_loop_prefetching_profitable): New function with a cost model
>> ? ? ? ?for prefetching.
>> ? ? ? ?(loop_prefetch_arrays): Use the new cost model to determine if
>> ? ? ? ?prefetching is profitable.
>> ? ? ? ?* params.def (MIN_INSN_TO_PREFETCH_RATIO,
>> ? ? ? ?PREFETCH_MIN_INSN_TO_MEM_RATIO): New parameters.
>> ? ? ? ?* params.h (MIN_INSN_TO_PREFETCH_RATIO,
>> ? ? ? ?PREFETCH_MIN_INSN_TO_MEM_RATIO): New parameters.
>>
>> Thanks
>> -Ghassan
>>
>> -----Original Message-----
>> From: Richard Guenther [mailto:richard.guenther@gmail.com]
>> Sent: Friday, June 05, 2009 2:01 AM
>> To: Shobaki, Ghassan
>> Cc: Zdenek Dvorak; gcc-patches@gcc.gnu.org
>> Subject: Re: Patch to Avoid Bad Prefetching
>>
>> On Fri, Jun 5, 2009 at 4:29 AM, Shobaki, Ghassan
>> <Ghassan.Shobaki@amd.com> wrote:
>>> Here is the revised patch. I made the changes requested by Zdenek and
>>> changed the default parameter value to 3. I have also added entries for
>>> the new parameters to doc/invoke.texi. The patch bootstrapped and passed
>>> "make check". Is it now OK for check in?
>>
>> Ok with a proper ChangeLog entry.
>>
>> Thanks,
>> Richard.
>>
>>> Thanks
>>> -Ghassan
>>>
>>>
>>> Index: doc/invoke.texi
>>> ===================================================================
>>> --- doc/invoke.texi ? ? (revision 148159)
>>> +++ doc/invoke.texi ? ? (working copy)
>>> @@ -7839,6 +7839,15 @@ The size of L1 cache, in kilobytes.
>>> ?@item l2-cache-size
>>> ?The size of L2 cache, in kilobytes.
>>>
>>> +@item min-insn-to-prefetch-ratio
>>> +The minimum ratio between the number of instructions and the
>>> +number of prefetches to enable prefetching in a loop with an
>>> +unknown trip count.
>>> +
>>> +@item prefetch-min-insn-to-mem-ratio
>>> +The minimum ratio between the number of instructions and the
>>> +number of memory references to enable prefetching in a loop.
>>> +
>>> ?@item use-canonical-types
>>> ?Whether the compiler should use the ``canonical'' type system. ?By
>>> ?default, this should always be 1, which uses a more efficient internal
>>> Index: params.h
>>> ===================================================================
>>> --- params.h ? ?(revision 148159)
>>> +++ params.h ? ?(working copy)
>>> @@ -166,4 +166,8 @@ typedef enum compiler_param
>>> ? PARAM_VALUE (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP)
>>> ?#define SLP_MAX_INSNS_IN_BB \
>>> ? PARAM_VALUE (PARAM_SLP_MAX_INSNS_IN_BB)
>>> +#define MIN_INSN_TO_PREFETCH_RATIO \
>>> + ?PARAM_VALUE (PARAM_MIN_INSN_TO_PREFETCH_RATIO)
>>> +#define PREFETCH_MIN_INSN_TO_MEM_RATIO \
>>> + ?PARAM_VALUE (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO)
>>> ?#endif /* ! GCC_PARAMS_H */
>>> Index: tree-ssa-loop-prefetch.c
>>> ===================================================================
>>> --- tree-ssa-loop-prefetch.c ? ?(revision 148159)
>>> +++ tree-ssa-loop-prefetch.c ? ?(working copy)
>>> @@ -109,6 +109,23 @@ along with GCC; see the file COPYING3.
>>> ? ? ? prefetch instructions with guards in cases where 5) was not
>>> sufficient
>>> ? ? ? to satisfy the constraints?
>>>
>>> + ? The function is_loop_prefetching_profitable() implements a cost
>>> model
>>> + ? to determine if prefetching is profitable for a given loop. The cost
>>> + ? model has two heuristcs:
>>> + ? 1. A heuristic that determines whether the given loop has enough CPU
>>> + ? ? ?ops that can be overlapped with cache missing memory ops.
>>> + ? ? ?If not, the loop won't benefit from prefetching. This is
>>> implemented
>>> + ? ? ?by requirung the ratio between the instruction count and the mem
>>> ref
>>> + ? ? ?count to be above a certain minimum.
>>> + ? 2. A heuristic that disables prefetching in a loop with an unknown
>>> trip
>>> + ? ? ?count if the prefetching cost is above a certain limit. The
>>> relative
>>> + ? ? ?prefetching cost is estimated by taking the ratio between the
>>> + ? ? ?prefetch count and the total intruction count (this models the
>>> I-cache
>>> + ? ? ?cost).
>>> + ? The limits used in these heuristics are defined as parameters with
>>> + ? reasonable default values. Machine-specific default values will be
>>> + ? added later.
>>> +
>>> ? ?Some other TODO:
>>> ? ? ? -- write and use more general reuse analysis (that could be also
>>> used
>>> ? ? ? ? in other cache aimed loop optimizations)
>>> @@ -476,7 +493,7 @@ gather_memory_references_ref (struct loo
>>> ? ?true if there are no other memory references inside the loop. ?*/
>>>
>>> ?static struct mem_ref_group *
>>> -gather_memory_references (struct loop *loop, bool *no_other_refs)
>>> +gather_memory_references (struct loop *loop, bool *no_other_refs,
>>> unsigned *ref_count)
>>> ?{
>>> ? basic_block *body = get_loop_body_in_dom_order (loop);
>>> ? basic_block bb;
>>> @@ -487,6 +504,7 @@ gather_memory_references (struct loop *l
>>> ? struct mem_ref_group *refs = NULL;
>>>
>>> ? *no_other_refs = true;
>>> + ?*ref_count = 0;
>>>
>>> ? /* Scan the loop body in order, so that the former references precede
>>> the
>>> ? ? ?later ones. ?*/
>>> @@ -513,11 +531,17 @@ gather_memory_references (struct loop *l
>>> ? ? ? ? ?rhs = gimple_assign_rhs1 (stmt);
>>>
>>> ? ? ? ? ?if (REFERENCE_CLASS_P (rhs))
>>> + ? ? ? ? ? {
>>> ? ? ? ? ? ?*no_other_refs &= gather_memory_references_ref (loop, &refs,
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rhs, false,
>>> stmt);
>>> + ? ? ? ? ? *ref_count += 1;
>>> + ? ? ? ? ? }
>>> ? ? ? ? ?if (REFERENCE_CLASS_P (lhs))
>>> + ? ? ? ? ? {
>>> ? ? ? ? ? ?*no_other_refs &= gather_memory_references_ref (loop, &refs,
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?lhs, true,
>>> stmt);
>>> + ? ? ? ? ? *ref_count += 1;
>>> + ? ? ? ? ? }
>>> ? ? ? ?}
>>> ? ? }
>>> ? free (body);
>>> @@ -846,20 +870,20 @@ schedule_prefetches (struct mem_ref_grou
>>> ? return any;
>>> ?}
>>>
>>> -/* Determine whether there is any reference suitable for prefetching
>>> - ? in GROUPS. ?*/
>>> +/* Estimate the number of prefetches in the given GROUPS. ?*/
>>>
>>> -static bool
>>> -anything_to_prefetch_p (struct mem_ref_group *groups)
>>> +static int
>>> +estimate_prefetch_count (struct mem_ref_group *groups)
>>> ?{
>>> ? struct mem_ref *ref;
>>> + ?int prefetch_count = 0;
>>>
>>> ? for (; groups; groups = groups->next)
>>> ? ? for (ref = groups->refs; ref; ref = ref->next)
>>> ? ? ? if (should_issue_prefetch_p (ref))
>>> - ? ? ? return true;
>>> + ? ? ? ? prefetch_count++;
>>>
>>> - ?return false;
>>> + ?return prefetch_count;
>>> ?}
>>>
>>> ?/* Issue prefetches for the reference REF into loop as decided before.
>>> @@ -1449,6 +1473,71 @@ determine_loop_nest_reuse (struct loop *
>>> ? ? }
>>> ?}
>>>
>>> +/* Do a cost-benefit analysis to determine if prefetching is profitable
>>> + ? for the current loop given the following parameters:
>>> + ? AHEAD: the iteration ahead distance,
>>> + ? EST_NITER: the estimated trip count,
>>> + ? NINSNS: estimated number of instructions in the loop,
>>> + ? PREFETCH_COUNT: an estimate of the number of prefetches
>>> + ? MEM_REF_COUNT: total number of memory references in the loop. ?*/
>>> +
>>> +static bool
>>> +is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT
>>> est_niter,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned ninsns, unsigned
>>> prefetch_count,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned mem_ref_count)
>>> +{
>>> + ?int insn_to_mem_ratio, insn_to_prefetch_ratio;
>>> +
>>> + ?if (mem_ref_count == 0)
>>> + ? ?return false;
>>> +
>>> + ?/* Prefetching improves performance by overlapping cache missing
>>> + ? ? memory accesses with CPU operations. ?If the loop does not have
>>> + ? ? enough CPU operations to overlap with memory operations,
>>> prefetching
>>> + ? ? won't give a significant benefit. ?One approximate way of checking
>>>
>>> + ? ? this is to require the ratio of instructions to memory references
>>> to
>>> + ? ? be above a certain limit. ?This approximation works well in
>>> practice.
>>> + ? ? TODO: Implement a more precise computation by estimating the time
>>> + ? ? for each CPU or memory op in the loop. Time estimates for memory
>>> ops
>>> + ? ? should account for cache misses. ?*/
>>> + ?insn_to_mem_ratio = ninsns / mem_ref_count;
>>> +
>>> + ?if (insn_to_mem_ratio < PREFETCH_MIN_INSN_TO_MEM_RATIO)
>>> + ? ?return false;
>>> +
>>> + ?/* Profitability of prefetching is highly dependent on the trip
>>> count.
>>> + ? ? For a given AHEAD distance, the first AHEAD iterations do not
>>> benefit
>>> + ? ? from prefetching, and the last AHEAD iterations execute useless
>>> + ? ? prefetches. ?So, if the trip count is not large enough relative to
>>> AHEAD,
>>> + ? ? prefetching may cause serious performance degradation. ?To avoid
>>> this
>>> + ? ? problem when the trip count is not known at compile time, we
>>> + ? ? conservatively skip loops with high prefetching costs. ?For now,
>>> only
>>> + ? ? the I-cache cost is considered. ?The relative I-cache cost is
>>> estimated
>>> + ? ? by taking the ratio between the number of prefetches and the total
>>> + ? ? number of instructions. ?Since we are using integer arithmetic, we
>>> + ? ? compute the reciprocal of this ratio.
>>> + ? ? TODO: Account for loop unrolling, which may reduce the costs of
>>> + ? ? shorter stride prefetches. ?Note that not accounting for loop
>>> + ? ? unrolling over-estimates the cost and hence gives more
>>> conservative
>>> + ? ? results. ?*/
>>> + ?if (est_niter < 0)
>>> + ? ?{
>>> + ? ? ?insn_to_prefetch_ratio = ninsns / prefetch_count;
>>> + ? ? ?return insn_to_prefetch_ratio >= MIN_INSN_TO_PREFETCH_RATIO;
>>> + ? ?}
>>> +
>>> + ?if (est_niter <= (HOST_WIDE_INT) ahead)
>>> + ? ?{
>>> + ? ? ?if (dump_file && (dump_flags & TDF_DETAILS))
>>> + ? ? ? fprintf (dump_file,
>>> + ? ? ? ? ? ? ? ?"Not prefetching -- loop estimated to roll only %d
>>> times\n",
>>> + ? ? ? ? ? ? ? ?(int) est_niter);
>>> + ? ? ?return false;
>>> + ? ?}
>>> + ?return true;
>>> +}
>>> +
>>> +
>>> ?/* Issue prefetch instructions for array references in LOOP. ?Returns
>>> ? ?true if the LOOP was unrolled. ?*/
>>>
>>> @@ -1460,6 +1549,8 @@ loop_prefetch_arrays (struct loop *loop)
>>> ? HOST_WIDE_INT est_niter;
>>> ? struct tree_niter_desc desc;
>>> ? bool unrolled = false, no_other_refs;
>>> + ?unsigned prefetch_count;
>>> + ?unsigned mem_ref_count;
>>>
>>> ? if (optimize_loop_nest_for_size_p (loop))
>>> ? ? {
>>> @@ -1469,12 +1560,13 @@ loop_prefetch_arrays (struct loop *loop)
>>> ? ? }
>>>
>>> ? /* Step 1: gather the memory references. ?*/
>>> - ?refs = gather_memory_references (loop, &no_other_refs);
>>> + ?refs = gather_memory_references (loop, &no_other_refs,
>>> &mem_ref_count);
>>>
>>> ? /* Step 2: estimate the reuse effects. ?*/
>>> ? prune_by_reuse (refs);
>>>
>>> - ?if (!anything_to_prefetch_p (refs))
>>> + ?prefetch_count = estimate_prefetch_count (refs);
>>> + ?if (prefetch_count == 0)
>>> ? ? goto fail;
>>>
>>> ? determine_loop_nest_reuse (loop, refs, no_other_refs);
>>> @@ -1485,27 +1577,22 @@ loop_prefetch_arrays (struct loop *loop)
>>> ? ? ?the loop body. ?*/
>>> ? time = tree_num_loop_insns (loop, &eni_time_weights);
>>> ? ahead = (PREFETCH_LATENCY + time - 1) / time;
>>> - ?est_niter = estimated_loop_iterations_int (loop, false);
>>> -
>>> - ?/* The prefetches will run for AHEAD iterations of the original loop.
>>> Unless
>>> - ? ? the loop rolls at least AHEAD times, prefetching the references
>>> does not
>>> - ? ? make sense. ?*/
>>> - ?if (est_niter >= 0 && est_niter <= (HOST_WIDE_INT) ahead)
>>> - ? ?{
>>> - ? ? ?if (dump_file && (dump_flags & TDF_DETAILS))
>>> - ? ? ? fprintf (dump_file,
>>> - ? ? ? ? ? ? ? ?"Not prefetching -- loop estimated to roll only %d
>>> times\n",
>>> - ? ? ? ? ? ? ? ?(int) est_niter);
>>> - ? ? ?goto fail;
>>> - ? ?}
>>> -
>>> - ?mark_nontemporal_stores (loop, refs);
>>> + ?est_niter = estimated_loop_iterations_int (loop, false);
>>>
>>> ? ninsns = tree_num_loop_insns (loop, &eni_size_weights);
>>> ? unroll_factor = determine_unroll_factor (loop, refs, ninsns, &desc,
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? est_niter);
>>> ? if (dump_file && (dump_flags & TDF_DETAILS))
>>> - ? ?fprintf (dump_file, "Ahead %d, unroll factor %d\n", ahead,
>>> unroll_factor);
>>> + ? ?fprintf (dump_file, "Ahead %d, unroll factor %d, trip count %ld\n"
>>> + ? ? ? ? ? ?"insn count %d, mem ref count %d, prefetch count %d\n",
>>> + ? ? ? ? ? ?ahead, unroll_factor, est_niter, ninsns, mem_ref_count,
>>> + ? ? ? ? ? ?prefetch_count);
>>> +
>>> + ?if (!is_loop_prefetching_profitable (ahead, est_niter, ninsns,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?prefetch_count, mem_ref_count))
>>> + ? ?goto fail;
>>> +
>>> + ?mark_nontemporal_stores (loop, refs);
>>>
>>> ? /* Step 4: what to prefetch? ?*/
>>> ? if (!schedule_prefetches (refs, unroll_factor, ahead))
>>> @@ -1556,7 +1643,11 @@ tree_ssa_prefetch_arrays (void)
>>> ? ? ? fprintf (dump_file, " ? ?L1 cache size: %d lines, %d kB\n",
>>> ? ? ? ? ? ? ? L1_CACHE_SIZE_BYTES / L1_CACHE_LINE_SIZE, L1_CACHE_SIZE);
>>> ? ? ? fprintf (dump_file, " ? ?L1 cache line size: %d\n",
>>> L1_CACHE_LINE_SIZE);
>>> - ? ? ?fprintf (dump_file, " ? ?L2 cache size: %d kB\n", L2_CACHE_SIZE);
>>> + ? ? ?fprintf (dump_file, " ? ?L2 cache size: %d kB\n", L2_CACHE_SIZE);
>>>
>>> + ? ? ?fprintf (dump_file, " ? ?min insn-to-prefetch ratio: %d \n",
>>> + ? ? ? ? ? ? ?MIN_INSN_TO_PREFETCH_RATIO);
>>> + ? ? ?fprintf (dump_file, " ? ?min insn-to-mem ratio: %d \n",
>>> + ? ? ? ? ? ? ?PREFETCH_MIN_INSN_TO_MEM_RATIO);
>>> ? ? ? fprintf (dump_file, "\n");
>>> ? ? }
>>>
>>> Index: params.def
>>> ===================================================================
>>> --- params.def ?(revision 148159)
>>> +++ params.def ?(working copy)
>>> @@ -741,6 +741,17 @@ DEFPARAM (PARAM_SLP_MAX_INSNS_IN_BB,
>>> ? ? ? ? ? "Maximum number of instructions in basic block to be
>>> considered for SLP vectorization",
>>> ? ? ? ? ? 1000, 0, 0)
>>>
>>> +DEFPARAM (PARAM_MIN_INSN_TO_PREFETCH_RATIO,
>>> + ? ? ? ? "min-insn-to-prefetch-ratio",
>>> + ? ? ? ? "min. ratio of insns to prefetches to enable prefetching for "
>>> + ? ? ? ? ?"a loop with an unknown trip count",
>>> + ? ? ? ? 10, 0, 0)
>>> +
>>> +DEFPARAM (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO,
>>> + ? ? ? ? "prefetch-min-insn-to-mem-ratio",
>>> + ? ? ? ? "min. ratio of insns to mem ops to enable prefetching in a
>>> loop",
>>> + ? ? ? ? 3, 0, 0)
>>> +
>>> ?/*
>>> ?Local variables:
>>> ?mode:c
>>>
>>> -----Original Message-----
>>> From: Shobaki, Ghassan
>>> Sent: Thursday, June 04, 2009 2:54 PM
>>> To: 'Richard Guenther'
>>> Cc: Zdenek Dvorak; gcc-patches@gcc.gnu.org
>>> Subject: RE: Patch to Avoid Bad Prefetching
>>>
>>>
>>> I meant the time estimate.
>>>
>>> For example, consider a loop with 2 cache missing memory ops and 4 CPU
>>> ops. Assuming that we set the min_insn_to_mem_ratio to 3, the current
>>> *imprecise* heuristic will calculate an insn-to-mem ratio of (2+4)/2=3
>>> and conclude that this loop will benefit from prfetching whether the 4
>>> CPU ops are integer adds or FP divides. A more precise calculation will
>>> go like this:
>>>
>>> Time for memory ops = 200 cycle (assuming a cache-miss latency of 200
>>> and that the machine can do 2 mem ops simultaneously)
>>>
>>> If the 4 CPU ops are integer arithmetic:
>>> Time for CPU ops = 4*1=4 (assuming that int arith ops take one cycle
>>> each)
>>> Max potential benefit from prefetching = 4/204 = 2% (probably not
>>> significant enough to pay off for the prefething cost)
>>>
>>> On the other hand,
>>> If the 4 CPU ops are FP divides:
>>> Time for CPU ops = 4*20=80 (assuming that FP divides take 20 cycles
>>> each)
>>> Max potential benefit from prefetching = 80/(200+80) = 29% (most likely
>>> significant enough to pay off for the prefething cost and deliver a good
>>> performance gain)
>>>
>>> As you can see, this is much more precise than simply looking at the
>>> ratio, but it requires a good time estimate for each operation. I assume
>>> that the function tree_num_loop_insns() internally computes such time
>>> estimates if we pass it time weights. Of course, I know that middle-end
>>> time estimates will not be very precise compared to backend estimates.
>>> BTW, are the middle-end time estimates machine dependent?
>>>
>>> Are you OK with checking in the patch with the ratio for now?
>>>
>>> Thanks
>>> -Ghassan
>>>
>>> -----Original Message-----
>>> From: Richard Guenther [mailto:richard.guenther@gmail.com]
>>> Sent: Thursday, June 04, 2009 1:57 PM
>>> To: Shobaki, Ghassan
>>> Cc: Zdenek Dvorak; gcc-patches@gcc.gnu.org
>>> Subject: Re: Patch to Avoid Bad Prefetching
>>>
>>> On Thu, Jun 4, 2009 at 9:24 PM, Shobaki, Ghassan
>>> <Ghassan.Shobaki@amd.com> wrote:
>>>>
>>>> I agree with Zdenek that the first heuristic is imprecise and needs
>>> improvement. At least it has to distinguish between simple integer
>>> arithmetic ops and expensive FP ops. I am planning on doing this very
>>> soon. As I explained in the previous email, that will require computing
>>> an instruction-by-instruction time estimate using the estimates that GCC
>>> already computes but adapting them to properly account for the costs of
>>> cache missing memory ops.
>>>
>>> GCC has both size and time estimates for code - I guess this is where
>>> you use
>>> the size estimate but really ment the time estimate, no?
>>>
>>> Richard.
>>>
>>>
>>>
>>
>>
>>
>



-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]