[PATCH] x86-64: Add support for non temporal prefetches.
Nutan Singh
nutan.singh@noida.hcltech.com
Thu Aug 19 08:30:00 GMT 2004
Hi,
Attached here is the mentioned file in previous mail.
Nutan
> -----Original Message-----
> From: Nutan Singh
> Sent: Thursday, August 19, 2004 1:39 PM
> To: 'gcc-patches@gcc.gnu.org'
> Subject: [PATCH] x86-64: Add support for non temporal prefetches.
>
>
> Hi,
>
> The patch supports generation of PREFETCHNTA and MOVNTx
> (streaming stores) for AMD 64-bit target for non-temporal
> data accesses. Data is said to be non-temporal if it's size
> is more than the size of L2 cache or it is more than 64KB and
> not accessed again any time soon.
>
> Also, at present loop unrolling is done in second loop
> optimizer where as prefetching is done in first loop
> optimizer. So, once prefetch instructions are generated for
> each iteration then the loop is unrolled which results in
> quite a lot of prefetches in an iteration for same GIV i.e.
> prefetching of same cache line many times.
> The patch also overcomes this problem of fetching same cache
> line more than once for AMD 64-bit target.
>
> The spec-benchmark CPU 2000 shows some gain with this
> compiler over the previous one. Especially for 168.wupwise,
> 173.applu, 183.equake, 200.sixtrack of FP benchmarks and
> 164.gzip, 254.gap, 255.vortex, 256.bzip2, 300.twolf of INT
> benchmarks. Attached is the file "C2000.asc" showing these
> values. Please observe the difference between base (with
> -fprefetch-loop-arrays option) and peak (without
> -fprefetch-loop-arrays option) values.
> Comparing the new compiler with that of the old one
> 173.applu, 183.equake, 200.sixtrack, 164.gzip, 254.gap show
> performance gain.
>
> When loop iteration count is not known at the time of
> prefetching, it assumes a very high value (0XFFFFFFFF) as
> loop iteration count making data size more than the size of
> L2 cache hence forcing generation of PREFETCHNTA. For some
> functions PREFETCHx instead of PREFETCHNTA is beneficial for
> above case, as the loop iteration actually turns out to be
> small. Hence we need to specify a small value for loop
> iteration count so as to generate PREFETCHx. Adding
> assumed-loop-iteration option for --param does this.
>
> Reg tested on x86-64.
>
> Ok for 3.4 branch?
>
> Nutan
> ------------------------------------------------------------
> 2004-07-05 Nutan Singh <nutans@noida.hcltech.com>
>
> * loop.c (emit_prefetch_instructions): Emit
> non-temporal prefetches
> where givs are not reused.
> (non_temporal_store): New function for generating
> non-temporal stores.
>
> * common.opt: Enable guessing of loop iteration count.
>
> * opts.c: Set assumed-loop-iteration value.
>
> * params.def: Defining parameter assumed-loop-iteration.
>
> * params.h: Define macro ASSUMED_LOOP_ITERATION with value
> assumed-loop-iteration.
>
> * config/i386/i386.c: For ATHLON or K8 targets make
> x86_cost value
> equal to k8_cost.
>
> * config/i386/i386.h (L2_CACHE_SIZE, SIZE_L2,
> SIZE_LESS_THAN_L2) :
> Define new macros for AMD related cache sizes.
> (PREFETCH_BLOCKS_BEFORE_LOOP_MAX,
> PREFETCH_BLOCKS_BEFORE_LOOP_MIN):
> Defined these macros with defferent values for AMD target.
> (NON_TEMPORAL_STORE): New macro to generate movnti
> instructions.
>
> * cfgloop.c (remove_redundant_prefetches): New
> function to remove
> redundant prefetches for a loop to avoid fetching
> same cache line
> repeatedly.
>
> * toplev.c: Calls function remove_redundant_prefetches.
>
> Index: gcc/cfgloop.c
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/cfgloop.c,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 cfgloop.c
> --- gcc/cfgloop.c 2004/05/13 06:12:48 3.4.1.1
> +++ gcc/cfgloop.c 2004/08/04 11:59:19
> @@ -1286,3 +1286,56 @@ loop_preheader_edge (const struct loop *
>
> return e;
> }
> +
> +#ifdef HAVE_prefetch
> +void
> +remove_redundant_prefetches (struct loops *loops, FILE *
> rtl_dump_file)
> +{
> + basic_block *bbs;
> + rtx plist = 0, insn;
> + int i, j;
> + int loop_num = loops->num;
> +
> + for (i = 1; i < loop_num; i++)
> + {
> + struct loop *loop = loops->parray[i];
> +
> + if (!loop)
> + continue;
> +
> + bbs = get_loop_body (loop);
> + if (rtl_dump_file)
> + {
> + int i;
> + fprintf (rtl_dump_file, ";;\n;; Analyzing Loop:%d
> (For PREFETCH)\n",
> + loop->num);
> + fprintf (rtl_dump_file, ";; nodes:");
> + for (i = 0; i < (int) loop->num_nodes; i++)
> + fprintf (rtl_dump_file, " %d", bbs[i]->index);
> + fprintf (rtl_dump_file,
> + "\n--------------------------------------\n");
> + }
> + for (j = 0; j < (int) loop->num_nodes; j++)
> + {
> + for (insn = BB_HEAD (bbs[j]); insn != BB_END (bbs[j]);
> + insn = NEXT_INSN (insn))
> + {
> + if (INSN_P (insn) && GET_CODE (PATTERN (insn)) ==
> PREFETCH)
> + {
> + rtx addr = XEXP (PATTERN (insn), 0);
> + rtx next;
> + if (plist)
> + for (next = plist; next; next = XEXP (next, 1))
> + if (rtx_equal_p (addr, XEXP (next, 0)))
> + {
> + insn = PREV_INSN (delete_insn (insn));
> + break;
> + }
> + plist = gen_rtx_EXPR_LIST (VOIDmode, addr, plist);
> + }
> + }
> + }
> + free (bbs);
> + }
> +}
> +#endif
> Index: gcc/cfgloop.h
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/cfgloop.h,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 cfgloop.h
> --- gcc/cfgloop.h 2004/05/13 06:12:48 3.4.1.1
> +++ gcc/cfgloop.h 2004/08/04 11:59:19
> @@ -325,6 +325,7 @@ extern edge split_loop_bb (basic_block,
> /* Loop optimizer initialization. */
> extern struct loops *loop_optimizer_init (FILE *);
> extern void loop_optimizer_finalize (struct loops *, FILE *);
> +extern void remove_redundant_prefetches(struct loops *, FILE *);
>
> /* Optimization passes. */
> extern void unswitch_loops (struct loops *);
> Index: gcc/common.opt
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/common.opt,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 common.opt
> --- gcc/common.opt 2004/05/13 06:12:49 3.4.1.1
> +++ gcc/common.opt 2004/08/04 11:59:19
> @@ -373,6 +373,10 @@ fguess-branch-probability
> Common
> Enable guessing of branch probabilities
>
> +fguess-loop-iteration
> +Common
> +Enable guessing of loop iteration count
> +
> fident
> Common
> Process #ident directives
> Index: gcc/loop.c
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/loop.c,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 loop.c
> --- gcc/loop.c 2004/05/13 06:12:52 3.4.1.1
> +++ gcc/loop.c 2004/08/04 11:59:19
> @@ -66,13 +66,18 @@ Software Foundation, 59 Temple Place - S
> #include "optabs.h"
> #include "cfgloop.h"
> #include "ggc.h"
> +#include "params.h"
>
> /* Not really meaningful values, but at least something. */
> #ifndef SIMULTANEOUS_PREFETCHES
> #define SIMULTANEOUS_PREFETCHES 3
> #endif
> +/* Number of cache lines ahead to prefetch. */
> +#ifndef PREFETCH_DISTANCE
> +#define PREFETCH_DISTANCE 5
> +#endif
> #ifndef PREFETCH_BLOCK
> -#define PREFETCH_BLOCK 32
> +#define PREFETCH_BLOCK 32
> #endif
> #ifndef HAVE_prefetch
> #define HAVE_prefetch 0
> @@ -86,10 +91,14 @@ Software Foundation, 59 Temple Place - S
> #define MAX_PREFETCHES 100
> /* The number of prefetch blocks that are beneficial to
> fetch at once before
> a loop with a known (and low) iteration count. */
> -#define PREFETCH_BLOCKS_BEFORE_LOOP_MAX 6
> +#ifndef PREFETCH_BLOCKS_BEFORE_LOOP_MAX
> +#define PREFETCH_BLOCKS_BEFORE_LOOP_MAX 6
> +#endif
> /* For very tiny loops it is not worthwhile to prefetch even
> before the loop,
> since it is likely that the data are already in the cache. */
> -#define PREFETCH_BLOCKS_BEFORE_LOOP_MIN 2
> +#ifndef PREFETCH_BLOCKS_BEFORE_LOOP_MIN
> +#define PREFETCH_BLOCKS_BEFORE_LOOP_MIN 2
> +#endif
>
> /* Parameterize some prefetch heuristics so they can be
> turned on and off
> easily for performance testing on new architectures. These can be
> @@ -146,7 +155,7 @@ Software Foundation, 59 Temple Place - S
>
> /* Do not handle reversed order prefetches (negative stride). */
> #ifndef PREFETCH_NO_REVERSE_ORDER
> -#define PREFETCH_NO_REVERSE_ORDER 1
> +#define PREFETCH_NO_REVERSE_ORDER 0
> #endif
>
> /* Prefetch even if the GIV is in conditional code. */
> @@ -3665,6 +3674,7 @@ struct prefetch_info
> int prefetch_in_loop; /* Number of prefetch
> insns in loop. */
> int prefetch_before_loop; /* Number of prefetch insns
> before loop. */
> unsigned int write : 1; /* 1 for read/write prefetches. */
> + int reused; /* 1 if this giv is
> reused in the loop */
> };
>
> /* Data used by check_store function. */
> @@ -3852,7 +3862,7 @@ emit_prefetch_instructions (struct loop
> int i;
> struct iv_class *bl;
> struct induction *iv;
> - struct prefetch_info info[MAX_PREFETCHES];
> + struct prefetch_info info[MAX_PREFETCHES] = {0};
> struct loop_ivs *ivs = LOOP_IVS (loop);
>
> if (!HAVE_prefetch)
> @@ -4038,6 +4048,7 @@ emit_prefetch_instructions (struct loop
> info[i].class = bl;
> info[num_prefetches].base_address = address;
> add = 0;
> + info[i].reused = 1;
> break;
> }
>
> @@ -4047,6 +4058,7 @@ emit_prefetch_instructions (struct loop
> info[i].write |= d.mem_write;
> info[i].bytes_accessed += size;
> add = 0;
> + info[i].reused = 1;
> break;
> }
> }
> @@ -4084,7 +4096,7 @@ emit_prefetch_instructions (struct loop
> >= LOOP_INFO (loop)->n_iterations))
> info[i].total_bytes = info[i].stride * LOOP_INFO
> (loop)->n_iterations;
> else
> - info[i].total_bytes = 0xffffffff;
> + info[i].total_bytes = info[i].stride * ASSUMED_LOOP_ITERATION;
>
> density = info[i].bytes_accessed * 100 / info[i].stride;
>
> @@ -4136,9 +4148,9 @@ emit_prefetch_instructions (struct loop
> }
> /* We'll also use AHEAD to determine how many prefetch
> instructions to
> emit before a loop, so don't leave it zero. */
> - if (ahead == 0)
> - ahead = PREFETCH_BLOCKS_BEFORE_LOOP_MAX;
>
> + ahead = PREFETCH_BLOCKS_BEFORE_LOOP_MAX;
> +
> for (i = 0; i < num_prefetches; i++)
> {
> /* Update if we've decided not to prefetch anything
> within the loop. */
> @@ -4203,7 +4215,7 @@ emit_prefetch_instructions (struct loop
> {
> rtx loc = copy_rtx (*info[i].giv->location);
> rtx insn;
> - int bytes_ahead = PREFETCH_BLOCK * (ahead + y);
> + int bytes_ahead = PREFETCH_BLOCK * (ahead + y +
> PREFETCH_DISTANCE);
> rtx before_insn = info[i].giv->insn;
> rtx prev_insn = PREV_INSN (info[i].giv->insn);
> rtx seq;
> @@ -4226,8 +4238,23 @@ emit_prefetch_instructions (struct loop
> if (!
> (*insn_data[(int)CODE_FOR_prefetch].operand[0].predicate)
> (loc,
> insn_data[(int)CODE_FOR_prefetch].operand[0].mode))
> loc = force_reg (Pmode, loc);
> - emit_insn (gen_prefetch (loc, GEN_INT (info[i].write),
> - GEN_INT (3)));
> +
> +#ifdef L2_CACHE_SIZE
> + if (info[i].total_bytes > SIZE_L2
> + || ((info[i].total_bytes > SIZE_LESS_THAN_L2)
> + && (!info[i].reused)))
> + {
> +#ifdef NON_TEMPORAL_STORE
> + if (!info[i].write || !non_temporal_store (info[i].giv->insn))
> +#endif
> + emit_insn (gen_prefetch (loc, GEN_INT (0), GEN_INT(0)));
> + }
> + else
> + emit_insn (gen_prefetch (loc, GEN_INT
> (info[i].write), GEN_INT(3)));
> +#else
> + emit_insn (gen_prefetch (loc, GEN_INT
> (info[i].write), GEN_INT(3)));
> +#endif
> +
> seq = get_insns ();
> end_sequence ();
> emit_insn_before (seq, before_insn);
> @@ -4256,7 +4283,7 @@ emit_prefetch_instructions (struct loop
> rtx init_val = info[i].class->initial_value;
> rtx add_val = simplify_gen_binary (PLUS, Pmode,
> info[i].giv->add_val,
> - GEN_INT (y *
> PREFETCH_BLOCK));
> + GEN_INT ((y +
> PREFETCH_DISTANCE) * PREFETCH_BLOCK));
>
> /* Functions called by LOOP_IV_ADD_EMIT_BEFORE expect a
> non-constant INIT_VAL to have the same mode as
> REG, which
> @@ -4274,15 +4301,42 @@ emit_prefetch_instructions (struct loop
> loop_iv_add_mult_emit_before (loop, init_val,
> info[i].giv->mult_val,
> add_val, reg, 0,
> loop_start);
> - emit_insn_before (gen_prefetch (reg, GEN_INT
> (info[i].write),
> - GEN_INT (3)),
> - loop_start);
> +#ifdef L2_CACHE_SIZE
> + if (info[i].total_bytes > SIZE_L2
> + || ((info[i].total_bytes >=
> SIZE_LESS_THAN_L2) && (!info[i].reused)))
> + emit_insn_before (gen_prefetch (reg, GEN_INT
> (0), GEN_INT(0)), loop_start);
> + else
> + emit_insn_before (gen_prefetch (reg, GEN_INT
> (info[i].write), GEN_INT(3)), loop_start);
> +#else
> + emit_insn_before (gen_prefetch (reg, GEN_INT
> (info[i].write), GEN_INT(3)), loop_start);
> +#endif
> +
> }
> }
> }
>
> return;
> }
> +
> +#ifdef NON_TEMPORAL_STORE
> +/* See if a non-tempopral store can be used
> + for insn. If valid, make the change and return non-zero. */
> +static int
> +non_temporal_store (rtx insn)
> +{
> + rtx set = single_set (insn);
> + rtx store;
> + int rval=0;
> + if (set && GET_CODE (SET_DEST (set)) == MEM
> + && GET_CODE (SET_SRC (set)) == REG)
> + {
> + store = NON_TEMPORAL_STORE (SET_DEST (set), SET_SRC (set));
> + rval = validate_change (insn, &PATTERN(insn), store, 0);
> + }
> + return rval;
> +}
> +#endif
> +
>
> /* Communication with routines called via `note_stores'. */
>
> Index: gcc/loop.h
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/loop.h,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 loop.h
> --- gcc/loop.h 2004/05/13 06:12:52 3.4.1.1
> +++ gcc/loop.h 2004/08/04 11:59:19
> @@ -402,6 +402,7 @@ extern FILE *loop_dump_stream;
> /* Forward declarations for non-static functions declared in
> loop.c and
> unroll.c. */
> extern int loop_invariant_p (const struct loop *, rtx);
> +extern int reg_in_basic_block_p (rtx, rtx);
> extern rtx get_condition_for_loop (const struct loop *, rtx);
> extern void loop_iv_add_mult_hoist (const struct loop *,
> rtx, rtx, rtx, rtx);
> extern void loop_iv_add_mult_sink (const struct loop *, rtx,
> rtx, rtx, rtx);
> Index: gcc/opts.c
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/opts.c,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 opts.c
> --- gcc/opts.c 2004/05/13 06:12:52 3.4.1.1
> +++ gcc/opts.c 2004/08/04 11:59:19
> @@ -1071,6 +1071,10 @@ common_handle_option (size_t scode, cons
> set_param_value ("max-inline-insns-rtl", value);
> break;
>
> + case OPT_fguess_loop_iteration:
> + set_param_value ("assumed-loop-iteration", value);
> + break;
> +
> case OPT_finstrument_functions:
> flag_instrument_function_entry_exit = value;
> break;
> Index: gcc/params.def
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/params.def,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 params.def
> --- gcc/params.def 2004/05/13 06:12:52 3.4.1.1
> +++ gcc/params.def 2004/08/04 11:59:19
> @@ -215,6 +215,11 @@ DEFPARAM(TRACER_MAX_CODE_GROWTH,
> "tracer-max-code-growth",
> "Maximal code growth caused by tail duplication (in percent)",
> 100)
> +DEFPARAM(PARAM_ASSUMED_LOOP_ITERATION,
> + "assumed-loop-iteration",
> + "Assume loop iteration count when it is not known at
> the time of loop \
> +optimization)",
> + 200)
> DEFPARAM(TRACER_MIN_BRANCH_RATIO,
> "tracer-min-branch-ratio",
> "Stop reverse growth if the reverse probability of
> best edge is less \
> Index: gcc/params.h
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/params.h,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 params.h
> --- gcc/params.h 2004/05/13 06:12:52 3.4.1.1
> +++ gcc/params.h 2004/08/04 11:59:19
> @@ -82,6 +82,8 @@ typedef enum compiler_param
> (compiler_params[(int) ENUM].value)
>
> /* Macros for the various parameters. */
> +#define ASSUMED_LOOP_ITERATION \
> + PARAM_VALUE (PARAM_ASSUMED_LOOP_ITERATION)
> #define MAX_INLINE_INSNS_SINGLE \
> PARAM_VALUE (PARAM_MAX_INLINE_INSNS_SINGLE)
> #define MAX_INLINE_INSNS \
> Index: gcc/toplev.c
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/toplev.c,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 toplev.c
> --- gcc/toplev.c 2004/05/13 06:12:53 3.4.1.1
> +++ gcc/toplev.c 2004/08/04 11:59:20
> @@ -3083,7 +3083,12 @@ rest_of_handle_loop2 (tree decl, rtx ins
> (flag_peel_loops ? UAP_PEEL : 0) |
> (flag_unroll_loops ? UAP_UNROLL : 0) |
> (flag_unroll_all_loops ?
> UAP_UNROLL_ALL : 0));
> -
> +#ifdef HAVE_prefetch
> + /* Remove redundant copies of prefetch, generated during loop
> + unrolling. */
> + if (flag_prefetch_loop_arrays)
> + remove_redundant_prefetches (loops, rtl_dump_file);
> +#endif
> loop_optimizer_finalize (loops, rtl_dump_file);
> }
>
> Index: gcc/config/i386/i386.c
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/config/i386/i386.c,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 i386.c
> --- gcc/config/i386/i386.c 2004/05/13 06:13:22 3.4.1.1
> +++ gcc/config/i386/i386.c 2004/08/04 11:59:21
> @@ -457,7 +457,7 @@ struct processor_costs pentium4_cost = {
> 43, /* cost of
> FSQRT instruction. */
> };
>
> -const struct processor_costs *ix86_cost = &pentium_cost;
> +
>
> /* Processor feature/optimization bitmasks. */
> #define m_386 (1<<PROCESSOR_I386)
> @@ -469,6 +469,12 @@ const struct processor_costs *ix86_cost
> #define m_PENT4 (1<<PROCESSOR_PENTIUM4)
> #define m_K8 (1<<PROCESSOR_K8)
> #define m_ATHLON_K8 (m_K8 | m_ATHLON)
> +
> +#ifdef m_ATHLON_K8
> +const struct processor_costs *ix86_cost = &k8_cost;
> +#else
> +const struct processor_costs *ix86_cost = &pentium_cost;
> +#endif
>
> const int x86_use_leave = m_386 | m_K6 | m_ATHLON_K8;
> const int x86_push_memory = m_386 | m_K6 | m_ATHLON_K8 | m_PENT4;
> Index: gcc/config/i386/i386.h
> ===================================================================
> RCS file: /home/gnu/cvs/gcc-3.4/gcc/gcc/config/i386/i386.h,v
> retrieving revision 3.4.1.1
> diff -u -p -r3.4.1.1 i386.h
> --- gcc/config/i386/i386.h 2004/05/13 06:13:22 3.4.1.1
> +++ gcc/config/i386/i386.h 2004/08/04 11:59:21
> @@ -2569,6 +2569,25 @@ enum ix86_builtins
> /* Number of prefetch operations that can be done in parallel. */
> #define SIMULTANEOUS_PREFETCHES ix86_cost->simultaneous_prefetches
>
> +/* Define the L2 data cache size. */
> +#define SIZE_L2 1048576
> +#define L2_CACHE_SIZE SIZE_L2
> +
> +#define NON_TEMPORAL_STORE(OP1, OP2) gen_sse2_movntsi ((OP1), (OP2))
> +
> +/* MINIMUM threshhold to generate the temporal prefetches.
> e.g. on AMD64
> + 64KB is much less than L2 size and a temporal prefetch
> will always be generated
> + for sizez less than this size. */
> +#define SIZE_LESS_THAN_L2 65536
> +
> +#ifdef TARGET_ATHLON_K8
> +#define PREFETCH_BLOCKS_BEFORE_LOOP_MAX 2
> +#endif
> +
> +#ifdef TARGET_ATHLON_K8
> +#define PREFETCH_BLOCKS_BEFORE_LOOP_MIN 0
> +#endif
> +
> /* Max number of bytes we can move from memory to memory
> in one reasonably fast instruction. */
> #define MOVE_MAX 16
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: C2000.asc
Type: application/octet-stream
Size: 2660 bytes
Desc: C2000.asc
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20040819/1daff330/attachment.obj>
More information about the Gcc-patches
mailing list