[patch] Nontemporal prefetches

Zdenek Dvorak rakdver@atrey.karlin.mff.cuni.cz
Thu Apr 12 23:09:00 GMT 2007


Hello,

this patch makes us issue nontemporal prefetch instructions (prefetchnta
on x86) to prefetch memory references that provably are not reused in L2
cache.  We do this only for loop nests that process data that do not fit
entirely in L2 cache (since we do the reuse analysis only locally to the
loop nest, we cannot know otherwise whether the contents of the L2 cache
is not reused later).

The patch implements simple loop-nest local reuse analysis based on the
data dependence analysis -- we consider the distance vector for each
dependence, including the read-read ones, and multiply it by the amount
of data accessed in the corresponding loops.  If the obtained volume
between the two accesses exceeds the size of the L2 cache, we assume
that there is no reuse.  This is slightly inprecise, however it appears to
be good enough in practice, and it can be improved later if necessary.

We handle self-reuse specially, since it often happens that although
we never access exactly the same location (so from the dependence
analysis point of view, there is no dependence), we repeatedly access
the locations in the same cache line (see e.g. the example in
self_reuse_distance function, motivated by swim benchmark).

Regarding the performance results, we (unsurprisingly) get a nice
improvement in the stream benchmark (measured on athlon): 

without prefetchnta:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1806.4899       0.0200       0.0177       0.0282
Scale:       1712.2297       0.0188       0.0187       0.0192
Add:         1844.2428       0.0270       0.0260       0.0328
Triad:       1845.6551       0.0267       0.0260       0.0308

with prefetchnta:

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1918.5794       0.0169       0.0167       0.0173
Scale:       1857.2290       0.0174       0.0172       0.0176
Add:         2158.2799       0.0226       0.0222       0.0239
Triad:       2167.2333       0.0233       0.0221       0.0270

For spec2000 (on x86_64), we get a slight improvement on specfp and
basically neutral results on specint (peak is with the patch):

   164.gzip          1400   127      1104    *     1400   127      1105    *
   175.vpr           1400   136      1032    *     1400   135      1034    *
   176.gcc           1100    81.4    1351    *     1100    81.0    1358    *
   181.mcf           1800   285       631    *     1800   284       634    *
   186.crafty        1000    51.7    1935    *     1000    51.7    1935    *
   197.parser        1800   211       852    *     1800   211       853    *
   252.eon                                   X                             X
   253.perlbmk       1800   136      1322    *     1800   137      1318    *
   254.gap           1100    91.0    1208    *     1100    92.0    1195    *
   255.vortex                                X                             X
   256.bzip2         1500   141      1065    *     1500   141      1065    *
   300.twolf                                 X                             X
   Est. SPECint_base2000             1118    
   Est. SPECint2000                                                1118    

   168.wupwise       1600   118      1358    *     1600   118      1360    *
   171.swim          3100   228      1360    *     3100   226      1373    *
   172.mgrid         1800   282       639    *     1800   280       642    *
   173.applu         2100   209      1004    *     2100   208      1008    *
   177.mesa          1400    96.2    1456    *     1400    93.2    1502    *
   178.galgel                                X                             X
   179.art           2600   201      1294    *     2600   202      1284    *
   183.equake        1300   102      1268    *     1300   102      1270    *
   187.facerec       1900   185      1027    *     1900   185      1028    *
   188.ammp          2200   176      1248    *     2200   176      1253    *
   189.lucas         2000   148      1348    *     2000   148      1349    *
   191.fma3d         2100   201      1043    *     2100   201      1043    *
   200.sixtrack                              X                             X
   301.apsi          2600   213      1220    *     2600   213      1219    *
   Est. SPECfp_base2000              1165    
   Est. SPECfp2000                                                 1169    

The size of L2 cache is hard-coded in the patch; I will send a followup
patch to make it controlable via param and initialized to the
machine-specific value.

Bootstrapped & regtested on i686, x86_64 and ia64.  Although the changes
are quite simple, this definitely is a new functionality, so I need an
approval for the patch.

Zdenek

	* tree-vectorizer.h (DR_MISALIGNMENT): Cast aux to integer.
	(SET_DR_MISALIGNMENT): New.
	* tree-vect-analyze.c (vect_compute_data_ref_alignment,
	vect_update_misalignment_for_peel, vect_enhance_data_refs_alignment):
	Use SET_DR_MISALIGNMENT.

	* tree-data-ref.c (create_data_ref, compute_all_dependences,
	find_loop_nest): Export.
	* tree-data-ref.h (struct data_reference): Change aux field to pointer.
	(create_data_ref, compute_all_dependences, find_loop_nest): Declare.
	* tree-ssa-loop-prefetch.c: Include tree-data-ref.h.
	(L1_CACHE_SIZE_BYTES, L2_CACHE_SIZE_BYTES, NONTEMPORAL_FRACTION):
	New macros.
	(struct mem_ref): Add field reuse_distance.
	(find_or_create_group, record_ref): Use XNEW instead of xcalloc.
	Initialize reuse_distance field.
	(issue_prefetch_ref): Select temporality of prefetch according to
	reuse_distance.
	(volume_of_references, volume_of_dist_vector, add_subscript_strides,
	self_reuse_distance, determine_loop_nest_reuse): New functions.
	(loop_prefetch_arrays): Call determine_loop_nest_reuse.
	(tree_ssa_prefetch_arrays): Dump L2 cache size.
	* Makefile.in (tree-ssa-loop-prefetch.o): Add TREE_DATA_REF_H
	dependency.

	* gcc.dg/tree-ssa/prefetch-6.c: New test.

Index: testsuite/gcc.dg/tree-ssa/prefetch-6.c
===================================================================
*** testsuite/gcc.dg/tree-ssa/prefetch-6.c	(revision 0)
--- testsuite/gcc.dg/tree-ssa/prefetch-6.c	(revision 0)
***************
*** 0 ****
--- 1,54 ----
+ /* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+ /* { dg-require-effective-target ilp32 } */
+ /* { dg-options "-O2 -fprefetch-loop-arrays -march=athlon -msse2 -mfpmath=sse -fdump-tree-aprefetch-details" } */
+ 
+ #define N 1000
+ #define K 900
+ 
+ double a[N][N];
+ 
+ double test(void)
+ {
+   unsigned i, j;
+   double sum = 0;
+ 
+   /* Here, we should use non-temporal prefetch instruction.  */
+   for (i = 0; i < K; i++)
+     for (j = 0; j < K; j++)
+       sum += a[i][j];
+ 
+   /* Here, we should not use non-temporal prefetch instruction, since the
+      value of a[i+10][j] is reused in L2 cache.  */
+   for (i = 0; i < K; i++)
+     for (j = 0; j < K; j++)
+       sum += a[i][j] * a[i + 10][j];
+ 
+   /* Here, we should use non-temporal prefetch instruction, since the
+      value of a[i+100][j] is too far to be reused in L2 cache.  */
+   for (i = 0; i < K; i++)
+     for (j = 0; j < K; j++)
+       sum += a[i][j] * a[i + 100][j];
+ 
+   /* Here, temporal prefetches should be used, since the volume of the
+      memory accesses is smaller than L2 cache.  */
+   for (i = 0; i < 100; i++)
+     for (j = 0; j < 100; j++)
+       sum += a[i][j] * a[i + 100][j];
+ 
+   /* Temporal prefetches should be used here (even though the accesses to
+      a[j][i] are independent, the same cache line is almost always hit
+      every N iterations).  */
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       sum += a[j][i];
+ 
+   return sum;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "Issued prefetch" 5 "aprefetch" } } */
+ /* { dg-final { scan-tree-dump-times "Issued nontemporal prefetch" 3 "aprefetch" } } */
+ 
+ /* { dg-final { scan-assembler-times "prefetcht" 5 } } */
+ /* { dg-final { scan-assembler-times "prefetchnta" 3 } } */
+ 
+ /* { dg-final { cleanup-tree-dump "aprefetch" } } */
Index: tree-vectorizer.h
===================================================================
*** tree-vectorizer.h	(revision 123734)
--- tree-vectorizer.h	(working copy)
*************** is_pattern_stmt_p (stmt_vec_info stmt_in
*** 335,341 ****
  
  /* Reflects actual alignment of first access in the vectorized loop,
     taking into account peeling/versioning if applied.  */
! #define DR_MISALIGNMENT(DR)   (DR)->aux
  
  static inline bool
  aligned_access_p (struct data_reference *data_ref_info)
--- 335,342 ----
  
  /* Reflects actual alignment of first access in the vectorized loop,
     taking into account peeling/versioning if applied.  */
! #define DR_MISALIGNMENT(DR)   ((int) (size_t) (DR)->aux)
! #define SET_DR_MISALIGNMENT(DR, VAL)   ((DR)->aux = (void *) (size_t) (VAL))
  
  static inline bool
  aligned_access_p (struct data_reference *data_ref_info)
Index: tree-data-ref.c
===================================================================
*** tree-data-ref.c	(revision 123734)
--- tree-data-ref.c	(working copy)
*************** free_data_ref (data_reference_p dr)
*** 1907,1913 ****
     DR (returned value) - data_reference struct for MEMREF
  */
  
! static struct data_reference *
  create_data_ref (tree memref, tree stmt, bool is_read)
  {
    struct data_reference *dr = NULL;
--- 1907,1913 ----
     DR (returned value) - data_reference struct for MEMREF
  */
  
! struct data_reference *
  create_data_ref (tree memref, tree stmt, bool is_read)
  {
    struct data_reference *dr = NULL;
*************** compute_self_dependence (struct data_dep
*** 4916,4922 ****
     COMPUTE_SELF_AND_RR is FALSE, don't compute read-read and self
     relations.  */
  
! static void 
  compute_all_dependences (VEC (data_reference_p, heap) *datarefs,
  			 VEC (ddr_p, heap) **dependence_relations,
  			 VEC (loop_p, heap) *loop_nest,
--- 4916,4922 ----
     COMPUTE_SELF_AND_RR is FALSE, don't compute read-read and self
     relations.  */
  
! void 
  compute_all_dependences (VEC (data_reference_p, heap) *datarefs,
  			 VEC (ddr_p, heap) **dependence_relations,
  			 VEC (loop_p, heap) *loop_nest,
*************** find_loop_nest_1 (struct loop *loop, VEC
*** 5134,5140 ****
     contain the loops from the outermost to the innermost, as they will
     appear in the classic distance vector.  */
  
! static bool
  find_loop_nest (struct loop *loop, VEC (loop_p, heap) **loop_nest)
  {
    VEC_safe_push (loop_p, heap, *loop_nest, loop);
--- 5134,5140 ----
     contain the loops from the outermost to the innermost, as they will
     appear in the classic distance vector.  */
  
! bool
  find_loop_nest (struct loop *loop, VEC (loop_p, heap) **loop_nest)
  {
    VEC_safe_push (loop_p, heap, *loop_nest, loop);
Index: tree-data-ref.h
===================================================================
*** tree-data-ref.h	(revision 123734)
--- tree-data-ref.h	(working copy)
*************** struct data_reference
*** 79,85 ****
    tree ref;
  
    /* Auxiliary info specific to a pass.  */
!   int aux;
  
    /* True when the data reference is in RHS of a stmt.  */
    bool is_read;
--- 79,85 ----
    tree ref;
  
    /* Auxiliary info specific to a pass.  */
!   void *aux;
  
    /* True when the data reference is in RHS of a stmt.  */
    bool is_read;
*************** extern void dump_data_dependence_directi
*** 357,363 ****
  extern void free_dependence_relation (struct data_dependence_relation *);
  extern void free_dependence_relations (VEC (ddr_p, heap) *);
  extern void free_data_refs (VEC (data_reference_p, heap) *);
! 
  
  /* Return the index of the variable VAR in the LOOP_NEST array.  */
  
--- 357,366 ----
  extern void free_dependence_relation (struct data_dependence_relation *);
  extern void free_dependence_relations (VEC (ddr_p, heap) *);
  extern void free_data_refs (VEC (data_reference_p, heap) *);
! struct data_reference *create_data_ref (tree, tree, bool);
! bool find_loop_nest (struct loop *, VEC (loop_p, heap) **);
! void compute_all_dependences (VEC (data_reference_p, heap) *,
! 			      VEC (ddr_p, heap) **, VEC (loop_p, heap) *, bool);
  
  /* Return the index of the variable VAR in the LOOP_NEST array.  */
  
Index: tree-vect-analyze.c
===================================================================
*** tree-vect-analyze.c	(revision 123734)
--- tree-vect-analyze.c	(working copy)
*************** vect_compute_data_ref_alignment (struct 
*** 1111,1117 ****
      fprintf (vect_dump, "vect_compute_data_ref_alignment:");
  
    /* Initialize misalignment to unknown.  */
!   DR_MISALIGNMENT (dr) = -1;
  
    misalign = DR_OFFSET_MISALIGNMENT (dr);
    aligned_to = DR_ALIGNED_TO (dr);
--- 1111,1117 ----
      fprintf (vect_dump, "vect_compute_data_ref_alignment:");
  
    /* Initialize misalignment to unknown.  */
!   SET_DR_MISALIGNMENT (dr, -1);
  
    misalign = DR_OFFSET_MISALIGNMENT (dr);
    aligned_to = DR_ALIGNED_TO (dr);
*************** vect_compute_data_ref_alignment (struct 
*** 1182,1188 ****
        return false;
      }
  
!   DR_MISALIGNMENT (dr) = TREE_INT_CST_LOW (misalign);
  
    if (vect_print_dump_info (REPORT_DETAILS))
      {
--- 1182,1188 ----
        return false;
      }
  
!   SET_DR_MISALIGNMENT (dr, TREE_INT_CST_LOW (misalign));
  
    if (vect_print_dump_info (REPORT_DETAILS))
      {
*************** vect_update_misalignment_for_peel (struc
*** 1246,1252 ****
        && (DR_MISALIGNMENT (dr) / dr_size ==
            DR_MISALIGNMENT (dr_peel) / dr_peel_size))
      {
!       DR_MISALIGNMENT (dr) = 0;
        return;
      }
  
--- 1246,1252 ----
        && (DR_MISALIGNMENT (dr) / dr_size ==
            DR_MISALIGNMENT (dr_peel) / dr_peel_size))
      {
!       SET_DR_MISALIGNMENT (dr, 0);
        return;
      }
  
*************** vect_update_misalignment_for_peel (struc
*** 1260,1280 ****
          continue;
        gcc_assert (DR_MISALIGNMENT (dr) / dr_size ==
                    DR_MISALIGNMENT (dr_peel) / dr_peel_size);
!       DR_MISALIGNMENT (dr) = 0;
        return;
      }
  
    if (known_alignment_for_access_p (dr)
        && known_alignment_for_access_p (dr_peel))
      {
!       DR_MISALIGNMENT (dr) += npeel * dr_size;
!       DR_MISALIGNMENT (dr) %= UNITS_PER_SIMD_WORD;
        return;
      }
  
    if (vect_print_dump_info (REPORT_DETAILS))
      fprintf (vect_dump, "Setting misalignment to -1.");
!   DR_MISALIGNMENT (dr) = -1;
  }
  
  
--- 1260,1282 ----
          continue;
        gcc_assert (DR_MISALIGNMENT (dr) / dr_size ==
                    DR_MISALIGNMENT (dr_peel) / dr_peel_size);
!       SET_DR_MISALIGNMENT (dr, 0);
        return;
      }
  
    if (known_alignment_for_access_p (dr)
        && known_alignment_for_access_p (dr_peel))
      {
!       int misal = DR_MISALIGNMENT (dr);
!       misal += npeel * dr_size;
!       misal %= UNITS_PER_SIMD_WORD;
!       SET_DR_MISALIGNMENT (dr, misal);
        return;
      }
  
    if (vect_print_dump_info (REPORT_DETAILS))
      fprintf (vect_dump, "Setting misalignment to -1.");
!   SET_DR_MISALIGNMENT (dr, -1);
  }
  
  
*************** vect_enhance_data_refs_alignment (loop_v
*** 1563,1569 ****
  	  save_misalignment = DR_MISALIGNMENT (dr);
  	  vect_update_misalignment_for_peel (dr, dr0, npeel);
  	  supportable_dr_alignment = vect_supportable_dr_alignment (dr);
! 	  DR_MISALIGNMENT (dr) = save_misalignment;
  	  
  	  if (!supportable_dr_alignment)
  	    {
--- 1565,1571 ----
  	  save_misalignment = DR_MISALIGNMENT (dr);
  	  vect_update_misalignment_for_peel (dr, dr0, npeel);
  	  supportable_dr_alignment = vect_supportable_dr_alignment (dr);
! 	  SET_DR_MISALIGNMENT (dr, save_misalignment);
  	  
  	  if (!supportable_dr_alignment)
  	    {
*************** vect_enhance_data_refs_alignment (loop_v
*** 1587,1593 ****
  
            LOOP_VINFO_UNALIGNED_DR (loop_vinfo) = dr0;
            LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo) = DR_MISALIGNMENT (dr0);
!           DR_MISALIGNMENT (dr0) = 0;
  	  if (vect_print_dump_info (REPORT_ALIGNMENT))
              fprintf (vect_dump, "Alignment of access forced using peeling.");
  
--- 1589,1595 ----
  
            LOOP_VINFO_UNALIGNED_DR (loop_vinfo) = dr0;
            LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo) = DR_MISALIGNMENT (dr0);
! 	  SET_DR_MISALIGNMENT (dr0, 0);
  	  if (vect_print_dump_info (REPORT_ALIGNMENT))
              fprintf (vect_dump, "Alignment of access forced using peeling.");
  
*************** vect_enhance_data_refs_alignment (loop_v
*** 1688,1694 ****
          {
            stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
            dr = STMT_VINFO_DATA_REF (stmt_info);
!           DR_MISALIGNMENT (dr) = 0;
  	  if (vect_print_dump_info (REPORT_ALIGNMENT))
              fprintf (vect_dump, "Alignment of access forced using versioning.");
          }
--- 1690,1696 ----
          {
            stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
            dr = STMT_VINFO_DATA_REF (stmt_info);
! 	  SET_DR_MISALIGNMENT (dr, 0);
  	  if (vect_print_dump_info (REPORT_ALIGNMENT))
              fprintf (vect_dump, "Alignment of access forced using versioning.");
          }
Index: tree-ssa-loop-prefetch.c
===================================================================
*** tree-ssa-loop-prefetch.c	(revision 123734)
--- tree-ssa-loop-prefetch.c	(working copy)
*************** Software Foundation, 59 Temple Place - S
*** 46,51 ****
--- 46,52 ----
  #include "params.h"
  #include "langhooks.h"
  #include "tree-inline.h"
+ #include "tree-data-ref.h"
  
  /* This pass inserts prefetch instructions to optimize cache usage during
     accesses to arrays in loops.  It processes loops sequentially and:
*************** Software Foundation, 59 Temple Place - S
*** 82,87 ****
--- 83,92 ----
  	   7/32.
         (5) has PREFETCH_MOD 1 as well.
  
+       Additionally, we use data dependence analysis to determine for each
+       reference the distance till the first reuse; this information is used
+       to determine the temporality of the issued prefetch instruction.
+ 
     3) We determine how much ahead we need to prefetch.  The number of
        iterations needed is time to fetch / time spent in one iteration of
        the loop.  The problem is that we do not know either of these values,
*************** Software Foundation, 59 Temple Place - S
*** 161,166 ****
--- 166,182 ----
  #define HAVE_prefetch 0
  #endif
  
+ #define L1_CACHE_SIZE_BYTES ((unsigned) (L1_CACHE_SIZE * L1_CACHE_LINE_SIZE))
+ /* TODO:  Add parameter to specify L2 cache size.  */
+ #define L2_CACHE_SIZE_BYTES (8 * L1_CACHE_SIZE_BYTES)
+ 
+ /* We consider a memory access nontemporal if it is not reused sooner than
+    after L2_CACHE_SIZE_BYTES of memory are accessed.  However, we ignore
+    accesses closer than L1_CACHE_SIZE_BYTES / NONTEMPORAL_FRACTION,
+    so that we use nontemporal prefetches e.g. if single memory location
+    is accessed several times in a single iteration of the loop.  */
+ #define NONTEMPORAL_FRACTION 16
+ 
  /* The group of references between that reuse may occur.  */
  
  struct mem_ref_group
*************** struct mem_ref
*** 190,195 ****
--- 206,213 ----
    unsigned HOST_WIDE_INT prefetch_before;
  				/* Prefetch only first PREFETCH_BEFORE
  				   iterations.  */
+   unsigned reuse_distance;	/* The amount of data accessed before the first
+ 				   reuse of this value.  */
    bool issue_prefetch_p;	/* Should we really issue the prefetch?  */
    struct mem_ref *next;		/* The next reference in the group.  */
  };
*************** find_or_create_group (struct mem_ref_gro
*** 236,242 ****
  	break;
      }
  
!   group = xcalloc (1, sizeof (struct mem_ref_group));
    group->base = base;
    group->step = step;
    group->refs = NULL;
--- 254,260 ----
  	break;
      }
  
!   group = XNEW (struct mem_ref_group);
    group->base = base;
    group->step = step;
    group->refs = NULL;
*************** record_ref (struct mem_ref_group *group,
*** 273,285 ****
  	return;
      }
  
!   (*aref) = xcalloc (1, sizeof (struct mem_ref));
    (*aref)->stmt = stmt;
    (*aref)->mem = mem;
    (*aref)->delta = delta;
    (*aref)->write_p = write_p;
    (*aref)->prefetch_before = PREFETCH_ALL;
    (*aref)->prefetch_mod = 1;
    (*aref)->issue_prefetch_p = false;
    (*aref)->group = group;
    (*aref)->next = NULL;
--- 291,304 ----
  	return;
      }
  
!   (*aref) = XNEW (struct mem_ref);
    (*aref)->stmt = stmt;
    (*aref)->mem = mem;
    (*aref)->delta = delta;
    (*aref)->write_p = write_p;
    (*aref)->prefetch_before = PREFETCH_ALL;
    (*aref)->prefetch_mod = 1;
+   (*aref)->reuse_distance = 0;
    (*aref)->issue_prefetch_p = false;
    (*aref)->group = group;
    (*aref)->next = NULL;
*************** static void
*** 815,826 ****
  issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead)
  {
    HOST_WIDE_INT delta;
!   tree addr, addr_base, prefetch, write_p;
    block_stmt_iterator bsi;
    unsigned n_prefetches, ap;
  
    if (dump_file && (dump_flags & TDF_DETAILS))
!     fprintf (dump_file, "Issued prefetch for %p.\n", (void *) ref);
  
    bsi = bsi_for_stmt (ref->stmt);
  
--- 834,848 ----
  issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead)
  {
    HOST_WIDE_INT delta;
!   tree addr, addr_base, prefetch, write_p, local;
    block_stmt_iterator bsi;
    unsigned n_prefetches, ap;
+   bool nontemporal = ref->reuse_distance >= L2_CACHE_SIZE_BYTES;
  
    if (dump_file && (dump_flags & TDF_DETAILS))
!     fprintf (dump_file, "Issued%s prefetch for %p.\n",
! 	     nontemporal ? " nontemporal" : "",
! 	     (void *) ref);
  
    bsi = bsi_for_stmt (ref->stmt);
  
*************** issue_prefetch_ref (struct mem_ref *ref,
*** 829,834 ****
--- 851,857 ----
    addr_base = build_fold_addr_expr_with_type (ref->mem, ptr_type_node);
    addr_base = force_gimple_operand_bsi (&bsi, unshare_expr (addr_base), true, NULL);
    write_p = ref->write_p ? integer_one_node : integer_zero_node;
+   local = build_int_cst (integer_type_node, nontemporal ? 0 : 3);
  
    for (ap = 0; ap < n_prefetches; ap++)
      {
*************** issue_prefetch_ref (struct mem_ref *ref,
*** 840,846 ****
  
        /* Create the prefetch instruction.  */
        prefetch = build_call_expr (built_in_decls[BUILT_IN_PREFETCH],
! 				  2, addr, write_p);
        bsi_insert_before (&bsi, prefetch, BSI_SAME_STMT);
      }
  }
--- 863,869 ----
  
        /* Create the prefetch instruction.  */
        prefetch = build_call_expr (built_in_decls[BUILT_IN_PREFETCH],
! 				  3, addr, write_p, local);
        bsi_insert_before (&bsi, prefetch, BSI_SAME_STMT);
      }
  }
*************** determine_unroll_factor (struct loop *lo
*** 935,940 ****
--- 958,1268 ----
    return factor;
  }
  
+ /* Returns the total volume of the memory references REFS, taking into account
+    reuses in the innermost loop and cache line size.  TODO -- we should also
+    take into account reuses across the iterations of the loops in the loop
+    nest.  */
+ 
+ static unsigned
+ volume_of_references (struct mem_ref_group *refs)
+ {
+   unsigned volume = 0;
+   struct mem_ref_group *gr;
+   struct mem_ref *ref;
+ 
+   for (gr = refs; gr; gr = gr->next)
+     for (ref = gr->refs; ref; ref = ref->next)
+       {
+ 	/* Almost always reuses another value?  */
+ 	if (ref->prefetch_before != PREFETCH_ALL)
+ 	  continue;
+ 
+ 	/* If several iterations access the same cache line, use the size of
+ 	   the line divided by this number.  Otherwise, a cache line is
+ 	   accessed in each iteration.  TODO -- in the latter case, we should
+ 	   take the size of the reference into account, rounding it up on cache
+ 	   line size multiple.  */
+ 	volume += L1_CACHE_LINE_SIZE / ref->prefetch_mod;
+       }
+   return volume;
+ }
+ 
+ /* Returns the volume of memory references accessed across VEC iterations of
+    loops, whose sizes are described in the LOOP_SIZES array.  N is the number
+    of the loops in the nest (length of VEC and LOOP_SIZES vectors).  */
+ 
+ static unsigned
+ volume_of_dist_vector (lambda_vector vec, unsigned *loop_sizes, unsigned n)
+ {
+   unsigned i;
+ 
+   for (i = 0; i < n; i++)
+     if (vec[i] != 0)
+       break;
+ 
+   if (i == n)
+     return 0;
+ 
+   gcc_assert (vec[i] > 0);
+ 
+   /* We ignore the parts of the distance vector in subloops, since usually
+      the numbers of iterations are much smaller.  */
+   return loop_sizes[i] * vec[i];
+ }
+ 
+ /* Add the steps of ACCESS_FN multiplied by STRIDE to the array STRIDE
+    at the position corresponding to the loop of the step.  N is the depth
+    of the considered loop nest, and, LOOP is its innermost loop.  */
+ 
+ static void
+ add_subscript_strides (tree access_fn, unsigned stride,
+ 		       HOST_WIDE_INT *strides, unsigned n, struct loop *loop)
+ {
+   struct loop *aloop;
+   tree step;
+   HOST_WIDE_INT astep;
+   unsigned min_depth = loop->depth - n;
+ 
+   while (TREE_CODE (access_fn) == POLYNOMIAL_CHREC)
+     {
+       aloop = get_chrec_loop (access_fn);
+       step = CHREC_RIGHT (access_fn);
+       access_fn = CHREC_LEFT (access_fn);
+ 
+       if ((unsigned) aloop->depth <= min_depth)
+ 	continue;
+ 
+       if (host_integerp (step, 0))
+ 	astep = tree_low_cst (step, 0);
+       else
+ 	astep = L1_CACHE_LINE_SIZE;
+ 
+       strides[n - 1 - loop->depth + aloop->depth] += astep * stride;
+ 
+     }
+ }
+ 
+ /* Returns the volume of memory references accessed between two consecutive
+    self-reuses of the reference DR.  We consider the subscripts of DR in N
+    loops, and LOOP_SIZES contains the volumes of accesses in each of the
+    loops.  LOOP is the innermost loop of the current loop nest.  */
+ 
+ static unsigned
+ self_reuse_distance (data_reference_p dr, unsigned *loop_sizes, unsigned n,
+ 		     struct loop *loop)
+ {
+   tree stride, access_fn;
+   HOST_WIDE_INT *strides, astride;
+   VEC (tree, heap) *access_fns;
+   tree ref = DR_REF (dr);
+   unsigned i, ret = ~0u;
+ 
+   /* In the following example:
+ 
+      for (i = 0; i < N; i++)
+        for (j = 0; j < N; j++)
+          use (a[j][i]);
+      the same cache line is accessed each N steps (except if the change from
+      i to i + 1 crosses the boundary of the cache line).  Thus, for self-reuse,
+      we cannot rely purely on the results of the data dependence analysis.
+ 
+      Instead, we compute the stride of the reference in each loop, and consider
+      the innermost loop in that the stride is less than cache size.  */
+ 
+   strides = XCNEWVEC (HOST_WIDE_INT, n);
+   access_fns = DR_ACCESS_FNS (dr);
+ 
+   for (i = 0; VEC_iterate (tree, access_fns, i, access_fn); i++)
+     {
+       /* Keep track of the reference corresponding to the subscript, so that we
+ 	 know its stride.  */
+       while (handled_component_p (ref) && TREE_CODE (ref) != ARRAY_REF)
+ 	ref = TREE_OPERAND (ref, 0);
+       
+       if (TREE_CODE (ref) == ARRAY_REF)
+ 	{
+ 	  stride = TYPE_SIZE_UNIT (TREE_TYPE (ref));
+ 	  if (host_integerp (stride, 1))
+ 	    astride = tree_low_cst (stride, 1);
+ 	  else
+ 	    astride = L1_CACHE_LINE_SIZE;
+ 
+ 	  ref = TREE_OPERAND (ref, 0);
+ 	}
+       else
+ 	astride = 1;
+ 
+       add_subscript_strides (access_fn, astride, strides, n, loop);
+     }
+ 
+   for (i = n; i-- > 0; )
+     {
+       unsigned HOST_WIDE_INT s;
+ 
+       s = strides[i] < 0 ?  -strides[i] : strides[i];
+ 
+       if (s < (unsigned) L1_CACHE_LINE_SIZE
+ 	  && (loop_sizes[i]
+ 	      > (unsigned) (L1_CACHE_SIZE_BYTES / NONTEMPORAL_FRACTION)))
+ 	{
+ 	  ret = loop_sizes[i];
+ 	  break;
+ 	}
+     }
+ 
+   free (strides);
+   return ret;
+ }
+ 
+ /* Determines the distance till the first reuse of each reference in REFS
+    in the loop nest of LOOP.  */
+ 
+ static void
+ determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs)
+ {
+   struct loop *nest, *aloop;
+   VEC (data_reference_p, heap) *datarefs = NULL;
+   VEC (ddr_p, heap) *dependences = NULL;
+   struct mem_ref_group *gr;
+   struct mem_ref *ref;
+   VEC (loop_p, heap) *vloops = NULL;
+   unsigned *loop_data_size;
+   unsigned i, j, n;
+   unsigned volume, dist, adist;
+   HOST_WIDE_INT vol;
+   data_reference_p dr;
+   ddr_p dep;
+ 
+   if (loop->inner)
+     return;
+ 
+   /* Find the outermost loop of the loop nest of loop (we require that
+      there are no sibling loops inside the nest).  */
+   nest = loop;
+   while (1)
+     {
+       aloop = nest->outer;
+ 
+       if (aloop == current_loops->tree_root
+ 	  || aloop->inner->next)
+ 	break;
+ 
+       nest = aloop;
+     }
+ 
+   /* For each loop, determine the amount of data accessed in each iteration.
+      We use this to estimate whether the reference is evicted from the
+      cache before its reuse.  */
+   find_loop_nest (nest, &vloops);
+   n = VEC_length (loop_p, vloops);
+   loop_data_size = XNEWVEC (unsigned, n);
+   volume = volume_of_references (refs);
+   i = n;
+   while (i-- != 0)
+     {
+       loop_data_size[i] = volume;
+       /* Bound the volume by the L2 cache size, since above this bound,
+ 	 all dependence distances are equivalent.  */
+       if (volume > L2_CACHE_SIZE_BYTES)
+ 	continue;
+ 
+       aloop = VEC_index (loop_p, vloops, i);
+       vol = estimated_loop_iterations_int (aloop, false);
+       if (vol < 0)
+ 	vol = expected_loop_iterations (aloop);
+       volume *= vol;
+     }
+ 
+   /* Prepare the references in the form suitable for data dependence
+      analysis.  We ignore unanalysable data references (the results
+      are used just as a heuristics to estimate temporality of the
+      references, hence we do not need to worry about correctness).  */
+   for (gr = refs; gr; gr = gr->next)
+     for (ref = gr->refs; ref; ref = ref->next)
+       {
+ 	dr = create_data_ref (ref->mem, ref->stmt, !ref->write_p);
+ 
+ 	if (dr)
+ 	  {
+ 	    ref->reuse_distance = volume;
+ 	    dr->aux = ref;
+ 	    VEC_safe_push (data_reference_p, heap, datarefs, dr);
+ 	  }
+       }
+ 
+   for (i = 0; VEC_iterate (data_reference_p, datarefs, i, dr); i++)
+     {
+       dist = self_reuse_distance (dr, loop_data_size, n, loop);
+       ref = dr->aux;
+       if (ref->reuse_distance > dist)
+ 	ref->reuse_distance = dist;
+     }
+ 
+   compute_all_dependences (datarefs, &dependences, vloops, true);
+ 
+   for (i = 0; VEC_iterate (ddr_p, dependences, i, dep); i++)
+     {
+       if (DDR_ARE_DEPENDENT (dep) == chrec_known)
+ 	continue;
+ 
+       if (DDR_ARE_DEPENDENT (dep) == chrec_dont_know
+ 	  || DDR_NUM_DIST_VECTS (dep) == 0)
+ 	{
+ 	  /* If the dependence cannot be analysed, assume that there might be
+ 	     a reuse.  */
+ 	  dist = 0;
+ 	}
+       else
+ 	{
+ 	  /* The distance vectors are normalised to be always lexicographically
+ 	     positive, hence we cannot tell just from them whether DDR_A comes
+ 	     before DDR_B or vice versa.  However, it is not important,
+ 	     anyway -- if DDR_A is close to DDR_B, then it is either reused in
+ 	     DDR_B (and it is not nontemporal), or it reuses the value of DDR_B
+ 	     in cache (and marking it as nontemporal would not affect
+ 	     anything).  */
+ 
+ 	  dist = volume;
+ 	  for (j = 0; j < DDR_NUM_DIST_VECTS (dep); j++)
+ 	    {
+ 	      adist = volume_of_dist_vector (DDR_DIST_VECT (dep, j),
+ 					     loop_data_size, n);
+ 
+ 	      /* Ignore accesses closer than
+ 		 L1_CACHE_SIZE_BYTES / NONTEMPORAL_FRACTION,
+ 	      	 so that we use nontemporal prefetches e.g. if single memory
+ 		 location is accessed several times in a single iteration of
+ 		 the loop.  */
+ 	      if (adist < L1_CACHE_SIZE_BYTES / NONTEMPORAL_FRACTION)
+ 		continue;
+ 
+ 	      if (adist < dist)
+ 		dist = adist;
+ 	    }
+ 	}
+ 
+       ref = DDR_A (dep)->aux;
+       if (ref->reuse_distance > dist)
+ 	ref->reuse_distance = dist;
+       ref = DDR_B (dep)->aux;
+       if (ref->reuse_distance > dist)
+ 	ref->reuse_distance = dist;
+     }
+ 
+   free_dependence_relations (dependences);
+   free_data_refs (datarefs);
+   free (loop_data_size);
+ 
+   if (dump_file && (dump_flags & TDF_DETAILS))
+     {
+       fprintf (dump_file, "Reuse distances:\n");
+       for (gr = refs; gr; gr = gr->next)
+ 	for (ref = gr->refs; ref; ref = ref->next)
+ 	  fprintf (dump_file, " ref %p distance %u\n",
+ 		   (void *) ref, ref->reuse_distance);
+     }
+ }
+ 
  /* Issue prefetch instructions for array references in LOOP.  Returns
     true if the LOOP was unrolled.  */
  
*************** loop_prefetch_arrays (struct loop *loop)
*** 956,961 ****
--- 1284,1291 ----
    if (!anything_to_prefetch_p (refs))
      goto fail;
  
+   determine_loop_nest_reuse (loop, refs);
+ 
    /* Step 3: determine the ahead and unroll factor.  */
  
    /* FIXME: the time should be weighted by the probabilities of the blocks in
*************** tree_ssa_prefetch_arrays (void)
*** 1027,1036 ****
        fprintf (dump_file, "    simultaneous prefetches: %d\n",
  	       SIMULTANEOUS_PREFETCHES);
        fprintf (dump_file, "    prefetch latency: %d\n", PREFETCH_LATENCY);
-       fprintf (dump_file, "    L1 cache size: %d (%d bytes)\n",
- 	       L1_CACHE_SIZE, L1_CACHE_SIZE * L1_CACHE_LINE_SIZE);
-       fprintf (dump_file, "    L1 cache line size: %d\n", L1_CACHE_LINE_SIZE);
        fprintf (dump_file, "    prefetch block size: %d\n", PREFETCH_BLOCK);
        fprintf (dump_file, "\n");
      }
  
--- 1357,1367 ----
        fprintf (dump_file, "    simultaneous prefetches: %d\n",
  	       SIMULTANEOUS_PREFETCHES);
        fprintf (dump_file, "    prefetch latency: %d\n", PREFETCH_LATENCY);
        fprintf (dump_file, "    prefetch block size: %d\n", PREFETCH_BLOCK);
+       fprintf (dump_file, "    L1 cache size: %d lines, %d bytes\n",
+ 	       L1_CACHE_SIZE, L1_CACHE_SIZE_BYTES);
+       fprintf (dump_file, "    L1 cache line size: %d\n", L1_CACHE_LINE_SIZE);
+       fprintf (dump_file, "    L2 cache size: %d bytes\n", L2_CACHE_SIZE_BYTES);
        fprintf (dump_file, "\n");
      }
  
Index: Makefile.in
===================================================================
*** Makefile.in	(revision 123734)
--- Makefile.in	(working copy)
*************** tree-ssa-loop-prefetch.o: tree-ssa-loop-
*** 2134,2140 ****
     output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
     tree-pass.h $(GGC_H) $(RECOG_H) insn-config.h $(HASHTAB_H) $(SCEV_H) \
     $(CFGLOOP_H) $(PARAMS_H) langhooks.h $(BASIC_BLOCK_H) hard-reg-set.h \
!    tree-chrec.h toplev.h langhooks.h $(TREE_INLINE_H)
  tree-ssa-loop-ivopts.o : tree-ssa-loop-ivopts.c $(TREE_FLOW_H) $(CONFIG_H) \
     $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(CFGLOOP_H) $(EXPR_H) \
     output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
--- 2134,2140 ----
     output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
     tree-pass.h $(GGC_H) $(RECOG_H) insn-config.h $(HASHTAB_H) $(SCEV_H) \
     $(CFGLOOP_H) $(PARAMS_H) langhooks.h $(BASIC_BLOCK_H) hard-reg-set.h \
!    tree-chrec.h toplev.h langhooks.h $(TREE_INLINE_H) $(TREE_DATA_REF_H)
  tree-ssa-loop-ivopts.o : tree-ssa-loop-ivopts.c $(TREE_FLOW_H) $(CONFIG_H) \
     $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(CFGLOOP_H) $(EXPR_H) \
     output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \



More information about the Gcc-patches mailing list