[PATCH, RFC] New target interface for vectorizer cost model

Tue Jul 3 14:00:00 GMT 2012

On Tue, 3 Jul 2012, William J. Schmidt wrote:

> Hi,
> 
> This is the first in a series of patches to give targets a more global
> view of vectorized loops and basic blocks, to enable more accurate
> decisions about when to vectorize.  The purpose of this patch is to
> establish a new interface (init_cost, add_stmt_cost, finish_cost,
> destroy_cost_data) and to use that interface in parallel with the
> current cost model inside-cost calculations.  The patch in its current
> form assert()s that the two cost models provide the same results.  (The
> assert will have to be removed before the patch is committed.)
> 
> I ran into several probable-bugs with the current cost model
> implementation, which I've called out with comments in the code.  I'd
> like to get confirmation that others believe the existing code is wrong
> in these cases.
> 
>  * An inside-cost is computed and attached to PHI statements that
> represent induction operations, but that cost is never added into the
> inside cost of the loop (the final calculation only iterates over
> regular statements, not PHIs).  See vect_model_induction_cost.
> 
>  * Similarly, a lot of effort goes into determining inside and outside
> costs for loop peeling, but those costs are never factored into the
> vectorization decision.  See vect_enhance_data_refs_alignment.
> 
>  * When building SLP trees, a load of a single value sometimes feeds a
> group of instructions to be vectorized.  This is represented by an SLP
> node where the load is repeated N times.  A side effect of this is that
> its cost is calculated N times.  This is harmless (other than wasting
> resources) under the old model, but can't be tolerated with the new one.
> 
> The logic is complicated in some cases by the order of events in the
> vectorizer.  For instance, the number of copies of SLP instructions is
> determined quite late in vect_update_slp_costs_according_to_vf, so the
> passing of these instructions to the new cost model can't occur until
> their counts areknown accurately.  Similarly, peeling is done by trying
> a variety of peeling factors (each with its cost calculated) and one is
> selected, so we have to be careful to only pass the costs from the
> selected peel factor to the cost model.  A stmt_vector_for_cost is used
> to save representations of instructions until they are ready to be
> presented to the target cost model.
> 
> As a reminder, here was the overall patch plan:
> 
> (1) This one.
> (1a) Split up cost hooks (one for data refs with misalign parm, one for
> vector_stmt with tree code, etc.).
> (2) Handle the SLP ordering problem (different order of analysis and
> transformation).
> (3) Handle outside costs in the target model.
> (4) Remove unnecessary cost fields and the calls that set them.
> 
> I changed the interface for add_stmt_cost to include a stmt_vec_info for
> the statement.  With this I think we can avoid (1a), since the target
> can extract the tree code from the statement in the stmt_vec_info.  This
> leaves the misalign parameter as a wart for statements that aren't data
> references, but that seems tolerable.
> 
> I also think we should dispense with (2).  After working with this, I
> can see that there's no way we'll be able to provide any ordering
> information on statements presented to the cost model.  Target
> heuristics will have to be designed based on the overall set of
> instructions, with no inference about their order.
> 
> So after this patch, we should only need an outside-costs patch and a
> cleanup patch, along with any follow-up heuristics for the targets.
> 
> For testing, I've built the regression testsuite and SPEC
> cpu2000/cpu2006 with no remaining cost model mismatches or regressions.
> I won't be surprised if there are still some mismatches that haven't
> been caught yet, but the major ones should be out of play.
> 
> This is a more complicated patch than I originally expected, so let me
> know whether you think this is still the right way to proceed.

Yes, this is still the way to proceed IMHO.  Comments below

> Thanks,
> Bill
> 
> 
> 2012-07-03  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* doc/tm.texi: Regenerate.
> 	* doc/tm.texi.in (TARGET_VECTORIZE_INIT_COST): New hook.
> 	(TARGET_VECTORIZE_ADD_STMT_COST): Likewise.
> 	(TARGET_VECTORIZE_FINISH_COST): Likewise.
> 	(TARGET_VECTORIZE_DESTROY_COST_DATA): Likewise.
> 	* targhooks.c (default_init_cost): New function.
> 	(default_add_stmt_cost): Likewise.
> 	(default_finish_cost): Likewise.
> 	(default_destroy_cost_data): Likewise.
> 	* targhooks.h (default_init_cost): New decl.
> 	(default_add_stmt_cost): Likewise.
> 	(default_finish_cost): Likewise.
> 	(default_destroy_cost_data): Likewise.
> 	* target.def (init_cost): New DEFHOOK.
> 	(add_stmt_cost): Likewise.
> 	(finish_cost): Likewise.
> 	(destroy_cost_data): Likewise.
> 	* target.h (struct _loop_vec_info): New extern decl.
> 	(struct _stmt_vec_info): Likewise.
> 	(stmt_vectype): Likewise.
> 	(stmt_in_inner_loop_p): Likewise.
> 	* tree-vectorizer.c (target_cost_data): New static var.
> 	* tree-vectorizer.h (target_cost_data): New extern decl.
> 	(stmt_info_for_cost): New struct/typedef.
> 	(stmt_vector_for_cost): New VEC/typedef.
> 	(add_stmt_info_to_vec): New function.
> 	(struct _slp_instance): Add stmt_cost_vec field.
> 	(SLP_INSTANCE_STMT_COST_VEC): New accessor macro.
> 	(struct _vect_peel_extended_info): Add stmt_cost_vec field.
> 	(init_cost): New function.
> 	(add_stmt_cost): Likewise.
> 	(finish_cost): Likewise.
> 	(destroy_cost_data): Likewise.
> 	(record_stmt_cost): Likewise.
> 	(vect_model_simple_cost): Change parameter list.
> 	(vect_model_store_cost): Likewise.
> 	(vect_model_load_cost): Likewise.
> 	(vect_get_load_cost): Likewise.
> 	(vect_get_store_cost): Likewise.
> 	* tree-vect-loop.c (vect_analyze_loop_operations): Add
> 	cost_data_released output parameter; set its value.
> 	(vect_analyze_loop_2): Call init_cost and destroy_cost_data; add
> 	argument to vect_analyze_loop_operations call.
> 	(vect_estimate_min_profitable_iters): Call finish_cost and verify
> 	its result matches vec_inside_cost.
> 	(vect_model_reduction_cost): Call add_stmt_cost.
> 	(vect_model_induction_cost): Call add_stmt_cost (but comment it out
> 	to match current buggy behavior for now).
> 	* tree-vect-data-refs.c (vect_get_data_access_cost): Change to
> 	return a stmt_vector_for_cost; modify calls to vect_get_load_cost
> 	and vect_get_store_cost to obtain the value to return.
> 	(vect_peeling_hash_get_lowest_cost): Obtain a stmt_cost_vec from
> 	vect_get_data_access_cost and store it in the minimum peeling
> 	structure.
> 	(vect_peeling_hash_choose_best_peeling): Change the parameter list
> 	to add a (stmt_vector_for_cost *) output parameter, and set its value.
> 	(vect_enhance_data_refs_alignment): Ignore the new return value from
> 	calls to vect_get_data_access_cost; obtain stmt_cost_vec from
> 	vect_peeling_hash_choose_best_peeling and pass its contents to the
> 	target cost model (but comment the latter out to match current buggy
> 	behavior for now).
> 	* tree-vect-stmts.c (stmt_vectype): New function.
> 	(stmt_in_inner_loop_p): Likewise.
> 	(vect_model_simple_cost): Add stmt_cost_vec parameter; call
> 	record_stmt_cost.
> 	(vect_model_promotion_demotion_cost): Call add_stmt_cost.
> 	(vect_model_store_cost): Add stmt_cost_vec parameter; call
> 	record_stmt_cost; add stmt_cost_vec parameter to
> 	vect_get_store_cost call.
> 	(vect_get_store_cost): Add stmt_cost_vec parameter; call
> 	record_stmt_cost.
> 	(vect_model_load_cost): Add stmt_cost_vec parameter; call
> 	record_stmt_cost; add stmt_cost_vec parameter to
> 	vect_get_load_cost call.
> 	(vect_get_load_cost): Add stmt_cost_vec parameter; call
> 	record_stmt_cost.
> 	(vectorizable_call): Add NULL parameter to vect_model_simple_cost call.
> 	(vectorizable_conversion): Likewise.
> 	(vectorizable_assignment): Likewise.
> 	(vectorizable_shift): Likewise.
> 	(vectorizable_operation): Likewise.
> 	(vectorizable_store): Add NULL parameter to vect_model_store_cost call.
> 	(vectorizable_load): Add NULL parameter to vect_model_load_cost call.
> 	* config/spu/spu.c (TARGET_VECTORIZE_INIT_COST): New macro def.
> 	(TARGET_VECTORIZE_ADD_STMT_COST): Likewise.
> 	(TARGET_VECTORIZE_FINISH_COST): Likewise.
> 	(TARGET_VECTORIZE_DESTROY_COST_DATA): Likewise.
> 	(spu_init_cost): New function.
> 	(spu_add_stmt_cost): Likewise.
> 	(spu_finish_cost): Likewise.
> 	(spu_destroy_cost_data): Likewise.
> 	* config/i386/i386.c (ix86_init_cost): New function.
> 	(ix86_add_stmt_cost): Likewise.
> 	(ix86_finish_cost): Likewise.
> 	(ix86_destroy_cost_data): Likewise.
> 	(TARGET_VECTORIZE_INIT_COST): New macro def.
> 	(TARGET_VECTORIZE_ADD_STMT_COST): Likewise.
> 	(TARGET_VECTORIZE_FINISH_COST): Likewise.
> 	(TARGET_VECTORIZE_DESTROY_COST_DATA): Likewise.
> 	* config/rs6000/rs6000.c (TARGET_VECTORIZE_INIT_COST): New macro def.
> 	(TARGET_VECTORIZE_ADD_STMT_COST): Likewise.
> 	(TARGET_VECTORIZE_FINISH_COST): Likewise.
> 	(TARGET_VECTORIZE_DESTROY_COST_DATA): Likewise.
> 	(rs6000_init_cost): New function.
> 	(rs6000_add_stmt_cost): Likewise.
> 	(rs6000_finish_cost): Likewise.
> 	(rs6000_destroy_cost_data): Likewise.
> 	* tree-vect-slp.c (vect_free_slp_instance): Free stmt_cost_vec.
> 	(vect_get_and_check_slp_defs): Add stmt_cost_vec parameter; add
> 	stmt_cost_vec parameter to vect_model_store_cost and
> 	vect_model_simple_cost calls.
> 	(vect_build_slp_tree): Add stmt_cost_vec parameter; add stmt_cost_vec
> 	parameter to vect_get_and_check_slp_defs, vect_model_load_cost, and
> 	recursive vect_build_slp_tree calls; prevent calculating cost more
> 	than once for loads; call record_stmt_cost.
> 	(vect_analyze_slp_instance): Allocate stmt_cost_vec and save it with
> 	the instance; free it on premature exit; add stmt_cost_vec parameter
> 	to vect_build_slp_tree call.
> 	(vect_bb_vectorization_profitable_p): Call add_stmt_cost for each
> 	statement recorded with an SLP instance; call finish_cost and verify
> 	its result matches vec_inside_cost.
> 	(vect_slp_analyze_bb_1): Call init_cost and destroy_cost_data.
> 	(vect_update_slp_costs_according_to_vf): Record statement costs from
> 	SLP instances, multiplying by the appropriate number of copies.
> 
> 
> Index: gcc/doc/tm.texi
> ===================================================================
> --- gcc/doc/tm.texi	(revision 189081)
> +++ gcc/doc/tm.texi	(working copy)
> @@ -5792,6 +5792,31 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} {void *} TARGET_VECTORIZE_INIT_COST (struct _loop_vec_info *@var{})
> +This hook should initialize target-specific data structures in preparation
> +for modeling the costs of vectorizing a loop or basic block.  The default
> +allocates an integer for accumulating a single cost.
> +@end deftypefn

Do we really want to expose struct _loop_vect_info (and its entries) to
the targets?  I'd have used struct loop * (and NULL if we are vectorizing
a basic-block).  We could also simply pass a flag to tell whether we
are doing loop or basic-block vectorization and leave passing the actual
loop/basic-block until we see the need for it.

> +@deftypefn {Target Hook} void TARGET_VECTORIZE_ADD_STMT_COST (void *@var{}, @var{int}, enum @var{vect_cost_for_stmt}, struct _stmt_vec_info *@var{}, @var{int})
> +This hook should update target-specific data structures in response to
> +adding a given number of copies of the given kind of statement to the
> +body of a loop or basic block.  The default adds the builtin vectorizer
> +cost for the copies of the statement to the accumulator.
> +@end deftypefn

Please mention the actual parameter when refering to it, thus
"to adding @var{n} number of copies ..." and give the parameters
names.

> +@deftypefn {Target Hook} int TARGET_VECTORIZE_FINISH_COST (void *@var{})
> +This hook should complete calculations of the cost of vectorizing a loop 
> +or basic block, and return that cost as an integer.  It should also release
> +any target-specific data structures allocated by TARGET_VECTORIZE_INIT_COST.
> +The default returns the value of the accumulator and releases it.
> +@end deftypefn

Should return unsigned int I think.

> +@deftypefn {Target Hook} void TARGET_VECTORIZE_DESTROY_COST_DATA (void *@var{})
> +This hook should release any target-specific data structures allocated by
> +TARGET_VECTORIZE_INIT_COST.  The default releases the accumulator.
> +@end deftypefn
> +

Any reason this is not unified into one?  finish also destroys the data,
so are you merely saving time in the not vectorized case?

>  @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_TM_LOAD (tree)
>  This hook should return the built-in decl needed to load a vector of the given type within a transaction.
>  @end deftypefn
> Index: gcc/doc/tm.texi.in
> ===================================================================
> --- gcc/doc/tm.texi.in	(revision 189081)
> +++ gcc/doc/tm.texi.in	(working copy)
> @@ -5724,6 +5724,31 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>  
> +@hook TARGET_VECTORIZE_INIT_COST
> +This hook should initialize target-specific data structures in preparation
> +for modeling the costs of vectorizing a loop or basic block.  The default
> +allocates an integer for accumulating a single cost.
> +@end deftypefn

Please move documentation of new hooks to target.def.

> +@hook TARGET_VECTORIZE_ADD_STMT_COST
> +This hook should update target-specific data structures in response to
> +adding a given number of copies of the given kind of statement to the
> +body of a loop or basic block.  The default adds the builtin vectorizer
> +cost for the copies of the statement to the accumulator.
> +@end deftypefn
> +
> +@hook TARGET_VECTORIZE_FINISH_COST
> +This hook should complete calculations of the cost of vectorizing a loop 
> +or basic block, and return that cost as an integer.  It should also release
> +any target-specific data structures allocated by TARGET_VECTORIZE_INIT_COST.
> +The default returns the value of the accumulator and releases it.
> +@end deftypefn
> +
> +@hook TARGET_VECTORIZE_DESTROY_COST_DATA
> +This hook should release any target-specific data structures allocated by
> +TARGET_VECTORIZE_INIT_COST.  The default releases the accumulator.
> +@end deftypefn
> +
>  @hook TARGET_VECTORIZE_BUILTIN_TM_LOAD
>  
>  @hook TARGET_VECTORIZE_BUILTIN_TM_STORE
> Index: gcc/targhooks.c
> ===================================================================
> --- gcc/targhooks.c	(revision 189081)
> +++ gcc/targhooks.c	(working copy)
> @@ -996,6 +996,61 @@ default_autovectorize_vector_sizes (void)
>    return 0;
>  }
>  
> +/* By default, the cost model just accumulates the inside_loop costs for
> +   a vectorized loop or block.  So allocate an unsigned int, set it to
> +   zero, and return its address.  */
> +
> +void *
> +default_init_cost (struct _loop_vec_info *loop_vinfo ATTRIBUTE_UNUSED)
> +{
> +  int *cost = XNEW (int);

I'd use an unsigned type.

> +  *cost = 0;
> +  return cost;
> +}
> +
> +/* By default, the cost model looks up the cost of the given statement
> +   kind and mode, multiplies it by the occurrence count, and accumulates
> +   it into the cost.  */
> +
> +void
> +default_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
> +		       struct _stmt_vec_info *stmt_info, int misalign)
> +{
> +  int *cost = (int *) data;
> +  if (flag_vect_cost_model)
> +    {
> +      tree vectype = stmt_vectype (stmt_info);
> +      int stmt_cost = default_builtin_vectorization_cost (kind, vectype,
> +							  misalign);
> +      /* Statements in an inner loop relative to the loop being
> +	 vectorized are weighted more heavily.  The value here is
> +	 arbitrary and could potentially be improved with analysis.  */
> +      if (stmt_in_inner_loop_p (stmt_info))
> +	count *= 50;  /* FIXME.  */
> +
> +      *cost += count * stmt_cost;
> +    }
> +}
> +
> +/* By default, the cost model just returns the accumulated
> +   inside_loop cost.  */
> +
> +int
> +default_finish_cost (void *data)
> +{
> +  int retval = *((int *) data);
> +  free (data);
> +  return retval;
> +}
> +
> +/* Free the cost data.  */
> +
> +void
> +default_destroy_cost_data (void *data)
> +{
> +  free (data);
> +}
> +
>  /* Determine whether or not a pointer mode is valid. Assume defaults
>     of ptr_mode or Pmode - can be overridden.  */
>  bool
> Index: gcc/targhooks.h
> ===================================================================
> --- gcc/targhooks.h	(revision 189081)
> +++ gcc/targhooks.h	(working copy)
> @@ -90,6 +90,11 @@ default_builtin_support_vector_misalignment (enum
>  					     int, bool);
>  extern enum machine_mode default_preferred_simd_mode (enum machine_mode mode);
>  extern unsigned int default_autovectorize_vector_sizes (void);
> +extern void *default_init_cost (struct _loop_vec_info *);
> +extern void default_add_stmt_cost (void *, int, enum vect_cost_for_stmt,
> +				   struct _stmt_vec_info *, int);
> +extern int default_finish_cost (void *);
> +extern void default_destroy_cost_data (void *);
>  
>  /* These are here, and not in hooks.[ch], because not all users of
>     hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
> Index: gcc/target.def
> ===================================================================
> --- gcc/target.def	(revision 189081)
> +++ gcc/target.def	(working copy)
> @@ -1063,6 +1063,41 @@ DEFHOOK
>   (const_tree mem_vectype, const_tree index_type, int scale),
>   NULL)
>  
> +/* Target function to initialize the cost model for a loop or block.  */
> +DEFHOOK
> +(init_cost,
> + "",
> + void *,
> + (struct _loop_vec_info *),
> + default_init_cost)
> +
> +/* Target function to record N statements of the given kind using the
> +   given vector type within the cost model data for the current loop
> +   or block.  */
> +DEFHOOK
> +(add_stmt_cost,
> + "",
> + void,
> + (void *, int, enum vect_cost_for_stmt, struct _stmt_vec_info *, int),
> + default_add_stmt_cost)
> +
> +/* Target function to calculate the total cost of the current vectorized
> +   loop or block.  */
> +DEFHOOK
> +(finish_cost,
> + "",
> + int,
> + (void *),
> + default_finish_cost)
> +
> +/* Function to delete target-specific cost modeling data.  */
> +DEFHOOK
> +(destroy_cost_data,
> + "",
> + void,
> + (void *),
> + default_destroy_cost_data)
> +
>  HOOK_VECTOR_END (vectorize)
>  
>  #undef HOOK_PREFIX
> Index: gcc/target.h
> ===================================================================
> --- gcc/target.h	(revision 189081)
> +++ gcc/target.h	(working copy)
> @@ -120,6 +120,14 @@ struct loop;
>  /* This is defined in tree-ssa-alias.h.  */
>  struct ao_ref_s;
>  
> +/* These are defined in tree-vectorizer.h.  */
> +struct _loop_vec_info;
> +struct _stmt_vec_info;
> +
> +/* These are defined in tree-vect-stmts.c.  */
> +extern tree stmt_vectype (struct _stmt_vec_info *);
> +extern bool stmt_in_inner_loop_p (struct _stmt_vec_info *);
> +
>  /* Assembler instructions for creating various kinds of integer object.  */
>  
>  struct asm_int_op
> Index: gcc/tree-vectorizer.c
> ===================================================================
> --- gcc/tree-vectorizer.c	(revision 189081)
> +++ gcc/tree-vectorizer.c	(working copy)
> @@ -82,6 +82,8 @@ LOC vect_location;
>  /* Vector mapping GIMPLE stmt to stmt_vec_info. */
>  VEC(vec_void_p,heap) *stmt_vec_info_vec;
>  
> +/* Opaque pointer to target-specific cost model data.  */
> +void *target_cost_data;

Put that into _loop_vec_info / _bb_vec_info?

>  
>  
>  /* Function vect_set_dump_settings.
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h	(revision 189081)
> +++ gcc/tree-vectorizer.h	(working copy)
> @@ -71,6 +71,35 @@ enum vect_def_type {
>                                     || ((D) == vect_double_reduction_def) \
>                                     || ((D) == vect_nested_cycle))
>  
> +/* In tree-vectorizer.c.  */
> +extern void *target_cost_data;
> +
> +/* Structure to encapsulate information about a group of like
> +   instructions to be presented to the target cost model.  */
> +typedef struct _stmt_info_for_cost {
> +  int count;
> +  enum vect_cost_for_stmt kind;
> +  gimple stmt;
> +  int misalign;
> +} stmt_info_for_cost;
> +
> +DEF_VEC_O (stmt_info_for_cost);
> +DEF_VEC_ALLOC_O (stmt_info_for_cost, heap);
> +
> +typedef VEC(stmt_info_for_cost, heap) *stmt_vector_for_cost;
> +
> +static inline void
> +add_stmt_info_to_vec (stmt_vector_for_cost *stmt_cost_vec, int count,
> +		      enum vect_cost_for_stmt kind, gimple stmt, int misalign)
> +{
> +  stmt_info_for_cost si;
> +  si.count = count;
> +  si.kind = kind;
> +  si.stmt = stmt;
> +  si.misalign = misalign;
> +  VEC_safe_push (stmt_info_for_cost, heap, *stmt_cost_vec, &si);
> +}
> +
>  /************************************************************************
>    SLP
>   ************************************************************************/
> @@ -122,6 +151,9 @@ typedef struct _slp_instance {
>      int inside_of_loop;      /* Statements generated inside loop.  */
>    } cost;
>  
> +  /* Another view of inside costs, which will eventually replace the above.  */
> +  stmt_vector_for_cost stmt_cost_vec;
> +
>    /* Loads permutation relatively to the stores, NULL if there is no
>       permutation.  */
>    VEC (int, heap) *load_permutation;
> @@ -143,6 +175,7 @@ DEF_VEC_ALLOC_P(slp_instance, heap);
>  #define SLP_INSTANCE_UNROLLING_FACTOR(S)         (S)->unrolling_factor
>  #define SLP_INSTANCE_OUTSIDE_OF_LOOP_COST(S)     (S)->cost.outside_of_loop
>  #define SLP_INSTANCE_INSIDE_OF_LOOP_COST(S)      (S)->cost.inside_of_loop
> +#define SLP_INSTANCE_STMT_COST_VEC(S)            (S)->stmt_cost_vec
>  #define SLP_INSTANCE_LOAD_PERMUTATION(S)         (S)->load_permutation
>  #define SLP_INSTANCE_LOADS(S)                    (S)->loads
>  #define SLP_INSTANCE_FIRST_LOAD_STMT(S)          (S)->first_load
> @@ -186,6 +219,7 @@ typedef struct _vect_peel_extended_info
>    struct _vect_peel_info peel_info;
>    unsigned int inside_cost;
>    unsigned int outside_cost;
> +  stmt_vector_for_cost stmt_cost_vec;
>  } *vect_peel_extended_info;
>  
>  /*-----------------------------------------------------------------*/
> @@ -782,6 +816,55 @@ int vect_get_stmt_cost (enum vect_cost_for_stmt ty
>                                                         dummy_type, dummy);
>  }
>  
> +/* Alias targetm.vectorize.init_cost.  */
> +
> +static inline void *
> +init_cost (struct _loop_vec_info *loop_vinfo)
> +{
> +  return targetm.vectorize.init_cost (loop_vinfo);
> +}
> +
> +/* Alias targetm.vectorize.add_stmt_cost.  */
> +
> +static inline void
> +add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
> +	       stmt_vec_info stmt_info, int misalign)
> +{
> +  targetm.vectorize.add_stmt_cost (data, count, kind, stmt_info, misalign);
> +}
> +
> +/* Alias targetm.vectorize.finish_cost.  */
> +
> +static inline int
> +finish_cost (void *data)
> +{
> +  return targetm.vectorize.finish_cost (data);
> +}
> +
> +/* Alias targetm.vectorize.destroy_cost_data.  */
> +
> +static inline void
> +destroy_cost_data (void *data)
> +{
> +  targetm.vectorize.destroy_cost_data (data);
> +}
> +
> +/* Record the cost of a statement, either by directly informing the 
> +   target model or by saving it in a vector for later processing.  */
> +
> +static inline void
> +record_stmt_cost (stmt_vector_for_cost *stmt_cost_vec, int count,
> +		  enum vect_cost_for_stmt kind, stmt_vec_info stmt_info,
> +		  int misalign)
> +{
> +  if (stmt_cost_vec)
> +    add_stmt_info_to_vec (stmt_cost_vec, count, kind,
> +			  STMT_VINFO_STMT (stmt_info), misalign);
> +  else
> +    add_stmt_cost (target_cost_data, count, kind, stmt_info, misalign);
> +}
> +
> +
>  /*-----------------------------------------------------------------*/
>  /* Info on data references alignment.                              */
>  /*-----------------------------------------------------------------*/
> @@ -849,10 +932,12 @@ extern stmt_vec_info new_stmt_vec_info (gimple stm
>  extern void free_stmt_vec_info (gimple stmt);
>  extern tree vectorizable_function (gimple, tree, tree);
>  extern void vect_model_simple_cost (stmt_vec_info, int, enum vect_def_type *,
> -                                    slp_tree);
> +                                    slp_tree, stmt_vector_for_cost *);
>  extern void vect_model_store_cost (stmt_vec_info, int, bool,
> -				   enum vect_def_type, slp_tree);
> -extern void vect_model_load_cost (stmt_vec_info, int, bool, slp_tree);
> +				   enum vect_def_type, slp_tree,
> +				   stmt_vector_for_cost *);
> +extern void vect_model_load_cost (stmt_vec_info, int, bool, slp_tree,
> +				  stmt_vector_for_cost *);
>  extern void vect_finish_stmt_generation (gimple, gimple,
>                                           gimple_stmt_iterator *);
>  extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
> @@ -867,8 +952,10 @@ extern bool vect_analyze_stmt (gimple, bool *, slp
>  extern bool vectorizable_condition (gimple, gimple_stmt_iterator *, gimple *,
>                                      tree, int, slp_tree);
>  extern void vect_get_load_cost (struct data_reference *, int, bool,
> -                                unsigned int *, unsigned int *);
> -extern void vect_get_store_cost (struct data_reference *, int, unsigned int *);
> +				unsigned int *, unsigned int *,
> +				stmt_vector_for_cost *);
> +extern void vect_get_store_cost (struct data_reference *, int,
> +				 unsigned int *, stmt_vector_for_cost *);
>  extern bool vect_supportable_shift (enum tree_code, tree);
>  extern void vect_get_vec_defs (tree, tree, gimple, VEC (tree, heap) **,
>  			       VEC (tree, heap) **, slp_tree, int);
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c	(revision 189081)
> +++ gcc/tree-vect-loop.c	(working copy)
> @@ -1206,7 +1206,8 @@ vect_analyze_loop_form (struct loop *loop)
>     Scan the loop stmts and make sure they are all vectorizable.  */
>  
>  static bool
> -vect_analyze_loop_operations (loop_vec_info loop_vinfo, bool slp)
> +vect_analyze_loop_operations (loop_vec_info loop_vinfo, bool slp,
> +			      bool *cost_data_released)
>  {
>    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>    basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
> @@ -1362,7 +1363,7 @@ static bool
>                             "not vectorized: relevant phi not supported: ");
>                    print_gimple_stmt (vect_dump, phi, 0, TDF_SLIM);
>                  }
> -              return false;
> +	      return false;
>              }
>          }
>  
> @@ -1417,6 +1418,7 @@ static bool
>  
>    min_profitable_iters = vect_estimate_min_profitable_iters (loop_vinfo);
>    LOOP_VINFO_COST_MODEL_MIN_ITERS (loop_vinfo) = min_profitable_iters;
> +  *cost_data_released = true;

I'd simply allocate / destroy cost data along with the _bb_vec_info
or _loop_vec_info structs.  Thus, I suppose finish_cost should
not release the data but only destroy_cost would.

>  
>    if (min_profitable_iters < 0)
>      {
> @@ -1490,6 +1492,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)
>    bool ok, slp = false;
>    int max_vf = MAX_VECTORIZATION_FACTOR;
>    int min_vf = 2;
> +  bool cost_data_released = false;
>  
>    /* Find all data references in the loop (which correspond to vdefs/vuses)
>       and analyze their evolution in the loop.  Also adjust the minimal
> @@ -1585,6 +1588,9 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)
>        return false;
>      }
>  
> +  /* Initialize the target cost model for the loop body.  */
> +  target_cost_data = init_cost (loop_vinfo);
> +
>    /* This pass will decide on using loop versioning and/or loop peeling in
>       order to enhance the alignment of data references in the loop.  */
>  
> @@ -1593,6 +1599,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)
>      {
>        if (vect_print_dump_info (REPORT_DETAILS))
>          fprintf (vect_dump, "bad data alignment.");
> +      destroy_cost_data (target_cost_data);
>        return false;
>      }
>  
> @@ -1607,16 +1614,21 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo)
>        vect_detect_hybrid_slp (loop_vinfo);
>      }
>    else
> -    return false;
> +    {
> +      destroy_cost_data (target_cost_data);
> +      return false;
> +    }
>  
>    /* Scan all the operations in the loop and make sure they are
>       vectorizable.  */
>  
> -  ok = vect_analyze_loop_operations (loop_vinfo, slp);
> +  ok = vect_analyze_loop_operations (loop_vinfo, slp, &cost_data_released);
>    if (!ok)
>      {
>        if (vect_print_dump_info (REPORT_DETAILS))
>  	fprintf (vect_dump, "bad operation or unsupported loop bound.");
> +      if (!cost_data_released)
> +	destroy_cost_data (target_cost_data);
>        return false;
>      }
>  
> @@ -2490,6 +2502,7 @@ vect_estimate_min_profitable_iters (loop_vec_info
>    int peel_iters_epilogue;
>    int vec_inside_cost = 0;
>    int vec_outside_cost = 0;
> +  int target_model_inside_cost;
>    int scalar_single_iter_cost = 0;
>    int scalar_outside_cost = 0;
>    int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> @@ -2730,6 +2743,13 @@ vect_estimate_min_profitable_iters (loop_vec_info
>        vec_inside_cost += SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance);
>      }
>  
> +  /* Complete the target-specific cost calculation for the inside-of-loop
> +     costs.  */
> +  target_model_inside_cost = finish_cost (target_cost_data);
> +  
> +  /* For now, the new target model cost should match the accumulated cost.  */
> +  gcc_assert (vec_inside_cost == target_model_inside_cost);
> +
>    /* Calculate number of iterations required to make the vector version
>       profitable, relative to the loop bodies only.  The following condition
>       must hold true:
> @@ -2830,6 +2850,7 @@ vect_model_reduction_cost (stmt_vec_info stmt_info
>    /* Cost of reduction op inside loop.  */
>    STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info) 
>      += ncopies * vect_get_stmt_cost (vector_stmt);
> +  add_stmt_cost (target_cost_data, ncopies, vector_stmt, stmt_info, 0);
>  
>    stmt = STMT_VINFO_STMT (stmt_info);
>  
> @@ -2932,6 +2953,11 @@ vect_model_induction_cost (stmt_vec_info stmt_info
>    /* loop cost for vec_loop.  */
>    STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info) 
>      = ncopies * vect_get_stmt_cost (vector_stmt);
> +  /*  This is not currently added to the cost in 
> +      vect_estimate_min_profitable_iters, which is almost certainly a bug.
> +  add_stmt_cost (target_cost_data, ncopies, vector_stmt, stmt_info, 0);
> +  */
> +
>    /* prologue cost for vec_init and vec_step.  */
>    STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info)  
>      = 2 * vect_get_stmt_cost (scalar_to_vec);
> Index: gcc/tree-vect-data-refs.c
> ===================================================================
> --- gcc/tree-vect-data-refs.c	(revision 189081)
> +++ gcc/tree-vect-data-refs.c	(working copy)
> @@ -1205,7 +1205,7 @@ vector_alignment_reachable_p (struct data_referenc
>  
>  /* Calculate the cost of the memory access represented by DR.  */
>  
> -static void
> +static stmt_vector_for_cost
>  vect_get_data_access_cost (struct data_reference *dr,
>                             unsigned int *inside_cost,
>                             unsigned int *outside_cost)
> @@ -1216,15 +1216,19 @@ vect_get_data_access_cost (struct data_reference *
>    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>    int vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    int ncopies = vf / nunits;
> +  stmt_vector_for_cost stmt_cost_vec = VEC_alloc (stmt_info_for_cost, heap, 2);
>  
>    if (DR_IS_READ (dr))
> -    vect_get_load_cost (dr, ncopies, true, inside_cost, outside_cost);
> +    vect_get_load_cost (dr, ncopies, true, inside_cost,
> +			outside_cost, &stmt_cost_vec);
>    else
> -    vect_get_store_cost (dr, ncopies, inside_cost);
> +    vect_get_store_cost (dr, ncopies, inside_cost, &stmt_cost_vec);
>  
>    if (vect_print_dump_info (REPORT_COST))
>      fprintf (vect_dump, "vect_get_data_access_cost: inside_cost = %d, "
>               "outside_cost = %d.", *inside_cost, *outside_cost);
> +
> +  return stmt_cost_vec;
>  }
>  
>  
> @@ -1317,6 +1321,7 @@ vect_peeling_hash_get_lowest_cost (void **slot, vo
>    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>    VEC (data_reference_p, heap) *datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
>    struct data_reference *dr;
> +  stmt_vector_for_cost stmt_cost_vec = NULL;
>  
>    FOR_EACH_VEC_ELT (data_reference_p, datarefs, i, dr)
>      {
> @@ -1330,7 +1335,8 @@ vect_peeling_hash_get_lowest_cost (void **slot, vo
>  
>        save_misalignment = DR_MISALIGNMENT (dr);
>        vect_update_misalignment_for_peel (dr, elem->dr, elem->npeel);
> -      vect_get_data_access_cost (dr, &inside_cost, &outside_cost);
> +      stmt_cost_vec = vect_get_data_access_cost (dr, &inside_cost,
> +						 &outside_cost);
>        SET_DR_MISALIGNMENT (dr, save_misalignment);
>      }
>  
> @@ -1342,6 +1348,7 @@ vect_peeling_hash_get_lowest_cost (void **slot, vo
>      {
>        min->inside_cost = inside_cost;
>        min->outside_cost = outside_cost;
> +      min->stmt_cost_vec = stmt_cost_vec;
>        min->peel_info.dr = elem->dr;
>        min->peel_info.npeel = elem->npeel;
>      }
> @@ -1356,11 +1363,13 @@ vect_peeling_hash_get_lowest_cost (void **slot, vo
>  
>  static struct data_reference *
>  vect_peeling_hash_choose_best_peeling (loop_vec_info loop_vinfo,
> -                                       unsigned int *npeel)
> +                                       unsigned int *npeel,
> +				       stmt_vector_for_cost *stmt_cost_vec)
>  {
>     struct _vect_peel_extended_info res;
>  
>     res.peel_info.dr = NULL;
> +   res.stmt_cost_vec = NULL;
>  
>     if (flag_vect_cost_model)
>       {
> @@ -1377,6 +1386,7 @@ vect_peeling_hash_choose_best_peeling (loop_vec_in
>       }
>  
>     *npeel = res.peel_info.npeel;
> +   *stmt_cost_vec = res.stmt_cost_vec;
>     return res.peel_info.dr;
>  }
>  
> @@ -1493,6 +1503,7 @@ vect_enhance_data_refs_alignment (loop_vec_info lo
>    unsigned possible_npeel_number = 1;
>    tree vectype;
>    unsigned int nelements, mis, same_align_drs_max = 0;
> +  stmt_vector_for_cost stmt_cost_vec = NULL;
>  
>    if (vect_print_dump_info (REPORT_DETAILS))
>      fprintf (vect_dump, "=== vect_enhance_data_refs_alignment ===");
> @@ -1697,10 +1708,10 @@ vect_enhance_data_refs_alignment (loop_vec_info lo
>            unsigned int load_inside_penalty = 0, load_outside_penalty = 0;
>            unsigned int store_inside_penalty = 0, store_outside_penalty = 0;
>  
> -          vect_get_data_access_cost (dr0, &load_inside_cost,
> -                                     &load_outside_cost);
> -          vect_get_data_access_cost (first_store, &store_inside_cost,
> -                                     &store_outside_cost);
> +          (void) vect_get_data_access_cost (dr0, &load_inside_cost,
> +					    &load_outside_cost);
> +          (void) vect_get_data_access_cost (first_store, &store_inside_cost,
> +					    &store_outside_cost);
>  
>            /* Calculate the penalty for leaving FIRST_STORE unaligned (by
>               aligning the load DR0).  */
> @@ -1764,7 +1775,8 @@ vect_enhance_data_refs_alignment (loop_vec_info lo
>        gcc_assert (!all_misalignments_unknown);
>  
>        /* Choose the best peeling from the hash table.  */
> -      dr0 = vect_peeling_hash_choose_best_peeling (loop_vinfo, &npeel);
> +      dr0 = vect_peeling_hash_choose_best_peeling (loop_vinfo, &npeel,
> +						   &stmt_cost_vec);
>        if (!dr0 || !npeel)
>          do_peeling = false;
>      }
> @@ -1848,6 +1860,10 @@ vect_enhance_data_refs_alignment (loop_vec_info lo
>  
>        if (do_peeling)
>          {
> +	  /*
> +	  stmt_info_for_cost *si;
> +	  */
> +
>            /* (1.2) Update the DR_MISALIGNMENT of each data reference DR_i.
>               If the misalignment of DR_i is identical to that of dr0 then set
>               DR_MISALIGNMENT (DR_i) to zero.  If the misalignment of DR_i and
> @@ -1871,6 +1887,21 @@ vect_enhance_data_refs_alignment (loop_vec_info lo
>            if (vect_print_dump_info (REPORT_DETAILS))
>              fprintf (vect_dump, "Peeling for alignment will be applied.");
>  
> +	  /* We've delayed passing the inside-loop peeling costs to the
> +	     target cost model until we were sure peeling would happen.
> +	     Do so now.  */
> +	  if (stmt_cost_vec)
> +	    {
> +	  /*  Peeling costs are apparently not currently counted in the
> +	      vectorization decision, which is almost certainly a bug.
> +
> +	      FOR_EACH_VEC_ELT (stmt_info_for_cost, stmt_cost_vec, i, si)
> +		add_stmt_cost (target_cost_data, si->count, si->kind,
> +			       vinfo_for_stmt (si->stmt), si->misalign);
> +	  */
> +	      VEC_free (stmt_info_for_cost, heap, stmt_cost_vec);
> +	    }
> +
>  	  stat = vect_verify_datarefs_alignment (loop_vinfo, NULL);
>  	  gcc_assert (stat);
>            return stat;
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c	(revision 189081)
> +++ gcc/tree-vect-stmts.c	(working copy)
> @@ -41,6 +41,32 @@ along with GCC; see the file COPYING3.  If not see
>  #include "langhooks.h"
>  
>  
> +/* Return the vectorized type for the given statement.  */
> +
> +tree
> +stmt_vectype (struct _stmt_vec_info *stmt_info)
> +{
> +  return STMT_VINFO_VECTYPE (stmt_info);
> +}
> +
> +/* Return TRUE iff the given statement is in an inner loop relative to
> +   the loop being vectorized.  */
> +bool
> +stmt_in_inner_loop_p (struct _stmt_vec_info *stmt_info)
> +{
> +  gimple stmt = STMT_VINFO_STMT (stmt_info);
> +  basic_block bb = gimple_bb (stmt);
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> +  struct loop* loop;
> +
> +  if (!loop_vinfo)
> +    return false;
> +
> +  loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> +  return (bb->loop_father == loop->inner);
> +}
> +
>  /* Return a variable of type ELEM_TYPE[NELEMS].  */
>  
>  static tree
> @@ -735,7 +761,8 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info lo
>  
>  void
>  vect_model_simple_cost (stmt_vec_info stmt_info, int ncopies,
> -			enum vect_def_type *dt, slp_tree slp_node)
> +			enum vect_def_type *dt, slp_tree slp_node,
> +			stmt_vector_for_cost *stmt_cost_vec)
>  {
>    int i;
>    int inside_cost = 0, outside_cost = 0;
> @@ -760,6 +787,9 @@ vect_model_simple_cost (stmt_vec_info stmt_info, i
>    /* Set the costs either in STMT_INFO or SLP_NODE (if exists).  */
>    stmt_vinfo_set_inside_of_loop_cost (stmt_info, slp_node, inside_cost);
>    stmt_vinfo_set_outside_of_loop_cost (stmt_info, slp_node, outside_cost);
> +
> +  /* Pass the inside-of-loop statements to the target-specific cost model.  */
> +  record_stmt_cost (stmt_cost_vec, ncopies, vector_stmt, stmt_info, 0);
>  }
>  
>  
> @@ -785,6 +815,8 @@ vect_model_promotion_demotion_cost (stmt_vec_info
>        tmp = (STMT_VINFO_TYPE (stmt_info) == type_promotion_vec_info_type) ?
>  	(i + 1) : i;
>        inside_cost += vect_pow2 (tmp) * single_stmt_cost;
> +      add_stmt_cost (target_cost_data, vect_pow2 (tmp), vec_promote_demote,
> +		     stmt_info, 0);
>      }
>  
>    /* FORNOW: Assuming maximum 2 args per stmts.  */
> @@ -829,7 +861,7 @@ vect_cost_group_size (stmt_vec_info stmt_info)
>  void
>  vect_model_store_cost (stmt_vec_info stmt_info, int ncopies,
>  		       bool store_lanes_p, enum vect_def_type dt,
> -		       slp_tree slp_node)
> +		       slp_tree slp_node, stmt_vector_for_cost *stmt_cost_vec)
>  {
>    int group_size;
>    unsigned int inside_cost = 0, outside_cost = 0;
> @@ -873,8 +905,10 @@ vect_model_store_cost (stmt_vec_info stmt_info, in
>    if (!store_lanes_p && group_size > 1)
>      {
>        /* Uses a high and low interleave operation for each needed permute.  */
> -      inside_cost = ncopies * exact_log2(group_size) * group_size
> -        * vect_get_stmt_cost (vec_perm);
> +      
> +      int nstmts = ncopies * exact_log2 (group_size) * group_size;
> +      inside_cost = nstmts * vect_get_stmt_cost (vec_perm);
> +      record_stmt_cost (stmt_cost_vec, nstmts, vec_perm, stmt_info, 0);
>  
>        if (vect_print_dump_info (REPORT_COST))
>          fprintf (vect_dump, "vect_model_store_cost: strided group_size = %d .",
> @@ -882,7 +916,7 @@ vect_model_store_cost (stmt_vec_info stmt_info, in
>      }
>  
>    /* Costs of the stores.  */
> -  vect_get_store_cost (first_dr, ncopies, &inside_cost);
> +  vect_get_store_cost (first_dr, ncopies, &inside_cost, stmt_cost_vec);
>  
>    if (vect_print_dump_info (REPORT_COST))
>      fprintf (vect_dump, "vect_model_store_cost: inside_cost = %d, "
> @@ -897,15 +931,19 @@ vect_model_store_cost (stmt_vec_info stmt_info, in
>  /* Calculate cost of DR's memory access.  */
>  void
>  vect_get_store_cost (struct data_reference *dr, int ncopies,
> -                     unsigned int *inside_cost)
> +		     unsigned int *inside_cost,
> +		     stmt_vector_for_cost *stmt_cost_vec)
>  {
>    int alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
> +  gimple stmt = DR_STMT (dr);
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>  
>    switch (alignment_support_scheme)
>      {
>      case dr_aligned:
>        {
>          *inside_cost += ncopies * vect_get_stmt_cost (vector_store);
> +	record_stmt_cost (stmt_cost_vec, ncopies, vector_store, stmt_info, 0);
>  
>          if (vect_print_dump_info (REPORT_COST))
>            fprintf (vect_dump, "vect_model_store_cost: aligned.");
> @@ -915,14 +953,14 @@ vect_get_store_cost (struct data_reference *dr, in
>  
>      case dr_unaligned_supported:
>        {
> -        gimple stmt = DR_STMT (dr);
> -        stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>          tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>  
>          /* Here, we assign an additional cost for the unaligned store.  */
>          *inside_cost += ncopies
>            * targetm.vectorize.builtin_vectorization_cost (unaligned_store,
>                                   vectype, DR_MISALIGNMENT (dr));
> +	record_stmt_cost (stmt_cost_vec, ncopies, unaligned_store,
> +			  stmt_info, DR_MISALIGNMENT (dr));
>  
>          if (vect_print_dump_info (REPORT_COST))
>            fprintf (vect_dump, "vect_model_store_cost: unaligned supported by "
> @@ -956,7 +994,7 @@ vect_get_store_cost (struct data_reference *dr, in
>  
>  void
>  vect_model_load_cost (stmt_vec_info stmt_info, int ncopies, bool load_lanes_p,
> -		      slp_tree slp_node)
> +		      slp_tree slp_node, stmt_vector_for_cost *stmt_cost_vec)
>  {
>    int group_size;
>    gimple first_stmt;
> @@ -988,8 +1026,9 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>    if (!load_lanes_p && group_size > 1)
>      {
>        /* Uses an even and odd extract operations for each needed permute.  */
> -      inside_cost = ncopies * exact_log2(group_size) * group_size
> -	* vect_get_stmt_cost (vec_perm);
> +      int nstmts = ncopies * exact_log2 (group_size) * group_size;
> +      inside_cost = nstmts * vect_get_stmt_cost (vec_perm);
> +      record_stmt_cost (stmt_cost_vec, nstmts, vec_perm, stmt_info, 0);
>  
>        if (vect_print_dump_info (REPORT_COST))
>          fprintf (vect_dump, "vect_model_load_cost: strided group_size = %d .",
> @@ -1006,12 +1045,16 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>        inside_cost += ncopies
>  	* targetm.vectorize.builtin_vectorization_cost (vec_construct,
>  							vectype, 0);
> +      record_stmt_cost (stmt_cost_vec,
> +			ncopies * TYPE_VECTOR_SUBPARTS (vectype),
> +			scalar_load, stmt_info, 0);
> +      record_stmt_cost (stmt_cost_vec, ncopies, vec_construct, stmt_info, 0);
>      }
>    else
>      vect_get_load_cost (first_dr, ncopies,
>  			((!STMT_VINFO_GROUPED_ACCESS (stmt_info))
>  			 || group_size > 1 || slp_node),
> -			&inside_cost, &outside_cost);
> +			&inside_cost, &outside_cost, stmt_cost_vec);
>  
>    if (vect_print_dump_info (REPORT_COST))
>      fprintf (vect_dump, "vect_model_load_cost: inside_cost = %d, "
> @@ -1026,16 +1069,20 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>  /* Calculate cost of DR's memory access.  */
>  void
>  vect_get_load_cost (struct data_reference *dr, int ncopies,
> -                    bool add_realign_cost, unsigned int *inside_cost,
> -                    unsigned int *outside_cost)
> +		    bool add_realign_cost, unsigned int *inside_cost,
> +		    unsigned int *outside_cost,
> +		    stmt_vector_for_cost *stmt_cost_vec)
>  {
>    int alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
> +  gimple stmt = DR_STMT (dr);
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>  
>    switch (alignment_support_scheme)
>      {
>      case dr_aligned:
>        {
>          *inside_cost += ncopies * vect_get_stmt_cost (vector_load); 
> +	record_stmt_cost (stmt_cost_vec, ncopies, vector_load, stmt_info, 0);
>  
>          if (vect_print_dump_info (REPORT_COST))
>            fprintf (vect_dump, "vect_model_load_cost: aligned.");
> @@ -1044,14 +1091,15 @@ vect_get_load_cost (struct data_reference *dr, int
>        }
>      case dr_unaligned_supported:
>        {
> -        gimple stmt = DR_STMT (dr);
> -        stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>          tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>  
>          /* Here, we assign an additional cost for the unaligned load.  */
>          *inside_cost += ncopies
>            * targetm.vectorize.builtin_vectorization_cost (unaligned_load,
>                                             vectype, DR_MISALIGNMENT (dr));
> +	record_stmt_cost (stmt_cost_vec, ncopies, unaligned_load,
> +			  stmt_info, DR_MISALIGNMENT (dr));
> +
>          if (vect_print_dump_info (REPORT_COST))
>            fprintf (vect_dump, "vect_model_load_cost: unaligned supported by "
>                     "hardware.");
> @@ -1062,12 +1110,18 @@ vect_get_load_cost (struct data_reference *dr, int
>        {
>          *inside_cost += ncopies * (2 * vect_get_stmt_cost (vector_load)
>  				   + vect_get_stmt_cost (vec_perm));
> +	record_stmt_cost (stmt_cost_vec, ncopies * 2, vector_load,
> +			  stmt_info, 0);
> +	record_stmt_cost (stmt_cost_vec, ncopies, vec_perm, stmt_info, 0);
>  
>          /* FIXME: If the misalignment remains fixed across the iterations of
>             the containing loop, the following cost should be added to the
>             outside costs.  */
>          if (targetm.vectorize.builtin_mask_for_load)
> -          *inside_cost += vect_get_stmt_cost (vector_stmt);
> +	  {
> +	    *inside_cost += vect_get_stmt_cost (vector_stmt);
> +	    record_stmt_cost (stmt_cost_vec, 1, vector_stmt, stmt_info, 0);
> +	  }
>  
>          if (vect_print_dump_info (REPORT_COST))
>            fprintf (vect_dump, "vect_model_load_cost: explicit realign");
> @@ -1096,6 +1150,8 @@ vect_get_load_cost (struct data_reference *dr, int
>  
>          *inside_cost += ncopies * (vect_get_stmt_cost (vector_load)
>  				   + vect_get_stmt_cost (vec_perm));
> +	record_stmt_cost (stmt_cost_vec, ncopies, vector_load, stmt_info, 0);
> +	record_stmt_cost (stmt_cost_vec, ncopies, vec_perm, stmt_info, 0);
>  
>          if (vect_print_dump_info (REPORT_COST))
>            fprintf (vect_dump,
> @@ -1719,7 +1775,7 @@ vectorizable_call (gimple stmt, gimple_stmt_iterat
>        STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
>        if (vect_print_dump_info (REPORT_DETAILS))
>          fprintf (vect_dump, "=== vectorizable_call ===");
> -      vect_model_simple_cost (stmt_info, ncopies, dt, NULL);
> +      vect_model_simple_cost (stmt_info, ncopies, dt, NULL, NULL);
>        return true;
>      }
>  
> @@ -2433,7 +2489,7 @@ vectorizable_conversion (gimple stmt, gimple_stmt_
>        if (code == FIX_TRUNC_EXPR || code == FLOAT_EXPR)
>          {
>  	  STMT_VINFO_TYPE (stmt_info) = type_conversion_vec_info_type;
> -	  vect_model_simple_cost (stmt_info, ncopies, dt, NULL);
> +	  vect_model_simple_cost (stmt_info, ncopies, dt, NULL, NULL);
>  	}
>        else if (modifier == NARROW)
>  	{
> @@ -2841,7 +2897,7 @@ vectorizable_assignment (gimple stmt, gimple_stmt_
>        STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
>        if (vect_print_dump_info (REPORT_DETAILS))
>          fprintf (vect_dump, "=== vectorizable_assignment ===");
> -      vect_model_simple_cost (stmt_info, ncopies, dt, NULL);
> +      vect_model_simple_cost (stmt_info, ncopies, dt, NULL, NULL);
>        return true;
>      }
>  
> @@ -3187,7 +3243,7 @@ vectorizable_shift (gimple stmt, gimple_stmt_itera
>        STMT_VINFO_TYPE (stmt_info) = shift_vec_info_type;
>        if (vect_print_dump_info (REPORT_DETAILS))
>          fprintf (vect_dump, "=== vectorizable_shift ===");
> -      vect_model_simple_cost (stmt_info, ncopies, dt, NULL);
> +      vect_model_simple_cost (stmt_info, ncopies, dt, NULL, NULL);
>        return true;
>      }
>  
> @@ -3565,7 +3621,7 @@ vectorizable_operation (gimple stmt, gimple_stmt_i
>        STMT_VINFO_TYPE (stmt_info) = op_vec_info_type;
>        if (vect_print_dump_info (REPORT_DETAILS))
>          fprintf (vect_dump, "=== vectorizable_operation ===");
> -      vect_model_simple_cost (stmt_info, ncopies, dt, NULL);
> +      vect_model_simple_cost (stmt_info, ncopies, dt, NULL, NULL);
>        return true;
>      }
>  
> @@ -3938,7 +3994,7 @@ vectorizable_store (gimple stmt, gimple_stmt_itera
>    if (!vec_stmt) /* transformation not required.  */
>      {
>        STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
> -      vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt, NULL);
> +      vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt, NULL, NULL);
>        return true;
>      }
>  
> @@ -4494,7 +4550,7 @@ vectorizable_load (gimple stmt, gimple_stmt_iterat
>    if (!vec_stmt) /* transformation not required.  */
>      {
>        STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
> -      vect_model_load_cost (stmt_info, ncopies, load_lanes_p, NULL);
> +      vect_model_load_cost (stmt_info, ncopies, load_lanes_p, NULL, NULL);
>        return true;
>      }
>  
> Index: gcc/config/spu/spu.c
> ===================================================================
> --- gcc/config/spu/spu.c	(revision 189081)
> +++ gcc/config/spu/spu.c	(working copy)
> @@ -443,6 +443,18 @@ static void spu_setup_incoming_varargs (cumulative
>  #undef TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
>  #define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST spu_builtin_vectorization_cost
>  
> +#undef TARGET_VECTORIZE_INIT_COST
> +#define TARGET_VECTORIZE_INIT_COST spu_init_cost
> +
> +#undef TARGET_VECTORIZE_ADD_STMT_COST
> +#define TARGET_VECTORIZE_ADD_STMT_COST spu_add_stmt_cost
> +
> +#undef TARGET_VECTORIZE_FINISH_COST
> +#define TARGET_VECTORIZE_FINISH_COST spu_finish_cost
> +
> +#undef TARGET_VECTORIZE_DESTROY_COST_DATA
> +#define TARGET_VECTORIZE_DESTROY_COST_DATA spu_destroy_cost_data
> +
>  #undef TARGET_VECTORIZE_VECTOR_ALIGNMENT_REACHABLE
>  #define TARGET_VECTORIZE_VECTOR_ALIGNMENT_REACHABLE spu_vector_alignment_reachable
>  
> @@ -6947,6 +6959,56 @@ spu_builtin_vectorization_cost (enum vect_cost_for
>      }
>  }
>  
> +/* Implement targetm.vectorize.init_cost.  */
> +
> +void *
> +spu_init_cost (struct _loop_vec_info *loop_vinfo ATTRIBUTE_UNUSED)
> +{
> +  int *cost = XNEW (int);
> +  *cost = 0;
> +  return cost;
> +}
> +
> +/* Implement targetm.vectorize.add_stmt_cost.  */
> +
> +void
> +spu_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
> +		   struct _stmt_vec_info *stmt_info, int misalign)
> +{
> +  int *cost = (int *) data;
> +  if (flag_vect_cost_model)
> +    {
> +      tree vectype = stmt_vectype (stmt_info);
> +      int stmt_cost = spu_builtin_vectorization_cost (kind, vectype, misalign);
> +
> +      /* Statements in an inner loop relative to the loop being
> +	 vectorized are weighted more heavily.  The value here is
> +	 arbitrary and could potentially be improved with analysis.  */
> +      if (stmt_in_inner_loop_p (stmt_info))
> +	count *= 50;  /* FIXME.  */

That's a too elaborate default implementation - doesn't the vectorizer
try to account for this fact?  Does this even trigger?

So looking at the vectype would be the single use of stmt_info
for now, possibly extracting the compute code would be another,
or looking at the pattern.  So I'm not sure we shouldn't simply
pass down the vector mode (targets should care about the mode only)
and the operation code (later, when we split up the hook).

> +      *cost += count * stmt_cost;
> +    }
> +}
> +
> +/* Implement targetm.vectorize.finish_cost.  */
> +
> +int
> +spu_finish_cost (void *data)
> +{
> +  int retval = *((int *) data);
> +  free (data);
> +  return retval;
> +}
> +
> +/* Implement targetm.vectorize.destroy_cost_data.  */
> +
> +void
> +spu_destroy_cost_data (void *data)
> +{
> +  free (data);
> +}
> +
>  /* Return true iff, data reference of TYPE can reach vector alignment (16)
>     after applying N number of iterations.  This routine does not determine
>     how may iterations are required to reach desired alignment.  */
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c	(revision 189081)
> +++ gcc/config/i386/i386.c	(working copy)
> @@ -40122,6 +40122,56 @@ ix86_autovectorize_vector_sizes (void)
>    return (TARGET_AVX && !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
>  }
>  
> +/* Implement targetm.vectorize.init_cost.  */
> +
> +void *
> +ix86_init_cost (struct _loop_vec_info *loop_vinfo ATTRIBUTE_UNUSED)
> +{
> +  int *cost = XNEW (int);
> +  *cost = 0;
> +  return cost;
> +}
> +
> +/* Implement targetm.vectorize.add_stmt_cost.  */
> +
> +void
> +ix86_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
> +		    struct _stmt_vec_info *stmt_info, int misalign)
> +{
> +  int *cost = (int *) data;
> +  if (flag_vect_cost_model)
> +    {
> +      tree vectype = stmt_vectype (stmt_info);
> +      int stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +
> +      /* Statements in an inner loop relative to the loop being
> +	 vectorized are weighted more heavily.  The value here is
> +	 arbitrary and could potentially be improved with analysis.  */
> +      if (stmt_in_inner_loop_p (stmt_info))
> +	count *= 50;  /* FIXME.  */
> +
> +      *cost += count * stmt_cost;
> +    }
> +}
> +
> +/* Implement targetm.vectorize.finish_cost.  */
> +
> +int
> +ix86_finish_cost (void *data)
> +{
> +  int retval = *((int *) data);
> +  free (data);
> +  return retval;
> +}
> +
> +/* Implement targetm.vectorize.destroy_cost_data.  */
> +
> +void
> +ix86_destroy_cost_data (void *data)
> +{
> +  free (data);
> +}
> +
>  /* Validate target specific memory model bits in VAL. */
>  
>  static unsigned HOST_WIDE_INT
> @@ -40432,6 +40482,14 @@ ix86_memmodel_check (unsigned HOST_WIDE_INT val)
>  #undef TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES
>  #define TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES \
>    ix86_autovectorize_vector_sizes
> +#undef TARGET_VECTORIZE_INIT_COST
> +#define TARGET_VECTORIZE_INIT_COST ix86_init_cost
> +#undef TARGET_VECTORIZE_ADD_STMT_COST
> +#define TARGET_VECTORIZE_ADD_STMT_COST ix86_add_stmt_cost
> +#undef TARGET_VECTORIZE_FINISH_COST
> +#define TARGET_VECTORIZE_FINISH_COST ix86_finish_cost
> +#undef TARGET_VECTORIZE_DESTROY_COST_DATA
> +#define TARGET_VECTORIZE_DESTROY_COST_DATA ix86_destroy_cost_data
>  
>  #undef TARGET_SET_CURRENT_FUNCTION
>  #define TARGET_SET_CURRENT_FUNCTION ix86_set_current_function
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c	(revision 189081)
> +++ gcc/config/rs6000/rs6000.c	(working copy)
> @@ -1288,6 +1288,14 @@ static const struct attribute_spec rs6000_attribut
>  #undef TARGET_VECTORIZE_PREFERRED_SIMD_MODE
>  #define TARGET_VECTORIZE_PREFERRED_SIMD_MODE \
>    rs6000_preferred_simd_mode
> +#undef TARGET_VECTORIZE_INIT_COST
> +#define TARGET_VECTORIZE_INIT_COST rs6000_init_cost
> +#undef TARGET_VECTORIZE_ADD_STMT_COST
> +#define TARGET_VECTORIZE_ADD_STMT_COST rs6000_add_stmt_cost
> +#undef TARGET_VECTORIZE_FINISH_COST
> +#define TARGET_VECTORIZE_FINISH_COST rs6000_finish_cost
> +#undef TARGET_VECTORIZE_DESTROY_COST_DATA
> +#define TARGET_VECTORIZE_DESTROY_COST_DATA rs6000_destroy_cost_data
>  
>  #undef TARGET_INIT_BUILTINS
>  #define TARGET_INIT_BUILTINS rs6000_init_builtins
> @@ -3563,6 +3571,56 @@ rs6000_preferred_simd_mode (enum machine_mode mode
>    return word_mode;
>  }
>  
> +/* Implement targetm.vectorize.init_cost.  */
> +
> +void *
> +rs6000_init_cost (struct _loop_vec_info *loop_vinfo ATTRIBUTE_UNUSED)
> +{
> +  int *cost = XNEW (int);
> +  *cost = 0;
> +  return cost;
> +}
> +
> +/* Implement targetm.vectorize.add_stmt_cost.  */
> +
> +void
> +rs6000_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
> +		      struct _stmt_vec_info *stmt_info, int misalign)
> +{
> +  int *cost = (int *) data;
> +  if (flag_vect_cost_model)
> +    {
> +      tree vectype = stmt_vectype (stmt_info);
> +      int stmt_cost = rs6000_builtin_vectorization_cost (kind, vectype,
> +							 misalign);
> +      /* Statements in an inner loop relative to the loop being
> +	 vectorized are weighted more heavily.  The value here is
> +	 arbitrary and could potentially be improved with analysis.  */
> +      if (stmt_in_inner_loop_p (stmt_info))
> +	count *= 50;  /* FIXME.  */
> +
> +      *cost += count * stmt_cost;
> +    }
> +}
> +
> +/* Implement targetm.vectorize.finish_cost.  */
> +
> +int
> +rs6000_finish_cost (void *data)
> +{
> +  int retval = *((int *) data);
> +  free (data);
> +  return retval;
> +}
> +
> +/* Implement targetm.vectorize.destroy_cost_data.  */
> +
> +void
> +rs6000_destroy_cost_data (void *data)
> +{
> +  free (data);
> +}
> +
>  /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a
>     library with vectorized intrinsics.  */
>  
> Index: gcc/tree-vect-slp.c
> ===================================================================
> --- gcc/tree-vect-slp.c	(revision 189081)
> +++ gcc/tree-vect-slp.c	(working copy)
> @@ -94,6 +94,7 @@ vect_free_slp_instance (slp_instance instance)
>    vect_free_slp_tree (SLP_INSTANCE_TREE (instance));
>    VEC_free (int, heap, SLP_INSTANCE_LOAD_PERMUTATION (instance));
>    VEC_free (slp_tree, heap, SLP_INSTANCE_LOADS (instance));
> +  VEC_free (stmt_info_for_cost, heap, SLP_INSTANCE_STMT_COST_VEC (instance));
>  }
>  
>  
> @@ -179,7 +180,8 @@ static bool
>  vect_get_and_check_slp_defs (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo,
>                               slp_tree slp_node, gimple stmt,
>  			     int ncopies_for_cost, bool first,
> -                             VEC (slp_oprnd_info, heap) **oprnds_info)
> +                             VEC (slp_oprnd_info, heap) **oprnds_info,
> +			     stmt_vector_for_cost *stmt_cost_vec)
>  {
>    tree oprnd;
>    unsigned int i, number_of_oprnds;
> @@ -320,7 +322,7 @@ vect_get_and_check_slp_defs (loop_vec_info loop_vi
>  	      if (REFERENCE_CLASS_P (lhs))
>  		/* Store.  */
>                  vect_model_store_cost (stmt_info, ncopies_for_cost, false,
> -                                        dt, slp_node);
> +				       dt, slp_node, stmt_cost_vec);
>  	      else
>  		{
>  		  enum vect_def_type dts[2];
> @@ -329,7 +331,7 @@ vect_get_and_check_slp_defs (loop_vec_info loop_vi
>  		  /* Not memory operation (we don't call this function for
>  		     loads).  */
>  		  vect_model_simple_cost (stmt_info, ncopies_for_cost, dts,
> -					  slp_node);
> +					  slp_node, stmt_cost_vec);
>  		}
>  	    }
>  	}
> @@ -451,7 +453,8 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>                       int ncopies_for_cost, unsigned int *max_nunits,
>                       VEC (int, heap) **load_permutation,
>                       VEC (slp_tree, heap) **loads,
> -                     unsigned int vectorization_factor, bool *loads_permuted)
> +                     unsigned int vectorization_factor, bool *loads_permuted,
> +		     stmt_vector_for_cost *stmt_cost_vec)
>  {
>    unsigned int i;
>    VEC (gimple, heap) *stmts = SLP_TREE_SCALAR_STMTS (*node);
> @@ -470,7 +473,7 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>    HOST_WIDE_INT dummy;
>    bool permutation = false;
>    unsigned int load_place;
> -  gimple first_load, prev_first_load = NULL;
> +  gimple first_load = NULL, prev_first_load = NULL, old_first_load = NULL;
>    VEC (slp_oprnd_info, heap) *oprnds_info;
>    unsigned int nops;
>    slp_oprnd_info oprnd_info;
> @@ -711,7 +714,8 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>  	      /* Store.  */
>  	      if (!vect_get_and_check_slp_defs (loop_vinfo, bb_vinfo, *node,
>  						stmt, ncopies_for_cost,
> -						(i == 0), &oprnds_info))
> +						(i == 0), &oprnds_info,
> +						stmt_cost_vec))
>  		{
>  	  	  vect_free_oprnd_info (&oprnds_info);
>   		  return false;
> @@ -754,6 +758,7 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>                    return false;
>                  }
>  
> +	      old_first_load = first_load;
>                first_load = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt));
>                if (prev_first_load)
>                  {
> @@ -778,7 +783,9 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>                else
>                  prev_first_load = first_load;
>  
> -              if (first_load == stmt)
> +	      /* In some cases a group of loads is just the same load
> +		 repeated N times.  Only analyze its cost once.  */
> +              if (first_load == stmt && old_first_load != first_load)
>                  {
>                    first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
>                    if (vect_supportable_dr_alignment (first_dr, false)
> @@ -797,7 +804,8 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>  
>                    /* Analyze costs (for the first stmt in the group).  */
>                    vect_model_load_cost (vinfo_for_stmt (stmt),
> -                                        ncopies_for_cost, false, *node);
> +                                        ncopies_for_cost, false, *node,
> +					stmt_cost_vec);
>                  }
>  
>                /* Store the place of this load in the interleaving chain.  In
> @@ -871,7 +879,7 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>  	  /* Find the def-stmts.  */
>  	  if (!vect_get_and_check_slp_defs (loop_vinfo, bb_vinfo, *node, stmt,
>  					    ncopies_for_cost, (i == 0),
> -					    &oprnds_info))
> +					    &oprnds_info, stmt_cost_vec))
>  	    {
>  	      vect_free_oprnd_info (&oprnds_info);
>  	      return false;
> @@ -894,6 +902,8 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>            *inside_cost 
>              += targetm.vectorize.builtin_vectorization_cost (vec_perm, NULL, 0) 
>                 * group_size;
> +	  record_stmt_cost (stmt_cost_vec, group_size, vec_perm, 
> +			    vinfo_for_stmt (VEC_index (gimple, stmts, 0)), 0);
>          }
>        else
>          {
> @@ -919,9 +929,10 @@ vect_build_slp_tree (loop_vec_info loop_vinfo, bb_
>        child = vect_create_new_slp_node (oprnd_info->def_stmts);
>        if (!child
>            || !vect_build_slp_tree (loop_vinfo, bb_vinfo, &child, group_size,
> -				inside_cost, outside_cost, ncopies_for_cost,
> -				max_nunits, load_permutation, loads,
> -				vectorization_factor, loads_permuted))
> +				   inside_cost, outside_cost, ncopies_for_cost,
> +				   max_nunits, load_permutation, loads,
> +				   vectorization_factor, loads_permuted,
> +				   stmt_cost_vec))
>          {
>  	  if (child)
>  	    oprnd_info->def_stmts = NULL;
> @@ -1466,6 +1477,7 @@ vect_analyze_slp_instance (loop_vec_info loop_vinf
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
>    bool loads_permuted = false;
>    VEC (gimple, heap) *scalar_stmts;
> +  stmt_vector_for_cost stmt_cost_vec;
>  
>    if (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
>      {
> @@ -1551,12 +1563,14 @@ vect_analyze_slp_instance (loop_vec_info loop_vinf
>  
>    load_permutation = VEC_alloc (int, heap, group_size * group_size);
>    loads = VEC_alloc (slp_tree, heap, group_size);
> +  stmt_cost_vec = VEC_alloc (stmt_info_for_cost, heap, 10);
>  
>    /* Build the tree for the SLP instance.  */
>    if (vect_build_slp_tree (loop_vinfo, bb_vinfo, &node, group_size,
>                             &inside_cost, &outside_cost, ncopies_for_cost,
>  			   &max_nunits, &load_permutation, &loads,
> -			   vectorization_factor, &loads_permuted))
> +			   vectorization_factor, &loads_permuted,
> +			   &stmt_cost_vec))
>      {
>        /* Calculate the unrolling factor based on the smallest type.  */
>        if (max_nunits > nunits)
> @@ -1568,6 +1582,7 @@ vect_analyze_slp_instance (loop_vec_info loop_vinf
>            if (vect_print_dump_info (REPORT_SLP))
>              fprintf (vect_dump, "Build SLP failed: unrolling required in basic"
>                                 " block SLP");
> +	  VEC_free (stmt_info_for_cost, heap, stmt_cost_vec);
>            return false;
>          }
>  
> @@ -1578,6 +1593,7 @@ vect_analyze_slp_instance (loop_vec_info loop_vinf
>        SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = unrolling_factor;
>        SLP_INSTANCE_OUTSIDE_OF_LOOP_COST (new_instance) = outside_cost;
>        SLP_INSTANCE_INSIDE_OF_LOOP_COST (new_instance) = inside_cost;
> +      SLP_INSTANCE_STMT_COST_VEC (new_instance) = stmt_cost_vec;
>        SLP_INSTANCE_LOADS (new_instance) = loads;
>        SLP_INSTANCE_FIRST_LOAD_STMT (new_instance) = NULL;
>        SLP_INSTANCE_LOAD_PERMUTATION (new_instance) = load_permutation;
> @@ -1617,6 +1633,8 @@ vect_analyze_slp_instance (loop_vec_info loop_vinf
>  
>        return true;
>      }
> +  else
> +    VEC_free (stmt_info_for_cost, heap, stmt_cost_vec);
>  
>    /* Failed to SLP.  */
>    /* Free the allocated memory.  */
> @@ -1918,8 +1936,9 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb
>  {
>    VEC (slp_instance, heap) *slp_instances = BB_VINFO_SLP_INSTANCES (bb_vinfo);
>    slp_instance instance;
> -  int i;
> +  int i, j;
>    unsigned int vec_outside_cost = 0, vec_inside_cost = 0, scalar_cost = 0;
> +  unsigned int target_model_inside_cost;
>    unsigned int stmt_cost;
>    gimple stmt;
>    gimple_stmt_iterator si;
> @@ -1927,12 +1946,19 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb
>    stmt_vec_info stmt_info = NULL;
>    tree dummy_type = NULL;
>    int dummy = 0;
> +  stmt_vector_for_cost stmt_cost_vec;
> +  stmt_info_for_cost *ci;
>  
>    /* Calculate vector costs.  */
>    FOR_EACH_VEC_ELT (slp_instance, slp_instances, i, instance)
>      {
>        vec_outside_cost += SLP_INSTANCE_OUTSIDE_OF_LOOP_COST (instance);
>        vec_inside_cost += SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance);
> +      stmt_cost_vec = SLP_INSTANCE_STMT_COST_VEC (instance);
> +
> +      FOR_EACH_VEC_ELT (stmt_info_for_cost, stmt_cost_vec, j, ci)
> +	add_stmt_cost (target_cost_data, ci->count, ci->kind,
> +		       vinfo_for_stmt (ci->stmt), ci->misalign);
>      }
>  
>    /* Calculate scalar cost.  */
> @@ -1971,6 +1997,12 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb
>        fprintf (vect_dump, "  Scalar cost of basic block: %d", scalar_cost);
>      }
>  
> +  /* Complete the target-specific cost calculation.  */
> +  target_model_inside_cost = (unsigned) finish_cost (target_cost_data);
> +
> +  /* For now, the two inside-cost calculations should match.  */
> +  gcc_assert (vec_inside_cost == target_model_inside_cost);
> +
>    /* Vectorization is profitable if its cost is less than the cost of scalar
>       version.  */
>    if (vec_outside_cost + vec_inside_cost >= scalar_cost)
> @@ -2050,6 +2082,9 @@ vect_slp_analyze_bb_1 (basic_block bb)
>        return NULL;
>      }
>  
> +  /* Initialize the target-specific cost model.  */
> +  target_cost_data = init_cost (NULL);
> +
>    /* Check the SLP opportunities in the basic block, analyze and build SLP
>       trees.  */
>    if (!vect_analyze_slp (NULL, bb_vinfo))
> @@ -2059,6 +2094,7 @@ vect_slp_analyze_bb_1 (basic_block bb)
>                              "in basic block.\n");
>  
>        destroy_bb_vec_info (bb_vinfo);
> +      destroy_cost_data (target_cost_data);
>        return NULL;
>      }
>  
> @@ -2072,13 +2108,14 @@ vect_slp_analyze_bb_1 (basic_block bb)
>        vect_mark_slp_stmts_relevant (SLP_INSTANCE_TREE (instance));
>      }
>  
> -   if (!vect_verify_datarefs_alignment (NULL, bb_vinfo))
> +  if (!vect_verify_datarefs_alignment (NULL, bb_vinfo))
>      {
>        if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
>          fprintf (vect_dump, "not vectorized: unsupported alignment in basic "
>                              "block.\n");
>  
>        destroy_bb_vec_info (bb_vinfo);
> +      destroy_cost_data (target_cost_data);
>        return NULL;
>      }
>  
> @@ -2088,6 +2125,7 @@ vect_slp_analyze_bb_1 (basic_block bb)
>          fprintf (vect_dump, "not vectorized: bad operation in basic block.\n");
>  
>        destroy_bb_vec_info (bb_vinfo);
> +      destroy_cost_data (target_cost_data);
>        return NULL;
>      }
>  
> @@ -2175,17 +2213,30 @@ vect_slp_analyze_bb (basic_block bb)
>  void
>  vect_update_slp_costs_according_to_vf (loop_vec_info loop_vinfo)
>  {
> -  unsigned int i, vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  unsigned int i, j, vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    VEC (slp_instance, heap) *slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
>    slp_instance instance;
> +  stmt_vector_for_cost stmt_cost_vec;
> +  stmt_info_for_cost *si;
>  
>    if (vect_print_dump_info (REPORT_SLP))
>      fprintf (vect_dump, "=== vect_update_slp_costs_according_to_vf ===");
>  
>    FOR_EACH_VEC_ELT (slp_instance, slp_instances, i, instance)
> -    /* We assume that costs are linear in ncopies.  */
> -    SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance) *= vf
> -      / SLP_INSTANCE_UNROLLING_FACTOR (instance);
> +    {
> +      /* We assume that costs are linear in ncopies.  */
> +      int ncopies = vf / SLP_INSTANCE_UNROLLING_FACTOR (instance);
> +      SLP_INSTANCE_INSIDE_OF_LOOP_COST (instance) *= ncopies;
> +
> +      /* Record the instance's instructions in the target cost model.
> +	 This was delayed until here because the count of instructions
> +	 isn't known beforehand.  */
> +      stmt_cost_vec = SLP_INSTANCE_STMT_COST_VEC (instance);
> +
> +      FOR_EACH_VEC_ELT (stmt_info_for_cost, stmt_cost_vec, j, si)
> +	add_stmt_cost (target_cost_data, si->count * ncopies, si->kind,
> +		       vinfo_for_stmt (si->stmt), si->misalign);
> +    }
>  }