This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel

From: Jan Hubicka <hubicka at ucw dot cz>
To: Richard Biener <rguenther at suse dot de>
Cc: gcc-patches at gcc dot gnu dot org, Venkataramanan dot Kumar at amd dot com
Date: Tue, 17 Oct 2017 19:22:17 +0200
Subject: Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
Authentication-results: sourceware.org; auth=none
References: <20171017133415.GC94155@kam.mff.cuni.cz> <alpine.LSU.2.20.1710171536550.5588@zhemvz.fhfr.qr>

> On Tue, 17 Oct 2017, Jan Hubicka wrote:
> 
> > Hi,
> > gether/scatter loads tends to be expensive (at least for x86) while we now account them
> > as vector loads/stores which are cheap.  This patch adds vectorizer cost entry for these
> > so this can be modelled more realistically.
> > 
> > Bootstrapped/regtested x86_64-linux, OK?
> 
> Ok.  gather and load is somewhat redundant, likewise
> scatter and store.  So you might want to change it to just
> vector_gather and vector_scatter.  Even vector_ is redundant...

Hehe, comming from outside of vectorizer world, I did not know what
scatter/gather is and thus I wanted to keep load/store and vec in so it will be
easier to google for those who will need to fill in the numbers in future :)
> 
> Best available implementations manage to hide the vector build
> cost and just expose the latency of the load(s).  I wonder what
> Zen does here ;)

According to Agner's tables, gathers range from 12 ops (vgatherdpd)
to 66 ops (vpgatherdd).  I assume that CPU needs to do following:

1) transfer the offsets sse->ALU unit for address generation (3 cycles
   each, 2 ops)
2) do the address calcualtion (2 ops, probably 4 ops because it does not map naturally
			       to AGU)
2) do the load (7 cycles each, 2 ops)
3) merge results (1 ops)

so I get 7 ops, not sure what remaining 5 do.

Agner does not account time, but According to
http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt the
gather time ranges from 14 cycles (vgatherpd) to 20 cycles.  Here I guess it is
3+1+7+1=12 so it seems to work.

If you implement gather by hand, you save the SSE->address caluclation path and
thus you can get faster.
> 
> Note the most major source of impreciseness in the cost model
> is from vec_perm because we lack the information of the
> permutation mask which means we can't distinguish between
> cross-lane and intra-lane permutes.

Besides that we lack information about what operation we do (addition
or division?) which may be useful to pass down, especially because we do
have relevant information handy in the x86_cost tables.  So I am thinking
of adding extra parameter to the hook telling the operation.
What info we need to pass for permutations?

Honza
> 
> Richard.
> 
> > Honza
> > 
> > 2017-10-17  Jan Hubicka  <hubicka@ucw.cz>
> > 
> > 	* target.h (enum vect_cost_for_stmt): Add vec_gather_load and
> > 	vec_scatter_store
> > 	* tree-vect-stmts.c (record_stmt_cost): Make difference between normal
> > 	and scatter/gather ops.
> > 
> > 	* aarch64/aarch64.c (aarch64_builtin_vectorization_cost): Add
> > 	vec_gather_load and vec_scatter_store.
> > 	* arm/arm.c (arm_builtin_vectorization_cost): Likewise.
> > 	* powerpcspe/powerpcspe.c (rs6000_builtin_vectorization_cost): Likewise.
> > 	* rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Likewise.
> > 	* s390/s390.c (s390_builtin_vectorization_cost): Likewise.
> > 	* spu/spu.c (spu_builtin_vectorization_cost): Likewise.
> > 
> > Index: config/aarch64/aarch64.c
> > ===================================================================
> > --- config/aarch64/aarch64.c	(revision 253789)
> > +++ config/aarch64/aarch64.c	(working copy)
> > @@ -8547,9 +8547,10 @@ aarch64_builtin_vectorization_cost (enum
> >  	return fp ? costs->vec_fp_stmt_cost : costs->vec_int_stmt_cost;
> >  
> >        case vector_load:
> > +      case vector_gather_load:
> >  	return costs->vec_align_load_cost;
> >  
> > -      case vector_store:
> > +      case vector_scatter_store:
> >  	return costs->vec_store_cost;
> >  
> >        case vec_to_scalar:
> > Index: config/arm/arm.c
> > ===================================================================
> > --- config/arm/arm.c	(revision 253789)
> > +++ config/arm/arm.c	(working copy)
> > @@ -11241,9 +11241,11 @@ arm_builtin_vectorization_cost (enum vec
> >          return current_tune->vec_costs->vec_stmt_cost;
> >  
> >        case vector_load:
> > +      case vector_gather_load:
> >          return current_tune->vec_costs->vec_align_load_cost;
> >  
> >        case vector_store:
> > +      case vector_scatter_store:
> >          return current_tune->vec_costs->vec_store_cost;
> >  
> >        case vec_to_scalar:
> > Index: config/powerpcspe/powerpcspe.c
> > ===================================================================
> > --- config/powerpcspe/powerpcspe.c	(revision 253789)
> > +++ config/powerpcspe/powerpcspe.c	(working copy)
> > @@ -5834,6 +5834,8 @@ rs6000_builtin_vectorization_cost (enum
> >        case vector_stmt:
> >        case vector_load:
> >        case vector_store:
> > +      case vector_gather_load:
> > +      case vector_scatter_store:
> >        case vec_to_scalar:
> >        case scalar_to_vec:
> >        case cond_branch_not_taken:
> > Index: config/rs6000/rs6000.c
> > ===================================================================
> > --- config/rs6000/rs6000.c	(revision 253789)
> > +++ config/rs6000/rs6000.c	(working copy)
> > @@ -5398,6 +5398,8 @@ rs6000_builtin_vectorization_cost (enum
> >        case vector_stmt:
> >        case vector_load:
> >        case vector_store:
> > +      case vector_gather_load:
> > +      case vector_scatter_store:
> >        case vec_to_scalar:
> >        case scalar_to_vec:
> >        case cond_branch_not_taken:
> > Index: config/s390/s390.c
> > ===================================================================
> > --- config/s390/s390.c	(revision 253789)
> > +++ config/s390/s390.c	(working copy)
> > @@ -3717,6 +3717,8 @@ s390_builtin_vectorization_cost (enum ve
> >        case vector_stmt:
> >        case vector_load:
> >        case vector_store:
> > +      case vector_gather_load:
> > +      case vector_scatter_store:
> >        case vec_to_scalar:
> >        case scalar_to_vec:
> >        case cond_branch_not_taken:
> > Index: config/spu/spu.c
> > ===================================================================
> > --- config/spu/spu.c	(revision 253789)
> > +++ config/spu/spu.c	(working copy)
> > @@ -6625,6 +6625,8 @@ spu_builtin_vectorization_cost (enum vec
> >        case vector_stmt:
> >        case vector_load:
> >        case vector_store:
> > +      case vector_gather_load:
> > +      case vector_scatter_store:
> >        case vec_to_scalar:
> >        case scalar_to_vec:
> >        case cond_branch_not_taken:
> > Index: target.h
> > ===================================================================
> > --- target.h	(revision 253789)
> > +++ target.h	(working copy)
> > @@ -171,9 +171,11 @@ enum vect_cost_for_stmt
> >    scalar_store,
> >    vector_stmt,
> >    vector_load,
> > +  vector_gather_load,
> >    unaligned_load,
> >    unaligned_store,
> >    vector_store,
> > +  vector_scatter_store,
> >    vec_to_scalar,
> >    scalar_to_vec,
> >    cond_branch_not_taken,
> > Index: tree-vect-stmts.c
> > ===================================================================
> > --- tree-vect-stmts.c	(revision 253789)
> > +++ tree-vect-stmts.c	(working copy)
> > @@ -95,6 +95,12 @@ record_stmt_cost (stmt_vector_for_cost *
> >  		  enum vect_cost_for_stmt kind, stmt_vec_info stmt_info,
> >  		  int misalign, enum vect_cost_model_location where)
> >  {
> > +  if ((kind == vector_load || kind == unaligned_load)
> > +      && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> > +    kind = vector_gather_load;
> > +  if ((kind == vector_store || kind == unaligned_store)
> > +      && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> > +    kind = vector_scatter_store;
> >    if (body_cost_vec)
> >      {
> >        tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

Follow-Ups:
- Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Richard Biener
- Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Toon Moene

References:
- [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Jan Hubicka
- Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]