This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel

From: Richard Biener <rguenther at suse dot de>
To: Jan Hubicka <hubicka at ucw dot cz>
Cc: gcc-patches at gcc dot gnu dot org, Venkataramanan dot Kumar at amd dot com
Date: Wed, 18 Oct 2017 09:35:53 +0200 (CEST)
Subject: Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
Authentication-results: sourceware.org; auth=none
References: <20171017133415.GC94155@kam.mff.cuni.cz> <alpine.LSU.2.20.1710171536550.5588@zhemvz.fhfr.qr> <20171017172217.GD94155@kam.mff.cuni.cz>

On Tue, 17 Oct 2017, Jan Hubicka wrote:

> > On Tue, 17 Oct 2017, Jan Hubicka wrote:
> > 
> > > Hi,
> > > gether/scatter loads tends to be expensive (at least for x86) while we now account them
> > > as vector loads/stores which are cheap.  This patch adds vectorizer cost entry for these
> > > so this can be modelled more realistically.
> > > 
> > > Bootstrapped/regtested x86_64-linux, OK?
> > 
> > Ok.  gather and load is somewhat redundant, likewise
> > scatter and store.  So you might want to change it to just
> > vector_gather and vector_scatter.  Even vector_ is redundant...
> 
> Hehe, comming from outside of vectorizer world, I did not know what
> scatter/gather is and thus I wanted to keep load/store and vec in so it will be
> easier to google for those who will need to fill in the numbers in future :)
> > 
> > Best available implementations manage to hide the vector build
> > cost and just expose the latency of the load(s).  I wonder what
> > Zen does here ;)
> 
> According to Agner's tables, gathers range from 12 ops (vgatherdpd)
> to 66 ops (vpgatherdd).  I assume that CPU needs to do following:
> 
> 1) transfer the offsets sse->ALU unit for address generation (3 cycles
>    each, 2 ops)
> 2) do the address calcualtion (2 ops, probably 4 ops because it does not map naturally
> 			       to AGU)
> 2) do the load (7 cycles each, 2 ops)
> 3) merge results (1 ops)
> 
> so I get 7 ops, not sure what remaining 5 do.
> 
> Agner does not account time, but According to
> http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt the
> gather time ranges from 14 cycles (vgatherpd) to 20 cycles.  Here I guess it is
> 3+1+7+1=12 so it seems to work.
> 
> If you implement gather by hand, you save the SSE->address caluclation path and
> thus you can get faster.

I see.  It looks to me Zen should disable gather/scatter then completely
and we should implement manual gather/scatter code-generation in the
vectorizer (or lower it in vector lowering).  It sounds like they
only implemented it to have "complete" AVX2 support (ISTR scatter
is only in AVX512f).

> > Note the most major source of impreciseness in the cost model
> > is from vec_perm because we lack the information of the
> > permutation mask which means we can't distinguish between
> > cross-lane and intra-lane permutes.
> 
> Besides that we lack information about what operation we do (addition
> or division?) which may be useful to pass down, especially because we do
> have relevant information handy in the x86_cost tables.  So I am thinking
> of adding extra parameter to the hook telling the operation.

Not sure.  The costs are all supposed to be relative to scalar cost
and I fear we get nearer to a GIGO syndrome when adding more information
here ;)

> What info we need to pass for permutations?

The full constant permutation vector ...

Note that I think this particular cost hook isn't the best one.  We've
added TARGET_VECTORIZE_ADD_STMT_COST and friends to be the more
"powerful" ones but unfortunately the vectorizer itself doesn't
always use that and it's somewhat too powerful and at the same time
the vectorizer doesn't provide all information in a convenient way
through the passed stmt_info but the hook would have to reverse-engineer
things (like a permutation mask).  At least the operation code is
readily available here.

I think it only needs some minor refactoring to make the vectorizer
always use the "proper" hook though.

Richard.

> Honza
> > 
> > Richard.
> > 
> > > Honza
> > > 
> > > 2017-10-17  Jan Hubicka  <hubicka@ucw.cz>
> > > 
> > > 	* target.h (enum vect_cost_for_stmt): Add vec_gather_load and
> > > 	vec_scatter_store
> > > 	* tree-vect-stmts.c (record_stmt_cost): Make difference between normal
> > > 	and scatter/gather ops.
> > > 
> > > 	* aarch64/aarch64.c (aarch64_builtin_vectorization_cost): Add
> > > 	vec_gather_load and vec_scatter_store.
> > > 	* arm/arm.c (arm_builtin_vectorization_cost): Likewise.
> > > 	* powerpcspe/powerpcspe.c (rs6000_builtin_vectorization_cost): Likewise.
> > > 	* rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Likewise.
> > > 	* s390/s390.c (s390_builtin_vectorization_cost): Likewise.
> > > 	* spu/spu.c (spu_builtin_vectorization_cost): Likewise.
> > > 
> > > Index: config/aarch64/aarch64.c
> > > ===================================================================
> > > --- config/aarch64/aarch64.c	(revision 253789)
> > > +++ config/aarch64/aarch64.c	(working copy)
> > > @@ -8547,9 +8547,10 @@ aarch64_builtin_vectorization_cost (enum
> > >  	return fp ? costs->vec_fp_stmt_cost : costs->vec_int_stmt_cost;
> > >  
> > >        case vector_load:
> > > +      case vector_gather_load:
> > >  	return costs->vec_align_load_cost;
> > >  
> > > -      case vector_store:
> > > +      case vector_scatter_store:
> > >  	return costs->vec_store_cost;
> > >  
> > >        case vec_to_scalar:
> > > Index: config/arm/arm.c
> > > ===================================================================
> > > --- config/arm/arm.c	(revision 253789)
> > > +++ config/arm/arm.c	(working copy)
> > > @@ -11241,9 +11241,11 @@ arm_builtin_vectorization_cost (enum vec
> > >          return current_tune->vec_costs->vec_stmt_cost;
> > >  
> > >        case vector_load:
> > > +      case vector_gather_load:
> > >          return current_tune->vec_costs->vec_align_load_cost;
> > >  
> > >        case vector_store:
> > > +      case vector_scatter_store:
> > >          return current_tune->vec_costs->vec_store_cost;
> > >  
> > >        case vec_to_scalar:
> > > Index: config/powerpcspe/powerpcspe.c
> > > ===================================================================
> > > --- config/powerpcspe/powerpcspe.c	(revision 253789)
> > > +++ config/powerpcspe/powerpcspe.c	(working copy)
> > > @@ -5834,6 +5834,8 @@ rs6000_builtin_vectorization_cost (enum
> > >        case vector_stmt:
> > >        case vector_load:
> > >        case vector_store:
> > > +      case vector_gather_load:
> > > +      case vector_scatter_store:
> > >        case vec_to_scalar:
> > >        case scalar_to_vec:
> > >        case cond_branch_not_taken:
> > > Index: config/rs6000/rs6000.c
> > > ===================================================================
> > > --- config/rs6000/rs6000.c	(revision 253789)
> > > +++ config/rs6000/rs6000.c	(working copy)
> > > @@ -5398,6 +5398,8 @@ rs6000_builtin_vectorization_cost (enum
> > >        case vector_stmt:
> > >        case vector_load:
> > >        case vector_store:
> > > +      case vector_gather_load:
> > > +      case vector_scatter_store:
> > >        case vec_to_scalar:
> > >        case scalar_to_vec:
> > >        case cond_branch_not_taken:
> > > Index: config/s390/s390.c
> > > ===================================================================
> > > --- config/s390/s390.c	(revision 253789)
> > > +++ config/s390/s390.c	(working copy)
> > > @@ -3717,6 +3717,8 @@ s390_builtin_vectorization_cost (enum ve
> > >        case vector_stmt:
> > >        case vector_load:
> > >        case vector_store:
> > > +      case vector_gather_load:
> > > +      case vector_scatter_store:
> > >        case vec_to_scalar:
> > >        case scalar_to_vec:
> > >        case cond_branch_not_taken:
> > > Index: config/spu/spu.c
> > > ===================================================================
> > > --- config/spu/spu.c	(revision 253789)
> > > +++ config/spu/spu.c	(working copy)
> > > @@ -6625,6 +6625,8 @@ spu_builtin_vectorization_cost (enum vec
> > >        case vector_stmt:
> > >        case vector_load:
> > >        case vector_store:
> > > +      case vector_gather_load:
> > > +      case vector_scatter_store:
> > >        case vec_to_scalar:
> > >        case scalar_to_vec:
> > >        case cond_branch_not_taken:
> > > Index: target.h
> > > ===================================================================
> > > --- target.h	(revision 253789)
> > > +++ target.h	(working copy)
> > > @@ -171,9 +171,11 @@ enum vect_cost_for_stmt
> > >    scalar_store,
> > >    vector_stmt,
> > >    vector_load,
> > > +  vector_gather_load,
> > >    unaligned_load,
> > >    unaligned_store,
> > >    vector_store,
> > > +  vector_scatter_store,
> > >    vec_to_scalar,
> > >    scalar_to_vec,
> > >    cond_branch_not_taken,
> > > Index: tree-vect-stmts.c
> > > ===================================================================
> > > --- tree-vect-stmts.c	(revision 253789)
> > > +++ tree-vect-stmts.c	(working copy)
> > > @@ -95,6 +95,12 @@ record_stmt_cost (stmt_vector_for_cost *
> > >  		  enum vect_cost_for_stmt kind, stmt_vec_info stmt_info,
> > >  		  int misalign, enum vect_cost_model_location where)
> > >  {
> > > +  if ((kind == vector_load || kind == unaligned_load)
> > > +      && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> > > +    kind = vector_gather_load;
> > > +  if ((kind == vector_store || kind == unaligned_store)
> > > +      && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> > > +    kind = vector_scatter_store;
> > >    if (body_cost_vec)
> > >      {
> > >        tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;
> > > 
> > > 
> > 
> > -- 
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

Follow-Ups:
- Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Jan Hubicka

References:
- [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Jan Hubicka
- Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Richard Biener
- Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]