This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][i386] Update Zen tuning for vector load cost

From: Richard Biener <rguenther at suse dot de>
To: Jan Hubicka <hubicka at ucw dot cz>
Cc: gcc-patches at gcc dot gnu dot org
Date: Mon, 8 Oct 2018 16:46:44 +0200 (CEST)
Subject: Re: [PATCH][i386] Update Zen tuning for vector load cost
References: <alpine.LSU.2.20.1810081233180.16707@zhemvz.fhfr.qr> <20181008142214.GA55421@kam.mff.cuni.cz>

On Mon, 8 Oct 2018, Jan Hubicka wrote:

> > 
> > This adjusts Zen AVX256 vector load cost to be twice as expensive
> > than AVX128 vector load cost (twice via ix86_vec_cost keying on
> > TARGET_AVX128_OPTIMAL - should rather use some
> > TARGET_VECTOR_IMPL_WIDTH or so).
> > 
> > Likely the current cost value was meant to make AVX256 loads _cheaper_
> > than two AVX128 ones in which case we'd have to use 5 here.  Honza,
> > what was the intent here?  Maybe not call ix86_vec_cost for loads or
> > stores at all (given we model different sizes explicitely?)?  The
> > odd thing is that in some places you pass ix86_vec_cost
> > COSNTS_N_INSNS (ix86_cost->...) and in others just
> > ix86_cost->....  That's probably caused by most entries in the cost
> > tables being scaled via COSTS_N_INSNS but not all which makes them
> > not easily comparable... (I found a comment that says
> > "We assume COSTS_N_INSNS is defined as (N)*4")
> > 
> > That ix86_vec_cost adds a factor of two for AVX256 also is throwing
> > throughput into the mix because latency-wise AVX256 behaves the
> > same as AVX128 AFAICs.  That is, the modeling isn't very precise...
> > (multiplying by two still makes sense IMHO - but for example stores
> > have only one 128bit port so those are the ones that should be
> > more than factor-of-two pessimized if any)
> 
> Hmm the intent was to make the tables latency based, so 256bit load should be
> definitly cheaper than two 128loads.  I guess the current state is result of
> getting TARGET_AVX128_OPTIMAL adjustemnt inconsistent.  It did not make much
> difference previously becuase TARGET_AVX128_OPTIMAL prevents us from using 256
> instructions in most cases.

Yeah, I guess I'll propose splitting that TARGET_AVX128_OPTIMAL use
in ix86_vec_cost to a different flag/tunable (possibly just a cost
table entry because it really is closely tied to the costs we assign).

> Vectorizer cost model is bit odd mix of throughputs and latencies in general.
> I wondered if it would make sense to compute the cost of a loop by computing
> producing the DAG of dependencies and computing cost of the most expensive
> path through the DAG...

Yeah - in office we threw around some thoughts about throwing
all stmts added via add_stmt hook at the scheduler pipeline model
and compute that path, basically only modeling resources and dependences.
We'd need to identify instructions (do instruction selection) at the
granularity of the pipeline description of course which may be
"interesting".

> Patch looks OK to me (it makes situation better than previously for sure :)

Thanks, I've also done some benchmarking with this and only saw
positive effects.

Applied.

Richard.

> Honza
> > 
> > Anyhow, the patch removes one oddity comparing costs of vectorized
> > loops when there are no lane-crossing operations (after the patch
> > such loops will cost the same with AVX128 and AVX256 even though
> > Zens frontend will benefit from using AVX256).
> > 
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> > 
> > OK?
> > 
> > Thanks,
> > Richard.
> > 
> > 2018-10-08  Richard Biener  <rguenther@suse.de>
> > 
> > 	* config/i386/x86-tune-costs.h (znver1_cost): Make AVX256 vector loads
> > 	cost the same as AVX128 ones.
> > 
> > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > index 71a5854c09a..c7f3945d72c 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -1518,9 +1518,9 @@ struct processor_costs znver1_cost = {
> >    {8, 8},				/* cost of storing MMX registers
> >  					   in SImode and DImode.  */
> >    2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
> > -  {6, 6, 6, 10, 20},			/* cost of loading SSE registers
> > +  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit.  */
> > -  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
> > +  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
> >    {8, 8, 8, 8, 16},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit.  */
> >    {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

References:
- [PATCH][i386] Update Zen tuning for vector load cost
  - From: Richard Biener
- Re: [PATCH][i386] Update Zen tuning for vector load cost
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]