This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel


> > According to Agner's tables, gathers range from 12 ops (vgatherdpd)
> > to 66 ops (vpgatherdd).  I assume that CPU needs to do following:
> > 
> > 1) transfer the offsets sse->ALU unit for address generation (3 cycles
> >    each, 2 ops)
> > 2) do the address calcualtion (2 ops, probably 4 ops because it does not map naturally
> > 			       to AGU)
> > 2) do the load (7 cycles each, 2 ops)
> > 3) merge results (1 ops)
> > 
> > so I get 7 ops, not sure what remaining 5 do.
> > 
> > Agner does not account time, but According to
> > http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt the
> > gather time ranges from 14 cycles (vgatherpd) to 20 cycles.  Here I guess it is
> > 3+1+7+1=12 so it seems to work.
> > 
> > If you implement gather by hand, you save the SSE->address caluclation path and
> > thus you can get faster.
> 
> I see.  It looks to me Zen should disable gather/scatter then completely
> and we should implement manual gather/scatter code-generation in the
> vectorizer (or lower it in vector lowering).  It sounds like they
> only implemented it to have "complete" AVX2 support (ISTR scatter
> is only in AVX512f).

Those instructions seems similarly expensive in Intel implementation.
http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeXeon9_InstLatX64.txt
lists latencies ranging from 18 to 32 cycles.

Of course it may also be the case that the utility is measuring gathers incorrectly.
according to Agner's table Skylake has optimized gathers, they used to be
12 to 34 uops on haswell and are no 4 to 5.
> 
> > > Note the most major source of impreciseness in the cost model
> > > is from vec_perm because we lack the information of the
> > > permutation mask which means we can't distinguish between
> > > cross-lane and intra-lane permutes.
> > 
> > Besides that we lack information about what operation we do (addition
> > or division?) which may be useful to pass down, especially because we do
> > have relevant information handy in the x86_cost tables.  So I am thinking
> > of adding extra parameter to the hook telling the operation.
> 
> Not sure.  The costs are all supposed to be relative to scalar cost
> and I fear we get nearer to a GIGO syndrome when adding more information
> here ;)

Yep, however there is setup cost (like loads/stores) which comes into game
as well.  I will see how far i can get by making x86 costs more "realistic"

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]