This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Disable autogeneration of gather instructions on Ryzen and generic
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, Martin Jambor <mjambor at suse dot cz>, sergey dot shalnov at intel dot com
- Date: Tue, 9 Jan 2018 12:30:37 +0100
- Subject: Re: Disable autogeneration of gather instructions on Ryzen and generic
- Authentication-results: sourceware.org; auth=none
- References: <20180109102606.GA59414@kam.mff.cuni.cz> <CAFiYyc1YVPMoTLLsR+7qUuZg7DgFXYhsGJ=r4eQqW9pbeKuvDg@mail.gmail.com> <20180109105828.GA80189@kam.mff.cuni.cz>
On Tue, Jan 9, 2018 at 11:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Hi,
>> > gather instructions are rather hard to implement in hardware and except for
>> > skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
>> > degree I did not find real world loop where gather would help on Zen.
>> > This patch simply adds a knob to disable its autogeneration (builtin still
>> > works). I have considered two alternatives
>> > 1) tune this with x86-tune-costs because gather is still profitable than
>> > scalar code if we do very expensive operations (such as sequence of divides)
>> > on the values gathered/scattered
>> > 2) implement expansion of gathers into primitive instructions. This is faster
>> > as semantics of gather is bit weird and not fully needed for our vectorizer
>> >
>> > I did not have luck to get any good results out of 1 alone as cost model is not
>> > very realistic. 1+2 probably makes sense but we can do this incrementally as
>> > that would make most sense to be implemented generically in vectorizer on the
>> > top of this change.
>>
>> Note that the vectorizer gives up on loops with gathers with no target
>> support for
>> gathers. It could simply open-code the gather though (and properly cost that
>> open-coded variant), that's probably the way to go here.
>
> Agreed, that is whay i meant by 2. Generic code has all info to implement gathers/scatters
> by hand when they are not supported and estimate cost of the implementation.
> This is what I meant by the paragraph above. I tried to look into it about month
> ago but got bit lost in vectorizer internals :(
Yeah, I think the only bits that cannot be done open-coded in an
efficient way is
the masking.
Richard.
> Honza
>>
>> Richard.
>>
>> > Given that gather is problematic even on skylake+ as shown in the PR (which has
>> > most optimized implementation of it) it is good to have a knob to control its
>> > codegen at first place.
>> >
>> > I have also disabled gathers for generic. This is because its use causes
>> > some two-digit regression on zen for spec2k17 while there are no measurable
>> > benefits on Intel. Note that this affects only
>> > -march=<somethig supporting gather> -mtune=generic
>> > as by default we do not use AVX2.
>> >
>> > Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
>> > are no complains.
>> >
>> > Honza
>> >
>> > PR target/81616
>> > * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
>> > * i386.h (TARGET_USE_GATHER): Define.
>> > * x86-tune.def (X86_TUNE_USE_GATHER): New.
>> > Index: config/i386/i386.c
>> > ===================================================================
>> > --- config/i386/i386.c (revision 256369)
>> > +++ config/i386/i386.c (working copy)
>> > @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre
>> > bool si;
>> > enum ix86_builtins code;
>> >
>> > - if (! TARGET_AVX2)
>> > + if (! TARGET_AVX2 || !TARGET_USE_GATHER)
>> > return NULL_TREE;
>> >
>> > if ((TREE_CODE (index_type) != INTEGER_TYPE
>> > Index: config/i386/i386.h
>> > ===================================================================
>> > --- config/i386/i386.h (revision 256369)
>> > +++ config/i386/i386.h (working copy)
>> > @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[
>> > ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
>> > #define TARGET_AVOID_4BYTE_PREFIXES \
>> > ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
>> > +#define TARGET_USE_GATHER \
>> > + ix86_tune_features[X86_TUNE_USE_GATHER]
>> > #define TARGET_FUSE_CMP_AND_BRANCH_32 \
>> > ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
>> > #define TARGET_FUSE_CMP_AND_BRANCH_64 \
>> > Index: config/i386/x86-tune.def
>> > ===================================================================
>> > --- config/i386/x86-tune.def (revision 256369)
>> > +++ config/i386/x86-tune.def (working copy)
>> > @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
>> > DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>> > m_SILVERMONT | m_INTEL)
>> >
>> > +/* X86_TUNE_USE_GATHER: Use gather instructions. */
>> > +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
>> > + ~(m_ZNVER1 | m_GENERIC))
>> > +
>> > /*****************************************************************************/
>> > /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
>> > /*****************************************************************************/