This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [11a/n] Avoid retrying with the same vector modes


Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Nov 6, 2019 at 12:02 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Nov 6, 2019 at 11:21 AM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Biener <richard.guenther@gmail.com> writes:
>> >> > On Tue, Nov 5, 2019 at 9:25 PM Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> Patch 12/n makes the AArch64 port add four entries to
>> >> >> autovectorize_vector_modes.  Each entry describes a different
>> >> >> vector mode assignment for vector code that mixes 8-bit, 16-bit,
>> >> >> 32-bit and 64-bit elements.  But if (as usual) the vector code has
>> >> >> fewer element sizes than that, we could end up trying the same
>> >> >> combination of vector modes multiple times.  This patch adds a
>> >> >> check to prevent that.
>> >> >>
>> >> >> As before: each patch tested individually on aarch64-linux-gnu and the
>> >> >> series as a whole on x86_64-linux-gnu.
>> >> >>
>> >> >>
>> >> >> 2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>
>> >> >>
>> >> >> gcc/
>> >> >>         * tree-vectorizer.h (vec_info::mode_set): New typedef.
>> >> >>         (vec_info::used_vector_mode): New member variable.
>> >> >>         (vect_chooses_same_modes_p): Declare.
>> >> >>         * tree-vect-stmts.c (get_vectype_for_scalar_type): Record each
>> >> >>         chosen vector mode in vec_info::used_vector_mode.
>> >> >>         (vect_chooses_same_modes_p): New function.
>> >> >>         * tree-vect-loop.c (vect_analyze_loop): Use it to avoid trying
>> >> >>         the same vector statements multiple times.
>> >> >>         * tree-vect-slp.c (vect_slp_bb_region): Likewise.
>> >> >>
>> >> >> Index: gcc/tree-vectorizer.h
>> >> >> ===================================================================
>> >> >> --- gcc/tree-vectorizer.h       2019-11-05 10:48:11.246092351 +0000
>> >> >> +++ gcc/tree-vectorizer.h       2019-11-05 10:57:41.662071145 +0000
>> >> >> @@ -298,6 +298,7 @@ typedef std::pair<tree, tree> vec_object
>> >> >>  /* Vectorizer state common between loop and basic-block vectorization.  */
>> >> >>  class vec_info {
>> >> >>  public:
>> >> >> +  typedef hash_set<int_hash<machine_mode, E_VOIDmode, E_BLKmode> > mode_set;
>> >> >>    enum vec_kind { bb, loop };
>> >> >>
>> >> >>    vec_info (vec_kind, void *, vec_info_shared *);
>> >> >> @@ -335,6 +336,9 @@ typedef std::pair<tree, tree> vec_object
>> >> >>    /* Cost data used by the target cost model.  */
>> >> >>    void *target_cost_data;
>> >> >>
>> >> >> +  /* The set of vector modes used in the vectorized region.  */
>> >> >> +  mode_set used_vector_modes;
>> >> >> +
>> >> >>    /* The argument we should pass to related_vector_mode when looking up
>> >> >>       the vector mode for a scalar mode, or VOIDmode if we haven't yet
>> >> >>       made any decisions about which vector modes to use.  */
>> >> >> @@ -1615,6 +1619,7 @@ extern tree get_related_vectype_for_scal
>> >> >>  extern tree get_vectype_for_scalar_type (vec_info *, tree);
>> >> >>  extern tree get_mask_type_for_scalar_type (vec_info *, tree);
>> >> >>  extern tree get_same_sized_vectype (tree, tree);
>> >> >> +extern bool vect_chooses_same_modes_p (vec_info *, machine_mode);
>> >> >>  extern bool vect_get_loop_mask_type (loop_vec_info);
>> >> >>  extern bool vect_is_simple_use (tree, vec_info *, enum vect_def_type *,
>> >> >>                                 stmt_vec_info * = NULL, gimple ** = NULL);
>> >> >> Index: gcc/tree-vect-stmts.c
>> >> >> ===================================================================
>> >> >> --- gcc/tree-vect-stmts.c       2019-11-05 10:48:11.242092379 +0000
>> >> >> +++ gcc/tree-vect-stmts.c       2019-11-05 10:57:41.662071145 +0000
>> >> >> @@ -11235,6 +11235,10 @@ get_vectype_for_scalar_type (vec_info *v
>> >> >>                                                       scalar_type);
>> >> >>    if (vectype && vinfo->vector_mode == VOIDmode)
>> >> >>      vinfo->vector_mode = TYPE_MODE (vectype);
>> >> >> +
>> >> >> +  if (vectype)
>> >> >> +    vinfo->used_vector_modes.add (TYPE_MODE (vectype));
>> >> >> +
>> >> >
>> >> > Do we actually end up _using_ all types returned by this function?
>> >>
>> >> No, not all of them, so it's a bit crude.  E.g. some types might end up
>> >> not being relevant after pattern recognition, or after we've made a
>> >> final decision about which parts of an address calculation to include
>> >> in a gather or scatter op.  So we can still end up retrying the same
>> >> thing even after the patch.
>> >>
>> >> The problem is that we're trying to avoid pointless retries on failure
>> >> as well as success, so we could end up stopping at arbitrary points.
>> >> I wasn't sure where else to handle this.
>> >
>> > Yeah, I think this "iterating" is somewhat bogus (crude) now.
>>
>> I think it was crude even before the series though. :-)  Not sure the
>> series is making things worse.
>>
>> The problem is that there's a chicken-and-egg problem between how
>> we decide to vectorise and which vector subarchitecture and VF we use.
>> E.g. if we have:
>>
>>   unsigned char *x, *y;
>>   ...
>>   x[i] = (unsigned short) (x[i] + y[i] + 1) >> 1;
>>
>> do we build the SLP graph on the assumption that we need to use short
>> elements, or on the assumption that we can use IFN_AVG_CEIL?  This
>> affects the VF we get out: using IFN_AVG_CEIL gives double the VF
>> relative to doing unsigned short arithmetic.
>>
>> And we need to know which vector subarchitecture we're targetting when
>> making that decision: e.g. Advanced SIMD and SVE2 have IFN_AVG_CEIL,
>> but SVE doesn't.  On the other hand, SVE supports things that Advanced
>> SIMD doesn't.  It's similar story of course for the x86 vector subarchs.
>>
>> For one pattern like this, we could simply try both ways.
>> But that becomes untenable if there are multiple potential patterns.
>> Iterating over the vector subarchs gives us a sensible way of reducing
>> the search space by only applying patterns that the subarch supports.
>>
>> So...
>>
>> > What we'd like to collect is for all defs the vector types we could
>> > use and then vectorizable_ defines constraints between input and
>> > output vector types.  From that we'd arrive at a (possibly quite
>> > large) set of "SLP graphs with vector types" we'd choose from.  I
>> > believe we'll never want to truly explore the whole space but guess we
>> > want to greedily compute those "SLP graphs with vector types" starting
>> > from what (grouped) datarefs tells us is possible (which is kind of
>> > what we do now).
>>
>> ...I don't think we can/should use the same SLP graph to pick vector
>> types for all subarchs, since the ideal graph depends on the subarch.
>> I'm also not sure the vectorizable_* routines could say anything that
>> isn't obvious from the input and output scalar types.  Won't it still be
>> the case that within an SLP instance, all scalars of type T will become
>> the same vector(N) T?
>
> Not necessarily.  I can see the SLP graph containing "reductions"
> (like those complex patterns proposed).  But yes, at the moment
> there's a single group-size per SLP instance.  Now, for the SLP _graph_
> we may have one instance with 4 and one with group size of 8 both
> for example sharing the same grouped load.  It may make sense to
> vectorize the load with v8si and for the group-size 4 SLP instance
> have a "reduction operation" that selects the appropriate part of the
> loaded vector.
>
> Now, vectorizable_* for say a conversion from int to double
> may be able to vectorize for a v4si directly to v4df or to two times v2df.
> With the size restriction relaxed vectorizable_* could opt to choose
> smaller/larger vectors specifically and thus also have different vector types
> for the same scalar type in the SLP graph.  I do expect this to be
> profitable on x86 for some loops due to some asymmetries in the ISA
> (and extra cost of lane-crossing operations for say AVX where using SSE
> is cheaper for some ops even though you now have 2 or more instructions).

Making this dependent on vectorizable_* seems to require a natural
conversion point in the original scalar code.  In your shared data-ref
example, the SLP instance with group size 4 could use 2 v2sis or a
single v4si.  The fact that the data ref is shared with a group size
of 8 shouldn't necessarily affect that choice and e.g. force v4si over
v2si.  So when taking the graph as a whole, it seems like we should be
able to combine 2 v2sis into a single v4si or split a v4si into 2 v2sis
at any point where that's worthwhile, independently of the surrounding
operations.

The same goes for the conversions.  If the target only supports
v4si->v2df conversions, it should still be possible to combine 2 v2dfs
into a single v4df as a separate, independent step if there's a benefit
to operating on v4df.  Same idea in reverse if the target only supports
v4si->v4df and an SLP instance wants to operate on v2df.  It would be
good if this splitting and combining wasn't dependent on having a
conversion.

So it seems like the natural point for inserting group size adjustments
is at sharing boundaries between SLP instances, with each SLP instance
using consistent choices internally (i.e. with the scalar type
determining the vector type).  And we should be able to insert those
group size adjustments based only on the types involved.  Whether
vectorizable_conversion can do a particular group size adjustment on
the fly seems more like a costing/pattern-matching decision rather than
something that should restrict the choice of types.

Thanks,
Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]