This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

From: Sameera Deshpande []
Sent: 20 June 2016 11:37:58
To: Richard Biener
Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole;; Jaydeep Patil
Subject: Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote:
> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande
> <> wrote:
>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote:
>>> On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener
>>> <> wrote:
>>>> On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
>>>> <> wrote:
>>>>> Hi Richard,
>>>>> This is with reference to our discussion at GNU Tools Cauldron 2015
>>>>> regarding my talk titled "Improving the effectiveness and generality of GCC
>>>>> auto-vectorization." Further to our prototype implementation of the concept,
>>>>> we have started implementing this concept in GCC.
>>>>> We are following incremental model to add language support in our
>>>>> front-end, and corresponding back-end (for auto-vectorizer) will be added
>>>>> for feature completion.
>>>>> Looking at the complexity and scale of the project, we have divided this
>>>>> project into subtasks listed below, for ease of implementation, testing and
>>>>> review.
>>>>> 0. Add new pass to perform autovectorization using unified
>>>>> representation - Current GCC framework does not give complete overview of
>>>>> the loop to be vectorized : it either breaks the loop across body, or across
>>>>> iterations. Because of which these data structures can not be reused for our
>>>>> approach which gathers all the information of loop body at one place using
>>>>> primitive permute operations. Hence, define new data structures and populate
>>>>> them.
>>>>> 1. Add support for vectorization of LOAD/STORE instructions
>>>>>       a. Create permute order tree for the loop with LOAD and STORE
>>>>> instructions for single or multi-dimensional arrays, aggregates within
>>>>> nested loops.
>>>>>       b. Basic transformation phase to generate vectorized code for the
>>>>> primitive reorder tree generated at stage 1a using tree tiling algorithm.
>>>>> This phase handles code generation for SCATTER, GATHER, stridded memory
>>>>> accesses etc. along with permute instruction generation.
>>>>> 2. Implementation of k-arity promotion/reduction : The permute nodes
>>>>> within primitive reorder tree generated from input program can have any
>>>>> arity. However, the target can support maximum of arity = 2 in most of the
>>>>> cases. Hence, we need to promote or reduce the arity of permute order tree
>>>>> to enable successful tree tiling.
>>>>> 3. Vector size reduction : Depending upon the vector size for target,
>>>>> reduce vector size per statement and adjust the loop count for vectorized
>>>>> loop accordingly.
>>>>> 4. Support simple arithmetic operations :
>>>>>       a. Add support for analyzing statements with simple arithmetic
>>>>> operations like +, -, *, / for vectorization, and create primitive reorder
>>>>> tree with compute_op.
>>>>>       b. Generate vector code for primitive reorder tree generated at
>>>>> stage 4a using tree tiling algorithm - here support for complex patterns
>>>>> like multiply-add should be checked and appropriate instruction to be
>>>>> generated.
>>>>> 5. Support reduction operation :
>>>>>       a. Add support for reduction operation analysis and primitive
>>>>> reorder tree generation. The reduction operation needs special handling, as
>>>>> the finish statement should COLLAPSE the temporary reduction vector TEMP_VAR
>>>>> into original reduction variable.
>>>>>       b. The code generation for primitive reorder tree does not need any
>>>>> handling - as reduction tree is same as tree generated in 4a, with only
>>>>> difference that in 4a, the destination is MEMREF (because of STORE
>>>>> operation) and for reduction it is TEMP_VAR. At this stage, generate code
>>>>> for COLLAPSE node in finish statements.
>>>>> 6. Support other vectorizable statements like complex arithmetic
>>>>> operations, bitwise operations, type conversions etc.
>>>>>       a. Add support for analysis and primitive reorder tree generation.
>>>>>       b. Vector code generation.
>>>>> 7. Cost effective tree tiling algorithm : Till now, the tree tiling is
>>>>> happening without considering cost of computation. However, there can be
>>>>> multiple target instructions covering the tree - hence, instead of picking
>>>>> first matched largest instruction cover, select the instruction cover based
>>>>> on cost of instruction given in .md for the target.
>>>>> 8. Optimizations on created primitive reorder tree : This stage is open
>>>>> ended, and depending upon perf analysis, the scope of it can be defined.
>>>>> The current patch I have attached herewith handles stage 0 and 1a : Adds
>>>>> new pass to perform autovectorization using unified representation, defines
>>>>> new data structures to cater to this requirement and creates primitive
>>>>> reorder tree for LOAD/STORE instructions within the loop.
>>>>> The whole loop is represented using the ITER_NODE, which have
>>>>> information about
>>>>> - The preparatory statements for vectorization to be executed before
>>>>> entering the loop (like initialization of vectors, prepping for reduction
>>>>> operations, peeling etc.)
>>>>> - Vectorizable loop body represented as PRIMOP_TREE (primitive
>>>>> reordering tree)
>>>>> - Final statements (For peeling, variable loop bound, COLLAPSE operation
>>>>> for reduction etc.)
>>>>> - Other loop attributes (loop bound, peeling needed, dependences, etc.)
>>>>> Memory accesses within a loop have definite repetitive pattern which can
>>>>> be captured using primitive permute operators which can be used to
>>>>> determine desired permute order for the vector computations.  The
>>>>> PRIMOP_TREE is AST which records all computations and permutations required
>>>>> to store  destination vector into continuous memory at the end of all
>>>>> iterations of the  loop. It can have INTERLEAVE, CONCAT, EXTRACT, SPLIT,
>>>>> ITER or any compute operation as intermediate node. Leaf nodes can either be
>>>>> memory reference, constant or vector of loop invariants. Depending upon the
>>>>> operation, PRIMOP_TREE holds appropriate information about the statement
>>>>> within the loop which is necessary for vectorization.
>>>>> At this stage, these data structures are populated by gathering all the
>>>>> information of the loop, statements within the loop and correlation of the
>>>>> statements within the loop. Moreover the loop body is analyzed to check if
>>>>> vectorization of each statement is possible. One has to note however that
>>>>> this analysis phase will give worst-case estimate of instruction selection,
>>>>> as it checks if specific named pattern is defined in .md for the target. It
>>>>> not necessarily  give optimal cover which is aim of the transformation phase
>>>>> using tree tiling algorithm - and can be invoked only once the loop body is
>>>>> represented using primitive reoder tree.
>>>>> At this stage, the focus is to create permute order tree for the loop
>>>>> with LOAD and STORE instructions only. The code we intend to compile is of
>>>>> the form
>>>>> FOR(i = 0; i < N; i + +)
>>>>> {
>>>>>     stmt 1 : D[k ∗ i + d 1 ] =S 1 [k ∗ i + c 11 ]
>>>>>     stmt 2 : D[k ∗ i + d 2 ] =S 1 [k ∗ i + c 21 ]
>>>>>     ...
>>>>>     stmt k : D[k ∗ i + d k ] =S 1 [k ∗ i + c k 1 ]
>>>>> }
>>>>> Here we are assuming that any data reference can be represented using
>>>>> base + k * index + offset (The data structure struct data_reference from GCC
>>>>> is used currently for this purpose). If not, the address is normalized to
>>>>> convert to such representation.
>>>>> We are looking forward to your suggestions and insight in this regard
>>>>> for better execution of this project.
>>>> I will look at the patch in detail this afternoon and will write up
>>>> some comments.
>>> Ok, so here we go.
>> Hi Richard,
>> Thanks for your detailed review. Please find my comments inlined.
>>> I see you copy quite some code from the existing vectorizer - rather than
>>> doing this and adding a "new" pass I'd make the
>>> flag_tree_loop_vectorize_unified
>>> flag guard code inside the existing vectorizer - thus share the pass.
>> I agree with you that I have copied lot of code from current implementation
>> of GCC, and some data structures seem redundant as they create same
>> information as that is already available. However, as per our discussion at
>> cauldron, I tried to generate ptrees after all the analysis phases were
>> done, using the information generated there. However, because of
>> overwhelming and scattered information in various statements, loops and
>> their respective info nodes, it did not yield much. Moreover, it was seen
>> that many functions or parts of the functions were not very useful, and some
>> stmt or loop related information needed different way of computation for our
>> scheme than GCC's framework. So, instead of tweaking GCC's codebase, and
>> corrupting the information, thereby making it unusable if our optimization
>> fails to vectorize by default vectorizer, we created new pass for the same.
>> I agree, that currently it looks like we are reusing most of the checks and
>> conditions for vectorization as in GCC - as for first phase, our aim is to
>> meet performance of GCC before adding different optimizations. However, we
>> plan to increase the scope further, thereby need to change the checks and
>> data generated accordingly.
>> e.g.: Peeling information won't be generated till transformation phase is
>> started, where it will be added in the ITER_node depending upon alignment
>> requirements, ITER_count and VEC_SIZE of target.
>> Or, scatter/gather information is not needed as tree tiling for vector
>> load/store has different tiles for those instructions if target supports
>> them.
>> Also, the way reduction operations are to be implemented in this scheme
>> makes all the categorization of reduction operation in current GCC
>> redundant.
>> However, if you still think it is good idea to reuse same data structures
>> and functions, I have older patch available, which I can clean up and update
>> to add updated ptree generation part in it, and share it.
> My comment was mostly about making patches smaller and thus easier to review.
> I do expect that little of the copied code will remain as-is.
Richard, the way we have phased out the project, the first patch is bigger one as we had to do all the initializations and checks relevant to
auto-vectorization in this patch. However, we expect following patches to be smaller, and dedicated to single functionality. It is difficult to factor
this patch, as the changes that we have added are needed for the functionality - to vectorize load/store operations.

>>> Similarly I'd re-use loop_vinfo and simply add fields to it for the
>>> new vectorizer
>>> so you can dispatch to existing vectorizer functions "easily".  Otherwise
>>> it's going to be hard to keep things in-sync.  Likewise for your stmt_attr
>>> and the existing stmt_vec_info.
>> The main reason behind creating new ITER_node instead of reusing loop_vinfo
>> is to capture whole loop as single entity, thereby allowing optimizations on
>> nested loops as well. The ITER_node, when implemented completely, can handle
>> vectorization of multiple nested loops, as similar to all permute nodes,
>> even ITER_node can be distributed over each element in ITER_node.stmts - and
>> can be propagated across compute_tree in permute order tree. So, for future
>> expansion, we are creating ITER_node with copy of some of loop_vinfo fields,
>> and do not compute the fields which are not needed for this scheme.
>> Similar is the case with stmt_vinfo - the information gathered in stmt_vinfo
>> is mostly usable for SLP optimizations, which we cannot use as is. So,
>> instead of generating redundant information, or alter generated information,
>> we chose to create new data structure.
>>> Why isn't the existing data_reference structure good enough and you need
>>> to invent a struct mem_ref?  Even if that might be leaner adding new
>>> concepts
>>> makes the code harder to understand.  You can always use dr->aux to
>>> add auxiliar data you need (like the existing vectorizer does).
>> We can use data_reference structure as is, however, each of it's component
>> has very specific meaning associated with it - whereas the memref has only 3
>> components which are important - stride, offset - which is less than stride,
>> and remaining base - for address of first element. So, again we thought
>> instead of overriding current semantics of components of data_ref, we
>> created new data structure.
> I was worried at the point you made an existing GCC function take data-ref
> pieces to make your mem-ref look like a data-ref.  If the data-ref doesn't
> give you the information you want and that is important it seems to me it
> is better to fix that.  It sounds to me you want to look at innermost
> where a part of DR_INIT is your 'offset' (I think it is what the
> current vectorizer
> computes as GROUPs and offset would be the offset inside the group,
> relative to the first element?)
I agree that the data structure mem-ref is redundant. Will eliminate it to use altered dataref. We can say that the computations to find mem-ref are
similar to those to form GROUPS - however, GROUP is collection of all the statements which access elements of aggregate with definite interval,
whereas mem-ref looks at small window of the aggregate with reference to stride, and the offset addressing is always relative to stride. Also, mem-ref
is not collection of the statements, but the place-holder to identify the continuous piece of aggregate within the memory of size = stride.
However, I can use dataref instead of mem-ref as the place holder.

>>> It helps if you follow the GCC coding-conventions from the start - a lot
>>> of the new functions miss comments.
>> Sorry about that. I will tidy up the patch.
>>> I realize the patch is probably still a "hack" and nowhere a final step
>>> 0/1a,
>>> but for example you copied the iteration scheme over vector sizes.  I hope
>>> the new vectorizer can do without that and instead decide on the vector
>>> size as part of the cost effective tiling algorithm.
>> I didn't get this point clearly. However, for my understanding, is the
>> objection with use of ITER_COUNT instead of VEC_SIZE for each statement?
>> Because, if that is the case, there is basic difference between current
>> GCC's implementation and this scheme - in case of GCC, we always look for
>> the standard pattern name to be matched for scalar instruction - for which
>> VEC_TYPE becomes crucial. Whereas, this scheme represents each statement
>> within the loop as a vector with element of type SCALAR_TYPE and
>> initial_vec_size = ITER_COUNT (as those many instances of the statement will
>> be executed). Then, depending upon the VEC_SIZE for target, VEC_SIZE
>> reduction will be applied to it, on top of which tiling algorithm functions.
>> So, for tiling, we will have to use VEC_SIZE. However, for all other
>> optimizations and transformations, we use each node of permute order tree to
>> have vec_size = ITER_COUNT to allow free movement of permute-nodes across
>> compute-nodes, and various optimizations on them.
> The comment was about you copying the loop iterating over target vector sizes
> and repeating vectorization attempts with that single vector size.  I hope
> the cost effective tiling will simply choose a single vector size based on
> cost and iteration count.  Your description above suggests so thus the
> copying of the vector size iteration seems spurious and if then should
> be done _after_ building the ptree (and sharing the ptree).
Yes, as I have mentioned in previous mail

 >> (I have kept the loop checking for
 >> different vector types in vect_analyze_loop_with_prim_tree() for now as I am
 >> still not very clear at what point in transformation phase will it play any
 >> role... Once, the end-to-end solution for it is developed for load/store
 >> chains, there will be some clarity - and the loop can be moved there or
 >> eliminated completely.)

So, this loop will be eliminated once the backend for tree tiling of load/store is implemented.

>>> As the main feature of the patch is creation of the ptree (no code
>>> generation
>>> seems to be done?) the biggest missing part is dumping the ptree in
>>> some nice form (like in DOT format and/or to the dump file).
>> Yes, the ptree dump was put on back burner because I was focusing on the
>> functionality. However, I will share the patch for dumping ptree in DOT
>> format shortly.
> Thanks!
>>> You do seem to check for target capabilities at ptree build time (looking
>>> at
>>> vectorizable_store).  I think that is a mistake as we'd ideally have the
>>> same
>>> ptree when considering different vector sizes (or even generic vectors).
>>> Target capabilities will differ between vector sizes.
>> That's right. However as long as ITER_COUNT > VEC_SIZE of target, there is
>> at least a cover available for the ptree that we are creating. Hence, this
>> check is just to check if certain instruction can be represented anyhow on
>> target architecture. This will actually be handled at transform phase after
>> vec_size reduction is performed. (I have kept the loop checking for
>> different vector types in vect_analyze_loop_with_prim_tree() for now as I am
>> still not very clear at what point in transformation phase will it play any
>> role... Once, the end-to-end solution for it is developed for load/store
>> chains, there will be some clarity - and the loop can be moved there or
>> eliminated completely.)
> Ok.
> One limit of the current vectorizer is that it wants to use a single
> vector size.
> In principle vectorizable_* could compute a vector size mask for each
> scalar statement (ptree node) noting the sizes the target has support for
> encoding the stmt.  Similar to permute nodes the ptree can have
> size-change nodes in case supported input/output size have no match.

We can use the macro TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES to get the mask, and perform cost-effective instruction selection for optimal vector
size, than considering availability of single vector size in get_vectype_for_scalar_type_and_size(). However, I will check if it is needed to be done
as early as vectorizable_*, or can be postponed till tree tiling algorithm hits.
> Thanks,
> Richard.
>>> That's all for now.
>>> Last but not least - at the Cauldron I suggested to incrementally rewrite
>>> the existing vectorizer by building the ptree at the point it is "ready"
>>> for code generation and perform the permute optimizations on it and then
>>> re-write the code generation routines to work off the ptree.
>>> Thanks,
>>> Richard.
>> - Thanks and regards,
>>    Sameera D.

> I will create new branch in GCC as you directed in other thread, and create incremental patches for the changes suggested here. Will that be fine?
> - Thanks and regards,
>   Sameera D.

Hi Richard,

Sorry for delay with this updated patch. Please find attached the patch with changes addressing the review comments.

Is it ok to put in the branch?

- Thanks and regards,
  Sameera D.

Attachment: autovectorization-load-store-ptree-create.patch
Description: autovectorization-load-store-ptree-create.patch

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]