This is the mail archive of the
mailing list for the GCC project.
Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: sameera <sameera dot deshpande at imgtec dot com>
- Cc: GCC Development <gcc at gcc dot gnu dot org>, Matthew Fortune <Matthew dot Fortune at imgtec dot com>, Prachi Godbole <Prachi dot Godbole at imgtec dot com>
- Date: Tue, 27 Oct 2015 14:39:15 +0100
- Subject: Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
- Authentication-results: sourceware.org; auth=none
- References: <561F7DDD dot 6060003 at imgtec dot com> <CAFiYyc0yG=bqK1n2shLLLk1J+d_t-WJy_2SCtNAVaVCE_4rs7Q at mail dot gmail dot com> <562DC13A dot 4040004 at imgtec dot com>
On Mon, Oct 26, 2015 at 6:59 AM, sameera <firstname.lastname@example.org> wrote:
> On Wednesday 21 October 2015 05:23 PM, Richard Biener wrote:
>> On Thu, Oct 15, 2015 at 12:20 PM, sameera <email@example.com>
>>> Hi Richard,
>>> This is with reference to our discussion at GNU Tools Cauldron 2015
>>> regarding my talk titled "Improving the effectiveness and generality of
>>> auto-vectorization." We, at Imaginations Technologies, have further
>>> on finalizing the algorithms for transformations to achieve efficient
>>> target-aware reordering instruction selection. We have implemented the
>>> prototype in python to demonstrate the capabilities of the algorithm and
>>> verify the claims made in the presentation.
>>> I am attaching the prototype along with the documented algorithm for
>>> purposes. We would be starting with design and implementation of the same
>>> GCC and would be glad to receive comments, feedback and suggestions.
>> So I finally sat down and skimmed over auto_vectorization.txt. The first
>> I notice is that you constrain your input language to sth you define
>> In the end you will need to work on GIMPLE in SSA form. Thus the
>> Algorithm as described needs to construct its AST from that GIMPLE
> Richard, we have defined the input language for convenience in prototype
> implementation. However, we will be using GIMPLE as our IR. As per grammar
> of our tree, p-tree denote the permute order associated with the statement,
> whereas c-tree is actually the GIMPLE instruction, which performs compute
> operation. I tried looking at structures used in SLP, however they can not
> be used as they are, as main difference between current SLP implementation
> in GCC versus our representation is that, permute order in SLP is part of
> the tree node in current GCC, whereas in our representation permute order is
> represented as independent tree-node. Hence, I have created new tree
> structure for our pass, which will create p-tree nodes for permute order,
> and c-tree node which points to appropriate gimple statement.
Yes, that's the whole purpose - get the vectorizer (and SLP) a better
which is explicit about permutes.
>> Loop count and data reference analysis is provided by GCC and you need to
>> with the way their result is presented.
> I am trying to figure out where and how interleave pattern encapsulating
> whole loop can be represented, as the interleave pattern not only has the
> loop related information, but also the order in which dest array is being
> written. The data reference analysis can be used nicely with the data
> structures we have designed - as introduction of p-tree nodes does not alter
> the attributes of c-tree (GIMPLE stmt).
> I am also trying to identify relations between chain of recurrences for each
> SSA variable and vec_size associated with each tree-node in our structure.
> Logically, both of them compute same information, and I am seeing if it can
> be propagated in our tree.
>> As with the presentation the paper is mostly about optimizing the
>> code. That's fine in principle but I suspect that the AST needs to get
>> support for "vectorized stmts" that perform type conversions (like type
>> conversions themselves or widening operations), that is, represent the
>> that for certain loops you need N vectorized stmts for stmt A and M
>> stmts for stmt B. This is an important observation once you get to the
>> supporting targets with multiple vector sizes (and instructions like the
>> integer - double conversions which go from 128bit to 256bit vectors).
> Yes, we haven't given much thought about the type conversions, because our
> assumption is that type conversions are order preserving transformations
> (c-tree), and not order altering transformations (p-tree). Because of which
> those instructions will be generated as any other compute instruction is
> generated. And I see that GCC is also having same assumption, because of
> which it treats vec_perm_const patterns different from vec_pack/unpack*
> patterns though the instructions generated can be same. And, as each
> statement can have its own vectorization count, the scenario that you are
> mentioning can be taken care. However, I will again look more into it, if we
> need to take additional care for type conversions.
>> I somewhat miss an idea on how to deal with irregular SLP cases - that is,
>> does the AST model each (current) SLP node as a single statement or
>> does it model individual vector elements (so you can insert no-op
>> code to make the operations match). Consider
>> a[i] = b[i] * 3;
>> a[i+1] = b[i+1];
>> which can be vectorized with SLP if you realize you can multiply b[i+1] by
> The pass structure we are having for our optimization is as follows:
> - New pass target-aware-loop-vect with following subpasses:
> - Loop structure identification
> - Restricted loop construction
> - Loop analysis (6 phases of target-aware reordering-instruction selection
> - Loop transformation (Cost aware gimple code generation)
> We might need some optimizations to be performed on loop body before we
> start our optimization for reordering instruction selection, so that input
> program can be transformed to fit into our restricted loop structure. These
> transformations can be performed by restricted loop construction subpass.
> Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
> supported by transforming input loop. The case that you have mentioned, can
> be transformed at this level.
> Also, we have introduced new operator "PERM" to support irregular permute
> orders which cannot be matched by vec_perm_const patterns, but can be
> generated using vec_perm pattern.
>> As of implementing the whole thing in GCC I recently spent some more time
>> thinking and came to the conclusion that the path to the fastest
>> in GCC code-gen would be to build the new AST after the current analysis
>> phase finished, do the optimization and drive code-generation off the new
>> Thus keep things like data-ref and dependence analysis as is, as well as
>> group analysis and SLP tree construction. Build the AST from the
>> "IL" at the point vect_transform_loop / vect_slp_transform_bb is called.
> Current pass structure that we have designed, does exactly that. The only
> difference is that, we are planning to use the transform phase as a
> co-process of our transform pass, than reconstruct an AST from our
> representation to pass to vect_transform_loop.
>> More and more of the rest of the vectorizer code can then be "slowly"
>> to work on the AST and AST construction be moved earlier.
> Thanks again for your inputs. I will share the complete design and our plan
> of action for implementation in GCC shortly.
> - Thanks and regards,
> Sameera D.
>>> - Thanks and regards,
>>> Sameera Deshpande