This is the mail archive of the
mailing list for the GCC project.
Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
- From: Alan Lawrence <alan dot lawrence at foss dot arm dot com>
- To: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, Richard Biener <richard dot guenther at gmail dot com>, Sameera Deshpande <Sameera dot Deshpande at arm dot com>
- Cc: Matthew Fortune <Matthew dot Fortune at imgtec dot com>, Prachi dot Godbole at imgtec dot com
- Date: Fri, 8 Jan 2016 16:11:56 +0000
- Subject: Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
- Authentication-results: sourceware.org; auth=none
On Tues, Oct 27, 2015 at 2:39 PM, Richard Biener <firstname.lastname@example.org> wrote:
On Mon, Oct 26, 2015 at 6:59 AM, sameera <email@example.com> wrote:
Richard, we have defined the input language for convenience in prototype
implementation. However, we will be using GIMPLE as our IR. As per grammar
of our tree, p-tree denote the permute order associated with the statement,
whereas c-tree is actually the GIMPLE instruction, which performs compute
operation. I tried looking at structures used in SLP, however they can not
be used as they are, as main difference between current SLP implementation
in GCC versus our representation is that, permute order in SLP is part of
the tree node in current GCC, whereas in our representation permute order is
represented as independent tree-node. Hence, I have created new tree
structure for our pass, which will create p-tree nodes for permute order,
and c-tree node which points to appropriate gimple statement.
Yes, that's the whole purpose - get the vectorizer (and SLP) a better
which is explicit about permutes.
I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op
code to make the operations match). Consider
a[i] = b[i] * 3;
a[i+1] = b[i+1];
which can be vectorized with SLP if you realize you can multiply b[i+1] by
The pass structure we are having for our optimization is as follows:
- New pass target-aware-loop-vect with following subpasses:
- Loop structure identification
- Restricted loop construction
- Loop analysis (6 phases of target-aware reordering-instruction selection
- Loop transformation (Cost aware gimple code generation)
We might need some optimizations to be performed on loop body before we
start our optimization for reordering instruction selection, so that input
program can be transformed to fit into our restricted loop structure. These
transformations can be performed by restricted loop construction subpass.
Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
supported by transforming input loop. The case that you have mentioned, can
be transformed at this level.
As of implementing the whole thing in GCC I recently spent some more time
thinking and came to the conclusion that the path to the fastest
in GCC code-gen would be to build the new AST after the current analysis
phase finished, do the optimization and drive code-generation off the new
Thus keep things like data-ref and dependence analysis as is, as well as
group analysis and SLP tree construction. Build the AST from the
"IL" at the point vect_transform_loop / vect_slp_transform_bb is called.
Current pass structure that we have designed, does exactly that. The only
difference is that, we are planning to use the transform phase as a
co-process of our transform pass, than reconstruct an AST from our
representation to pass to vect_transform_loop.
More and more of the rest of the vectorizer code can then be "slowly"
to work on the AST and AST construction be moved earlier.
So I think an incremental approach is much more likely to get us some benefit,
and I'm wondering how much of the benefit here stems from which bit, and how
many of these changes can be broken off.
For example, making permutes into explicit operations on the SLP tree, and
adding a pass that tries to push them up/down the tree into each other, might be
done on the existing codebase, and would be a significant win in itself
(bringing some/much(?) of improvements in interleaving code of your proposal).
I'm not really sure how much of the benefit stems from what, though - e.g. does
reducing vector length down to fit the target machine only after the other
steps, bring any benefits besides raising abstraction? (Which certainly can be a
benefit!). AFAICT this could be done independently of other changes?
Another limitation of the present SLP structure is it's not understanding
sharing (until later CSE, hence costing wrongly); it would be good to fix this
(making the SLP tree into a DAG, and potentially reusing subtrees by adding
permutes). This is a step we'd like to take anyway, but might tie in with your
"Bottom-up commoning of p-node subtrees". And would help with (BB as well as
loop) cost calculations.
Are there any really fundamental differences that we cannot overcome in steps? (
/ are there any steps we shouldn't take in isolation, because we'd have to
'throw them away' later?)