This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: auto vectorization in gcc
- From: "Dorit Naishlos" <DORIT at il dot ibm dot com>
- To: Richard Henderson <rth at redhat dot com>
- Cc: dje at watson dot ibm dot com, dnovillo at redhat dot com, dberlin at dberlin dot org, aldyh at redhat dot com, law at redhat dot com, joern dot rennecke at superh dot com, gcc-mail at the-meissners dot org, gcc at gcc dot gnu dot org
- Date: Mon, 21 Jul 2003 14:52:56 +0300
- Subject: Re: auto vectorization in gcc
Thanks very much for the responsiveness!
> The tree level is *more* capable than the rtl level at representing
> vector types (and thus operations). I think all we need is some
> small amount of info from the target about vector widths and memory
> blocking, and then the transformation should happen at the tree level.
I wonder if the target info that you suggest to expose to the tree level
would suffice. In many cases code sequences that are perfectly
parallelizable with respect to data dependences, will not benefit from
vectorization. In order to avoid making really poor decisions, you want
to have at least the following information exposed:
1- (Does the target support SIMD operations...)
2- what is the vector width (to guide loop unrolling / blocking of loop
iterations / partial sums generation for reduction, etc).
3- which vector units and vector data types are supported by the target
(does it support multiplication of integer vectors? does it support
double precision floating point vectors?)
4- which vector capabilities does the target provide to support if
conversion. For example, does the target provide means to vectorize
an 'if' statement, namely, vector select or predicated or masked
instructions? maybe the target doesn't support any of the above, but
it does support, say, "substruct and saturate" - which in certain cases
can replace the if statement? Say you don't know what is the loop bound,
or the loop bound doesn't divide by your vector width; do you need to
generate a scalar epilog loop for the remaining iterations, or can you
get away with masked stores?
5- which additional vector functionalities are available (e.g., permute,
vsplat, etc.)
6- how many vector registers (of each type) do you have? Spilling such
wide registers can be *very* costly.
We don't want to move data back and forth between vector registers and
scalar registers (often through memory) every time we discover that some
operation within a vectorized code sequence is not supported. We want to
be able to make an informed decision whether to vectorize an entire code
sequence or not.
One could claim that it is always possible to reverse the vectorization
transformation, but that can actually entail a lot of non trivial work.
Among the things you'll have to undo are:
- unrolling (along with prolog/epilog code),
- vector setup code (that includes packing/unpacking into/from vector
registers),
- alignment checks (including possibly loop peeling and runtime guard
code, along with multiple versions of the loop depending on runtime
tests),
- other run-time guard code (for example to check for overlap of pointers
if memory anti-aliasing is unsuccessful),
- partial sums and reduction with it's epilog code,
- constructs (like 'if' statement) that were collapsed into a vector
tree (like select/predicated/masked vector operation) that is not
supported by the scalar unit, forcing you to regenerate the original
construct.
Maybe keeping the original (scalar) copy of the loop in place along with
the
vectorized version can help, but even assuming that you are willing to
duplicate so many trees (and translate it to RTL etc.), it will not help
you
if you don't want to resort to the scalar version entirely --- this is an
"all or nothing" solution. In many cases, it's not what we want.
Bottom line is, if there's not enough information at the tree level, too
much redundant code will be generated, requiring a lot of effort to
undo it. I think we should either expose all vital target specific
info to the tree-level, or perform only (target independent) analysis in
the
tree level, and do the actual vectorization transformation where we know if
and how to do it.
dorit
Richard Henderson
<rth@redhat.com> To: Dorit Naishlos/Haifa/IBM@IBMIL
cc: gcc@gcc.gnu.org, Diego Novillo <dnovillo@redhat.com>, dje@watson.ibm.com
17/07/2003 20:48 Subject: Re: auto vectorization in gcc
On Thu, Jul 17, 2003 at 03:45:47PM +0300, Dorit Naishlos wrote:
> (c) since the tree-level is machine independent, actual vectorization
> will take place in the RTL level
I would think this would be false. Just because the tree level
representation is machine independent doesn't mean we can't do
any target-specific transformations.
The tree level is *more* capable than the rtl level at representing
vector types (and thus operations). I think all we need is some
small amount of info from the target about vector widths and memory
blocking, and then the transformation should happen at the tree level.
r~