This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[tree-ssa] vectorizer related issues





Hi,

I went through the exercise of vectorizing a simple loop in tree-ssa,
skipping some analysis but going all the way through the transformation, in
order to get an idea of the problems and issues that lie ahead. This note
first describes what I did in general, then raises a list of open
problems/questions, and ends with a couple of tree-ssa dumps. You are
encouraged to read & respond, at least to the second part! :-)

thanks,
dorit

Part I. The preliminary implemented vectorizer

The vectorizer transforms the following simple loop:

loop1.c:
  short a[N]; short b[N]; short c[N]; int i;

  for (i=0; i<N; i++){
    a[i] = b[i] + c[i];
  }

as if it was manually vectorized by rewriting the source code into:

loop2.c:
  typedef int __attribute__((mode(V8HI))) v8hi;
  short a[N];  short b[N]; short c[N];   int i;
  v8hi *pa = (v8hi*)a, *pb = (v8hi*)b, *pc = (v8hi*)c;
  v8hi va, vb, vc;

  for (i=0; i<N/8; i++){
    vb = pb[i];
    vc = pc[i];
    va = vb + vc;
    pa[i] = va;
  }

The assembly generated by auto-vectorizing loop1.c is identical to that
generated by (manually vectorizing and) compiling loop2.c.

The transformation that loop1.c undergoes is demonstrated at the end of
this note, where you'll find a dump of the (SSA) trees before and after the
vectorizer (part III).

As mentioned, I still need to go back and fill-in the implementation of a
few missing analyses. The main purpose of this exercise was to expose as
many issues/problems early on, as follow (in part II).

I will try to put together a design doc with some preliminary code for a
basic vectorizer; The questions below already highlight a lot on the way
the vectorizer works.


Part II. Open problems and questions

(1) Array addressing using pointers.
      As illustrated in the example above (loop2.c), the basic approach for
handling array accesses is to generate a pointer to a vector type ('pb'),
have it point to the array base ('pb = (v8hi *)b'), and use that to access
the array ('vb = pb[i]'). I was hoping to get away with generating an
ARRAY_REF expression for pb[i], but the RTL expander didn't like that. So I
generated the following sequence of stmts instead, inspired by what the
compiler currently generates at the tree level for the C stmt 'vb = pb[i]':

pb = &b
T.1 = (<unnamed type>)i_1;
T.2 = T.1 * 16;
T.3 = pb + T.2;
vb = *T.3;

It's too bad that we loose the information that these accesses were simple
array references. Maybe there's a simple way to convince the expander to
digest the expression pb[i] (Q1.1: is there?). What's much more worrying is
that we loose this information even before vectorization takes place. This
goes back to the issue mentioned in
http://gcc.gnu.org/ml/gcc/2003-07/msg02013.html.  With a lot of effort the
vectorizer could deal with pointer arithmetic, but I think this is exactly
the kind of effort we should save, working at the tree level.

Is anyone planning to fix this situation? i.e., Q1.2: have the front-end
represent p[i] as an array_ref even when p is declared a pointer, and Q1.3:
have the RTL expander accept such array references.

This is probably the most critical infrastructure fix required for the
vectorizer to be applicable in the short term. If gcc continues to
represent array references this way, it would be pretty useless to
vectorize specifically for array accesses. Good handling of pointer
arithmetic and pointer aliasing (which have well-known limitations) will
then be vital.


(2) virtual defs/uses.
      This issue goes back to an action item on the tree-ssa todo list:
"SSA information for arrays
      The existing implementation treats arrays as an opaque object. A
definition to an array location is treated as a definition for the whole
array".

Q2.1: Is anyone planning to address this issue? (so that the vdefs/vuses of
arrays will be connected only to may/must aliased references?)

It's not a major obstacle at the moment because I'm simply ignoring the
virtual defs/uses, as they are currently not very informative. What I do
when I analyze cross-iteration def-use cycles, is examine the def-use cycle
related to each of the loop phi's. Virtual phi's are ignored, because I
don't know whether they represent a real def-use cycle or not; I rely on
other analysis passes to check the data references that are responsible for
virtual defs-uses. For now, the only form of non-scalar data access I allow
is array references. So the analysis of the loop scalar phi's together with
array data dependence tests, suffice. In the future, when other forms of
non-scalar accesses are allowed (structure, etc) - this workaround virtual
defs/uses would need to be revisited.

By the way, (Q2.2:) is there an API that provides a mapping between a
virtual def/use and the operand it corresponds to? I guess in GIMPLE
finding this mapping is pretty straightforward because the grammar doesn't
allow a lot of flexibility; For example, I want to verify that any
vuse/vdef found corresponds to an array_ref (and not structure or pointer
access). I currently do that as follows: if a (GIMPLE) stmt has any vdefs
(vuses), I check that it has exactly one vdef (vuse), I assume that it
corresponds to op0 (op1), and I expect that the stmt is of the form: op0 =
SSA_NAME/CONST (SSA_NAME = op1). Then I verify that op0 (op1) is an
ARRAY_REF. Is there a better way to do that?
Q3.3: I'm assuming that if a stmt has multiple vdefs/vuses then it's either
a non vectorizable stmt (CALL_EXPR, asm) or there are aliasing problems
which also prevent vectorization. Are there cases involving stmts with
multiple vdefs/vuses that are relevant for vectorization?


(3) loop invariants, dead code, redundant code:
      The general approach taken by the vectorizer when generating vector
stmts, is to insert the vector code sequence corresponding to a certain
(scalar) stmt X just before stmt X, and remove stmt X only if it has side
effects (like changing memory). If stmt X does not have side effects, we
rely on dead code elimination for removing it.

According to this approach, the vectorizer is currently doing the bare
minimum, relying on subsequent optimization phases to take care of any loop
invariants, dead code etc. This is indeed the case - currently taken care
of by RTL level opts. The resulting code is as efficient as could be.

Taking a look at the tree dumps below, you'll find that there are some
(temporary!) inefficiencies introduced by this approach:
(3.1) some loop invariant code is generated inside the loop body (lines 3,
9 and 17) instead of in the loop prolog.
(3.2) a few values that can be reused (like T.7_17, lines 1,2) are instead
recomputed (variables T.12_22 & T.7_28, lines 7,8 & 15,16).
(3.3) some of the original (scalar) code, which is now redundant, is left
in the loop (lines 6, 12, 14).

The vectorizer can take care of these inefficiency issues by itself. Q3.1:
should it? It's simply a design decision. Cleaning up may help subsequent
(tree-level) optimizations and may also save compilation time and memory.
In any case, the very first version of the vectorizer will probably
continue to take this approach. If we decide that the vectorizer should
cleanup after itself, I'll incorporate these optimizations into the next
versions of the vectorizer.


(4) naming conventions:
      The vectorizer creates many new variables; looking at the tree-dumps,
having more informative names for variables would be helpful; for example,
when creating a pointer which is used to point to array b, maybe use p_b as
a prefix. Q4.1: Are there any naming conventions?


(5) stmt_info data structure:
      During the analysis phases the vectorizer records some information
per stmt. For example, a pointer to the vectorized version of each stmt;
Say stmt1:'b=x[i]' is vectorized into vec_stmt1:'vb=px[T]'. When
vectorizing the following stmt2:'a=b', the vectorizer has to find the def
of 'vb'; it does so by first finding stmt1 (via 'SSA_NAME_DEF_STMT(b)'),
then finding vec_stmt1 from stmt1 (via 'stmt_info(stmt1)'), and finally
finding the def 'vb' from vec_stmt1.

This scheme works for operands that are SSA_NAMEs. Operands that aren't,
are currently limited to array references appearing in load/store
operations, and are handled differently.

I've actually used info-per-SSA_NAME so far, but I'm considering to switch
to an info-per-stmt scheme since (a) I can't use it for non SSA_NAME
operands (arrays), (b) I had to allocate an array of size
"next_ssa_version" which is a bit wasteful considering I'm only looking at
SSA_NAMEs that are used in certain loops, and (c) I expect to have more
information to record in the future per stmt. My questions are:
Q5.1: Is there an available field (in stmt/stmt_ann) for recording such
information? I didn't find a general field, like a pointer to void, that
each optimization pass could use for it's own purposes. Q5.2: Is it OK to
add such a field (assuming the answer to Q5.1 is "no")?
Q5.3: In GIMPLE, are there stmts that have more than one def, and I mean
vectorizable stmts, not CALL_EXPRs and such? are they conceivable, even if
not exist today?


(6) modeling the target vector support:
      One of the issues that I haven't addressed yet is the modelling of
the vector support that is available in the target machine. The approach I
took for now is to generate only vector operations which can be expressed
using existing scalar tree codes (PLUS, MULT etc), after verifying that
they are supported by checking the relevant optab at the relevant
machine_mode (e.g, add_optab->handlers[(int) V8HImode].insn_code). If the
value found is CODE_FOR_nothing, then there's no target support, and we
can't vectorize the stmt. Otherwise - the stmt is transformed. The only
target specific information I'm currently exposing is the size of the
vector (in bytes) (through a variable in defaults.h which each target can
define). In the case of loads/stores, for which there's no optab (Q6.1: is
there?), I assume that the target supports any loads/store of the vector
size (i.e., if the vector size is 16 bytes, it's possible to load/store 16
QIs, 8 HIs, 4 SIs/SFs, etc).

Q6.2: Sounds reasonable so far?

There are a lot of vector operations for which this approach is not
expressive enough (e.g., a dot-product operation and reduction operations
in general, operations that perform computations on complex numbers,
conditional operations, and more). Q6.3: Should we add new TREE_CODES (and
corresponding optabs?) for the new types of operations we want to express?
Or instead, Q6.4: should the vectorizer generate calls to target specific
builtin functions? in this case we'll need to decide (Q6.5: ) how to expose
them to the tree level, and (Q6.6: ) how to address the problem that the
compiler doesn't attempt to perform any optimizations around target
specific builtins, not even to SSA their arguments; Or, Q6.7: should we
introduce new generic builtins that are totally exposed to the compiler?

A related issue, is how much information to expose to the tree level. We
want the vectorizer to have enough information to decide whether it's worth
while to vectorize; Also, some patterns can be detected and transformed
much more easily at the tree level. Other should be left to the RTL level.


III. Here are tree dumps for the loop that is auto-vectorized.

The SSA trees for loop1.c just before the vectorizer:
========================================================
  int i;
  short int c[12];
  short int b[12];
  short int a[12];
  short int T.4;
  short int T.3;
  short int T.2;
  short int T.1;

  # BLOCK 0
  # PRED: ENTRY (fallthru,exec)
  goto <L1>;
  # SUCC: 2 (exec)

  # BLOCK 1
  # PRED: 2 (true,exec)
<L0>:;
  #   VUSE <b_5>;
  T.1_6 = b[i_1];
  #   VUSE <c_7>;
  T.2_8 = c[i_1];
  T.3_9 = T.1_6 + T.2_8;
  #   a_10 = VDEF <a_2>;
  a[i_1] = T.3_9;
  i_11 = i_1 + 1;
  # SUCC: 2 (fallthru,exec)

  # BLOCK 2
  # PRED: 1 (fallthru,exec) 0 (exec)
  # a_2 = PHI <a_4(0), a_10(1)>;
  # i_1 = PHI <0(0), i_11(1)>;
<L1>:;
  if (i_1 <= 11) goto <L0>;
   else goto <L2>;


The SSA trees for loop1.c just after the vectorizer:
I added the following things manually to the dump file:
- line numbering in the loop body
- the actual type where the dump just emitted "unnamed type"
- comments with variable names in an attempt to make it more readable
=====================================================================
  <unnamed type (v8hi)> * T.19;
  <unnamed type (v8hi)> * T.18;           # pa
  <unnamed type (unsigned int)> T.17;
  <unnamed type (unsigned int)> T.16;
  <unnamed type (v8hi)> T.15;             # va
  <unnamed type (v8hi)> * T.14;
  <unnamed type (v8hi)> * T.13;           # pc
  <unnamed type (unsigned int)> T.12;
  <unnamed type (unsigned int)> T.11;
  <unnamed type (v8hi)> T.10;             # vc
  <unnamed type (v8hi)> * T.9;
  <unnamed type (v8hi)> * T.8;            # pb
  <unnamed type (unsigned int)> T.7;
  <unnamed type (unsigned int)> T.6;
  <unnamed type (v8hi)> T.5;              # vb
  int i;
  short int c[12];
  short int b[12];
  short int a[12];
  short int T.4;
  short int T.3;
  short int T.2;
  short int T.1;

  # BLOCK 0
  # PRED: ENTRY (fallthru,exec)
  goto <L1>;
  # SUCC: 2 (exec)

  # BLOCK 1
  # PRED: 2 (true,exec)
<L0>:;
1)   T.6_16 = (<unnamed type> (unsigned int))i_1;
2)   T.7_17 = T.6_16 * 16;
3)   T.8_18 = &b;                   #pb = b
4)   T.9_19 = T.7_17 + T.8_18;      #pb' = pb + (i * 16)
     #   VUSE <b_5>;
5)   T.5_20 = *T.9_19;              #vb = (* pb')
     #   VUSE <b_5>;
6)   T.1_6 = b[i_1];                #original scalar code
7)   T.11_21 = (<unnamed type> (unsigned int))i_1;
8)   T.12_22 = T.11_21 * 16;
9)   T.13_23 = &c;                  #pc = c
10)  T.14_24 = T.12_22 + T.13_23;   #pc' = pc + (i * 16)
     #   VUSE <c_7>;
11)  T.10_25 = *T.14_24;            #vc = (* pc')
     #   VUSE <c_7>;
12)  T.2_8 = c[i_1];                #original scalar code
13)  T.15_26 = T.5_20 + T.10_25;    #va = vb + vc
14)  T.3_9 = T.1_6 + T.2_8;         #original scalar code
15)  T.16_27 = (<unnamed type> (unsigned int))i_1;
16)  T.17_28 = T.16_27 * 16;
17)  T.18_29 = &a;                  #pa = a
18)  T.19_30 = T.17_28 + T.18_29;   #pa' = pa + (i * 16)
     #   a_10 = VDEF <a_2>;
19)  *T.19_30 = T.15_26;            #(* pa') = va
20)  i_11 = i_1 + 1;
  # SUCC: 2 (fallthru,exec)

  # BLOCK 2
  # PRED: 1 (fallthru,exec) 0 (exec)
  # a_2 = PHI <a_4(0), a_10(1)>;
  # i_1 = PHI <0(0), i_11(1)>;
<L1>:;
  if (i_1 <= 11) goto <L0>;
  else goto <L2>;
# SUCC: 3 (false,exec) 1 (true,exec)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]