Introduction
============
Current methods of auto-vectorization take the vectorization decision through a generic analysis which is almost target independent
(except for target features such as vector sizes). However, the instruction selection mechanism is largely adhoc in the sense that the
ideas used for instruction selection for a target may not carry over seamlessly to some other target. This has led to proliferation of
target specific code in the generic part of compiler frameworks such as GCC.

We propose a generic approach for reordering-instruction selection which is based on the insight that effective data reordering for
vectorization can be defined in terms of a set of well defined primitive operations. Interestingly, the same set of operations can be 
used to define target instructions almost for all targets that we know of. Thus these operations form the basic building blocks of 
reorderings for effective vectorization.

Primitive operations
====================
Let V_N denote a vector of N scalar elements. A collection of k V_N vectors is denoted by V_N , ...k , V_N
We propose the following operations:
- Concatenation_k^N : V_N x ... x k V_N --> V_kN : CONCAT_k^N (V_N , ...k , V_N) concatenates k vectors of size N to produce a
single vector of size kN.

- Split_k,i^kN : V_kN --> V_N . Split operation is the inverse of concatenation. SPLT_k,i^kN ( V_kN ) splits a vector of size kN into
k vectors of size N each and returns the i th vector.

- Inerleave_k^N : V_N x ...k x V_N --> V_kN . ILV_k^N ( V_N, ...k, V_N ) creates a vector of size kN such that ith element of output
  vector comes from i/kth element from i%kth vector.

- Extract_k,i^kN : V_kN --> V_N. Extract operation is inverse of interleave operation. EXTR_k,i^kN (V_kN) divides the input vector
 into N parts and creates k vectors of size N by selecting jth element from each part, and returning ith vector of size N .

- permute^N : V_N x I_M --> V_N. Permute operation takes vector of size N and integer order of size M. It divides vector N into N/M
  vectors of size M, and permutes each subvector with permute order I_M.

Phase structure of Algorithm
============================
The algorithm for target-aware reordering-instruction selection for efficient autovectorization is divided into following phases:
Phase 1 : Create tree for each loop per destination - Implemented by Sameera Deshpande
Phase 2 : k-arity reduction/promotion  - Implemented by Sameera Deshpande
Phase 3 : Top-down distribution of p-node subtree  - Implemented by Sameera Deshpande
Phase 4 : Optimizations  - Implemented by Sameera Deshpande
Phase 5 : Bottom-up commoning of p-node subtree  - Implemented by Sameera Deshpande
          - Cost computation of p-tree  - Implemented by Sameera Deshpande
Phase 6 : Vector-size reduction  - Implemented by Sameera Deshpande
Phase 4(again): Optimization  - Implemented by Sameera Deshpande
Phase 7: Cost computation - Store min_cost_tree.  - Implemented by Prachi Godbole
Phase 8: Target specific code generation for min_cost_tree  - Implemented by Prachi Godbole

Input loop
==========
To enable auto-vectorization using this algorithm, the loop structure should be as follows:
Loop with k statements reading from S_1 , S_2 , ... S_m and writing to destination D, where stmt j is of form
D[k * i + d_j ] = op_j (S_1 [k_1 * i + c_j1 ], S_2 [k_2 * i + c_j2 ], ... S_m [k_m * i + c_jm ])
such that

– All statements stmt_1 , ... stmt_k in the loop body are independent and vectorizable for target with vector size VS.
– Iteration count N > VS.
– Induction variable i for the loop is not altered by stmt_1 , stmt_2 ... stmt_k .
– All S_1 , ... S_m and D are of same type.
– Indices for all S and D have same multiplicative index. If different, normalize by taking LCM, and duplicate the statements 
  appropriately by adjusting the offset.
– c_jp < k_p


FOR(i = 0 ... N − 1)
{
  stmt 1 : D [k * i + d_1 ] = op_1 (S_1 [k_1 * i + c_11 ], S_2 [k_2 * i + c_12 ], . . . S_m [k_m * i + c_1m ])
  stmt 2 : D [k * i + d_2 ] = op_2 (S_1 [k_1 * i + c_21 ], S_2 [k_2 * i + c_22 ], . . . S_m [k_m * i + c_2m ])
  ...
  stmt k : D [k * i + d_k ] = op_k (S_1 [k_1 * i + c_k1 ], S_2 [k_2 * i + c_k2 ], . . . S_m [k_m * i + c_km ])
}

For ease of implementation, the loop information is taken as input in language defined as follows:

Input language
--------------
loop_begin       ->    LOOP '(' NUMBER ')' dest_stmt_list
dest_stmt_list   ->    dest_stmt_list dest_stmt 			| dest_stmt
dest_stmt        ->    VARIABLE '[' NUMBER ',' NUMBER ']' '=' cop_tree
cop_tree         ->    VARIABLE '[' NUMBER ',' NUMBER ']' 		| c-op '(' cop_tree_list ')'
cop_tree_list    ->    cop_tree_list ',' cop_tree 			| cop_tree
c-op 		 ->    Any order-preserving compute operation.

So, the above given loop can be represented as follows in the input language:

LOOP (N)
D[k,d_1] = op1 (S_1[k_1, c_11], S_2[k_2, c_12], ... S_m[k_m, c_1m])
D[k,d_2] = op1 (S_1[k_1, c_21], S_2[k_2, c_22], ... S_m[k_m, c_2m])
...
D[k,d_k] = op1 (S_1[k_1, c_k1], S_2[k_2, c_k2], ... S_m[k_m, c_km])

Algorithm
=========

Phase 1: Create AST for each loop per destination D.
---------------------------------------------------
Input:
- - - 
Program defined using input language.

Procedure:
- - - - -
- Identify loop count N.
- Identify respective k for destination array and source arrays.
- Create ILV_k^N node which has arity k, which writes to destination array D. The vectorsize = loopcount.
- Create vectorized AST for each statement within loop. Each operation op_j in statement j is vectorized to vop_j depending upon
   target architecture. If statement j accesses memory S_p[ k_p * i + c_jp ], create EXTR_k_p,c_jp on source S_p. If statement j 
   writes at location D[k * i + d_j ] in the iteration, attach the AST as d_jth operand to ILV_k^N
- For each destination array D, goto step 3

Output:
- - - - 
Output of phase 1 is an AST which can be defined using following grammar:

tree   -> 	ptree 					| ctree 		| ITER (tree)
ptree  -> 	p-op_k (tree_1, tree_2, ... , tree_k)
ctree  -> 	c-op_k (tree_1, tree_2, ... , tree_k) 	| MEM (src-name)
p-op_k -> 	Permute operation of arity k.
c-op_k ->	Compute operation of arity k.

This phase is implemented in
- scanner.lex - Lexical analyzer to scan the input
- parser.y - Syntax analyzer and AST generator
- header.h - Header file

Commands to Execute:
To build phase 1:
flex scanner.lex
bison parser.y -d -v
gcc parser.tab.c lex.yy.c -g -o parser.out

To run phase 1:
./parser.out < <inputFile> > dmp.txt

Phase 2: k-arity reduction/promotion.
------------------------------------
Input:
- - -
AST constructed from dmp.txt rooted at root_node

Procedure:
- - - - -
IF root_node is p-op:
- Let m be machine arity
- Let k be arity of root_node

- IF m < k: Arity reduction
  - k % m should be 0, otherwise autovectorization not possible.
  - IF p-op of root_node is EXTR:
       EXTR_k,i^N               EXTR_m,(i/(k/m))%m^Nm/k
            |          =>                |
            |                            |
            T                   EXTR_k/m,i%(k/m)^N	<=== Recursion on newly created EXTR node
                                         |
                                         |
                                         T

  - ELSE IF p-op of root_node is ILV:
          ILV_k^N                              ILV_m^Nk/m
        /    |    \        =>                  /   |    \
       /     |     \                        /      |       \
      T1    T2 ... Tk                   /          |   ...     \
                                    /              |              \
                                ILV_k/m^N      ILV_k/m^N      ILV_k/m^N    <=== Recursion on all new ILV nodes
                                 /  |  \        /  |  \        /  |  \
                                /   |...\      /   |...\      /   |...\
                              T1 Tm+1 Tk-m+1  T2 Tm+2 Tk-m+2 Tm  T2m  Tk
- ELSE IF m == k:
  - Already in desired format. Hence, invoke phase 2 recursively on child(root_node)

- ELSE IF m > k: Arity promotion
  - m %k should be 0, otherwise autovectorization not possible.
  - Algorithm for ILV-promotion introduces new EXTR nodes whereas algorithm for EXTR-promotion introduce ILV node. To avoid the 
    infinite loop, only ILV-promotion or EXTR-promotion can be implemented. As phase 2 is designed as top-down algorithm, ILV-promotion
    is implemented, and expect that the propagation of newly created EXTR nodes creates opportunity for arity reduction. If this cannot
    be done, auto-vectorization cannot be possible. So,
  - IF p-op of root-node is EXTR:
    - Do not perform auto-vectorization
  - Else if p-op of root_node is ILV:
            ILV_k^N                                ILV_m^Nk/m
          /    |    \        =>                  / / / \ \   \
         /     |     \                     
        T1    T2 ... Tk                       /  /  /   \  \    \       
                                                      :
                                                      :
                                          /    /    /   \      \       \
                                       /    /     /       \      \       \
                                   /      /      /         \        \       \        
                                  N      N      N          N        N        N
                               EXTR    EXTR    EXTR      EXTR     EXTR       EXTR
                                m/k,0   m/k,0   m/k,1    m/k,1     m/k,m/k-1   m/k,m/k-1 
                                  |      |       |         |          |            |
                                  T1 ... Tk      T1   ...  Tk         T1    ...    Tk

                      
ELSE #root_node is c-op or ITER or MEM
- Invoke phase 2 recursively on child(root_node)

Output:
- - - -
Output of phase 2 is AST with all intermediate nodes having arity of p-nodes adjusted to arity of target instruction.

This phase is implemented in Transformations.py and can be invoked by invoking driver function k_arity_reduction_promotion

Phase 3: Top-down distribution of p-node subtree
------------------------------------------------
Input:
- - -
Input of phase 3 is AST rooted at root_node with p-node arity adjusted to arity of reordering-instructions in target.

Procedure:
- - - - -
This phase helps accumulate all reordering operations on source operands, because of which effective instruction selection is possible.
- Select a subtree T_N of input expression tree, such that all nodes within the subtree are p-ops and all n children of the subtree
  have same c-op node. Each c-op node has arity of k.
                         ^
                        /  \
                       /    \
                      / T_N  \
                     +========+
                   ==============
                   /     |      \
                  /      |       \
                COP_k   ...     COP_k
                 /|\           /  |  \
                / | \         /   |   \
               C0...Ck-1    C_kN-k...C_kN-1  
   
               
- If such subtree T can be found, then construct a tree rooted at COP_k, and propagate T on each child of cop_k.
                         
                       COP_k 
                     /   |  \
                    /    |   \
                   /     |    \
                  /      |     \
                 T_N    ...    T_N
                 /|\         /  |  \
                / | \       /   |   \
               C0...CNk-k  C_k-1...C_kN-1

Output:
- - - -
AST rooted at newly created cop-node or original tree.

This phase is implemented in Transformations.py and can be invoked using top_down_distribution.

Phase 4: Optimization
---------------------
Input:
- - -
AST rooted at root_node with accumulated reordering operations whenever possible. 

Procedure:
- - - - -
The distribution of reordering operations (p-ops) can create opportunities to perform redundancy elimination. Few identities that can
be eliminated are:
    1. ILV_m (EXTR_0(S), EXTR_1(S),...EXTR_m-1(S)) => S
    2. EXTR_m,x (ILV_M(S1, S2, ... Sm)) => Sx
    3. CONCAT_m (SPLT_0(S), SPLT_1(S),...SPLT_m-1(S)) => S
    4. SPLT_m,x (CONCAT_M(S1, S2, ... Sm)) => Sx
This phase process tree nodes in post order, so that maximum redundancy elimination can be acheived in single pass over tree.

Output:
- - - -
AST rooted at root_node with identity elimination performed.

This phase is implemented in Transformations.py and can be invoked using optimizations.

Phase 5: Bottom-up commoning of p-node subtrees
-----------------------------------------------
Input:
- - -
Optimized tree rooted at root_node.

Procedure:
- - - - -
Each p-tree represents set of permute instructions to be generated on source of the p-tree. If single reordering instruction can be 
generated at any tree rooted at c-node, then commoning subpart of child p-tree is always expensive. Hence, if single instruction cannot
be generated for each child of c-node, then only commoning out the sub p-tree is efficient.

This commoning can give better performance, if for k-ary c-tree, 
- Each child p-tree has cost > 1.
- Each child p-tree has common sub p-tree with cost c - so, k * c instructions can be reduced if sub p-tree can be commoned across all 
  children.
- Commoned sub p-tree now operate on result, hence k * c instructions are replaced by c instructions.

- Select common subtree T_N from all children of root COP_k node, such that all nodes within the subtree are p-ops
  and T_N is not subtree of single instruction matching.

                       COP_k 
                   ==============
                   /     |      \
                  /      |       \
                 T_N   ...       T_N
                 /|\           /  |  \
                / | \         /   |   \
               C0...CNk-k   Ck-1 ... CkN-1
               
               
- If such subtree T can be found, common out subtree T_N from all children of COP_K, and construct a tree rooted
  at T_N, with each child as tree rooted at COP_k, performing c-op on appropriate children on original tree. 
                         ^
                        /  \
                       /    \
                      / T_N  \
                     +========+
                   ==============
                   /     |      \
                  /      |       \
                COP_k   ...     COP_k
                 /|\           /  |  \
                / | \         /   |   \
               C0...Ck-1    CkN-k ...CkN-1  
   
Output:
- - - -
p-tree commoned tree rooted at new-p-node, or original tree.

This phase is implemented in Transformations.py and can be invoked using bottom_up_commoning.

Phase 6: Vector-size reduction
------------------------------
Input:
- - -
AST rooted at root_node

Procedure:
- - - - -
- Take GCD of vec_size fields of all nodes in tree. 
- IF GCD is divisible by target's vec_size, 
  - create ITER node of GCD/target_vec_size. 
- ELSE IF GCD is not divisible by target's vec_size,
  - do not vectorize.   

Output:
- - - -
AST rooted at ITER node with adjusted vector size in each node.

This phase is implemented in Transformations.py and can be invoked using vs_reduction.

Phase 7: Cost Computation
-------------------------
This phase covers entire expression tree with tiles of target instructions. It records selected instructions along with their costs for 
each node of the tree in tuple <instruction, cost for input VS, desired input VS>. For each target architecture, reordering
instructions in target are represented using primitive operations. Then finite automata is constructed by creating transitions per p-op
node in instruction. In final state for each reordering instruction, the tuple is populated. Along with state transitions for each
target instruction, we also generate states in the finite automata to match single ILV and EXTR node, where final state gives tuple for 
combination of instructions or combination of input respectively.

Input:
- - - 
AST for cost computation rooted at root_node.

Procedure for instruction selection:
- - - - - - - - - - - - - - - - - -
- Traverse tree top-down and for each subtree, 
  - Transition over the generated automaton.
  - If final state is reached, add populated cost tuple in root of subtree matched.

Output:
- - - -
Tree with cost anotated nodes.

This phase is implemented in CostComputation.py and can be invoked using compute_cost

Phase 8: Code generation
------------------------
Input:
- - -
AST rooted at root_node with cost-tuple annotations at each node.

Procedure:
- - - - -
- Create DAG from AST.
- Scan the DAG in Bottom-up fashion so that code for all children will be generated before the parent node.
- Depending upon vector size of tree-node and input vector size expected by instruction, tree_node(vec_size)/<desired input VS>
  instructions are generated.

Output:
- - - -
Assembly code or intermediate code.

This phase is implemented in CostComputation.py and can be invoked using generate_code.


TODOs
=====
- Implementation of ILV <-> CONCAT and EXTR <-> SPLT conversions.
- Implementation of PERM for target instructions of type vec_perm.


Currently, this algorithm can optimize the loops which adhere to the input loop structure as described earlier. However some
optimizations and loop transformations can help to transform unacceptable loops into acceptable structure:
- Loop interchange - for conteguous element access in multi-dimentional arrays
- Loop peeling - For getting loop count % target VS = 0
- Loop fusion - To avoid voids on destinations as well as reuse sources better.
- Loop fission - To seperate different destinations.