Introduction ============ Current methods of auto-vectorization take the vectorization decision through a generic analysis which is almost target independent (except for target features such as vector sizes). However, the instruction selection mechanism is largely adhoc in the sense that the ideas used for instruction selection for a target may not carry over seamlessly to some other target. This has led to proliferation of target specific code in the generic part of compiler frameworks such as GCC. We propose a generic approach for reordering-instruction selection which is based on the insight that effective data reordering for vectorization can be defined in terms of a set of well defined primitive operations. Interestingly, the same set of operations can be used to define target instructions almost for all targets that we know of. Thus these operations form the basic building blocks of reorderings for effective vectorization. Primitive operations ==================== Let V_N denote a vector of N scalar elements. A collection of k V_N vectors is denoted by V_N , ...k , V_N We propose the following operations: - Concatenation_k^N : V_N x ... x k V_N --> V_kN : CONCAT_k^N (V_N , ...k , V_N) concatenates k vectors of size N to produce a single vector of size kN. - Split_k,i^kN : V_kN --> V_N . Split operation is the inverse of concatenation. SPLT_k,i^kN ( V_kN ) splits a vector of size kN into k vectors of size N each and returns the i th vector. - Inerleave_k^N : V_N x ...k x V_N --> V_kN . ILV_k^N ( V_N, ...k, V_N ) creates a vector of size kN such that ith element of output vector comes from i/kth element from i%kth vector. - Extract_k,i^kN : V_kN --> V_N. Extract operation is inverse of interleave operation. EXTR_k,i^kN (V_kN) divides the input vector into N parts and creates k vectors of size N by selecting jth element from each part, and returning ith vector of size N . - permute^N : V_N x I_M --> V_N. Permute operation takes vector of size N and integer order of size M. It divides vector N into N/M vectors of size M, and permutes each subvector with permute order I_M. Phase structure of Algorithm ============================ The algorithm for target-aware reordering-instruction selection for efficient autovectorization is divided into following phases: Phase 1 : Create tree for each loop per destination - Implemented by Sameera Deshpande Phase 2 : k-arity reduction/promotion - Implemented by Sameera Deshpande Phase 3 : Top-down distribution of p-node subtree - Implemented by Sameera Deshpande Phase 4 : Optimizations - Implemented by Sameera Deshpande Phase 5 : Bottom-up commoning of p-node subtree - Implemented by Sameera Deshpande - Cost computation of p-tree - Implemented by Sameera Deshpande Phase 6 : Vector-size reduction - Implemented by Sameera Deshpande Phase 4(again): Optimization - Implemented by Sameera Deshpande Phase 7: Cost computation - Store min_cost_tree. - Implemented by Prachi Godbole Phase 8: Target specific code generation for min_cost_tree - Implemented by Prachi Godbole Input loop ========== To enable auto-vectorization using this algorithm, the loop structure should be as follows: Loop with k statements reading from S_1 , S_2 , ... S_m and writing to destination D, where stmt j is of form D[k * i + d_j ] = op_j (S_1 [k_1 * i + c_j1 ], S_2 [k_2 * i + c_j2 ], ... S_m [k_m * i + c_jm ]) such that – All statements stmt_1 , ... stmt_k in the loop body are independent and vectorizable for target with vector size VS. – Iteration count N > VS. – Induction variable i for the loop is not altered by stmt_1 , stmt_2 ... stmt_k . – All S_1 , ... S_m and D are of same type. – Indices for all S and D have same multiplicative index. If different, normalize by taking LCM, and duplicate the statements appropriately by adjusting the offset. – c_jp < k_p FOR(i = 0 ... N − 1) { stmt 1 : D [k * i + d_1 ] = op_1 (S_1 [k_1 * i + c_11 ], S_2 [k_2 * i + c_12 ], . . . S_m [k_m * i + c_1m ]) stmt 2 : D [k * i + d_2 ] = op_2 (S_1 [k_1 * i + c_21 ], S_2 [k_2 * i + c_22 ], . . . S_m [k_m * i + c_2m ]) ... stmt k : D [k * i + d_k ] = op_k (S_1 [k_1 * i + c_k1 ], S_2 [k_2 * i + c_k2 ], . . . S_m [k_m * i + c_km ]) } For ease of implementation, the loop information is taken as input in language defined as follows: Input language -------------- loop_begin -> LOOP '(' NUMBER ')' dest_stmt_list dest_stmt_list -> dest_stmt_list dest_stmt | dest_stmt dest_stmt -> VARIABLE '[' NUMBER ',' NUMBER ']' '=' cop_tree cop_tree -> VARIABLE '[' NUMBER ',' NUMBER ']' | c-op '(' cop_tree_list ')' cop_tree_list -> cop_tree_list ',' cop_tree | cop_tree c-op -> Any order-preserving compute operation. So, the above given loop can be represented as follows in the input language: LOOP (N) D[k,d_1] = op1 (S_1[k_1, c_11], S_2[k_2, c_12], ... S_m[k_m, c_1m]) D[k,d_2] = op1 (S_1[k_1, c_21], S_2[k_2, c_22], ... S_m[k_m, c_2m]) ... D[k,d_k] = op1 (S_1[k_1, c_k1], S_2[k_2, c_k2], ... S_m[k_m, c_km]) Algorithm ========= Phase 1: Create AST for each loop per destination D. --------------------------------------------------- Input: - - - Program defined using input language. Procedure: - - - - - - Identify loop count N. - Identify respective k for destination array and source arrays. - Create ILV_k^N node which has arity k, which writes to destination array D. The vectorsize = loopcount. - Create vectorized AST for each statement within loop. Each operation op_j in statement j is vectorized to vop_j depending upon target architecture. If statement j accesses memory S_p[ k_p * i + c_jp ], create EXTR_k_p,c_jp on source S_p. If statement j writes at location D[k * i + d_j ] in the iteration, attach the AST as d_jth operand to ILV_k^N - For each destination array D, goto step 3 Output: - - - - Output of phase 1 is an AST which can be defined using following grammar: tree -> ptree | ctree | ITER (tree) ptree -> p-op_k (tree_1, tree_2, ... , tree_k) ctree -> c-op_k (tree_1, tree_2, ... , tree_k) | MEM (src-name) p-op_k -> Permute operation of arity k. c-op_k -> Compute operation of arity k. This phase is implemented in - scanner.lex - Lexical analyzer to scan the input - parser.y - Syntax analyzer and AST generator - header.h - Header file Commands to Execute: To build phase 1: flex scanner.lex bison parser.y -d -v gcc parser.tab.c lex.yy.c -g -o parser.out To run phase 1: ./parser.out < > dmp.txt Phase 2: k-arity reduction/promotion. ------------------------------------ Input: - - - AST constructed from dmp.txt rooted at root_node Procedure: - - - - - IF root_node is p-op: - Let m be machine arity - Let k be arity of root_node - IF m < k: Arity reduction - k % m should be 0, otherwise autovectorization not possible. - IF p-op of root_node is EXTR: EXTR_k,i^N EXTR_m,(i/(k/m))%m^Nm/k | => | | | T EXTR_k/m,i%(k/m)^N <=== Recursion on newly created EXTR node | | T - ELSE IF p-op of root_node is ILV: ILV_k^N ILV_m^Nk/m / | \ => / | \ / | \ / | \ T1 T2 ... Tk / | ... \ / | \ ILV_k/m^N ILV_k/m^N ILV_k/m^N <=== Recursion on all new ILV nodes / | \ / | \ / | \ / |...\ / |...\ / |...\ T1 Tm+1 Tk-m+1 T2 Tm+2 Tk-m+2 Tm T2m Tk - ELSE IF m == k: - Already in desired format. Hence, invoke phase 2 recursively on child(root_node) - ELSE IF m > k: Arity promotion - m %k should be 0, otherwise autovectorization not possible. - Algorithm for ILV-promotion introduces new EXTR nodes whereas algorithm for EXTR-promotion introduce ILV node. To avoid the infinite loop, only ILV-promotion or EXTR-promotion can be implemented. As phase 2 is designed as top-down algorithm, ILV-promotion is implemented, and expect that the propagation of newly created EXTR nodes creates opportunity for arity reduction. If this cannot be done, auto-vectorization cannot be possible. So, - IF p-op of root-node is EXTR: - Do not perform auto-vectorization - Else if p-op of root_node is ILV: ILV_k^N ILV_m^Nk/m / | \ => / / / \ \ \ / | \ T1 T2 ... Tk / / / \ \ \ : : / / / \ \ \ / / / \ \ \ / / / \ \ \ N N N N N N EXTR EXTR EXTR EXTR EXTR EXTR m/k,0 m/k,0 m/k,1 m/k,1 m/k,m/k-1 m/k,m/k-1 | | | | | | T1 ... Tk T1 ... Tk T1 ... Tk ELSE #root_node is c-op or ITER or MEM - Invoke phase 2 recursively on child(root_node) Output: - - - - Output of phase 2 is AST with all intermediate nodes having arity of p-nodes adjusted to arity of target instruction. This phase is implemented in Transformations.py and can be invoked by invoking driver function k_arity_reduction_promotion Phase 3: Top-down distribution of p-node subtree ------------------------------------------------ Input: - - - Input of phase 3 is AST rooted at root_node with p-node arity adjusted to arity of reordering-instructions in target. Procedure: - - - - - This phase helps accumulate all reordering operations on source operands, because of which effective instruction selection is possible. - Select a subtree T_N of input expression tree, such that all nodes within the subtree are p-ops and all n children of the subtree have same c-op node. Each c-op node has arity of k. ^ / \ / \ / T_N \ +========+ ============== / | \ / | \ COP_k ... COP_k /|\ / | \ / | \ / | \ C0...Ck-1 C_kN-k...C_kN-1 - If such subtree T can be found, then construct a tree rooted at COP_k, and propagate T on each child of cop_k. COP_k / | \ / | \ / | \ / | \ T_N ... T_N /|\ / | \ / | \ / | \ C0...CNk-k C_k-1...C_kN-1 Output: - - - - AST rooted at newly created cop-node or original tree. This phase is implemented in Transformations.py and can be invoked using top_down_distribution. Phase 4: Optimization --------------------- Input: - - - AST rooted at root_node with accumulated reordering operations whenever possible. Procedure: - - - - - The distribution of reordering operations (p-ops) can create opportunities to perform redundancy elimination. Few identities that can be eliminated are: 1. ILV_m (EXTR_0(S), EXTR_1(S),...EXTR_m-1(S)) => S 2. EXTR_m,x (ILV_M(S1, S2, ... Sm)) => Sx 3. CONCAT_m (SPLT_0(S), SPLT_1(S),...SPLT_m-1(S)) => S 4. SPLT_m,x (CONCAT_M(S1, S2, ... Sm)) => Sx This phase process tree nodes in post order, so that maximum redundancy elimination can be acheived in single pass over tree. Output: - - - - AST rooted at root_node with identity elimination performed. This phase is implemented in Transformations.py and can be invoked using optimizations. Phase 5: Bottom-up commoning of p-node subtrees ----------------------------------------------- Input: - - - Optimized tree rooted at root_node. Procedure: - - - - - Each p-tree represents set of permute instructions to be generated on source of the p-tree. If single reordering instruction can be generated at any tree rooted at c-node, then commoning subpart of child p-tree is always expensive. Hence, if single instruction cannot be generated for each child of c-node, then only commoning out the sub p-tree is efficient. This commoning can give better performance, if for k-ary c-tree, - Each child p-tree has cost > 1. - Each child p-tree has common sub p-tree with cost c - so, k * c instructions can be reduced if sub p-tree can be commoned across all children. - Commoned sub p-tree now operate on result, hence k * c instructions are replaced by c instructions. - Select common subtree T_N from all children of root COP_k node, such that all nodes within the subtree are p-ops and T_N is not subtree of single instruction matching. COP_k ============== / | \ / | \ T_N ... T_N /|\ / | \ / | \ / | \ C0...CNk-k Ck-1 ... CkN-1 - If such subtree T can be found, common out subtree T_N from all children of COP_K, and construct a tree rooted at T_N, with each child as tree rooted at COP_k, performing c-op on appropriate children on original tree. ^ / \ / \ / T_N \ +========+ ============== / | \ / | \ COP_k ... COP_k /|\ / | \ / | \ / | \ C0...Ck-1 CkN-k ...CkN-1 Output: - - - - p-tree commoned tree rooted at new-p-node, or original tree. This phase is implemented in Transformations.py and can be invoked using bottom_up_commoning. Phase 6: Vector-size reduction ------------------------------ Input: - - - AST rooted at root_node Procedure: - - - - - - Take GCD of vec_size fields of all nodes in tree. - IF GCD is divisible by target's vec_size, - create ITER node of GCD/target_vec_size. - ELSE IF GCD is not divisible by target's vec_size, - do not vectorize. Output: - - - - AST rooted at ITER node with adjusted vector size in each node. This phase is implemented in Transformations.py and can be invoked using vs_reduction. Phase 7: Cost Computation ------------------------- This phase covers entire expression tree with tiles of target instructions. It records selected instructions along with their costs for each node of the tree in tuple . For each target architecture, reordering instructions in target are represented using primitive operations. Then finite automata is constructed by creating transitions per p-op node in instruction. In final state for each reordering instruction, the tuple is populated. Along with state transitions for each target instruction, we also generate states in the finite automata to match single ILV and EXTR node, where final state gives tuple for combination of instructions or combination of input respectively. Input: - - - AST for cost computation rooted at root_node. Procedure for instruction selection: - - - - - - - - - - - - - - - - - - - Traverse tree top-down and for each subtree, - Transition over the generated automaton. - If final state is reached, add populated cost tuple in root of subtree matched. Output: - - - - Tree with cost anotated nodes. This phase is implemented in CostComputation.py and can be invoked using compute_cost Phase 8: Code generation ------------------------ Input: - - - AST rooted at root_node with cost-tuple annotations at each node. Procedure: - - - - - - Create DAG from AST. - Scan the DAG in Bottom-up fashion so that code for all children will be generated before the parent node. - Depending upon vector size of tree-node and input vector size expected by instruction, tree_node(vec_size)/ instructions are generated. Output: - - - - Assembly code or intermediate code. This phase is implemented in CostComputation.py and can be invoked using generate_code. TODOs ===== - Implementation of ILV <-> CONCAT and EXTR <-> SPLT conversions. - Implementation of PERM for target instructions of type vec_perm. Currently, this algorithm can optimize the loops which adhere to the input loop structure as described earlier. However some optimizations and loop transformations can help to transform unacceptable loops into acceptable structure: - Loop interchange - for conteguous element access in multi-dimentional arrays - Loop peeling - For getting loop count % target VS = 0 - Loop fusion - To avoid voids on destinations as well as reuse sources better. - Loop fission - To seperate different destinations.