The suggestions for the different tree-codes/optabs are organized according to representative operations that we want to express. For each such operation we suggest an optab, a tree (or a sequence of trees), and in some cases an example of the RTL that will be expanded from it.
In cases where we can use existing optabs and tree-codes we try to do so; in such cases, the operation represented by the tree-code is performed on each element of the vector. In cases where we need to introduce new optabs or tree-codes, these are marked in red, or in blue if they had already been suggested (for other purposes). New optabs/tree-codes that are optional appear in parenthesis. Some optabs/tree-codes that represent a "pure SIMD" operation - i.e, operation that has the same semantics when operated on scalars as when operated on each element of a vector - can be used both for vectors and scalars (e.g. addition). Other optabs/tree-codes are introduced for vectors only, as they have no meaning in the context of scalar; these are prefixed with vec_ (for optabs) or V (for tree-codes).
In some cases multiple alternatives are presented. Often, one alternative will introduce less new tree-codes, sometimes at the cost of difficulties to expand the trees to the most efficient RTL.
This is an RFC. Comments are welcome on any of the
following, and more:
1) is a new optab/tree-code introduced where an
existing one could have been reused
2) is an existing optab/tree-code used
with different semantics than its original semantics
3) which alternative
makes more sense
4) is there missing functionality (in the categories we have
addressed so far)
5) the names of the proposed new optabs/tree-codes are also
open for discussion
Notations:
t - represents a data-type (char, short, int,
etc.)
dt - represents a type whose size is double the size
of t (e.g, if t is of mode 'hi', then dt is of mode
'si').
ht - represents a type whose size is half the size of
t (e.g, if t is of mode 'hi', then ht is of mode 'qi').
v8t
- represents a vector that consists of 8 elements of type
t.
v4dt - represents a vector that consists of 4 elements if
type dt.
0/1_mask - a vector of 1s and
0s
v4_indx_mask - a vector of 4 elements; each element is an
index (typically between 0 to 7 for the 4 element mask
case).
s/u - appended to a tree-code if it will be used to
perform signed or unsigned arithmetic, depending on the type of the tree (or its
operands).
N - the number of bytes in a vector.
These include compare and select.
The main proposal tries to reuse existing tree-codes and optabs. Not all the compare forms below are required (e.g, if gt is available, we can do without le), but since optabs and tree-codes for them are already present in GCC, we present all the different forms here. An alternative proposal introduces a new tree-code (a 3 operand COMP_EXPR) that can be used for all the different compare forms. This one would make sense only if the existing tree-codes cannot be used for some reason. The CMP_EXPR can be used to operate on scalar as well as vectors.
Questions/Considerations/Comments:
01/_mask = NE_EXPR (vx, vy) vz = VSELECT (vx, vy, 01_mask)
We therefore suggest to introduce (at least) the vec_select_expr form that takes two operands (+ a mask) (i.e, "vselect2").
| operation | proposal | possible alternatives |
|---|---|---|
| compare_ne, compare_eq |
ne_optab,(eq_optab) | ne_optab,(eq_optab) |
0/1_mask = NE_EXPR (v4t_x, v4t_y) |
0/1_mask = CMP_EXPR (v4t_x, v4t_y, test)
test = ne/eq/gt/le/ge/lt, depending on the optab. | |
| compare_gt, compare_le (signed) |
gt_optab, le_optab | |
0/1_mask = GT_EXPR (v4t_x, v4t_y) (The sign/unsigned is hidden in the type of the tree) |
||
| compare_ge, compare_lt (signed) |
ge_optab, lt_optab | |
0/1_mask = GE_EXPR (v4t_x, v4t_y) |
||
| compare_ugt, compare_ule (unsigned) |
ugt_optab, (ule_optab) | |
0/1_mask = GT_EXPR (v4t_x, v4t_y) |
||
|
compare_uge, compare_ule (unsigned) |
uge_optab, (ult_optab) | |
0/1_mask = GE_EXPR (v4t_x, v4t_y) |
||
|
select |
vec_select_optab | vec_select_optab |
v4t_z = VSELECT_EXPR (v4t_x, v4t_y, 0/1_mask) |
v4t = VSELECT_EXPR (v8t, 0/1_mask) | |
| comments: or both: vec_select1_optab, VSELECT1_EXPR, vec_select2_optab, VSELECT2_EXPR Targets that don't directly support a select, |
These include permute operations, as well as higher-level operations (like
handling of misaligned accesses) that use them. Because type conversions between
vectors often require data permutation, we also address here type
conversions.
Some of the operations below change the order of the elements,
but maintain the same type and number of elements. e.g: v8t = vperm (v8t,
v8_indx_mask)
Some operations change the number of elements: e.g, v4t =
vperm_sel (v8t, v4_indx_mask), v4t = vperm_sel (v4t, v4t, v4_indx_mask)
Some
change the type: v8t = vpack (v4dt, v4dt)
And some change both type and
number of elements: v4t = vunpack (v8ht, v4_indx_mask)
Questions/Considerations/Comments:
| operation | proposal | possible alternatives |
|---|---|---|
|
Permute |
vec_permute_optab | |
v4t_z = VPERM_EXPR (v4t_x, v4_indx_mask) |
||
| comments: | comments: | |
| Merge two vectors into one by selecting a
permutation of half their elements or a more general definition: Select n elements of type t out of 2*n elements of the same type. (Similar functionality to vperm (VMX), SHUFPS (SSE) and vmrgh*,vmrgl* (VMX) UNPCKHPS,UNPCKLPS (SSE)). |
vec_merge_optab | vec_perm_select_optab |
v4t_z = VMERGE_EXPR (v4t_x, v4t_y, v4_indx_mask) |
v4t_z = VPERM_EXPR (v8t_x, v4_indx_mask) | |
| comments: | comments: or both alternatives | |
| Operation that involves permutation:
Misaligned load Assumption: a load fetches N bytes from "floor aligned" address. We use the VMERGE_EXPR to perform the following: |
misalign_lo_mask_optab | |
vx = MODIFY_EXPR (*addr) |
vx = MODIFY_EXPR (*addr) | |
| comments: rtl_expand - trivial (in VMX the MISALIGNED_LO_MASK_EXPR
will be expanded to lvsl/r for big/little-endian). |
comments: rtl_expand - detect pattern, or generate suboptimal code (or
rely on combine?). Q: OK to expose endianess at the tree-level ? | |
| Optimized Misaligned load
(Software-pipelined loads) |
misalign_hi_mask_optab | |
vx = MODIFY_EXPR (*addr) |
vx = MODIFY_EXPR (*addr) | |
| comments: rtl_expand - trivial (in VMX the MISALIGNED_HI_MASK_EXPR
will be expanded to lvsr/l for big/little-endian). |
comments: rtl_expand -detect pattern, or generate suboptimal code (or rely on combine?). | |
| Misaligned store | misalign_hi_mask_optab | |
vx = MODIFY_EXPR (*addr) |
vx = MODIFY_EXPR (*addr) | |
| comments: rtl_expand - trivial | comments: rtl_expand - detect pattern, or generate suboptimal code (or rely on combine?). | |
| Demotion: Pack n elements of type t into a vector of n elements of type ht. The demotion flavors we propose here use modulo or (signed/unsigned) saturating arithmetic. example: vectorization of - |
1. vec_pack_mod_optab 2. vec_pack_sat_optab 3. vec_pack_usat_optab |
alternative1: same vec_pack_*_optabs alternative2: use scalar optabs: |
1. v8t_z = VPACK_MOD_EXPR (v4dt_x, v4dt_y) |
alternative1: | |
| comments: rtl_expand - trivial | comments: or both alternatives.
rtl_expand - w.r.t alternative1: if v8dt is not supported, but v8t is supported, involves emulating the operation using smaller types (which should be doable). | |
| Promotion. Unpack n elements of type t into n/2 elements of type dt. The promotion can use sign extension or zero extension. example: vectorization of |
vec_unpack_optab | to cast each element of a vector (maintaining the same number of elements) we can reuse the same tree-code used to cast scalars, i.e CONVERT_EXPR: |
v4t_z = VUNPACK_EXPR (v8ht_x, v4_indx_mask) |
split into perm_select followed by the casting: v4ht_z = VPERM_EXPR (v8ht_x, v4_indx_mask) (or VMERGE_EXPR) v4t = CONVERT_EXPR s/u (v4ht) | |
| comments: rtl_expand - trivial | comments: rtl_expand - problematic for targets that support a single vector size. | |
| Operation that involves promotion:
Fixed-Point Multiplication: example: vectorization of - The semantics of VSEL_MULT_EXPR (v8t_x, v8t_y, v4_m): Putting it all together during vectorization: if (smul_widen_optab[v8t]){
v8dt_z = MULT (v8t_x,v8t_y)
}
|
scenario1: smul_widen_optab, umul_widen_optab |
|
v8dt_z = MULT_EXPR (v8t_x, v8t_y) |
||
| comments: rtl_expand - trivial (if target support available) | comments: | |
| scenario2: smul_lo_optab, smul_hi_optab, umul_lo_optab, umul_hi_optab |
smul_lo_optab, smul_hi_optab, | |
v4dt_z1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3}) |
reusing already existing/proposed tree-codes: v8dt_z = MULT_EXPR (v8t_x, v8t_y) | |
| comments: rtl_expand - trivial | comments: rtl_expand - when expanding each of the vperms, need to analyze their uses and detect that it's a mult_hi/lo pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_hi/lo and will not use the def of the original mult). | |
| scenario3: smul_odd_optab, smul_even_optab, umul_odd_optab, umul_even_optab |
smul_odd_optab, smul_even_optab, | |
v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,2,4,6}) |
reusing already existing/proposed tree-codes: v8dt_z = MULT_EXPR (v8t_x, v8t_y) | |
| comments: rtl_expand - trivial
(in VMX will be expanded to: |
comments: rtl_expand - when expanding each of the vperms, need to analyze their uses and detect that it's a mult_odd/even pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_odd/even and will not use the def of the original mult). | |
| Fixed-Point Multiplication: take high part example: vectorization of - |
smul_highpart_optab, umul_highpart_optab |
1. if have s/umul_hi/lo_optab 2. if have s/umul_even/odd_optab |
v8t_z = MULT_HI_EXPR(v8t_x,v8t_y) |
1. v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3}) | |
| comments: rtl_expand - trivial. (in VMX will be expanded to: m = {0,8,2,10,4,12,6,14}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);) |
comments: rtl_expand - simple (may need to detect pattern). | |
| Fixed-Point Multiplication: take low part example: vectorization of - |
smul_optab, umul_optab | |
v8t_z = MULT_EXPR (v8t_x, v8t_y) |
||
| rtl_expand: trivial. (in VMX will be expanded to: m = {1,9,3,11,5,13,7,15}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);) |
||
| vector shift, in number of bytes Note: shift the entire vector, not element-wise; can't use the scalar shift tree-codes. (Similar functionality to vslo/vsro in VMX) |
vec_shift_left_optab vec_shft_right_optab |
|
v = VSHIFT_RIGHT_EXPR (v,scalar_shft) |
v = VPERM_EXPR (v, v_indx_mask) | |
| comments: rtl_expand - trivial | comments: rtl_expand - not trivial to detect that this is a shift pattern (impossible if x is unknown at compile time). This will result in suboptimal code (prepare mask, and permute). | |
| vector shift, in number of bits | TBD | |
| TBD | ||
Using ADD as an example; Similarl optabs/tree-codes are proposed for SUB.
Multiplication discussed in previous section.
Also need to consider
ADD_CARRY_EXPR.
| operation | proposal | possible alternatives |
|---|---|---|
| saturate | optab_sat (for scalars and vectors) | |
v4t_z = SAT_EXPR(v4dt_x) |
||
| short_z = short_x + short_y (modulo arithmetic) |
optab_add | |
| v8t_z = ADD_EXPR (v8t_x, v8t_y) | ||
| int_z = short_x + short_y | optab_add_widen | optab_add_widen
or if vec_unpack_optab |
v8dt_z = ADD_EXPR (v8t_x, v8t_y) |
v4dt_x1 = VUNPACK_EXPR (v8t_x, {0,1,2,3}) | |
| short_z = u/s_SAT (short_x + short_y) | optab_add_ssat optab_add_usat |
optab_add_ssat,
optab_add_usat or if optab_add_widen and optab_sat |
v8t_z = ADD_SAT_EXPR (v8t_x, v8t_y) |
v8dt_tmp = ADD_EXPR (v8t_x, v8t_y) |