# new tree-codes proposal for vector operations

## Topics

The suggestions for the different tree-codes/optabs are organized according to representative operations that we want to express. For each such operation we suggest an optab, a tree (or a sequence of trees), and in some cases an example of the RTL that will be expanded from it.

In cases where we can use existing optabs and tree-codes we try to do so; in such cases, the operation represented by the tree-code is performed on each element of the vector. In cases where we need to introduce new optabs or tree-codes, these are marked in red, or in blue if they had already been suggested (for other purposes). New optabs/tree-codes that are optional appear in parenthesis. Some optabs/tree-codes that represent a "pure SIMD" operation - i.e, operation that has the same semantics when operated on scalars as when operated on each element of a vector - can be used both for vectors and scalars (e.g. addition). Other optabs/tree-codes are introduced for vectors only, as they have no meaning in the context of scalar; these are prefixed with vec_ (for optabs) or V (for tree-codes).

In some cases multiple alternatives are presented. Often, one alternative will introduce less new tree-codes, sometimes at the cost of difficulties to expand the trees to the most efficient RTL.

This is an RFC. Comments are welcome on any of the following, and more:
1) is a new optab/tree-code introduced where an existing one could have been reused
2) is an existing optab/tree-code used with different semantics than its original semantics
3) which alternative makes more sense
4) is there missing functionality (in the categories we have addressed so far)
5) the names of the proposed new optabs/tree-codes are also open for discussion

Notations:

t - represents a data-type (char, short, int, etc.)
dt
- represents a type whose size is double the size of t (e.g, if t is of mode 'hi', then dt is of mode 'si').
ht
- represents a type whose size is half the size of t (e.g, if t is of mode 'hi', then ht is of mode 'qi').
v8t
- represents a vector that consists of 8 elements of type t.
v4dt
- represents a vector that consists of 4 elements if type dt.
- a vector of 1s and 0s
- a vector of 4 elements; each element is an index (typically between 0 to 7 for the 4 element mask case).
s/u - appended to a tree-code if it will be used to perform signed or unsigned arithmetic, depending on the type of the tree (or its operands).
N - the number of bytes in a vector.

## Conditional Operations

These include compare and select.

The main proposal tries to reuse existing tree-codes and optabs. Not all the compare forms below are required (e.g, if gt is available, we can do without le), but since optabs and tree-codes for them are already present in GCC, we present all the different forms here. An alternative proposal introduces a new tree-code (a 3 operand COMP_EXPR) that can be used for all the different compare forms. This one would make sense only if the existing tree-codes cannot be used for some reason. The CMP_EXPR can be used to operate on scalar as well as vectors.

• In some cases we need to express both signed and unsigned comparisons.
Q: Do the optabs below (gt, lt, etc), which are already present in GCC, represent signed or unsigned comparison?
A tree-code could represent both, depending on the type of the tree or its operands; but an optab would have to stand for either signed or unsigned. In this case, we should add the required optabs for the signed (or unsigned?) case. The proposal below duplicates the optabs for the signed and unsigned case, but uses one set of tree-codes assuming that the signed/unsigned will be determined according to the type of the tree/operands (this is similar to how signed/unsigned multiplication is represented).

• Compare insns (ne/gt/ge...) compare two operands and produce a single 0/1 result. In the vector form, it compares each pair of elements, and produce a 0/1 def for each. The result in a 0/1 mask (a "pure SIMD").

• Does it make sense to try to use the existing cmp_optab, ucmp_optab here?

• The vector select operation selects n elements out of 2*n elements, organized in two vectors, according to a 0/1 mask (which is a result of a compare operation). We'll refer to this representation as "vselect2". Another possible representation that could be useful for targets that have multiple vector sizes is to have the 2*n input elements organized in a single vector. We'll refer to this representation as "vselect1". This is presented here mostly for "completeness", to be consistent with proposals we make later for data-permutation operations which also select n elements out of 2*n elements (be it organized in two vectors or in one). In a way, "vselect1" is more general; however, in practice, the two operand form (+ a mask) is more natural in the case of the select operation, as usually there really would be two vectors from which elements are selected; e.g:

```01/_mask = NE_EXPR (vx, vy)
vz = VSELECT (vx, vy, 01_mask)```

We therefore suggest to introduce (at least) the vec_select_expr form that takes two operands (+ a mask) (i.e, "vselect2").

operation proposal possible alternatives
compare_ne,
compare_eq
ne_optab,(eq_optab) ne_optab,(eq_optab)
`0/1_mask = NE_EXPR (v4t_x, v4t_y)0/1_mask = EQ_EXPR (v4t_x, v4t_y)`
`0/1_mask = CMP_EXPR (v4t_x, v4t_y, test)`

test = ne/eq/gt/le/ge/lt, depending on the optab.

compare_gt,
compare_le (signed)
gt_optab, le_optab
`0/1_mask = GT_EXPR (v4t_x, v4t_y)0/1_mask = LE_EXPR (v4t_x, v4t_y)`

(The sign/unsigned is hidden in the type of the tree)

compare_ge,
compare_lt (signed)
ge_optab, lt_optab
`0/1_mask = GE_EXPR (v4t_x, v4t_y)0/1_mask = LT_EXPR (v4t_x, v4t_Y)`
compare_ugt,
compare_ule (unsigned)
ugt_optab, (ule_optab)
`0/1_mask = GT_EXPR (v4t_x, v4t_y)0/1_mask = LE_EXPR (v4t_x, v4t_y)`
compare_uge,
compare_ule (unsigned)
uge_optab, (ult_optab)
`0/1_mask = GE_EXPR (v4t_x, v4t_y)0/1_mask = LT_EXPR (v4t_x, v4t_y)`

select

vec_select_optab vec_select_optab
`v4t_z = VSELECT_EXPR (v4t_x, v4t_y, 0/1_mask)`
`v4t = VSELECT_EXPR (v8t, 0/1_mask)`
vec_select1_optab, VSELECT1_EXPR,
vec_select2_optab, VSELECT2_EXPR

Targets that don't directly support a select,
can define the following expansion for it
(to allow vectorization using VSELECTS):
(va & vc) | (vb &! vc).

## Data Permutations

These include permute operations, as well as higher-level operations (like handling of misaligned accesses) that use them. Because type conversions between vectors often require data permutation, we also address here type conversions.
Some of the operations below change the order of the elements, but maintain the same type and number of elements. e.g: v8t = vperm (v8t, v8_indx_mask)
Some operations change the number of elements: e.g, v4t = vperm_sel (v8t, v4_indx_mask), v4t = vperm_sel (v4t, v4t, v4_indx_mask)
Some change the type: v8t = vpack (v4dt, v4dt)
And some change both type and number of elements: v4t = vunpack (v8ht, v4_indx_mask)

• Some operations are used to select a permutation of n elements out of 2*n elements. The permutation is specified in a vector (indx_mask) that holds n indices whose values range from 0 to (2*n-1). Typically, this operation is used when the 2*n elements are placed in two vectors (e.g, combining values from two vectors for handling misaligned accesses), in which case it can be viewed as an operation that merges two vectors into one. Targets that support multiple sizes may benefit from allowing also a form in which the 2*n elements are organized in a single vector (can SSE* enjoy this flexibility?).

• Casting/multiple-data-types:
Promotion of data types comes in two flavors - signed or unsigned.
Demotion of data types can come in different flavors - e.g, modulo arithmetic, unsigned saturation, and signed saturation. For the modulo case, we try to reuse the same tree-codes used when casting scalars, i.e CONVERT_EXPR.

• Multiplication in which the result is wider than the inputs:
We propose several solutions, as different targets may have different mechanism to support such multiplication: (1) direct support (for targets that support multiple vector sizes), (2) multiply the n/2 odd/even elements, (3) multiply the n/2 hi/lo elements, or (4) no support - in which case the input will be promoted before multiplication can be applied.

• Other arithmetic operations in which the result is wider than the inputs - see next section.

• VPERM_EXPR is proposed for two purposes:
1. vec_permute_optab, in which case the number of elements in the result vector is equal to the number of elements in the input vector.
2. vec_perm_select_optab, in which the number of elements in the result vector is half the number of elements in the input vector.

operation proposal possible alternatives

Permute

vec_permute_optab
`v4t_z = VPERM_EXPR (v4t_x, v4_indx_mask)`
Merge two vectors into one by selecting a permutation of half their elements
or a more general definition: Select n elements of type t out of 2*n elements of the same type.
(Similar functionality to vperm (VMX), SHUFPS (SSE) and vmrgh*,vmrgl* (VMX) UNPCKHPS,UNPCKLPS (SSE)).
vec_merge_optab vec_perm_select_optab
`v4t_z = VMERGE_EXPR (v4t_x, v4t_y, v4_indx_mask)(alternative names: VPERM2_EXPR, VPERM_SEL_EXPR)`
`v4t_z = VPERM_EXPR (v8t_x, v4_indx_mask)`
Operation that involves permutation:

We use the VMERGE_EXPR to perform the following:
vz = (vx << mis | vy >> (N-mis))

`vx = MODIFY_EXPR (*addr)vy = MODIFY_EXPR (*(addr+N-1))indx_mask = MISALIGN_LO_MASK_EXPR (addr)vz = VMERGE_EXPR (vx, vy, indx_mask)`
`vx = MODIFY_EXPR (*addr)vy = MODIFY_EXPR (*(addr+N-1))mis = addr & (N-1)shft = BIG_ENDIAN? mis : N-mis indx_mask = (shft,shft+1,...shft+N-1)vz = VMERGE_EXPR (vx, vy, indx_mask)`
comments: rtl_expand - trivial (in VMX the MISALIGNED_LO_MASK_EXPR will be expanded to lvsl/r for big/little-endian).
comments: rtl_expand - detect pattern, or generate suboptimal code (or rely on combine?).
Q: OK to expose endianess at the tree-level ?

`vx = MODIFY_EXPR (*addr)addr += (N-1)indx_mask = MISALIGN_HI_MASK_EXPR (-addr)loop_begin:vy = MODIFY_EXPR (*addr)vz = VMERGE_EXPR (vx, vy, indx_mask) vx = vyaddr +=16loop_end`
`vx = MODIFY_EXPR (*addr)addr += (N-1)mis = addr & (N-1)shft = BIG_ENDIAN? N-mis : mis indx_mask = (shft,shft+1,...shft+N-1)loop_begin:vy = MODIFY_EXPR (*addr)vz = VMERGE_EXPR (vx, vy, indx_mask) vx = vyaddr +=16loop_end`
comments: rtl_expand - trivial (in VMX the MISALIGNED_HI_MASK_EXPR will be expanded to lvsr/l for big/little-endian).
comments: rtl_expand -detect pattern, or generate suboptimal code (or rely on combine?).
`vx = MODIFY_EXPR (*addr)vy = MODIFY_EXPR (*(addr+N))indx_mask = MISALIGN_HI_MASK_EXPR (addr)0/1_mask = VMERGE_EXPR (vzeros, vones, indx_mask) vz1 = VSELECT_EXPR (vx, vsrc, 0/1_mask)vz2 = VSELECT_EXPR (vsrc, vy, 0/1_mask)(*addr) = MODIFY_EXPR (vz1)(*(addr+16)) = MODIFY_EXPR (vz2)`
`vx = MODIFY_EXPR (*addr)vy = MODIFY_EXPR (*(addr+N))mis = addr & (N-1)indx_mask1 = BIG_ENDIAN? {mis, mis+1, ...} :...indx_mask2 = BIG_ENDIAN? {N-mis,N-mis+1,...}:...vz1 = VMERGE_EXPR (vx, vsrc, indx_mask)vz2 = VMERGE_EXPR (vsrc, vy, indx_mask)(*addr) = MODIFY_EXPR (vz1)(*(addr+16)) = MODIFY_EXPR (vz2)`
comments: rtl_expand - trivial comments: rtl_expand - detect pattern, or generate suboptimal code (or rely on combine?).
Demotion:
Pack n elements of type t into a vector of n elements of type ht. The demotion flavors we propose here use modulo or (signed/unsigned) saturating arithmetic.

example: vectorization of -
short_z = (short) int_x
(Similar functionality to vpck* in VMX)

1. vec_pack_mod_optab
2. vec_pack_sat_optab

3. vec_pack_usat_optab
alternative1: same vec_pack_*_optabs

alternative2: use scalar optabs:
1. convert_optab (is this equivalent to casting?)
2. sat_optab

3. usat_optab

`1. v8t_z = VPACK_MOD_EXPR (v4dt_x, v4dt_y)2,3. v8t_z = VPACK_SAT_EXPR (v4dt_x, v4dt_y)`
```alternative1:1. v8t_z = VPACK_MOD_EXPR (v8dt_x)2,3. v8t_z = VPACK_SAT_EXPR s/u (v8dt_x)
alternative2:1. v8t_z = CONVERT_EXPR (v8dt_x)2,3. v8t_z = SAT_EXPR (v8dt_x)(SAT_EXPR saturates according to the type of the tree/result).```

rtl_expand - w.r.t alternative1: if v8dt is not supported, but v8t is supported, involves emulating the operation using smaller types (which should be doable).

Promotion.
Unpack n elements of type t into n/2 elements of type dt. The promotion can use sign extension or zero extension.

example: vectorization of
int_z = (int) short_x
(Similar functionality to vupk* in VMX)

vec_unpack_optab to cast each element of a vector (maintaining the same number of elements) we can reuse the same tree-code used to cast scalars, i.e CONVERT_EXPR:
`v4t_z = VUNPACK_EXPR (v8ht_x, v4_indx_mask)`
split into perm_select followed by the casting:
(or VMERGE_EXPR)
v4t = CONVERT_EXPR s/u (v4ht)
comments: rtl_expand - trivial comments: rtl_expand - problematic for targets that support
a single vector size.
Operation that involves promotion:

Fixed-Point Multiplication:
widen product
.

example: vectorization of -
int_z = short_x * short_y

The semantics of VSEL_MULT_EXPR (v8t_x, v8t_y, v4_m):
v4dt_x' - consists of elements from x, at indices specified in vm, promoted to type dt;
v4dt_y' - similarly for y;
v4dt_z = mult (v4dt_x', v4dt_y').

Putting it all together during vectorization:

```if (smul_widen_optab[v8t]){
v8dt_z = MULT (v8t_x,v8t_y)
} else if (smul_lo_optab[v8t]
&& smul_hi_optab[v8t]) { v4dt_z1 = VSEL_MULT (...)
v4dt_z2 = VSEL_MULT (...)}else if (smul_odd_optab[v8t]
&& smul_even_optab[v8t]) { v4dt_tmp1 = VSEL_MULT (...)
v4dt_tmp2 = VSEL_MULT (...) v4dt_z1 = VMERGE (...)
v4dt_z2 = VMERGE (...)}else if (vec_unpack_optab[v8t]
&& smul_optab[v8t]){ v4dt_x1 = VUNPACK (v8t_x, {0,1,2,3})
v4dt_x2 = VUNPACK (v8t_x, {4,5,6,7})
v4dt_y1 = VUNPACK (v8t_y, {0,1,2,3})
v4dt_y2 = VUNPACK (v8t_y, {4,5,6,7})
v4dt_z1 = MULT (v4dt_x1, v4dt_y1)
v4dt_z2 = MULT (v4dt_x2, v4dt_y2)
}else{ don't vectorize.
}```

scenario1:
smul_widen_optab, umul_widen_optab
`v8dt_z = MULT_EXPR (v8t_x, v8t_y)`
scenario2:
smul_lo_optab, smul_hi_optab,

umul_lo_optab, umul_hi_optab

smul_lo_optab, smul_hi_optab,
umul_lo_optab, umul_hi_optab

`v4dt_z1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3})v4dt_z2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {4,5,6,7})`
`v8dt_z = MULT_EXPR (v8t_x, v8t_y)v4dt_z1 = VPERM_EXPR (v8dt_z, {0,1,2,3})v4dt_z2 = VPERM_EXPR (v8dt_z, {4,5,6,7}) (or VMERGE_EXPR)`
comments: rtl_expand - trivial comments: rtl_expand - when expanding each of the vperms, need to analyze their uses and detect that it's a mult_hi/lo pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_hi/lo and will not use the def of the original mult).
scenario3:
smul_odd_optab, smul_even_optab,

umul_odd_optab, umul_even_optab

smul_odd_optab, smul_even_optab,
umul_odd_optab, umul_even_optab

`v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,2,4,6})v4dt_tmp2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {1,3,5,7})v4dt_z1 = VMERGE_EXPR (tmp1, tmp2, {0,4,1,5})v4dt_z2 = VMERGE_EXPR (tmp1, tmp2, {2,6,3,7})`
`v8dt_z = MULT_EXPR (v8t_x, v8t_y)v4dt_tmp1 = VPERM_EXPR (v8dt_z, {0,2,4,6})v4dt_tmp2 = VPERM_EXPR (v8dt_z, {1,3,5,7}) (or VMERGE_EXPR)v4dt_z1 = VMERGE_EXPR (tmp1, tmp2, {0,4,1,5})v4dt_z2 = VMERGE_EXPR (tmp1, tmp2, {2,6,3,7})`

(in VMX will be expanded to:
tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vmrghi(tmp1,tmp2); z2 = vmrglo(tmp1,tmp2);
or:
m1 = {0,4,1,5); m2 = {2,6,3,7}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2.m1); z2 = vperm(tmp1,tmp2,m2);)

comments: rtl_expand - when expanding each of the vperms, need to analyze their uses and detect that it's a mult_odd/even pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_odd/even and will not use the def of the original mult).
Fixed-Point Multiplication:
take high part

example: vectorization of -
short_z =((int) (short_x * short_y))>>16

smul_highpart_optab,
umul_highpart_optab
1. if have s/umul_hi/lo_optab
2. if have s/umul_even/odd_optab
`v8t_z = MULT_HI_EXPR(v8t_x,v8t_y)`
`1. v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3})   v4dt_tmp2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {4,5,6,7})   v8t_z = VMERGE_EXPR (tmp1, tmp2, {0,2,4,6,8,10,12,14})2. v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,2,4,6})   v4dt_tmp2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {1,3,5,7})   v8t_z = VMERGE_EXPR (tmp1, tmp2, {0,8,2,10,4,12,6,14})`
(in VMX will be expanded to:
m = {0,8,2,10,4,12,6,14}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);)
comments: rtl_expand - simple (may need to detect pattern).
Fixed-Point Multiplication:
take low part

example: vectorization of -
short_z = (short) (short_x * short_y)

smul_optab, umul_optab
`v8t_z = MULT_EXPR (v8t_x, v8t_y)`
rtl_expand: trivial.
(in VMX will be expanded to:
m = {1,9,3,11,5,13,7,15}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);)
vector shift, in number of bytes
Note: shift the entire vector, not element-wise; can't use the scalar shift tree-codes.

(Similar functionality to vslo/vsro in VMX)

vec_shift_left_optab
vec_shft_right_optab
`v = VSHIFT_RIGHT_EXPR (v,scalar_shft)v = VSHIFT_LEFT_EPXR (v,scalar_shft)`
`v = VPERM_EXPR (v, v_indx_mask)for shift left by x bytes:vN_indx_mask = (x, x+1,...,N-1,0,1,..,x-1)for shift right by x bytes:vN_indx_mask = (N-x,N-x+1,..,N-1,0,1,..,N-x-1)`
comments: rtl_expand - trivial comments: rtl_expand - not trivial to detect that this is a shift pattern (impossible if x is unknown at compile time). This will result in suboptimal code (prepare mask, and permute).
vector shift, in number of bits TBD
TBD

## Saturation (and friends)

Using ADD as an example; Similarl optabs/tree-codes are proposed for SUB. Multiplication discussed in previous section.

operation proposal possible alternatives
saturate optab_sat (for scalars and vectors)
`v4t_z = SAT_EXPR(v4dt_x)`
short_z = short_x + short_y
(modulo arithmetic)
`v8dt_z = ADD_EXPR (v8t_x, v8t_y)`
`v4dt_x1 = VUNPACK_EXPR (v8t_x, {0,1,2,3})v4dt_x2 = VUNPACK_EXPR (v8t_x, {4,5,6,7})v4dt_y1 = VUNPACK_EXPR (v8t_y, {0,1,2,3})v4dt_y2 = VUNPACK_EXPR (v8t_y, {4,5,6,7})v4dt_z1 = ADD_EXPR (v4dt_x1, v4dt_y1)v4dt_z2 = ADD_EXPR (v4dt_x2, v4dt_y2)`
`v8t_z = ADD_SAT_EXPR (v8t_x, v8t_y)`
`v8dt_tmp = ADD_EXPR (v8t_x, v8t_y)v8t_z = SAT_EXPR (v8dt_tmp)`