new tree-codes proposal for vector operations

Topics

General Comments

The suggestions for the different tree-codes/optabs are organized according to representative operations that we want to express. For each such operation we suggest an optab, a tree (or a sequence of trees), and in some cases an example of the RTL that will be expanded from it.

In cases where we can use existing optabs and tree-codes we try to do so; in such cases, the operation represented by the tree-code is performed on each element of the vector. In cases where we need to introduce new optabs or tree-codes, these are marked in red, or in blue if they had already been suggested (for other purposes). New optabs/tree-codes that are optional appear in parenthesis. Some optabs/tree-codes that represent a "pure SIMD" operation - i.e, operation that has the same semantics when operated on scalars as when operated on each element of a vector - can be used both for vectors and scalars (e.g. addition). Other optabs/tree-codes are introduced for vectors only, as they have no meaning in the context of scalar; these are prefixed with vec_ (for optabs) or V (for tree-codes).

In some cases multiple alternatives are presented. Often, one alternative will introduce less new tree-codes, sometimes at the cost of difficulties to expand the trees to the most efficient RTL.

This is an RFC. Comments are welcome on any of the following, and more:
1) is a new optab/tree-code introduced where an existing one could have been reused
2) is an existing optab/tree-code used with different semantics than its original semantics
3) which alternative makes more sense
4) is there missing functionality (in the categories we have addressed so far)
5) the names of the proposed new optabs/tree-codes are also open for discussion

Notations:

t - represents a data-type (char, short, int, etc.)
dt
- represents a type whose size is double the size of t (e.g, if t is of mode 'hi', then dt is of mode 'si').
ht
- represents a type whose size is half the size of t (e.g, if t is of mode 'hi', then ht is of mode 'qi').
v8t
- represents a vector that consists of 8 elements of type t.
v4dt
- represents a vector that consists of 4 elements if type dt.
0/1_mask
- a vector of 1s and 0s
v4_indx_mask
- a vector of 4 elements; each element is an index (typically between 0 to 7 for the 4 element mask case).
s/u - appended to a tree-code if it will be used to perform signed or unsigned arithmetic, depending on the type of the tree (or its operands).
N - the number of bytes in a vector.

Conditional Operations

These include compare and select.

The main proposal tries to reuse existing tree-codes and optabs. Not all the compare forms below are required (e.g, if gt is available, we can do without le), but since optabs and tree-codes for them are already present in GCC, we present all the different forms here. An alternative proposal introduces a new tree-code (a 3 operand COMP_EXPR) that can be used for all the different compare forms. This one would make sense only if the existing tree-codes cannot be used for some reason. The CMP_EXPR can be used to operate on scalar as well as vectors.

Questions/Considerations/Comments:

operation proposal possible alternatives
compare_ne,
compare_eq
ne_optab,(eq_optab) ne_optab,(eq_optab)
0/1_mask = NE_EXPR (v4t_x, v4t_y)
0/1_mask = EQ_EXPR (v4t_x, v4t_y)
0/1_mask = CMP_EXPR (v4t_x, v4t_y, test)

test = ne/eq/gt/le/ge/lt, depending on the optab.

compare_gt,
compare_le (signed)
gt_optab, le_optab
0/1_mask = GT_EXPR (v4t_x, v4t_y)
0/1_mask = LE_EXPR (v4t_x, v4t_y)

(The sign/unsigned is hidden in the type of the tree)

compare_ge,
compare_lt (signed)
ge_optab, lt_optab
0/1_mask = GE_EXPR (v4t_x, v4t_y)
0/1_mask = LT_EXPR (v4t_x, v4t_Y)
compare_ugt,
compare_ule (unsigned)
ugt_optab, (ule_optab)
0/1_mask = GT_EXPR (v4t_x, v4t_y)
0/1_mask = LE_EXPR (v4t_x, v4t_y)
compare_uge,
compare_ule (unsigned)
uge_optab, (ult_optab)
0/1_mask = GE_EXPR (v4t_x, v4t_y)
0/1_mask = LT_EXPR (v4t_x, v4t_y)

select

vec_select_optab vec_select_optab
v4t_z = VSELECT_EXPR (v4t_x, v4t_y, 0/1_mask)
v4t = VSELECT_EXPR (v8t, 0/1_mask)
comments: or both:
vec_select1_optab, VSELECT1_EXPR,
vec_select2_optab, VSELECT2_EXPR

Targets that don't directly support a select,
can define the following expansion for it
(to allow vectorization using VSELECTS):
(va & vc) | (vb &! vc).

Data Permutations

These include permute operations, as well as higher-level operations (like handling of misaligned accesses) that use them. Because type conversions between vectors often require data permutation, we also address here type conversions.
Some of the operations below change the order of the elements, but maintain the same type and number of elements. e.g: v8t = vperm (v8t, v8_indx_mask)
Some operations change the number of elements: e.g, v4t = vperm_sel (v8t, v4_indx_mask), v4t = vperm_sel (v4t, v4t, v4_indx_mask)
Some change the type: v8t = vpack (v4dt, v4dt)
And some change both type and number of elements: v4t = vunpack (v8ht, v4_indx_mask)

Questions/Considerations/Comments:

operation proposal possible alternatives

Permute

vec_permute_optab
v4t_z = VPERM_EXPR (v4t_x, v4_indx_mask)
comments: comments:
Merge two vectors into one by selecting a permutation of half their elements
or a more general definition: Select n elements of type t out of 2*n elements of the same type.
(Similar functionality to vperm (VMX), SHUFPS (SSE) and vmrgh*,vmrgl* (VMX) UNPCKHPS,UNPCKLPS (SSE)).
vec_merge_optab vec_perm_select_optab
v4t_z = VMERGE_EXPR (v4t_x, v4t_y, v4_indx_mask)
(alternative names: VPERM2_EXPR, VPERM_SEL_EXPR)
v4t_z = VPERM_EXPR (v8t_x, v4_indx_mask)
comments: comments: or both alternatives
Operation that involves permutation:

Misaligned load

Assumption: a load fetches N bytes from "floor aligned" address.

We use the VMERGE_EXPR to perform the following:
vz = (vx << mis | vy >> (N-mis))

misalign_lo_mask_optab
vx = MODIFY_EXPR (*addr)
vy = MODIFY_EXPR (*(addr+N-1))
indx_mask = MISALIGN_LO_MASK_EXPR (addr)
vz = VMERGE_EXPR (vx, vy, indx_mask)
vx = MODIFY_EXPR (*addr)
vy = MODIFY_EXPR (*(addr+N-1))
mis = addr & (N-1)
shft = BIG_ENDIAN? mis : N-mis
indx_mask = (shft,shft+1,...shft+N-1)
vz = VMERGE_EXPR (vx, vy, indx_mask)
comments: rtl_expand - trivial (in VMX the MISALIGNED_LO_MASK_EXPR will be expanded to lvsl/r for big/little-endian).
comments: rtl_expand - detect pattern, or generate suboptimal code (or rely on combine?).
Q: OK to expose endianess at the tree-level ?
Optimized Misaligned load

(Software-pipelined loads)

misalign_hi_mask_optab
vx = MODIFY_EXPR (*addr)
addr += (N-1)
indx_mask = MISALIGN_HI_MASK_EXPR (-addr)
loop_begin:
vy = MODIFY_EXPR (*addr)
vz = VMERGE_EXPR (vx, vy, indx_mask)
vx = vy
addr +=16
loop_end
vx = MODIFY_EXPR (*addr)
addr += (N-1)
mis = addr & (N-1)
shft = BIG_ENDIAN? N-mis : mis
indx_mask = (shft,shft+1,...shft+N-1)
loop_begin:
vy = MODIFY_EXPR (*addr)
vz = VMERGE_EXPR (vx, vy, indx_mask)
vx = vy
addr +=16
loop_end
comments: rtl_expand - trivial (in VMX the MISALIGNED_HI_MASK_EXPR will be expanded to lvsr/l for big/little-endian).
comments: rtl_expand -detect pattern, or generate suboptimal code (or rely on combine?).
Misaligned store misalign_hi_mask_optab
vx = MODIFY_EXPR (*addr)
vy = MODIFY_EXPR (*(addr+N))
indx_mask = MISALIGN_HI_MASK_EXPR (addr)
0/1_mask = VMERGE_EXPR (vzeros, vones, indx_mask)
vz1 = VSELECT_EXPR (vx, vsrc, 0/1_mask)
vz2 = VSELECT_EXPR (vsrc, vy, 0/1_mask)
(*addr) = MODIFY_EXPR (vz1)
(*(addr+16)) = MODIFY_EXPR (vz2)
vx = MODIFY_EXPR (*addr)
vy = MODIFY_EXPR (*(addr+N))
mis = addr & (N-1)
indx_mask1 = BIG_ENDIAN? {mis, mis+1, ...} :...
indx_mask2 = BIG_ENDIAN? {N-mis,N-mis+1,...}:...
vz1 = VMERGE_EXPR (vx, vsrc, indx_mask)
vz2 = VMERGE_EXPR (vsrc, vy, indx_mask)
(*addr) = MODIFY_EXPR (vz1)
(*(addr+16)) = MODIFY_EXPR (vz2)
comments: rtl_expand - trivial comments: rtl_expand - detect pattern, or generate suboptimal code (or rely on combine?).
Demotion:
Pack n elements of type t into a vector of n elements of type ht. The demotion flavors we propose here use modulo or (signed/unsigned) saturating arithmetic.

example: vectorization of -
short_z = (short) int_x
(Similar functionality to vpck* in VMX)

1. vec_pack_mod_optab
2. vec_pack_sat_optab

3. vec_pack_usat_optab
alternative1: same vec_pack_*_optabs

alternative2: use scalar optabs:
1. convert_optab (is this equivalent to casting?)
2. sat_optab

3. usat_optab

1. v8t_z = VPACK_MOD_EXPR (v4dt_x, v4dt_y)
2,3. v8t_z = VPACK_SAT_EXPR (v4dt_x, v4dt_y)
alternative1:
1. v8t_z = VPACK_MOD_EXPR (v8dt_x)
2,3. v8t_z = VPACK_SAT_EXPR s/u (v8dt_x)
alternative2:
1. v8t_z = CONVERT_EXPR (v8dt_x)
2,3. v8t_z = SAT_EXPR (v8dt_x)
(SAT_EXPR saturates according to
the type of the tree/result).
comments: rtl_expand - trivial comments: or both alternatives.

rtl_expand - w.r.t alternative1: if v8dt is not supported, but v8t is supported, involves emulating the operation using smaller types (which should be doable).

Promotion.
Unpack n elements of type t into n/2 elements of type dt. The promotion can use sign extension or zero extension.

example: vectorization of
int_z = (int) short_x
(Similar functionality to vupk* in VMX)

vec_unpack_optab to cast each element of a vector (maintaining the same number of elements) we can reuse the same tree-code used to cast scalars, i.e CONVERT_EXPR:
v4t_z = VUNPACK_EXPR (v8ht_x, v4_indx_mask)
split into perm_select followed by the casting:
v4ht_z = VPERM_EXPR (v8ht_x, v4_indx_mask)
(or VMERGE_EXPR)
v4t = CONVERT_EXPR s/u (v4ht)
comments: rtl_expand - trivial comments: rtl_expand - problematic for targets that support
a single vector size.
Operation that involves promotion:

Fixed-Point Multiplication:
widen product
.

example: vectorization of -
int_z = short_x * short_y

The semantics of VSEL_MULT_EXPR (v8t_x, v8t_y, v4_m):
v4dt_x' - consists of elements from x, at indices specified in vm, promoted to type dt;
v4dt_y' - similarly for y;
v4dt_z = mult (v4dt_x', v4dt_y').

Putting it all together during vectorization:

if (smul_widen_optab[v8t]){
 v8dt_z = MULT (v8t_x,v8t_y)
} 
else if (smul_lo_optab[v8t] && smul_hi_optab[v8t]) {
v4dt_z1 = VSEL_MULT (...) v4dt_z2 = VSEL_MULT (...)
}
else if (smul_odd_optab[v8t] && smul_even_optab[v8t]) {
v4dt_tmp1 = VSEL_MULT (...) v4dt_tmp2 = VSEL_MULT (...)
v4dt_z1 = VMERGE (...) v4dt_z2 = VMERGE (...)
}
else if (vec_unpack_optab[v8t] && smul_optab[v8t]){
v4dt_x1 = VUNPACK (v8t_x, {0,1,2,3}) v4dt_x2 = VUNPACK (v8t_x, {4,5,6,7}) v4dt_y1 = VUNPACK (v8t_y, {0,1,2,3}) v4dt_y2 = VUNPACK (v8t_y, {4,5,6,7}) v4dt_z1 = MULT (v4dt_x1, v4dt_y1) v4dt_z2 = MULT (v4dt_x2, v4dt_y2) }
else{
don't vectorize. }

scenario1:
smul_widen_optab, umul_widen_optab
v8dt_z = MULT_EXPR (v8t_x, v8t_y)
comments: rtl_expand - trivial (if target support available) comments:
scenario2:
smul_lo_optab, smul_hi_optab,

umul_lo_optab, umul_hi_optab

smul_lo_optab, smul_hi_optab,
umul_lo_optab, umul_hi_optab

v4dt_z1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3})
v4dt_z2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {4,5,6,7})
reusing already existing/proposed tree-codes:
v8dt_z = MULT_EXPR (v8t_x, v8t_y)
v4dt_z1 = VPERM_EXPR (v8dt_z, {0,1,2,3})
v4dt_z2 = VPERM_EXPR (v8dt_z, {4,5,6,7})
(or VMERGE_EXPR)
comments: rtl_expand - trivial comments: rtl_expand - when expanding each of the vperms, need to analyze their uses and detect that it's a mult_hi/lo pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_hi/lo and will not use the def of the original mult).
scenario3:
smul_odd_optab, smul_even_optab,

umul_odd_optab, umul_even_optab

smul_odd_optab, smul_even_optab,
umul_odd_optab, umul_even_optab

v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,2,4,6})
v4dt_tmp2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {1,3,5,7})
v4dt_z1 = VMERGE_EXPR (tmp1, tmp2, {0,4,1,5})
v4dt_z2 = VMERGE_EXPR (tmp1, tmp2, {2,6,3,7})
reusing already existing/proposed tree-codes:
v8dt_z = MULT_EXPR (v8t_x, v8t_y)
v4dt_tmp1 = VPERM_EXPR (v8dt_z, {0,2,4,6})
v4dt_tmp2 = VPERM_EXPR (v8dt_z, {1,3,5,7})
(or VMERGE_EXPR)
v4dt_z1 = VMERGE_EXPR (tmp1, tmp2, {0,4,1,5})
v4dt_z2 = VMERGE_EXPR (tmp1, tmp2, {2,6,3,7})
comments: rtl_expand - trivial

(in VMX will be expanded to:
tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vmrghi(tmp1,tmp2); z2 = vmrglo(tmp1,tmp2);
or:
m1 = {0,4,1,5); m2 = {2,6,3,7}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2.m1); z2 = vperm(tmp1,tmp2,m2);)

comments: rtl_expand - when expanding each of the vperms, need to analyze their uses and detect that it's a mult_odd/even pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_odd/even and will not use the def of the original mult).
Fixed-Point Multiplication:
take high part

example: vectorization of -
short_z =((int) (short_x * short_y))>>16

smul_highpart_optab,
umul_highpart_optab
1. if have s/umul_hi/lo_optab
2. if have s/umul_even/odd_optab
v8t_z = MULT_HI_EXPR(v8t_x,v8t_y)
1. v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3})
v4dt_tmp2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {4,5,6,7})
v8t_z = VMERGE_EXPR (tmp1, tmp2, {0,2,4,6,8,10,12,14})
2. v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,2,4,6})
v4dt_tmp2 = VSEL_MULT_EXPR (v8t_x, v8t_y, {1,3,5,7})
v8t_z = VMERGE_EXPR (tmp1, tmp2, {0,8,2,10,4,12,6,14})
comments: rtl_expand - trivial.
(in VMX will be expanded to:
m = {0,8,2,10,4,12,6,14}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);)
comments: rtl_expand - simple (may need to detect pattern).
Fixed-Point Multiplication:
take low part

example: vectorization of -
short_z = (short) (short_x * short_y)

smul_optab, umul_optab
v8t_z = MULT_EXPR (v8t_x, v8t_y)
rtl_expand: trivial.
(in VMX will be expanded to:
m = {1,9,3,11,5,13,7,15}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);)
vector shift, in number of bytes
Note: shift the entire vector, not element-wise; can't use the scalar shift tree-codes.

(Similar functionality to vslo/vsro in VMX)

vec_shift_left_optab
vec_shft_right_optab
v = VSHIFT_RIGHT_EXPR (v,scalar_shft)
v = VSHIFT_LEFT_EPXR (v,scalar_shft)
v = VPERM_EXPR (v, v_indx_mask)
for shift left by x bytes:
vN_indx_mask = (x, x+1,...,N-1,0,1,..,x-1)
for shift right by x bytes:
vN_indx_mask = (N-x,N-x+1,..,N-1,0,1,..,N-x-1)
comments: rtl_expand - trivial comments: rtl_expand - not trivial to detect that this is a shift pattern (impossible if x is unknown at compile time). This will result in suboptimal code (prepare mask, and permute).
vector shift, in number of bits TBD
TBD

Saturation (and friends)

Using ADD as an example; Similarl optabs/tree-codes are proposed for SUB. Multiplication discussed in previous section.
Also need to consider ADD_CARRY_EXPR.

operation proposal possible alternatives
saturate optab_sat (for scalars and vectors)
v4t_z = SAT_EXPR(v4dt_x)
short_z = short_x + short_y
(modulo arithmetic)
optab_add
v8t_z = ADD_EXPR (v8t_x, v8t_y)
int_z = short_x + short_y optab_add_widen optab_add_widen
or if vec_unpack_optab
v8dt_z = ADD_EXPR (v8t_x, v8t_y)
v4dt_x1 = VUNPACK_EXPR (v8t_x, {0,1,2,3})
v4dt_x2 = VUNPACK_EXPR (v8t_x, {4,5,6,7})
v4dt_y1 = VUNPACK_EXPR (v8t_y, {0,1,2,3})
v4dt_y2 = VUNPACK_EXPR (v8t_y, {4,5,6,7})
v4dt_z1 = ADD_EXPR (v4dt_x1, v4dt_y1)
v4dt_z2 = ADD_EXPR (v4dt_x2, v4dt_y2)
short_z = u/s_SAT (short_x + short_y) optab_add_ssat
optab_add_usat
optab_add_ssat, optab_add_usat
or
if optab_add_widen and optab_sat
v8t_z = ADD_SAT_EXPR (v8t_x, v8t_y)
v8dt_tmp = ADD_EXPR (v8t_x, v8t_y)
v8t_z = SAT_EXPR (v8dt_tmp)