The suggestions for the different treecodes/optabs are organized according to representative operations that we want to express. For each such operation we suggest an optab, a tree (or a sequence of trees), and in some cases an example of the RTL that will be expanded from it.
In cases where we can use existing optabs and treecodes we try to do so; in such cases, the operation represented by the treecode is performed on each element of the vector. In cases where we need to introduce new optabs or treecodes, these are marked in red, or in blue if they had already been suggested (for other purposes). New optabs/treecodes that are optional appear in parenthesis. Some optabs/treecodes that represent a "pure SIMD" operation  i.e, operation that has the same semantics when operated on scalars as when operated on each element of a vector  can be used both for vectors and scalars (e.g. addition). Other optabs/treecodes are introduced for vectors only, as they have no meaning in the context of scalar; these are prefixed with vec_ (for optabs) or V (for treecodes).
In some cases multiple alternatives are presented. Often, one alternative will introduce less new treecodes, sometimes at the cost of difficulties to expand the trees to the most efficient RTL.
This is an RFC. Comments are welcome on any of the
following, and more:
1) is a new optab/treecode introduced where an
existing one could have been reused
2) is an existing optab/treecode used
with different semantics than its original semantics
3) which alternative
makes more sense
4) is there missing functionality (in the categories we have
addressed so far)
5) the names of the proposed new optabs/treecodes are also
open for discussion
Notations:
t  represents a datatype (char, short, int,
etc.)
dt  represents a type whose size is double the size
of t (e.g, if t is of mode 'hi', then dt is of mode
'si').
ht  represents a type whose size is half the size of
t (e.g, if t is of mode 'hi', then ht is of mode 'qi').
v8t
 represents a vector that consists of 8 elements of type
t.
v4dt  represents a vector that consists of 4 elements if
type dt.
0/1_mask  a vector of 1s and
0s
v4_indx_mask  a vector of 4 elements; each element is an
index (typically between 0 to 7 for the 4 element mask
case).
s/u  appended to a treecode if it will be used to
perform signed or unsigned arithmetic, depending on the type of the tree (or its
operands).
N  the number of bytes in a vector.
These include compare and select.
The main proposal tries to reuse existing treecodes and optabs. Not all the compare forms below are required (e.g, if gt is available, we can do without le), but since optabs and treecodes for them are already present in GCC, we present all the different forms here. An alternative proposal introduces a new treecode (a 3 operand COMP_EXPR) that can be used for all the different compare forms. This one would make sense only if the existing treecodes cannot be used for some reason. The CMP_EXPR can be used to operate on scalar as well as vectors.
Questions/Considerations/Comments:
01/_mask = NE_EXPR (vx, vy) vz = VSELECT (vx, vy, 01_mask)
We therefore suggest to introduce (at least) the vec_select_expr form that takes two operands (+ a mask) (i.e, "vselect2").
operation  proposal  possible alternatives 

compare_ne, compare_eq 
ne_optab,(eq_optab)  ne_optab,(eq_optab) 
0/1_mask = NE_EXPR (v4t_x, v4t_y) 
0/1_mask = CMP_EXPR (v4t_x, v4t_y, test)
test = ne/eq/gt/le/ge/lt, depending on the optab.  
compare_gt, compare_le (signed) 
gt_optab, le_optab  
0/1_mask = GT_EXPR (v4t_x, v4t_y) (The sign/unsigned is hidden in the type of the tree) 

compare_ge, compare_lt (signed) 
ge_optab, lt_optab  
0/1_mask = GE_EXPR (v4t_x, v4t_y) 

compare_ugt, compare_ule (unsigned) 
ugt_optab, (ule_optab)  
0/1_mask = GT_EXPR (v4t_x, v4t_y) 

compare_uge, compare_ule (unsigned) 
uge_optab, (ult_optab)  
0/1_mask = GE_EXPR (v4t_x, v4t_y) 

select 
vec_select_optab  vec_select_optab 
v4t_z = VSELECT_EXPR (v4t_x, v4t_y, 0/1_mask) 
v4t = VSELECT_EXPR (v8t, 0/1_mask)  
comments: or both: vec_select1_optab, VSELECT1_EXPR, vec_select2_optab, VSELECT2_EXPR Targets that don't directly support a select, 
These include permute operations, as well as higherlevel operations (like
handling of misaligned accesses) that use them. Because type conversions between
vectors often require data permutation, we also address here type
conversions.
Some of the operations below change the order of the elements,
but maintain the same type and number of elements. e.g: v8t = vperm (v8t,
v8_indx_mask)
Some operations change the number of elements: e.g, v4t =
vperm_sel (v8t, v4_indx_mask), v4t = vperm_sel (v4t, v4t, v4_indx_mask)
Some
change the type: v8t = vpack (v4dt, v4dt)
And some change both type and
number of elements: v4t = vunpack (v8ht, v4_indx_mask)
Questions/Considerations/Comments:
operation  proposal  possible alternatives 

Permute 
vec_permute_optab  
v4t_z = VPERM_EXPR (v4t_x, v4_indx_mask) 

comments:  comments:  
Merge two vectors into one by selecting a
permutation of half their elements or a more general definition: Select n elements of type t out of 2*n elements of the same type. (Similar functionality to vperm (VMX), SHUFPS (SSE) and vmrgh*,vmrgl* (VMX) UNPCKHPS,UNPCKLPS (SSE)). 
vec_merge_optab  vec_perm_select_optab 
v4t_z = VMERGE_EXPR (v4t_x, v4t_y, v4_indx_mask) 
v4t_z = VPERM_EXPR (v8t_x, v4_indx_mask)  
comments:  comments: or both alternatives  
Operation that involves permutation:
Misaligned load Assumption: a load fetches N bytes from "floor aligned" address. We use the VMERGE_EXPR to perform the following: 
misalign_lo_mask_optab  
vx = MODIFY_EXPR (*addr) 
vx = MODIFY_EXPR (*addr)  
comments: rtl_expand  trivial (in VMX the MISALIGNED_LO_MASK_EXPR
will be expanded to lvsl/r for big/littleendian). 
comments: rtl_expand  detect pattern, or generate suboptimal code (or
rely on combine?). Q: OK to expose endianess at the treelevel ?  
Optimized Misaligned load
(Softwarepipelined loads) 
misalign_hi_mask_optab  
vx = MODIFY_EXPR (*addr) 
vx = MODIFY_EXPR (*addr)  
comments: rtl_expand  trivial (in VMX the MISALIGNED_HI_MASK_EXPR
will be expanded to lvsr/l for big/littleendian). 
comments: rtl_expand detect pattern, or generate suboptimal code (or rely on combine?).  
Misaligned store  misalign_hi_mask_optab  
vx = MODIFY_EXPR (*addr) 
vx = MODIFY_EXPR (*addr)  
comments: rtl_expand  trivial  comments: rtl_expand  detect pattern, or generate suboptimal code (or rely on combine?).  
Demotion: Pack n elements of type t into a vector of n elements of type ht. The demotion flavors we propose here use modulo or (signed/unsigned) saturating arithmetic. example: vectorization of  
1. vec_pack_mod_optab 2. vec_pack_sat_optab 3. vec_pack_usat_optab 
alternative1: same vec_pack_*_optabs alternative2: use scalar optabs: 
1. v8t_z = VPACK_MOD_EXPR (v4dt_x, v4dt_y) 
alternative1:  
comments: rtl_expand  trivial  comments: or both alternatives.
rtl_expand  w.r.t alternative1: if v8dt is not supported, but v8t is supported, involves emulating the operation using smaller types (which should be doable).  
Promotion. Unpack n elements of type t into n/2 elements of type dt. The promotion can use sign extension or zero extension. example: vectorization of 
vec_unpack_optab  to cast each element of a vector (maintaining the same number of elements) we can reuse the same treecode used to cast scalars, i.e CONVERT_EXPR: 
v4t_z = VUNPACK_EXPR (v8ht_x, v4_indx_mask) 
split into perm_select followed by the casting: v4ht_z = VPERM_EXPR (v8ht_x, v4_indx_mask) (or VMERGE_EXPR) v4t = CONVERT_EXPR s/u (v4ht)  
comments: rtl_expand  trivial  comments: rtl_expand  problematic for targets that support a single vector size.  
Operation that involves promotion:
FixedPoint Multiplication: example: vectorization of  The semantics of VSEL_MULT_EXPR (v8t_x, v8t_y, v4_m): Putting it all together during vectorization: if (smul_widen_optab[v8t]){ v8dt_z = MULT (v8t_x,v8t_y) } 
scenario1: smul_widen_optab, umul_widen_optab 

v8dt_z = MULT_EXPR (v8t_x, v8t_y) 

comments: rtl_expand  trivial (if target support available)  comments:  
scenario2: smul_lo_optab, smul_hi_optab, umul_lo_optab, umul_hi_optab 
smul_lo_optab, smul_hi_optab,  
v4dt_z1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3}) 
reusing already existing/proposed treecodes: v8dt_z = MULT_EXPR (v8t_x, v8t_y)  
comments: rtl_expand  trivial  comments: rtl_expand  when expanding each of the vperms, need to analyze their uses and detect that it's a mult_hi/lo pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_hi/lo and will not use the def of the original mult).  
scenario3: smul_odd_optab, smul_even_optab, umul_odd_optab, umul_even_optab 
smul_odd_optab, smul_even_optab,  
v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,2,4,6}) 
reusing already existing/proposed treecodes: v8dt_z = MULT_EXPR (v8t_x, v8t_y)  
comments: rtl_expand  trivial
(in VMX will be expanded to: 
comments: rtl_expand  when expanding each of the vperms, need to analyze their uses and detect that it's a mult_odd/even pattern. This should be doable. However, there's a problem to expand the MULT_EXPR (yet, its def will be removed by DCE since the vperms will be replaced by mult_odd/even and will not use the def of the original mult).  
FixedPoint Multiplication: take high part example: vectorization of  
smul_highpart_optab, umul_highpart_optab 
1. if have s/umul_hi/lo_optab 2. if have s/umul_even/odd_optab 
v8t_z = MULT_HI_EXPR(v8t_x,v8t_y) 
1. v4dt_tmp1 = VSEL_MULT_EXPR (v8t_x, v8t_y, {0,1,2,3})  
comments: rtl_expand  trivial. (in VMX will be expanded to: m = {0,8,2,10,4,12,6,14}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);) 
comments: rtl_expand  simple (may need to detect pattern).  
FixedPoint Multiplication: take low part example: vectorization of  
smul_optab, umul_optab  
v8t_z = MULT_EXPR (v8t_x, v8t_y) 

rtl_expand: trivial. (in VMX will be expanded to: m = {1,9,3,11,5,13,7,15}; tmp1 = vmule(x,y); tmp2 = vmulo(x,y); z1 = vperm(tmp1,tmp2,m);) 

vector shift, in number of bytes Note: shift the entire vector, not elementwise; can't use the scalar shift treecodes. (Similar functionality to vslo/vsro in VMX) 
vec_shift_left_optab vec_shft_right_optab 

v = VSHIFT_RIGHT_EXPR (v,scalar_shft) 
v = VPERM_EXPR (v, v_indx_mask)  
comments: rtl_expand  trivial  comments: rtl_expand  not trivial to detect that this is a shift pattern (impossible if x is unknown at compile time). This will result in suboptimal code (prepare mask, and permute).  
vector shift, in number of bits  TBD  
TBD  
Using ADD as an example; Similarl optabs/treecodes are proposed for SUB.
Multiplication discussed in previous section.
Also need to consider
ADD_CARRY_EXPR.
operation  proposal  possible alternatives 

saturate  optab_sat (for scalars and vectors)  
v4t_z = SAT_EXPR(v4dt_x) 

short_z = short_x + short_y (modulo arithmetic) 
optab_add  
v8t_z = ADD_EXPR (v8t_x, v8t_y)  
int_z = short_x + short_y  optab_add_widen  optab_add_widen
or if vec_unpack_optab 
v8dt_z = ADD_EXPR (v8t_x, v8t_y) 
v4dt_x1 = VUNPACK_EXPR (v8t_x, {0,1,2,3})  
short_z = u/s_SAT (short_x + short_y)  optab_add_ssat optab_add_usat 
optab_add_ssat,
optab_add_usat or if optab_add_widen and optab_sat 
v8t_z = ADD_SAT_EXPR (v8t_x, v8t_y) 
v8dt_tmp = ADD_EXPR (v8t_x, v8t_y) 