This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] optabs and tree-codes for vector operations


Thanks very much for your comments, Richard, Jim.
I'm sorry I didn't get to respond earlier.

I accept your proposal for handling conditional operations using
VEC_COND_EXPR instead of VSELECT.

I think we're generally in agreement with respect to most of the operations
- multiplication, saturation, demotion. Promotion we'll have to think
about.

I'd like to focus on the permutation and misalignment operations. The
terminology I used for the suggested tree-code/optabs created some
confusion due to conflicts with RTL names. Jim, thanks for pointing that
out. Lets consider the following:

<1> extraction of elements from two vectors according to a 0/1 mask:
vnt_a = vmerge (vnt_x, vnt_y, vn_0/1_mask) (the equivalent of the RTL
"vec_merge")
* optab (new): vec_merge_optab
* tree-code (new): VEC_MERGE_EXPR

<2> permutation of elements extracted from a vector, using a permutation
mask:
vmt_a = vperm1 (vnt_x, vm_indx_mask) (in which m <= n; the equivalent of
the RTL "vec_select")
* optab (new): vec_perm_optab
* tree-code (new): VEC_PERM_EXPR

<3> same as above, but the elements are extracted from two vectors:
vmt_a = vperm2 (vmt_x, vmt_y, vm_indx_mask)

<4> concatenate two vectors:
v8t_a = concat (v4t_x, v4t_y)

Between options <2> and <3>: <2> is more general, but most of the cases in
which we're going to need the permute operation are cases where we want to
shuffle elements from two input vectors (e.g, misaligned loads, fixed-point
multiplication). This means that most of the time we'll end up generating a
sequence of two tree stmts {concat + perm1}, and rely on RTL level combine
to detect the pattern.

We could consider to introduce both forms. In this case, we could name them
vperm1 and vperm2, or name one vperm and the other vshuffle. If we have to
choose, it would probably make more sense to introduce <2> and <4>, and
consider adding <3> later on if needed.

Actually, for the sake of the misalignment support, we'd actually want to
introduce two optabs for permute:

<3.1> a permute that takes only INTEGER_CST as the elements of the
permutation mask (vperm_imm_optab/vperm_cst_optab).
<3.2> a permute that takes a register as the mask
(vperm_optab/vperm_parametric_optab)

(and similarly for <2>).

Jim, you're right that it is not required to have all of the above
operations. If we have <1> we don't need <2>/<3>, and vice versa. But in
some cases this will not allow to generate the most optimal code (e.g,
misaligned stores case), and anyway these operations already exist in the
RTL. I feel we should introduce both <1> and <2>/<3>.


Now, with respect to misaligned loads. Thanks Richard for the survey of the
forms of misaligned loads on the different platforms. I think we can
classify the different loads as follows:

<5> load from an aligned address. I think we can assume that all platforms
have aligned loads, and just use a MODIFY_EXPR to represent those.

<6> load from an unaligned address (e.g, movdqu in sse):
* optab (new): load_u_optab
* tree-code (new): LOAD_U_EXPR

<7> load from a "floor aligned" address (i.e, drop low address bits, as in
ldvx (altivec), movdqa (sse), ldq_u (alpha)):
* optab (new): load_u_floor_optab / load_u_low_addr_optab / better name?
* tree-code (new): LOAD_U_FLOOR_EXPR

<8> load low/high (e.g. ldl/ldr (mips), ldqu+extql/ldqu+extqh (alpha)).
* optabs (new): load_u_low_optab + load_u_hi_optab
* tree-codes (new): LOAD_U_LOW_EXPR + LOAD_U_HI_EXPR

Also, we'd need to introduce:

<9> compute from an address the input to a vperm instruction (as in
altivec's lvsl/lvsr):
* optabs (new): misalign_lo_mask_optab + misalign_hi_mask_optab
* tree-code (new): MISALIGN_LO_MASK_EXPR + MISALIGN_HI_MASK_EXPR

<10> compute from an address the input to a vor instruction (as in alpha's
extql/extqh):
* optabs (new): misalign_lo_part_optab + misalign_hi_part_optab
* tree-code (new): MISALIGN_LO_PART_EXPR + MISALIGN_HI_PART_EXPR

We could represent <9> and <10> using a sequence of existing trees, but
then RTL combine will have to be able to detect the pattern, which may not
always be possible.

Given these optabs/tree-codes, we may be able avoid resorting to target
hooks:

case 1: an address is known to be aligned: trivial.

case 2: an address is known to be misaligned (and the misalignment amount
is a known constant):
For simple scheme: choose between using <6>, or <7> + <3.1>, or <8> (if
available)
For software-pipelined scheme: choose between using <7> + <3.1>, or <7> +
<10> (if available)

case 3: the alignment of the address is unknown:
For simple scheme: choose between using <6>, or <7> + <3.2>, or <8> (if
available)
For software-pipelined scheme: use <7> + <3.2> (if available)

Makes sense?

thanks

dorit



|---------+---------------------------->
|         |           Richard Henderson|
|         |           <rth@redhat.com> |
|         |                            |
|         |           15/08/2004 09:57 |
|---------+---------------------------->
  >---------------------------------------------------------------------------------------------------------------|
  |                                                                                                               |
  |       To:       James E Wilson <wilson@specifixinc.com>                                                       |
  |       cc:       Dorit Naishlos/Haifa/IBM@IBMIL, Devang Patel <dpatel@apple.com>, gcc@gcc.gnu.org              |
  |       Subject:  Re: [RFC] optabs and tree-codes for vector operations                                         |
  >---------------------------------------------------------------------------------------------------------------|




Jim writes:
> First a general comment, that we need better definitions of what the
> existing vector operations do.  They are ambiguous as to endianness, and
> some targets defining them differently for different endians, and some
> don't.

Indeed.  As for the specific question of endianness, we re-vamped
SUBREG to follow the memory layout.  It seems a definite mistake
to not do the same with vectors.

Dorit writes:
>      * In some cases we need to express both signed and unsigned
comparisons.
>        Q: Do the optabs below (gt, lt, etc), which are already present in
GCC,
>        represent signed or unsigned comparison?

The rtl code GT is signed, GTU is unsigned.

Dorit writes:
>      * Compare insns (ne/gt/ge...) compare two operands and produce a
single
>        0/1 result. In the vector form, it compares each pair of elements,
and
>        produce a 0/1 def for each. The result in a 0/1 mask (a "pure
SIMD").

Defining a specific result for a compare insn is going to be a mistake.
We have cases:

             (1) Altivec, x86, ia64 results in a vector of -1 for true, so

                         { 1,2,1,2 } < { 2,1,2,1 } = { -1, 0, -1, 0 }

             (2) alpha results in a packed set of bits, so

                         { 1,2,1,2,1,2,1,2 } < { 2,1,2,1,2,1,2,1 } = 0xAA

             (3) mips, apparently, results in a series of flags registers.

Which are different enough that I don't think we should represent the
result of the vector comparison at the tree level at all.  Fortunately
I think we may be able to get around this.

Dorit writes:
>      * Does it make sense to try to use the existing cmp_optab,
ucmp_optab
>        here?

Maybe.  See later.

Dorit writes:
>      * The vector select operation selects n elements out of 2*n
elements,
>        organized in two vectors, according to a 0/1 mask (which is a
result
>        of a compare operation).

As Jim says, "select" is probably a bad name for this.  More descriptive,
I think, would be VEC_COND_EXPR.  If COND_EXPR is a conditional move of
one value, VEC_COND_EXPR would be a conditional move of several values.

So we'd have

  a = VEC_COND_EXPR <(y < z), b, c>

Assume a[i] accesses vector element i.  Then this would mean

  forall i: a[i] = y[i] < z[i] ? b[i] : c[i]

The important consideration here is that we don't care how the target
implments the comparison, because we never expose it.

As for the rtl-level representation, see how mov<mode>cc is
implemented.  I think that's a better model (as it uses only
one named pattern) than conditional branches (which use hordes
of named patterns).

Dorit writes:
>    Targets that don't directly support a select,
>    can define the following expansion for it
>    (to allow vectorization using VSELECTS):
>    (va & vc) | (vb &! vc).

Note that this definition required -1 masks, as x86 generates, not
the 0/1 value that you described.

Dorit writes:
> Data Permutations
>
>    These include permute operations, as well as higher-level operations
(like
>    handling of misaligned accesses) that use them. Because type
conversions
>    between vectors often require data permutation, we also address here
type
>    conversions.
>    Some of the operations below change the order of the elements, but
maintain
>    the same type and number of elements. e.g: v8t = vperm (v8t,
v8_indx_mask)
>    Some operations change the number of elements: e.g, v4t = vperm_sel
(v8t,
>    v4_indx_mask), v4t = vperm_sel (v4t, v4t, v4_indx_mask)
>    Some change the type: v8t = vpack (v4dt, v4dt)
>    And some change both type and number of elements: v4t = vunpack (v8ht,
>    v4_indx_mask)

The most important thing to notice about this section is to notice
that a good many of these operations require immediate values to the
instructions.  Of the implementations with which I am familiar, only
the Altivec vperm allows arbitrary, variable, permutation.  For x86,
mips and spe, we either have different instructions for "merge high
and low parts", or something akin to the Altivec vperm, but accepts
only an immediate constant in the instruction.

In practice I don't think this is going to make that much difference
to you, since I expect the forms of permutations desired by the
vectorization code to be quite limited.

You've brought up two different operations you want to perform here;
let me treat them separately.

Dorit writes:
>    Misaligned load
>
>    Assumption: a load fetches N bytes from "floor aligned" address.

Bad assumption.  Altivec does this, but not x86 or MIPS.  Alpha
has one instruction that does and one that generates a bus error.

Jim wrote:
> misalignment can already be handled with shifts or rotate.  It isn't
> clear that we need a special operator for this.

Actully, Jim, it can't, since at least x86 doesn't have a shift that
applies to the entire 128-bit quantity.  Only shifts that apply to
the sub-elements.  Further, at least Altivec -- and probably most
others -- can do much better than shift/rotate if they're given
enough context to know that this is indeed an unaligned load.

This alignment thing is a particularly sticky wicket.

For Altivec, you already know the optimial (only?) way to do the
unaligned load, as that's clearly the model you're espousing in
your document.  But for everyone else, the "lvsl" instruction
computes, from the address, an input to the "vperm" instruction
that performs the byte permutation required to extract the proper
aligned value.

For x86 SSE, there doesn't appear to be a particularly good way to
software-pipeline the misaligned load.  Or even implement it at all,
as there's no whole-vector-shift or shuffle that takes a register
rather than an immediate.  On the other hand, SSE *does* have a
special unaligned move instruction, which will perform a complete
128-bit load or store from an unaligned address.  Intel seems lothe
to document what the performance impact of using "movdqu" (unaligned)
vs "movdqa" (aligned).  I suppose if you knew that the array were
misaligned to a constant offset mod 16 you could generate some
special-case code to shuffle around values and avoid the unaligned
load, but you couldn't do it without that known constant offset,
and you certainly wouldn't want to generate 4 or 8 code path variants
for each of the possibilities.

For Alpha, unaligned loads are performed by using a special load
that does drop the low 3 bits (as with Altivec), but then the
merge operation is performed by using special mask-and-shift
instructions that take the address as input.  E.g. an unaligned
v4hi load would be

             ldq_u             t1, 0(addr)
             ldq_u             t2, 7(addr)
             extql             t1, addr, t1
             extqh             t2, addr, t2
             or          t1, t2, result

Note that you cannot reduce this to one load per loop iteration
unless you know for certain that ADDR % 8 != 0.  In the case that
ADDR happens to be aligned, T1 == T2, the EXTQ[LH] instructions
do nothing, so the OR turns into a copy.  Other workarounds to move
the first load out of the loop require adding extra instructions
which would simply increase the latency of the entire sequence and
gain nothing.  Of course, if we *know* that the aligned case won't
happen, because we've handled it in another code path, then certainly
we'd like to software-pipeline the two loads.

I'm not sure where to find the MIPS SIMD stuff that's been talked
about on the list recently, but I suspect that there are no new
unaligned access instructions over the base architecture, and so
an unaligned load would be

             ldl         tmp, 0(addr)
             ldr         tmp, 0(addr)

into a general register, and then a move to the approprate simd
register.  This is never pipelinable, since there's no overlap
in the bytes accessed.  Of course, you could do the shift/mask
thing here instead; I have no idea if that's more efficient for
current mips implementations.

For ia64, there appears to be no special unaligned load support.
You're supposed to do the shift/mask thing yourself, I guess.
There is a shift-register-pair, but the count has to be immediate,
and thus the misalignment must be known beforehand.

I'm not sure what the best thing to do here is.  I really can't
think of anything that works well generically.  Perhaps we resort
to a target hook to generate code sequences...

Dorit writes:
>    Demotion:
>    Pack  n elements of type t into a vector of n elements of type ht. The
>    demotion flavors we propose here use modulo or (signed/unsigned)
>    saturating arithmetic.
...
>    alternative2: use scalar optabs:
>    1. convert_optab (is this equivalent to casting?)
>    2. sat_optab
>    3. usat_optab
> 1. v8t_z = VPACK_MOD_EXPR (v4dt_x, v4dt_y)
> 2,3. v8t_z = VPACK_SAT_EXPR (v4dt_x, v4dt_y)

This is the alternative I prefer.  It seems to match the available
instructions the best.  Note that not all saturation/modulo alternatives
are available on all systems, so expanding these will be entertaining.

For instance, Alpha has modulo packing only.  x86 does not have
modulo packing, but does have signed and unsigned saturating packing.
ia64 doesn't have modulo packing, has signed and unsigned HI->QI
packing, but only signed SI->HI packing.  Ho hum...

Dorit writes:
>    comments: rtl_expand - trivial comments: or both alternatives.
...
>    example: vectorization of
>    int_z = (int) short_x
>    (Similar functionality to vupk* in VMX)
>    vec_unpack_optab to cast each element of a vector (maintaining the
same
>    number of elements) we can reuse the same tree-code used to cast
scalars,
>    i.e CONVERT_EXPR:
> v4t_z = VUNPACK_EXPR (v8ht_x, v4_indx_mask)

For this... I dunno.  Alpha only unpacks from the low component.  x86 and
ia64 can unpack from either the high or the low half of the register (and
in an odd way, I think, wherein you're supposed to do your sign or zero
extension yourself, I'm not sure).  I see that Altivec has the high/low
thing as well, though apparently only sign extension.

I think it might be best to have VUNPACK_{HIGH,LOW}_EXPR.  That leaves
optimizers less free to screw up and give indx_mask something other than
the constants zero or one.

Dorit wrote:
>    Fixed-Point Multiplication:
>    widen product.

Another tricky one.

Altivec has widening even/odd signed/unsigned QI/HI multiplies.

x86 has signed HI->SI widening multiply, with add pair across (!),
unsigned SI->DI widening multiply, unsigned HI multiply highpart,
signed HI multiply lowpart, signed HI multiply highpart.  An odd
mix of widening and highpart/lowpart.

ia64 has signed/unsigned HI multiply high/lowpart.

Jim wrote:
> It isn't clear that we need MULT_HI_EXPR.

I do like the idea of MULT_HIGH_EXPR and MULT_WIDE_EXPR that
return the highpart or widened result, respectively.  I think we
should have access to these as scalar operations as well.  We
like to be able to use them in implementing double-word arithmetic
(e.g. longlong.h).  At present we resort to inline assembly.  It
would be much cooler to have builtins that expand to these tree
codes right away.

Jim wrote:
> I think VSEL_MULT_EXPR is an unnecessary complication.  Similarly the
> lo/hi/odd/even stuff.  This can all be represented with a permutation
> and a multiply.

What do you mean?  Certainly you could represent the operations
at the tree level with nested variants of these tree codes.  I
think you'd be hard-pressed to generate the available hardware
instructions if you represented the operations this way.

Dorit writes:
> Saturation (and friends)
...
>    short_z = u/s_SAT (short_x + short_y) optab_add_ssat
>    optab_add_usat optab_add_ssat, optab_add_usat
>    or if optab_add_widen and optab_sat
> v8t_z = ADD_SAT_EXPR (v8t_x, v8t_y)

I think this makes the most sense for the systems that support
the combined operation.  As noted, fall back to VECT_UNPACK
plus ADD_EXPR plus VECT_SAT_PACK if those are all supported.

I know we've got some clever bit manipulation things to
implement a vector ADD_EXPR without an actual vector addition
instruction; I don't know if there's similar cleverness that
can implement add-with-saturation.


Comments?



r~




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]