This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] optabs and tree-codes for vector operations


On Fri, Aug 20, 2004 at 08:27:49AM +0300, Dorit Naishlos wrote:
> <2> permutation of elements extracted from a vector, using a permutation
> mask:
> vmt_a = vperm1 (vnt_x, vm_indx_mask) (in which m <= n; the equivalent of
> the RTL "vec_select")
> * optab (new): vec_perm_optab
> * tree-code (new): VEC_PERM_EXPR
> 
> <3> same as above, but the elements are extracted from two vectors:
> vmt_a = vperm2 (vmt_x, vmt_y, vm_indx_mask)
> 
> <4> concatenate two vectors:
> v8t_a = concat (v4t_x, v4t_y)

I would prefer 3 over 2/4.  2 is implementable as vperm2(x,dontcare,m)
with a restricted mask for m.  4 is generally not found on any cpus,
which makes it dangerous to leave as a separate operation.  It could be
mistakenly cse'd or something, leaving us with something difficult to
emulate when it comes time to generate rtl.

> ... but most of the cases in which we're going to need the permute
> operation are cases where we want to shuffle elements from two input
> vectors ...

More argument for 3.

> Actually, for the sake of the misalignment support, we'd actually want to
> introduce two optabs for permute:
> 
> <3.1> a permute that takes only INTEGER_CST as the elements of the
> permutation mask (vperm_imm_optab/vperm_cst_optab).
> <3.2> a permute that takes a register as the mask
> (vperm_optab/vperm_parametric_optab)

Normally we don't put the same operation in two different optabs.
I can tell, however, that you want to use the different optabs to
decide how to implement misalignment.  Lemme think about this...

> <5> load from an aligned address. I think we can assume that all platforms
> have aligned loads, and just use a MODIFY_EXPR to represent those.
> 
> <6> load from an unaligned address (e.g, movdqu in sse):
> * optab (new): load_u_optab
> * tree-code (new): LOAD_U_EXPR

If dannyb's alignment data can somehow make it through to the rtl
level, we can use MODIFY_EXPR for this case too.  The rtl expander
can simply look at the known alignment and choose the correct 
instruction to emit.

> <7> load from a "floor aligned" address (i.e, drop low address bits, as in
> ldvx (altivec), movdqa (sse), ldq_u (alpha)):
> * optab (new): load_u_floor_optab / load_u_low_addr_optab / better name?
> * tree-code (new): LOAD_U_FLOOR_EXPR

Actually, the sse movdqa does not drop low bits -- it traps.

Problem here is that we'll need something for store as well.
I wonder if a better solution is an 'r' class reference that
indicates that low bits should be dropped.  So we build

	MODIFY_EXPR<ALIGN_REF<obj[index]>, value>

for a store, and the opposite for a load.  Alternately, something
more restricted like ALIGN_INDIRECT_REF, which only takes a 
pointer, rather than an entire complex lvalue.

> <8> load low/high (e.g. ldl/ldr (mips), ldqu+extql/ldqu+extqh (alpha)).
> * optabs (new): load_u_low_optab + load_u_hi_optab
> * tree-codes (new): LOAD_U_LOW_EXPR + LOAD_U_HI_EXPR

I would not consider this a separate case -- the two operations
are not really separable, so I'd just consider them as <6>.  But
it turns out that MIPS *did* add new instructions, to put it in
class <7>.  So I think we can ignore this option entirely.

> Also, we'd need to introduce:
> 
> <9> compute from an address the input to a vperm instruction (as in
> altivec's lvsl/lvsr):
> * optabs (new): misalign_lo_mask_optab + misalign_hi_mask_optab
> * tree-code (new): MISALIGN_LO_MASK_EXPR + MISALIGN_HI_MASK_EXPR
> 
> <10> compute from an address the input to a vor instruction (as in alpha's
> extql/extqh):
> * optabs (new): misalign_lo_part_optab + misalign_hi_part_optab
> * tree-code (new): MISALIGN_LO_PART_EXPR + MISALIGN_HI_PART_EXPR

I'm not sure that we should add all of these.  Especially wrt the
"part" versions, you won't ever use them separately.  Nor, since
we're still at such a high level, do you care about representing
each and every instruction for the scheduler.

I'm also thinking that perhaps the "mask" versions should not use
an optab at all, but rather a builtin function.  The reason here
is that the form of the mask differs between systems.  For Altivec,
the mask is a complete 16-byte vector.  For SPE, the mask is a CCmode
value.  If we have this as a builtin, then the target can easily
influence the return type of the function, and thus the type of the
variable that we create for the loop.

Back to the "we're only going to use these together" argument, I
think perhaps a better representation might be

	vectr = REALIGN_LOAD_EXPR < vect1, vect2, magic >

where

	vect1, vect2:	Sequential values read from the array, however
			we decided to read from the array.

	magic:		Two cases, depending on whether or not the
			target supplies a mask creation builtin.

			If the builtin exists, this operand receives
			the value it generated from the initial address.

			If the builtin doesn't exist, this operand
			receives an address related to either vect1
			or vect2.  We assume that only the low bits
			matter here, so any value handy should work.
			Our object is to not increase register pressure
			more then necessary.

We'll need a similar REALIGN_STORE_EXPR.




r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]