This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [rfc] and [autovect patch] Vectorize reduction.


On Wed, Feb 23, 2005 at 03:17:59PM +0200, Dorit Naishlos wrote:
> (a) There are several options for expressing the reduction epilog code:
> 
> option 1: single stmt (new tree-code) that reduces the vector of partial
> results into a scalar register. This scheme exposes the least details at
> the tree-level and is probably the most general:
> [Example3]
>       a_5 = reduc_op <va_3>
> 
> option 2: one stmt (new tree-code) that reduces the vector of partial
> results into a scalar result that is placed in a vector register (at the
> first/last index of the vector); an additional stmt extracts the scalar
> result from the vector register into a scalar register. This scheme exposes
> more implementation details, and is less general (doesn't fit targets that
> directly support vector-reg-input-to-scalar-reg-output in one insn; but
> this rarely exists I think). Also this scheme implies that we need to
> assume in which index the scalar result is placed, or let each target tell
> us (via target hook?). It was easy to implement this one (both from the
> tree-level end and the target-specific end), which is the main reason why
> it is the one used now:
> [Example4]
>       va_4 = reduc_op <va_3>
>       a_5 = bit_field_ref_expr <va_4, offset>

I think that it's reasonable to assume that the value gets places in 
the low order position.  That appears to be compatible with the 
operations in altivec, sse3, and mips3d.

One thing, though -- sse3 and mips3d both have 3 operand reductions
for addition.  That is

	hadds x, y, z
	->	z[0] = x[0] + x[1]
		z[1] = x[2] + x[3]
		z[2] = y[0] + y[1]
		z[3] = y[2] + y[3]

Now, implementing REDUC_PLUS_EXPR on these targets is easy enough,
with two sequential hadds instructions, but we're losing half of the
work done.  To get maximal performance, we need to perform your 
vectorization step, then unroll once, and then reduce *both* 
iterations at once.  That is,

	for (i = 0; i < n; i += 8)
	  vt1 += a[i], vt2 += a[i+4]
	t = reduc_plus(vt1, vt2)

and then we use three sequential hadds instructions to perform the
reduction.

The question, I suppose, is whether we want to define reduc_plus as
a one or two operand expression.  With your current code, we could
just make one operand a constant zero, which could be special-cased
in the rtl expansion.

> option 4: Can always store the vector of partial results to memory, and
> then sequentially compute the reduction on the scalar elements.

No reason to go to memory here.  Just use bit_field_ref_expr to extract
the elements and then perform scalar operations.

> option 5: all of the above:
>       if (support vector-reg-to-scalar-reg-in-one-stmt)
>         generate option 1
>       else if (support vector-reg-to-vector-reg-reduction)
>         generate option 2
>       else if (support whole-vector-shift)
>         generate option 3
>       else
>         generate option 4

I wouldn't do option 1, but otherwise, sure.

I suppose the big question is whether we should do this at the tree level,
or just leave this to optabs.c when we find reduce_plus isn't available.
It all depends on whether we expect to be able to do anything interesting
wrt optimization of the expansion at the tree level.

> (b) w.r.t how we use bit_field_ref_expr: the offset where the scalar result
> is expected to be placed is currenly hard-coded (it is expectd to be at
> offset 0), but probably should be left target specific, via a target hook?

Not until we need to do so.

> (c) vectorization of reduction changes the order of the computations, which
> in some cases may result in changing the semantics of the program.
> Currently we vectorize reduction only in the following cases:
> - case 1: we're operating on floats, and unsafe-math-optimizations is set.
> (ok? other flags that should be used?)

Correct.

> - case 2: we're operating on unsigned integers. (we assume wrapping
> arithmetic here. ok?)

Correct.

> - case 3: we're operating on signed integers, and wrapv is set. (ok? other
> flags that should be checked?)

Correct.

> - case 4 [experimental...]: we're operating on signed integers, and wrapv
> is not set, but the target supports saturating reduction. this is of course
> unsafe, and should be guarded by a flag ...

I don't think we should do this at all.  I know you're looking for how
to use the altivec vsumsws instruction, but I just don't think we should
be doing this.

You *can* use this instruction when the real data is 16 bits or less.
E.g.

	x			A B C D E F G H
	vupkhsh	 t1, x		  A   B   C   D
	vupklsh  t2, x		  E   F   G   H
	vxor     t3, t3		              0
	vsumsws	 t3, t2, t3	           EFGH
	vsumsws	 r, t1, t3	       ABCDEFGH

because it's certain that reducing the half-words can't overflow the
word result, and thus we never saturate.

You can fairly easily define patterns in the rs6000 backend that can
advertize "reduce plus hi" and "reduce plus qi" operations to take
advantage of this.

> We could use saturating reduction only when
> the original computation explicitly uses saturating arithmetic.

But signed saturating addition isn't associative, so we can't reorder
these operations.  There is no existing option that would allow this,
and I'm not sure that we should add one.  I suppose we could do so for
unsigned saturating addition.

But all that said, there's no way at present for the user to even write
a scalar saturating addition operation in any supported language.  And
I don't think we're likely to add one either, so the whole point is moot.

> (1) I'd like to be able to vectorize reduction on signed integers under
> no-wrapv,

This isn't a problem.  Without wrapv or trapv, the optimizer is free
to assume any behaviour that is most convenient.  Since we *know* that
all targets that do vector integer arithmetic actually perform modulo
arithmetic, we know that the reordered results will be the same as the
original.

> comment: another option, if we can't change the order of the computation,
> and the target supports a reduction operation that is guaranteed to
> preserve the order of the computation (does there exist such functionality
> in any target?),

Not that I'm aware of.  As mentioned above, we have horizontal 
additions that add adjacent pairs, but in no case do we first 
add with the accumulator.

> (d) So far I added REDUC_PLUS_EXPR as a new tree-code (and
> SAT_REDUC_PLUS_EXPR as described above), with the intention to add
> additional codes such as REDUC_MIN_EXPR, REDUC_MAX_EXPR etc. ok names?

Reasonable.

> (e) I added reduc_plus_optab (and sat_reduc_plus_optab). We may need
> seperate optabs for the signed and unsigned cases (sreduc_plus_optab,
> ureduc_plus_optab ?).

You'd only need separate signed/unsigned optabs for saturating 
arithmetic.  But as described above, I don't think we should bother
with those at all.

> (f) I'm using replace_immediate_uses to replaces uses of a_3 in [Example1]
> with uses of a_5 [Example2]. Is there a better way around this? (is there a
> way to replace the defining stmt of an ssa-name with a new defining stmt
> (for the same ssa-name), removing the original defining stmt?

I'm not sure, but wouldn't it just be modifying SSA_NAME_DEF_STMT?  Diego?


r~


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]