This is the mail archive of the
mailing list for the GCC project.
Re: [rfc] and [autovect patch] Vectorize reduction.
- From: Dorit Naishlos <DORIT at il dot ibm dot com>
- To: Richard Henderson <rth at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Mon, 28 Feb 2005 14:36:03 +0200
- Subject: Re: [rfc] and [autovect patch] Vectorize reduction.
Richard Henderson <firstname.lastname@example.org> wrote on 23/02/2005 22:45:20:
> On Wed, Feb 23, 2005 at 03:17:59PM +0200, Dorit Naishlos wrote:
> > (a) There are several options for expressing the reduction epilog code:
> > option 1: single stmt (new tree-code) that reduces the vector of
> > results into a scalar register. This scheme exposes the least details
> > the tree-level and is probably the most general:
> > [Example3]
> > a_5 = reduc_op <va_3>
> > option 2: one stmt (new tree-code) that reduces the vector of partial
> > results into a scalar result that is placed in a vector register (at
> > first/last index of the vector); an additional stmt extracts the scalar
> > result from the vector register into a scalar register. This scheme
> > more implementation details, and is less general (doesn't fit targets
> > directly support vector-reg-input-to-scalar-reg-output in one insn; but
> > this rarely exists I think). Also this scheme implies that we need to
> > assume in which index the scalar result is placed, or let each target
> > us (via target hook?). It was easy to implement this one (both from the
> > tree-level end and the target-specific end), which is the main reason
> > it is the one used now:
> > [Example4]
> > va_4 = reduc_op <va_3>
> > a_5 = bit_field_ref_expr <va_4, offset>
> I think that it's reasonable to assume that the value gets places in
> the low order position. That appears to be compatible with the
> operations in altivec, sse3, and mips3d.
except for altivec's vsumsws which places the result in the high order
> One thing, though -- sse3 and mips3d both have 3 operand reductions
> for addition. That is
> hadds x, y, z
> -> z = x + x
> z = x + x
> z = y + y
> z = y + y
> Now, implementing REDUC_PLUS_EXPR on these targets is easy enough,
> with two sequential hadds instructions, but we're losing half of the
> work done. To get maximal performance, we need to perform your
> vectorization step, then unroll once, and then reduce *both*
> iterations at once. That is,
> for (i = 0; i < n; i += 8)
> vt1 += a[i], vt2 += a[i+4]
> t = reduc_plus(vt1, vt2)
by the way - this is unrolling + modulo-variable-expansion (the
optimization contributed by Revital Eres a few months ago. it is not
enabled by default); unrolling alone would use only one accumulator, rather
than two. using both -funroll-loops and -fvariable-expansion-in-unroller we
for (i = 0; i < n; i += 8)
vt1 += a[i], vt2 += a[i+4]
vt1 = vt1 + vt2
t = reduc_plus(vt1)
(can also use --param max-variable-expansions-in-unroller=n to use n
accumulators rather than just two).
> and then we use three sequential hadds instructions to perform the
yes, if we detect (during combine?) that the following sequence -
vt1 = vt1 + vt2
t = reduc_plus(vt1,0)
is equivalent to:
t = reduc_plus(vt1,vt2)
(is it better than using a vector-add followed by two sequential hadds?)
> The question, I suppose, is whether we want to define reduc_plus as
> a one or two operand expression. With your current code, we could
> just make one operand a constant zero, which could be special-cased
> in the rtl expansion.
yes. I guess we can leave that up to each target to decide?
> > option 4: Can always store the vector of partial results to memory, and
> > then sequentially compute the reduction on the scalar elements.
> No reason to go to memory here. Just use bit_field_ref_expr to extract
> the elements and then perform scalar operations.
> > option 5: all of the above:
> > if (support vector-reg-to-scalar-reg-in-one-stmt)
> > generate option 1
> > else if (support vector-reg-to-vector-reg-reduction)
> > generate option 2
> > else if (support whole-vector-shift)
> > generate option 3
> > else
> > generate option 4
> I wouldn't do option 1, but otherwise, sure.
> I suppose the big question is whether we should do this at the tree
> or just leave this to optabs.c when we find reduce_plus isn't available.
> It all depends on whether we expect to be able to do anything interesting
> wrt optimization of the expansion at the tree level.
> > (b) w.r.t how we use bit_field_ref_expr: the offset where the scalar
> > is expected to be placed is currenly hard-coded (it is expectd to be at
> > offset 0), but probably should be left target specific, via a target
> Not until we need to do so.
looks like I'll need it if I use vsumsws (to implement "reduce plus hi" and
"reduce plus qi", as you suggest below).
> > (c) vectorization of reduction changes the order of the computations,
> > in some cases may result in changing the semantics of the program.
> > Currently we vectorize reduction only in the following cases:
> > - case 1: we're operating on floats, and unsafe-math-optimizations is
> > (ok? other flags that should be used?)
> > - case 2: we're operating on unsigned integers. (we assume wrapping
> > arithmetic here. ok?)
> > - case 3: we're operating on signed integers, and wrapv is set. (ok?
> > flags that should be checked?)
> > - case 4 [experimental...]: we're operating on signed integers, and
> > is not set, but the target supports saturating reduction. this is of
> > unsafe, and should be guarded by a flag ...
> I don't think we should do this at all. I know you're looking for how
> to use the altivec vsumsws instruction, but I just don't think we should
> be doing this.
> You *can* use this instruction when the real data is 16 bits or less.
> x A B C D E F G H
> vupkhsh t1, x A B C D
> vupklsh t2, x E F G H
> vxor t3, t3 0
> vsumsws t3, t2, t3 EFGH
> vsumsws r, t1, t3 ABCDEFGH
> because it's certain that reducing the half-words can't overflow the
> word result, and thus we never saturate.
> You can fairly easily define patterns in the rs6000 backend that can
> advertize "reduce plus hi" and "reduce plus qi" operations to take
> advantage of this.
thanks for the tip.
I guess I could also use:
t3 = vxor(t3,t3)
t1 = vsum4shs(x,t3)
r = vsumsws(t1,t3)
(for signed "reduce plus hi". I'll have to use something different for the
> > We could use saturating reduction only when
> > the original computation explicitly uses saturating arithmetic.
> But signed saturating addition isn't associative, so we can't reorder
> these operations. There is no existing option that would allow this,
> and I'm not sure that we should add one. I suppose we could do so for
> unsigned saturating addition.
ok, lets wait with introducing saturating reduction until we find an
important application where it makes sense to use it.
> But all that said, there's no way at present for the user to even write
> a scalar saturating addition operation in any supported language. And
> I don't think we're likely to add one either, so the whole point is moot.
why? a user could write something like:
int s = a + b
short r = (s > 2^15 - 1) ? 2^15 - 1 :
((s < -2^15) ? -2^15 : s);
which we could recognize as saturating addition, and vectorize.
> > (1) I'd like to be able to vectorize reduction on signed integers under
> > no-wrapv,
> This isn't a problem. Without wrapv or trapv, the optimizer is free
> to assume any behaviour that is most convenient. Since we *know* that
> all targets that do vector integer arithmetic actually perform modulo
> arithmetic, we know that the reordered results will be the same as the
so is the following correct for signed integers?:
1) wrapv=0 trapv=0 can safely vectorize
2) wrapv=0 trapv=1 don't vectorize
or, vectorize if unsafe-math-optimizations is set?
3) wrapv=1 trapv=0 can safely vectorize
4) wrapv=1 trapv=1 can safely vectorize
> > comment: another option, if we can't change the order of the
> > and the target supports a reduction operation that is guaranteed to
> > preserve the order of the computation (does there exist such
> > in any target?),
> Not that I'm aware of. As mentioned above, we have horizontal
> additions that add adjacent pairs, but in no case do we first
> add with the accumulator.
altivec's vsumsws seems to provide this functionality (but its available
only for signed ints, and it also saturates).
> > (d) So far I added REDUC_PLUS_EXPR as a new tree-code (and
> > SAT_REDUC_PLUS_EXPR as described above), with the intention to add
> > additional codes such as REDUC_MIN_EXPR, REDUC_MAX_EXPR etc. ok names?
> > (e) I added reduc_plus_optab (and sat_reduc_plus_optab). We may need
> > seperate optabs for the signed and unsigned cases (sreduc_plus_optab,
> > ureduc_plus_optab ?).
> You'd only need separate signed/unsigned optabs for saturating
> arithmetic. But as described above, I don't think we should bother
> with those at all.
isn't the "reduce plus hi" using vsum4shs as mentioned above a case where I
need separate signed/unsigned optabs?
> > (f) I'm using replace_immediate_uses to replaces uses of a_3 in
> > with uses of a_5 [Example2]. Is there a better way around this? (is
> > way to replace the defining stmt of an ssa-name with a new defining
> > (for the same ssa-name), removing the original defining stmt?
> I'm not sure, but wouldn't it just be modifying SSA_NAME_DEF_STMT?
yes. the problem I had is that the original defining stmt is a phi-node,
and when I call remove_phi_node it releases the ssa_name that the phi-node
defines (so it can be reused). So what I need is a version of
remove_phi_node that doesn't release the ssa_name.