This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [rfc] and [autovect patch] Vectorize reduction.

Richard Henderson <> wrote on 23/02/2005 22:45:20:

> On Wed, Feb 23, 2005 at 03:17:59PM +0200, Dorit Naishlos wrote:
> > (a) There are several options for expressing the reduction epilog code:
> >
> > option 1: single stmt (new tree-code) that reduces the vector of
> > results into a scalar register. This scheme exposes the least details
> > the tree-level and is probably the most general:
> > [Example3]
> >       a_5 = reduc_op <va_3>
> >
> > option 2: one stmt (new tree-code) that reduces the vector of partial
> > results into a scalar result that is placed in a vector register (at
> > first/last index of the vector); an additional stmt extracts the scalar
> > result from the vector register into a scalar register. This scheme
> > more implementation details, and is less general (doesn't fit targets
> > directly support vector-reg-input-to-scalar-reg-output in one insn; but
> > this rarely exists I think). Also this scheme implies that we need to
> > assume in which index the scalar result is placed, or let each target
> > us (via target hook?). It was easy to implement this one (both from the
> > tree-level end and the target-specific end), which is the main reason
> > it is the one used now:
> > [Example4]
> >       va_4 = reduc_op <va_3>
> >       a_5 = bit_field_ref_expr <va_4, offset>
> I think that it's reasonable to assume that the value gets places in
> the low order position.  That appears to be compatible with the
> operations in altivec, sse3, and mips3d.

except for altivec's vsumsws which places the result in the high order

> One thing, though -- sse3 and mips3d both have 3 operand reductions
> for addition.  That is
>    hadds x, y, z
>    ->   z[0] = x[0] + x[1]
>       z[1] = x[2] + x[3]
>       z[2] = y[0] + y[1]
>       z[3] = y[2] + y[3]
> Now, implementing REDUC_PLUS_EXPR on these targets is easy enough,
> with two sequential hadds instructions, but we're losing half of the
> work done.  To get maximal performance, we need to perform your
> vectorization step, then unroll once, and then reduce *both*
> iterations at once.  That is,
>    for (i = 0; i < n; i += 8)
>      vt1 += a[i], vt2 += a[i+4]
>    t = reduc_plus(vt1, vt2)

by the way - this is unrolling + modulo-variable-expansion (the
optimization contributed by Revital Eres a few months ago. it is not
enabled by default); unrolling alone would use only one accumulator, rather
than two. using both -funroll-loops and -fvariable-expansion-in-unroller we
currently get:

  for (i = 0; i < n; i += 8)
    vt1 += a[i], vt2 += a[i+4]
  vt1 = vt1 + vt2
  t = reduc_plus(vt1)

(can also use --param max-variable-expansions-in-unroller=n to use n
accumulators rather than just two).

> and then we use three sequential hadds instructions to perform the
> reduction.

yes, if we detect (during combine?) that the following sequence -
  vt1 = vt1 + vt2
  t = reduc_plus(vt1,0)
is equivalent to:
  t = reduc_plus(vt1,vt2)

(is it better than using a vector-add followed by two sequential hadds?)

> The question, I suppose, is whether we want to define reduc_plus as
> a one or two operand expression.  With your current code, we could
> just make one operand a constant zero, which could be special-cased
> in the rtl expansion.

yes. I guess we can leave that up to each target to decide?

> > option 4: Can always store the vector of partial results to memory, and
> > then sequentially compute the reduction on the scalar elements.
> No reason to go to memory here.  Just use bit_field_ref_expr to extract
> the elements and then perform scalar operations.

ah, right.

> > option 5: all of the above:
> >       if (support vector-reg-to-scalar-reg-in-one-stmt)
> >         generate option 1
> >       else if (support vector-reg-to-vector-reg-reduction)
> >         generate option 2
> >       else if (support whole-vector-shift)
> >         generate option 3
> >       else
> >         generate option 4
> I wouldn't do option 1, but otherwise, sure.
> I suppose the big question is whether we should do this at the tree
> or just leave this to optabs.c when we find reduce_plus isn't available.
> It all depends on whether we expect to be able to do anything interesting
> wrt optimization of the expansion at the tree level.


> > (b) w.r.t how we use bit_field_ref_expr: the offset where the scalar
> > is expected to be placed is currenly hard-coded (it is expectd to be at
> > offset 0), but probably should be left target specific, via a target
> Not until we need to do so.

looks like I'll need it if I use vsumsws (to implement "reduce plus hi" and
"reduce plus qi", as you suggest below).

> > (c) vectorization of reduction changes the order of the computations,
> > in some cases may result in changing the semantics of the program.
> > Currently we vectorize reduction only in the following cases:
> > - case 1: we're operating on floats, and unsafe-math-optimizations is
> > (ok? other flags that should be used?)
> Correct.
> > - case 2: we're operating on unsigned integers. (we assume wrapping
> > arithmetic here. ok?)
> Correct.
> > - case 3: we're operating on signed integers, and wrapv is set. (ok?
> > flags that should be checked?)
> Correct.
> > - case 4 [experimental...]: we're operating on signed integers, and
> > is not set, but the target supports saturating reduction. this is of
> > unsafe, and should be guarded by a flag ...
> I don't think we should do this at all.  I know you're looking for how
> to use the altivec vsumsws instruction, but I just don't think we should
> be doing this.
> You *can* use this instruction when the real data is 16 bits or less.
> E.g.
>    x         A B C D E F G H
>    vupkhsh    t1, x        A   B   C   D
>    vupklsh  t2, x        E   F   G   H
>    vxor     t3, t3                    0
>    vsumsws    t3, t2, t3              EFGH
>    vsumsws    r, t1, t3          ABCDEFGH
> because it's certain that reducing the half-words can't overflow the
> word result, and thus we never saturate.
> You can fairly easily define patterns in the rs6000 backend that can
> advertize "reduce plus hi" and "reduce plus qi" operations to take
> advantage of this.

thanks for the tip.
I guess I could also use:
 t3 = vxor(t3,t3)
 t1 = vsum4shs(x,t3)
 r = vsumsws(t1,t3)
(for signed "reduce plus hi". I'll have to use something different for the
unsigned case).

> > We could use saturating reduction only when
> > the original computation explicitly uses saturating arithmetic.
> But signed saturating addition isn't associative, so we can't reorder
> these operations.  There is no existing option that would allow this,
> and I'm not sure that we should add one.  I suppose we could do so for
> unsigned saturating addition.

ok, lets wait with introducing saturating reduction until we find an
important application where it makes sense to use it.

> But all that said, there's no way at present for the user to even write
> a scalar saturating addition operation in any supported language. And
> I don't think we're likely to add one either, so the whole point is moot.

why? a user could write something like:
  short a,b
  int s = a + b
  short r = (s > 2^15 - 1) ? 2^15 - 1 :
                      ((s < -2^15) ? -2^15 : s);
which we could recognize as saturating addition, and vectorize.

> > (1) I'd like to be able to vectorize reduction on signed integers under
> > no-wrapv,
> This isn't a problem.  Without wrapv or trapv, the optimizer is free
> to assume any behaviour that is most convenient.  Since we *know* that
> all targets that do vector integer arithmetic actually perform modulo
> arithmetic, we know that the reordered results will be the same as the
> original.

so is the following correct for signed integers?:
1) wrapv=0 trapv=0      can safely vectorize
2) wrapv=0 trapv=1      don't vectorize
            or, vectorize if unsafe-math-optimizations is set?
3) wrapv=1 trapv=0      can safely vectorize
4) wrapv=1 trapv=1      can safely vectorize

> > comment: another option, if we can't change the order of the
> > and the target supports a reduction operation that is guaranteed to
> > preserve the order of the computation (does there exist such
> > in any target?),
> Not that I'm aware of.  As mentioned above, we have horizontal
> additions that add adjacent pairs, but in no case do we first
> add with the accumulator.

altivec's vsumsws seems to provide this functionality (but its available
only for signed ints, and it also saturates).

> > (d) So far I added REDUC_PLUS_EXPR as a new tree-code (and
> > SAT_REDUC_PLUS_EXPR as described above), with the intention to add
> > additional codes such as REDUC_MIN_EXPR, REDUC_MAX_EXPR etc. ok names?
> Reasonable.
> > (e) I added reduc_plus_optab (and sat_reduc_plus_optab). We may need
> > seperate optabs for the signed and unsigned cases (sreduc_plus_optab,
> > ureduc_plus_optab ?).
> You'd only need separate signed/unsigned optabs for saturating
> arithmetic.  But as described above, I don't think we should bother
> with those at all.

isn't the "reduce plus hi" using vsum4shs as mentioned above a case where I
need separate signed/unsigned optabs?

> > (f) I'm using replace_immediate_uses to replaces uses of a_3 in
> > with uses of a_5 [Example2]. Is there a better way around this? (is
there a
> > way to replace the defining stmt of an ssa-name with a new defining
> > (for the same ssa-name), removing the original defining stmt?
> I'm not sure, but wouldn't it just be modifying SSA_NAME_DEF_STMT?

yes. the problem I had is that the original defining stmt is a phi-node,
and when I call remove_phi_node it releases the ssa_name that the phi-node
defines (so it can be reused). So what I need is a version of
remove_phi_node that doesn't release the ssa_name.

much thanks,


> r~

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]