[PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

Tue Nov 5 10:03:00 GMT 2013

On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote:
> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> >> index 2a5a2e1..8f5d39a 100644
> >> --- a/gcc/doc/md.texi
> >> +++ b/gcc/doc/md.texi
> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
> >> Operand 3 is of a mode equal or
> >>  wider than the mode of the product. The result is placed in operand 0, which
> >>  is of the same mode as operand 3.
> >>
> >> +@cindex @code{ssad@var{m}} instruction pattern
> >> +@item @samp{ssad@var{m}}
> >> +@cindex @code{usad@var{m}} instruction pattern
> >> +@item @samp{usad@var{m}}
> >> +Compute the sum of absolute differences of two signed/unsigned elements.
> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
> >> +equal or wider than the mode of the absolute difference. The result is placed
> >> +in operand 0, which is of the same mode as operand 3.
> >> +
> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
> >>  @item @samp{ssum_widen@var{m3}}
> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
> >> diff --git a/gcc/expr.c b/gcc/expr.c
> >> index 4975a64..1db8a49 100644
> >
> > I'm not sure I follow, and if I do - I don't think it matches what
> > you have implemented for i386.
> >
> > From your text description I would guess the series of operations to be:
> >
> >   v1 = widen (operands[1])
> >   v2 = widen (operands[2])
> >   v3 = abs (v1 - v2)
> >   operands[0] = v3 + operands[3]
> >
> > But if I understand the behaviour of PSADBW correctly, what you have
> > actually implemented is:
> >
> >   v1 = widen (operands[1])
> >   v2 = widen (operands[2])
> >   v3 = abs (v1 - v2)
> >   v4 = reduce_plus (v3)
> >   operands[0] = v4 + operands[3]
> >
> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
> > who do not get this for free with their Absolute Difference step. Imagine a
> > simple loop where we have synthesized the reduce_plus, we compute partial
> > sums each loop iteration, though we would be better to leave the reduce_plus
> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
> > Tree code for this.
> 
> What do you mean when you use "synthesizing" here? For each pattern,
> the only synthesized operation is the one being returned from the
> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
> reduce sum is necessary as we need corresponding prolog and epilog for
> reductions, which is already done before pattern recognition. Note
> that reduction is not a pattern but is a type of vector definition. A
> vectorization pattern can still be a reduction operation as long as
> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
> can check the other two reduction patterns: widen_sum_pattern and
> dot_prod_pattern for reference.

My apologies for not being clear. What I mean is, for a target which does
not have a dedicated PSADBW instruction, the individual steps of
'usad<m>' must be "synthesized" in such a way as to match the expected
behaviour of the tree code.

So, I must expand 'usadm' to a series of equivalent instructions
as USAD_EXPR expects.

If USAD_EXPR requires me to emit a reduction on each loop iteration,
I think that will be inefficient compared to performing the reduction
after the loop body.

To a first approximation on ARM, I would expect from your description
of 'usad<m>' that generating,

     VABAL   ops[3], ops[1], ops[2]
     (Vector widening Absolute Difference and Accumulate)

would fulfil the requirements.

But to match the behaviour you have implemented in the i386
backend I would be required to generate:

    VABAL   ops[3], ops[1], ops[2]
    VPADD   ops[3], ops[3], ops[3] (add one set of pairs)
    VPADD   ops[3], ops[3], ops[3] (and the other)
    VAND    ops[0], ops[3], MASK   (clear high lanes)

Which additionally performs the (redundant) vector reduction
and high lane zeroing step on each loop iteration.

My comment is that your documentation and implementation are
inconsistent so I am not sure which behaviour you intend for USAD_EXPR.

Additionally, I think it would be more generic to choose the first
behaviour, rather than requiring a wasteful decomposition to match
a very particular i386 opcode.

Thanks,
James