This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.


Now is this patch OK for the trunk? Thank you!



thanks,
Cong


On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <congh@google.com> wrote:
> Thank you for your detailed explanation.
>
> Once GCC detects a reduction operation, it will automatically
> accumulate all elements in the vector after the loop. In the loop the
> reduction variable is always a vector whose elements are reductions of
> corresponding values from other vectors. Therefore in your case the
> only instruction you need to generate is:
>
>     VABAL   ops[3], ops[1], ops[2]
>
> It is OK if you accumulate the elements into one in the vector inside
> of the loop (if one instruction can do this), but you have to make
> sure other elements in the vector should remain zero so that the final
> result is correct.
>
> If you are confused about the documentation, check the one for
> udot_prod (just above usad in md.texi), as it has very similar
> behavior as usad. Actually I copied the text from there and did some
> changes. As those two instruction patterns are both for vectorization,
> their behavior should not be difficult to explain.
>
> If you have more questions or think that the documentation is still
> improper please let me know.
>
> Thank you very much!
>
>
> Cong
>
>
> On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
>> On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote:
>>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
>>> <james.greenhalgh@arm.com> wrote:
>>> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote:
>>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>>> >> index 2a5a2e1..8f5d39a 100644
>>> >> --- a/gcc/doc/md.texi
>>> >> +++ b/gcc/doc/md.texi
>>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
>>> >> Operand 3 is of a mode equal or
>>> >>  wider than the mode of the product. The result is placed in operand 0, which
>>> >>  is of the same mode as operand 3.
>>> >>
>>> >> +@cindex @code{ssad@var{m}} instruction pattern
>>> >> +@item @samp{ssad@var{m}}
>>> >> +@cindex @code{usad@var{m}} instruction pattern
>>> >> +@item @samp{usad@var{m}}
>>> >> +Compute the sum of absolute differences of two signed/unsigned elements.
>>> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, which
>>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode
>>> >> +equal or wider than the mode of the absolute difference. The result is placed
>>> >> +in operand 0, which is of the same mode as operand 3.
>>> >> +
>>> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>>> >>  @item @samp{ssum_widen@var{m3}}
>>> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
>>> >> diff --git a/gcc/expr.c b/gcc/expr.c
>>> >> index 4975a64..1db8a49 100644
>>> >
>>> > I'm not sure I follow, and if I do - I don't think it matches what
>>> > you have implemented for i386.
>>> >
>>> > From your text description I would guess the series of operations to be:
>>> >
>>> >   v1 = widen (operands[1])
>>> >   v2 = widen (operands[2])
>>> >   v3 = abs (v1 - v2)
>>> >   operands[0] = v3 + operands[3]
>>> >
>>> > But if I understand the behaviour of PSADBW correctly, what you have
>>> > actually implemented is:
>>> >
>>> >   v1 = widen (operands[1])
>>> >   v2 = widen (operands[2])
>>> >   v3 = abs (v1 - v2)
>>> >   v4 = reduce_plus (v3)
>>> >   operands[0] = v4 + operands[3]
>>> >
>>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
>>> > who do not get this for free with their Absolute Difference step. Imagine a
>>> > simple loop where we have synthesized the reduce_plus, we compute partial
>>> > sums each loop iteration, though we would be better to leave the reduce_plus
>>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
>>> > Tree code for this.
>>>
>>> What do you mean when you use "synthesizing" here? For each pattern,
>>> the only synthesized operation is the one being returned from the
>>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
>>> reduce sum is necessary as we need corresponding prolog and epilog for
>>> reductions, which is already done before pattern recognition. Note
>>> that reduction is not a pattern but is a type of vector definition. A
>>> vectorization pattern can still be a reduction operation as long as
>>> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
>>> can check the other two reduction patterns: widen_sum_pattern and
>>> dot_prod_pattern for reference.
>>
>> My apologies for not being clear. What I mean is, for a target which does
>> not have a dedicated PSADBW instruction, the individual steps of
>> 'usad<m>' must be "synthesized" in such a way as to match the expected
>> behaviour of the tree code.
>>
>> So, I must expand 'usadm' to a series of equivalent instructions
>> as USAD_EXPR expects.
>>
>> If USAD_EXPR requires me to emit a reduction on each loop iteration,
>> I think that will be inefficient compared to performing the reduction
>> after the loop body.
>>
>> To a first approximation on ARM, I would expect from your description
>> of 'usad<m>' that generating,
>>
>>      VABAL   ops[3], ops[1], ops[2]
>>      (Vector widening Absolute Difference and Accumulate)
>>
>> would fulfil the requirements.
>>
>> But to match the behaviour you have implemented in the i386
>> backend I would be required to generate:
>>
>>     VABAL   ops[3], ops[1], ops[2]
>>     VPADD   ops[3], ops[3], ops[3] (add one set of pairs)
>>     VPADD   ops[3], ops[3], ops[3] (and the other)
>>     VAND    ops[0], ops[3], MASK   (clear high lanes)
>>
>> Which additionally performs the (redundant) vector reduction
>> and high lane zeroing step on each loop iteration.
>>
>> My comment is that your documentation and implementation are
>> inconsistent so I am not sure which behaviour you intend for USAD_EXPR.
>>
>> Additionally, I think it would be more generic to choose the first
>> behaviour, rather than requiring a wasteful decomposition to match
>> a very particular i386 opcode.
>>
>> Thanks,
>> James
>>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]