This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Semantics of SAD_EXPR and usad/ssad optabs

From: Richard Biener <richard dot guenther at gmail dot com>
To: Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>,gcc at gcc dot gnu dot org
Date: Thu, 10 May 2018 12:20:33 +0200
Subject: Re: Semantics of SAD_EXPR and usad/ssad optabs
References: <5AF31FA3.3040607@foss.arm.com> <D5DFA79F-1AC5-46E0-9BEF-DD0B9A43D374@gmail.com> <5AF4087F.6090403@foss.arm.com>

On May 10, 2018 10:53:19 AM GMT+02:00, Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>Hi Richard,
>
>On 09/05/18 19:37, Richard Biener wrote:
>> On May 9, 2018 6:19:47 PM GMT+02:00, Kyrill  Tkachov
><kyrylo.tkachov@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> I'm looking into implementing the usad/ssad optabs for aarch64 to
>catch
>>> code like in PR 85693
>>> and I'm a bit lost with what the midend expects the optabs to
>produce.
>>> The documentation for them says that the addend operand (op 3) is of
>>> mode equal or wider than
>>> the mode of the product (and consequently of operands 1 and 2) with
>the
>>> result operand 0 being
>>> the same mode as operand 3.
>>>
>>> The x86 implementation for usadv16qi (for example) takes a V16QI
>vector
>>> and returns a V4SI vector.
>>> I'm confused as to what is the reduction logic expected by the
>midend?
>>> The PSADBW instruction that x86 uses in that case accumulates the
>two
>>> V8QI halves of the input into
>>> two 16-bit values (you don't need any more bits to represent a sum
>of 8
>>> byte differences I believe):
>>> one placed at bit 0, and the other placed at bit 64. The bit ranges
>[16
>>> - 63] and [80 - 127] are left as zeroes.
>>> So it produces a V2DI result in essence.
>>>
>>> If the input V16QI vectors look like:
>>> { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14,
>a15
>>> }
>>> { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14,
>b15
>>> }
>>>
>>> then the result V4SI view (before being added into operand 3) is:
>>> { SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 }
>>> (1)
>>>
>>> whereas a normal widening reduction of V16QI -> V4SI to me would
>look
>>> more like:
>>>
>>> { SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS
>>> (a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) }  (2)
>>>
>>> My question is, does the vectoriser depend on the semantics of
>[us]sad
>>> producing the result in (1)?
>> No, it doesn't. It is required that any association of the embedded
>reduction is correct and thus this requires appropriate - ffast-math
>flags. Note it's also the reason why we do not implement constant
>folding of SAD.
>
>At the moment I'm looking at the integer modes, so I guess
>reassociation and -ffast-math doesn't come into play, but I'll keep
>that in mind.
>
>>> If so, do you think it's worth clarifying in the documentation?
>> Probably yes - but I'm not sure the current state of affairs is
>best... Do other targets implement the same reduction order as x86?
>Other similar reduction ops have high /low or even /odd variants. But
>they also do not reduce the outputs.
>
>AFAICS only x86 and powerpc implement this so far. The powerpc
>implementation synthesises the V16QI -> V4SI reduction using multiple
>instructions.
>The result it produces is variant (2) in my original post. So the two
>ports differ.
>
>From a purely target implementation perspective it is convenient to not
>impose any particular reduction strategy.
>If we say that the only requirement from the [us]sad optabs is that the
>result vector should be suitable for a full V4SI -> SI reduction
>but not rely on any particular approach, then each target can provide
>its optimal sequence.
>
>For example, an aarch64 implementation I'm experimenting with now would
>compute the V16QI -> V16QI absolute differences vector,
>reduce that into a single HImode value (there is a full widening
>reduction instruction in aarch64 for that) and then do a widening add
>of
>that value into element zero of the result V4SI vector. Following the
>notation above, this would produce from:
>
>{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15
>}
>{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15
>}
>
>the V4SI result:
>
>{ SUM (ABS (a[0-15] - b[0-15])), 0, 0, 0 }
>
>Matching the x86 or powerpc strategy would require a more costly
>sequence on aarch64, but of course this would only be
>safe if we had some guarantees that the midend won't rely on any
>particular reduction strategy and just treat it as a vector
>on which to perform a full reduction at the end of a loop.

OK, sounds reasonable. BTW, in other context I needed a very specific reduction order because the result was not used in a reduction. For that purpose we'd then need different optabs. 

Richard. 


>Thanks,
>Kyrill
>
>> Note DOT_PROD has the very same issue.
>>
>> Richard.
>>
>>> Thanks,
>>> Kyrill

References:
- Semantics of SAD_EXPR and usad/ssad optabs
  - From: Kyrill Tkachov
- Re: Semantics of SAD_EXPR and usad/ssad optabs
  - From: Richard Biener
- Re: Semantics of SAD_EXPR and usad/ssad optabs
  - From: Kyrill Tkachov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]