[PATCH][AArch64] Implement usadv16qi and ssadv16qi standard names

Tue May 15 08:20:00 GMT 2018

I realised I had forgotten to copy the maintainers...

https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00613.html

Thanks,
Kyrill

On 14/05/18 14:38, Kyrill Tkachov wrote:
> Hi all,
>
> This patch implements the usadv16qi and ssadv16qi standard names.
> See the thread at on gcc@gcc.gnu.org [1] for background.
>
> The V16QImode variant is important to get right as it is the most commonly used pattern:
> reducing vectors of bytes into an int.
> The midend expects the optab to compute the absolute differences of operands 1 and 2 and
> reduce them while widening along the way up to SImode. So the inputs are V16QImode and
> the output is V4SImode.
>
> I've tried out a few different strategies for that, the one I settled with is to emit:
> UABDL2    tmp.8h, op1.16b, op2.16b
> UABAL    tmp.8h, op1.16b, op2.16b
> UADALP    op3.4s, tmp.8h
>
> To work through the semantics let's say operands 1 and 2 are:
> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
> op3 { c0, c1, c2, c3 }
>
> The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:
>
> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }
>
> The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:
>
> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>
> Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>
> (sorry for the text dump)
>
> Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
> I've considered other sequences as well (thanks Wilco), for example
> * UABD + UADDLP + UADALP
> * UABLD2 + UABDL + UADALP + UADALP
>
> I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
> look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
> before doing a single UADALP in the end to accumulate into the final S register).
>
> If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
> to parametrise the generation sequence by code (or the appropriate RTX cost).
>
>
> This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
> So, for the code:
>
> unsigned char pix1[N], pix2[N];
>
> int foo (void)
> {
>   int i_sum = 0;
>   int i;
>
>   for (i = 0; i < 16; i++)
>     i_sum += __builtin_abs (pix1[i] - pix2[i]);
>
>   return i_sum;
> }
>
> we now generate on aarch64:
> foo:
>         adrp    x1, pix1
>         add     x1, x1, :lo12:pix1
>         movi    v0.4s, 0
>         adrp    x0, pix2
>         add     x0, x0, :lo12:pix2
>         ldr     q2, [x1]
>         ldr     q3, [x0]
>         uabdl2  v1.8h, v2.16b, v3.16b
>         uabal   v1.8h, v2.8b, v3.8b
>         uadalp  v0.4s, v1.8h
>         addv    s0, v0.4s
>         umov    w0, v0.s[0]
>         ret
>
>
> instead of:
> foo:
>         adrp    x1, pix1
>         adrp    x0, pix2
>         add     x1, x1, :lo12:pix1
>         add     x0, x0, :lo12:pix2
>         ldr     q0, [x1]
>         ldr     q4, [x0]
>         ushll   v1.8h, v0.8b, 0
>         ushll2  v0.8h, v0.16b, 0
>         ushll   v2.8h, v4.8b, 0
>         ushll2  v4.8h, v4.16b, 0
>         usubl   v3.4s, v1.4h, v2.4h
>         usubl2  v1.4s, v1.8h, v2.8h
>         usubl   v2.4s, v0.4h, v4.4h
>         usubl2  v0.4s, v0.8h, v4.8h
>         abs     v3.4s, v3.4s
>         abs     v1.4s, v1.4s
>         abs     v2.4s, v2.4s
>         abs     v0.4s, v0.4s
>         add     v1.4s, v3.4s, v1.4s
>         add     v1.4s, v2.4s, v1.4s
>         add     v0.4s, v0.4s, v1.4s
>         addv    s0, v0.4s
>         umov    w0, v0.s[0]
>         ret
>
> So I expect this new expansion to be better than the status quo in any case.
> Bootstrapped and tested on aarch64-none-linux-gnu.
> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.
>
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html
>
>
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
>     UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
>     UNSPEC_UADALP values.
>     * config/aarch64/iterators.md (ABAL): New int iterator.
>     (ABDL2): Likewise.
>     (ADALP): Likewise.
>     (sur): Add mappings for the above.
>     * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
>     New define_insn.
>     (aarch64_<sur>abal<mode>_4): Likewise.
>     (aarch64_<sur>adalp<mode>_3): Likewise.
>     (<sur>sadv16qi): New define_expand.
>
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     * gcc.c-torture/execute/ssad-run.c: New test.
>     * gcc.c-torture/execute/usad-run.c: Likewise.
>     * gcc.target/aarch64/ssadv16qi.c: Likewise.
>     * gcc.target/aarch64/usadv16qi.c: Likewise.