[PATCH][AArch64] Implement usadv16qi and ssadv16qi standard names
Kyrill Tkachov
kyrylo.tkachov@foss.arm.com
Tue May 15 08:20:00 GMT 2018
I realised I had forgotten to copy the maintainers...
https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00613.html
Thanks,
Kyrill
On 14/05/18 14:38, Kyrill Tkachov wrote:
> Hi all,
>
> This patch implements the usadv16qi and ssadv16qi standard names.
> See the thread at on gcc@gcc.gnu.org [1] for background.
>
> The V16QImode variant is important to get right as it is the most commonly used pattern:
> reducing vectors of bytes into an int.
> The midend expects the optab to compute the absolute differences of operands 1 and 2 and
> reduce them while widening along the way up to SImode. So the inputs are V16QImode and
> the output is V4SImode.
>
> I've tried out a few different strategies for that, the one I settled with is to emit:
> UABDL2 tmp.8h, op1.16b, op2.16b
> UABAL tmp.8h, op1.16b, op2.16b
> UADALP op3.4s, tmp.8h
>
> To work through the semantics let's say operands 1 and 2 are:
> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
> op3 { c0, c1, c2, c3 }
>
> The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:
>
> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }
>
> The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:
>
> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>
> Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
>
> (sorry for the text dump)
>
> Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
> I've considered other sequences as well (thanks Wilco), for example
> * UABD + UADDLP + UADALP
> * UABLD2 + UABDL + UADALP + UADALP
>
> I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
> look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
> before doing a single UADALP in the end to accumulate into the final S register).
>
> If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
> to parametrise the generation sequence by code (or the appropriate RTX cost).
>
>
> This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
> So, for the code:
>
> unsigned char pix1[N], pix2[N];
>
> int foo (void)
> {
> int i_sum = 0;
> int i;
>
> for (i = 0; i < 16; i++)
> i_sum += __builtin_abs (pix1[i] - pix2[i]);
>
> return i_sum;
> }
>
> we now generate on aarch64:
> foo:
> adrp x1, pix1
> add x1, x1, :lo12:pix1
> movi v0.4s, 0
> adrp x0, pix2
> add x0, x0, :lo12:pix2
> ldr q2, [x1]
> ldr q3, [x0]
> uabdl2 v1.8h, v2.16b, v3.16b
> uabal v1.8h, v2.8b, v3.8b
> uadalp v0.4s, v1.8h
> addv s0, v0.4s
> umov w0, v0.s[0]
> ret
>
>
> instead of:
> foo:
> adrp x1, pix1
> adrp x0, pix2
> add x1, x1, :lo12:pix1
> add x0, x0, :lo12:pix2
> ldr q0, [x1]
> ldr q4, [x0]
> ushll v1.8h, v0.8b, 0
> ushll2 v0.8h, v0.16b, 0
> ushll v2.8h, v4.8b, 0
> ushll2 v4.8h, v4.16b, 0
> usubl v3.4s, v1.4h, v2.4h
> usubl2 v1.4s, v1.8h, v2.8h
> usubl v2.4s, v0.4h, v4.4h
> usubl2 v0.4s, v0.8h, v4.8h
> abs v3.4s, v3.4s
> abs v1.4s, v1.4s
> abs v2.4s, v2.4s
> abs v0.4s, v0.4s
> add v1.4s, v3.4s, v1.4s
> add v1.4s, v2.4s, v1.4s
> add v0.4s, v0.4s, v1.4s
> addv s0, v0.4s
> umov w0, v0.s[0]
> ret
>
> So I expect this new expansion to be better than the status quo in any case.
> Bootstrapped and tested on aarch64-none-linux-gnu.
> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.
>
> Ok for trunk?
>
> Thanks,
> Kyrill
>
> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html
>
>
> 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
> UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
> UNSPEC_UADALP values.
> * config/aarch64/iterators.md (ABAL): New int iterator.
> (ABDL2): Likewise.
> (ADALP): Likewise.
> (sur): Add mappings for the above.
> * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
> New define_insn.
> (aarch64_<sur>abal<mode>_4): Likewise.
> (aarch64_<sur>adalp<mode>_3): Likewise.
> (<sur>sadv16qi): New define_expand.
>
> 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.c-torture/execute/ssad-run.c: New test.
> * gcc.c-torture/execute/usad-run.c: Likewise.
> * gcc.target/aarch64/ssadv16qi.c: Likewise.
> * gcc.target/aarch64/usadv16qi.c: Likewise.
More information about the Gcc-patches
mailing list