From 75add2d039de20a128af1d9e4ee54a8758db9837 Mon Sep 17 00:00:00 2001 From: Kyrylo Tkachov Date: Mon, 21 May 2018 11:21:07 +0000 Subject: [PATCH] [AArch64] Implement usadv16qi and ssadv16qi standard names This patch implements the usadv16qi and ssadv16qi standard names. See the thread at on gcc@gcc.gnu.org [1] for background. The V16QImode variant is important to get right as it is the most commonly used pattern: reducing vectors of bytes into an int. The midend expects the optab to compute the absolute differences of operands 1 and 2 and reduce them while widening along the way up to SImode. So the inputs are V16QImode and the output is V4SImode. I've tried out a few different strategies for that, the one I settled with is to emit: UABDL2 tmp.8h, op1.16b, op2.16b UABAL tmp.8h, op1.16b, op2.16b UADALP op3.4s, tmp.8h To work through the semantics let's say operands 1 and 2 are: op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 } op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 } op3 { c0, c1, c2, c3 } The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp: tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) } The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step: tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) } Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3: op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) } (sorry for the text dump) Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least). I've considered other sequences as well (thanks Wilco), for example * UABD + UADDLP + UADALP * UABLD2 + UABDL + UADALP + UADALP I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer before doing a single UADALP in the end to accumulate into the final S register). If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch to parametrise the generation sequence by code (or the appropriate RTX cost). This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them. So, for the code: unsigned char pix1[N], pix2[N]; int foo (void) { int i_sum = 0; int i; for (i = 0; i < 16; i++) i_sum += __builtin_abs (pix1[i] - pix2[i]); return i_sum; } we now generate on aarch64: foo: adrp x1, pix1 add x1, x1, :lo12:pix1 movi v0.4s, 0 adrp x0, pix2 add x0, x0, :lo12:pix2 ldr q2, [x1] ldr q3, [x0] uabdl2 v1.8h, v2.16b, v3.16b uabal v1.8h, v2.8b, v3.8b uadalp v0.4s, v1.8h addv s0, v0.4s umov w0, v0.s[0] ret instead of: foo: adrp x1, pix1 adrp x0, pix2 add x1, x1, :lo12:pix1 add x0, x0, :lo12:pix2 ldr q0, [x1] ldr q4, [x0] ushll v1.8h, v0.8b, 0 ushll2 v0.8h, v0.16b, 0 ushll v2.8h, v4.8b, 0 ushll2 v4.8h, v4.16b, 0 usubl v3.4s, v1.4h, v2.4h usubl2 v1.4s, v1.8h, v2.8h usubl v2.4s, v0.4h, v4.4h usubl2 v0.4s, v0.8h, v4.8h abs v3.4s, v3.4s abs v1.4s, v1.4s abs v2.4s, v2.4s abs v0.4s, v0.4s add v1.4s, v3.4s, v1.4s add v1.4s, v2.4s, v1.4s add v0.4s, v0.4s, v1.4s addv s0, v0.4s umov w0, v0.s[0] ret So I expect this new expansion to be better than the status quo in any case. Bootstrapped and tested on aarch64-none-linux-gnu. This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72. * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL, UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2, UNSPEC_UADALP values. * config/aarch64/iterators.md (ABAL): New int iterator. (ABDL2): Likewise. (ADALP): Likewise. (sur): Add mappings for the above. * config/aarch64/aarch64-simd.md (aarch64_abdl2_3): New define_insn. (aarch64_abal_4): Likewise. (aarch64_adalp_3): Likewise. (sadv16qi): New define_expand. * gcc.c-torture/execute/ssad-run.c: New test. * gcc.c-torture/execute/usad-run.c: Likewise. * gcc.target/aarch64/ssadv16qi.c: Likewise. * gcc.target/aarch64/usadv16qi.c: Likewise. From-SVN: r260437 --- gcc/ChangeLog | 15 +++++ gcc/config/aarch64/aarch64-simd.md | 61 +++++++++++++++++++ gcc/config/aarch64/aarch64.md | 6 ++ gcc/config/aarch64/iterators.md | 13 ++++ gcc/testsuite/ChangeLog | 7 +++ .../gcc.c-torture/execute/ssad-run.c | 49 +++++++++++++++ .../gcc.c-torture/execute/usad-run.c | 49 +++++++++++++++ gcc/testsuite/gcc.target/aarch64/ssadv16qi.c | 27 ++++++++ gcc/testsuite/gcc.target/aarch64/usadv16qi.c | 27 ++++++++ 9 files changed, 254 insertions(+) create mode 100644 gcc/testsuite/gcc.c-torture/execute/ssad-run.c create mode 100644 gcc/testsuite/gcc.c-torture/execute/usad-run.c create mode 100644 gcc/testsuite/gcc.target/aarch64/ssadv16qi.c create mode 100644 gcc/testsuite/gcc.target/aarch64/usadv16qi.c diff --git a/gcc/ChangeLog b/gcc/ChangeLog index b1a9017d0c15..b247c1fd28db 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,18 @@ +2018-05-21 Kyrylo Tkachov + + * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL, + UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2, + UNSPEC_UADALP values. + * config/aarch64/iterators.md (ABAL): New int iterator. + (ABDL2): Likewise. + (ADALP): Likewise. + (sur): Add mappings for the above. + * config/aarch64/aarch64-simd.md (aarch64_abdl2_3): + New define_insn. + (aarch64_abal_4): Likewise. + (aarch64_adalp_3): Likewise. + (sadv16qi): New define_expand. + 2018-05-21 Alexander Nesterovskiy * config/i386/i386.md (*movsf_internal): AVX falsedep fix. diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index c53a774f0053..fd971bf5c6de 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -612,6 +612,67 @@ [(set_attr "type" "neon_abd")] ) +(define_insn "aarch64_abdl2_3" + [(set (match_operand: 0 "register_operand" "=w") + (unspec: [(match_operand:VDQV_S 1 "register_operand" "w") + (match_operand:VDQV_S 2 "register_operand" "w")] + ABDL2))] + "TARGET_SIMD" + "abdl2\t%0., %1., %2." + [(set_attr "type" "neon_abd")] +) + +(define_insn "aarch64_abal_4" + [(set (match_operand: 0 "register_operand" "=w") + (unspec: [(match_operand:VDQV_S 1 "register_operand" "w") + (match_operand:VDQV_S 2 "register_operand" "w") + (match_operand: 3 "register_operand" "0")] + ABAL))] + "TARGET_SIMD" + "abal\t%0., %1., %2." + [(set_attr "type" "neon_arith_acc")] +) + +(define_insn "aarch64_adalp_3" + [(set (match_operand: 0 "register_operand" "=w") + (unspec: [(match_operand:VDQV_S 1 "register_operand" "w") + (match_operand: 2 "register_operand" "0")] + ADALP))] + "TARGET_SIMD" + "adalp\t%0., %1." + [(set_attr "type" "neon_reduc_add")] +) + +;; Emit a sequence to produce a sum-of-absolute-differences of the V16QI +;; inputs in operands 1 and 2. The sequence also has to perform a widening +;; reduction of the difference into a V4SI vector and accumulate that into +;; operand 3 before copying that into the result operand 0. +;; Perform that with a sequence of: +;; UABDL2 tmp.8h, op1.16b, op2.16b +;; UABAL tmp.8h, op1.16b, op2.16b +;; UADALP op3.4s, tmp.8h +;; MOV op0, op3 // should be eliminated in later passes. +;; The signed version just uses the signed variants of the above instructions. + +(define_expand "sadv16qi" + [(use (match_operand:V4SI 0 "register_operand")) + (unspec:V16QI [(use (match_operand:V16QI 1 "register_operand")) + (use (match_operand:V16QI 2 "register_operand"))] ABAL) + (use (match_operand:V4SI 3 "register_operand"))] + "TARGET_SIMD" + { + rtx reduc = gen_reg_rtx (V8HImode); + emit_insn (gen_aarch64_abdl2v16qi_3 (reduc, operands[1], + operands[2])); + emit_insn (gen_aarch64_abalv16qi_4 (reduc, operands[1], + operands[2], reduc)); + emit_insn (gen_aarch64_adalpv8hi_3 (operands[3], reduc, + operands[3])); + emit_move_insn (operands[0], operands[3]); + DONE; + } +) + (define_insn "aba_3" [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w") (plus:VDQ_BHSI (abs:VDQ_BHSI (minus:VDQ_BHSI diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 6556303ad4d3..7437971bdd89 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -141,6 +141,9 @@ UNSPEC_PRLG_STK UNSPEC_REV UNSPEC_RBIT + UNSPEC_SABAL + UNSPEC_SABDL2 + UNSPEC_SADALP UNSPEC_SCVTF UNSPEC_SISD_NEG UNSPEC_SISD_SSHL @@ -159,6 +162,9 @@ UNSPEC_TLSLE24 UNSPEC_TLSLE32 UNSPEC_TLSLE48 + UNSPEC_UABAL + UNSPEC_UABDL2 + UNSPEC_UADALP UNSPEC_UCVTF UNSPEC_USHL_2S UNSPEC_VSTRUCTDUMMY diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index ae4ec9d1565c..bf01044bc702 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -1389,6 +1389,16 @@ ;; ------------------------------------------------------------------- ;; Int Iterators. ;; ------------------------------------------------------------------- + +;; The unspec codes for the SABAL, UABAL AdvancedSIMD instructions. +(define_int_iterator ABAL [UNSPEC_SABAL UNSPEC_UABAL]) + +;; The unspec codes for the SABDL2, UABDL2 AdvancedSIMD instructions. +(define_int_iterator ABDL2 [UNSPEC_SABDL2 UNSPEC_UABDL2]) + +;; The unspec codes for the SADALP, UADALP AdvancedSIMD instructions. +(define_int_iterator ADALP [UNSPEC_SADALP UNSPEC_UADALP]) + (define_int_iterator MAXMINV [UNSPEC_UMAXV UNSPEC_UMINV UNSPEC_SMAXV UNSPEC_SMINV]) @@ -1596,6 +1606,9 @@ (UNSPEC_SHSUB "s") (UNSPEC_UHSUB "u") (UNSPEC_SRHSUB "sr") (UNSPEC_URHSUB "ur") (UNSPEC_ADDHN "") (UNSPEC_RADDHN "r") + (UNSPEC_SABAL "s") (UNSPEC_UABAL "u") + (UNSPEC_SABDL2 "s") (UNSPEC_UABDL2 "u") + (UNSPEC_SADALP "s") (UNSPEC_UADALP "u") (UNSPEC_SUBHN "") (UNSPEC_RSUBHN "r") (UNSPEC_ADDHN2 "") (UNSPEC_RADDHN2 "r") (UNSPEC_SUBHN2 "") (UNSPEC_RSUBHN2 "r") diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 6f917cbe36e7..4d10e0a9f7ea 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,10 @@ +2018-05-21 Kyrylo Tkachov + + * gcc.c-torture/execute/ssad-run.c: New test. + * gcc.c-torture/execute/usad-run.c: Likewise. + * gcc.target/aarch64/ssadv16qi.c: Likewise. + * gcc.target/aarch64/usadv16qi.c: Likewise. + 2018-05-21 Tamar Christina * gcc.target/gcc.target/aarch64/sha3.h (veor3q_u8, veor3q_u32, diff --git a/gcc/testsuite/gcc.c-torture/execute/ssad-run.c b/gcc/testsuite/gcc.c-torture/execute/ssad-run.c new file mode 100644 index 000000000000..f15f85f57537 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/ssad-run.c @@ -0,0 +1,49 @@ +extern void abort (); +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (signed char *w, int i, signed char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (signed char *w, signed char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +int +main (void) +{ + signed char m[256]; + signed char n[256]; + int sum, i; + + for (i = 0; i < 256; ++i) + if (i % 2 == 0) + { + m[i] = (i % 8) * 2 + 1; + n[i] = -(i % 8); + } + else + { + m[i] = -((i % 8) * 2 + 2); + n[i] = -((i % 8) >> 1); + } + + bar (m, n, 16, &sum); + + if (sum != 2368) + abort (); + + return 0; +} diff --git a/gcc/testsuite/gcc.c-torture/execute/usad-run.c b/gcc/testsuite/gcc.c-torture/execute/usad-run.c new file mode 100644 index 000000000000..904a634a4976 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/usad-run.c @@ -0,0 +1,49 @@ +extern void abort (); +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned char *w, int i, unsigned char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned char *w, unsigned char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +int +main (void) +{ + unsigned char m[256]; + unsigned char n[256]; + int sum, i; + + for (i = 0; i < 256; ++i) + if (i % 2 == 0) + { + m[i] = (i % 8) * 2 + 1; + n[i] = -(i % 8); + } + else + { + m[i] = -((i % 8) * 2 + 2); + n[i] = -((i % 8) >> 1); + } + + bar (m, n, 16, &sum); + + if (sum != 32384) + abort (); + + return 0; +} diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c new file mode 100644 index 000000000000..bab759929868 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +#define N 1024 + +signed char pix1[N], pix2[N]; + +int foo (void) +{ + int i_sum = 0; + int i; + + for (i = 0; i < N; i++) + i_sum += __builtin_abs (pix1[i] - pix2[i]); + + return i_sum; +} + +/* { dg-final { scan-assembler-not {\tsshll\t} } } */ +/* { dg-final { scan-assembler-not {\tsshll2\t} } } */ +/* { dg-final { scan-assembler-not {\tssubl\t} } } */ +/* { dg-final { scan-assembler-not {\tssubl2\t} } } */ +/* { dg-final { scan-assembler-not {\tabs\t} } } */ + +/* { dg-final { scan-assembler {\tsabdl2\t} } } */ +/* { dg-final { scan-assembler {\tsabal\t} } } */ +/* { dg-final { scan-assembler {\tsadalp\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/usadv16qi.c b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c new file mode 100644 index 000000000000..b7c08ee1e118 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +#define N 1024 + +unsigned char pix1[N], pix2[N]; + +int foo (void) +{ + int i_sum = 0; + int i; + + for (i = 0; i < N; i++) + i_sum += __builtin_abs (pix1[i] - pix2[i]); + + return i_sum; +} + +/* { dg-final { scan-assembler-not {\tushll\t} } } */ +/* { dg-final { scan-assembler-not {\tushll2\t} } } */ +/* { dg-final { scan-assembler-not {\tusubl\t} } } */ +/* { dg-final { scan-assembler-not {\tusubl2\t} } } */ +/* { dg-final { scan-assembler-not {\tabs\t} } } */ + +/* { dg-final { scan-assembler {\tuabdl2\t} } } */ +/* { dg-final { scan-assembler {\tuabal\t} } } */ +/* { dg-final { scan-assembler {\tuadalp\t} } } */ -- 2.43.5