[PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions

Thu Aug 29 14:17:00 GMT 2019

This patch allows for more efficient SVE2 vectorization of Multiply High with Round and Scale (MULHRS) patterns.

The example snippet:

    uint16_t a[N], b[N], c[N];

    void foo_round (void)
    {
        for (int i = 0; i < N; i++)
            a[i] = ((((int32_t)b[i] * (int32_t)c[i]) >> 14) + 1) >> 1;
    }

... previously vectorized to:

    foo_round:
        ...
        ptrue   p0.s
        whilelo p1.h, wzr, w2
        ld1h    {z2.h}, p1/z, [x4, x0, lsl #1]
        ld1h    {z0.h}, p1/z, [x3, x0, lsl #1]
        uunpklo z3.s, z2.h                      //
        uunpklo z1.s, z0.h                      //
        uunpkhi z2.s, z2.h                      //
        uunpkhi z0.s, z0.h                      //
        mul     z1.s, p0/m, z1.s, z3.s          //
        mul     z0.s, p0/m, z0.s, z2.s          //
        asr     z1.s, z1.s, #14                 //
        asr     z0.s, z0.s, #14                 //
        add     z1.s, z1.s, #1                  //
        add     z0.s, z0.s, #1                  //
        asr     z1.s, z1.s, #1                  //
        asr     z0.s, z0.s, #1                  //
        uzp1    z0.h, z1.h, z0.h                //
        st1h    {z0.h}, p1, [x1, x0, lsl #1]
        inch    x0
        whilelo p1.h, w0, w2
        b.ne    28
        ret

... and now vectorizes to:

    foo_round:
        ...
        whilelo p0.h, wzr, w2
        nop
        ld1h    {z1.h}, p0/z, [x4, x0, lsl #1]
        ld1h    {z2.h}, p0/z, [x3, x0, lsl #1]
        umullb  z0.s, z1.h, z2.h                //
        umullt  z1.s, z1.h, z2.h                //
        rshrnb  z0.h, z0.s, #15                 //
        rshrnt  z0.h, z1.s, #15                 //
        st1h    {z0.h}, p0, [x1, x0, lsl #1]
        inch    x0
        whilelo p0.h, w0, w2
        b.ne    28
        ret
        nop

Also supported are:

* Non-rounding cases

    The equivalent example snippet:

        void foo_trunc (void)
        {
            for (int i = 0; i < N; i++)
                a[i] = ((int32_t)b[i] * (int32_t)c[i]) >> 15;
        }

    ... vectorizes with SHRNT/SHRNB

* 32-bit and 8-bit input/output types

* Signed output types

    SMULLT/SMULLB are generated instead

SQRDMULH was considered as a potential single-instruction optimization but saturates the intermediate value instead of truncating.

Best Regards,
Yuliang Wang

ChangeLog:

2019-08-22  Yuliang Wang  <yuliang.wang@arm.com>

        * config/aarch64/aarch64-sve2.md: support for SVE2 instructions [S/U]MULL[T/B] + [R]SHRN[T/B] and MULHRS pattern variants
        * config/aarch64/iterators.md: iterators and attributes for above
        * internal-fn.def: internal functions for MULH[R]S patterns
        * optabs.def: optabs definitions for above and sign variants
        * tree-vect-patterns.c (vect_recog_multhi_pattern): pattern recognition function for MULHRS
        * gcc.target/aarch64/sve2/mulhrs_1.c: new test for all variants
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rb11655.patch
Type: application/octet-stream
Size: 20235 bytes
Desc: rb11655.patch
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20190829/e634c3a0/attachment.obj>