This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH][AArch64] Vectorize MULH(R)S patterns with SVE2 instructions

This patch allows for more efficient SVE2 vectorization of Multiply High with Round and Scale (MULHRS) patterns.

The example snippet:

    uint16_t a[N], b[N], c[N];

    void foo_round (void)
        for (int i = 0; i < N; i++)
            a[i] = ((((int32_t)b[i] * (int32_t)c[i]) >> 14) + 1) >> 1;

... previously vectorized to:

        ptrue   p0.s
        whilelo p1.h, wzr, w2
        ld1h    {z2.h}, p1/z, [x4, x0, lsl #1]
        ld1h    {z0.h}, p1/z, [x3, x0, lsl #1]
        uunpklo z3.s, z2.h                      //
        uunpklo z1.s, z0.h                      //
        uunpkhi z2.s, z2.h                      //
        uunpkhi z0.s, z0.h                      //
        mul     z1.s, p0/m, z1.s, z3.s          //
        mul     z0.s, p0/m, z0.s, z2.s          //
        asr     z1.s, z1.s, #14                 //
        asr     z0.s, z0.s, #14                 //
        add     z1.s, z1.s, #1                  //
        add     z0.s, z0.s, #1                  //
        asr     z1.s, z1.s, #1                  //
        asr     z0.s, z0.s, #1                  //
        uzp1    z0.h, z1.h, z0.h                //
        st1h    {z0.h}, p1, [x1, x0, lsl #1]
        inch    x0
        whilelo p1.h, w0, w2    28

... and now vectorizes to:

        whilelo p0.h, wzr, w2
        ld1h    {z1.h}, p0/z, [x4, x0, lsl #1]
        ld1h    {z2.h}, p0/z, [x3, x0, lsl #1]
        umullb  z0.s, z1.h, z2.h                //
        umullt  z1.s, z1.h, z2.h                //
        rshrnb  z0.h, z0.s, #15                 //
        rshrnt  z0.h, z1.s, #15                 //
        st1h    {z0.h}, p0, [x1, x0, lsl #1]
        inch    x0
        whilelo p0.h, w0, w2    28

Also supported are:

* Non-rounding cases

    The equivalent example snippet:

        void foo_trunc (void)
            for (int i = 0; i < N; i++)
                a[i] = ((int32_t)b[i] * (int32_t)c[i]) >> 15;

    ... vectorizes with SHRNT/SHRNB

* 32-bit and 8-bit input/output types

* Signed output types

    SMULLT/SMULLB are generated instead

SQRDMULH was considered as a potential single-instruction optimization but saturates the intermediate value instead of truncating.

Best Regards,
Yuliang Wang


2019-08-22  Yuliang Wang  <>

        * config/aarch64/ support for SVE2 instructions [S/U]MULL[T/B] + [R]SHRN[T/B] and MULHRS pattern variants
        * config/aarch64/ iterators and attributes for above
        * internal-fn.def: internal functions for MULH[R]S patterns
        * optabs.def: optabs definitions for above and sign variants
        * tree-vect-patterns.c (vect_recog_multhi_pattern): pattern recognition function for MULHRS
        * new test for all variants

Attachment: rb11655.patch
Description: rb11655.patch

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]