[PATCH] x86_64: Implement V1TI mode shifts/rotates by a constant

Mon Oct 25 18:57:12 GMT 2021

On Mon, Oct 25, 2021 at 4:16 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> I believe the proposed sequences should be dramatically faster than LLVM's
> implementation(s), due to the large latencies required to move values between
> the vector and scalar parts on modern x86_64 microarchitectures.  All of the
> SSE2 instructions used in the sequences proposed by my patch have single
> cycle latencies, so have a maximum total latency of 5 cycles, though due to
> multiple issue, typically require between 1 and 3 cycles depending up the sequence.
>
> Moving between units is significantly slower; according to Agner Fog's tables,
> the pinsrq/pextrq instructions you suggest have latencies up to 7 cycles on the
> Silvermont architecture.  Let's take the LLVM code you've provided, and
> annotate with cycle counts for a recent Intel (cascadelake) and recent AMD
> (zen2) CPUs.
>
> movq    %xmm0, %rax             ; 2-3 cycles
> pshufd  $78, %xmm0, %xmm0       ; 1 cycle
> movq    %xmm0, %rcx             ; 2-3 cycles
> shldq   $8, %rax, %rcx          ; 3 cycles
> shlq    $8, %rax                        ; 1 cycle
> movq    %rcx, %xmm1             ; 2-3 cycles
> movq    %rax, %xmm0             ; 2-3 cycles
> punpcklqdq      %xmm1, %xmm0    ; 1 cycle
>
> This 8 instruction sequence has a total latency of 14 cycles on CascadeLake and
> 18 cycles on Zen2, but an scheduled cycle count of 9 cycles and 11 cycles respectively.
>
> The same left shift by 8 as implemented by the proposed patch is:
>
> pslldq  $1, %xmm0               ; 1 cycle
>
> And for reference, the code currently generated by GCC is:
>
> movaps  %xmm0, -24(%rsp)        ; 3 cycles
> movq    -24(%rsp), %rax         ; 2 cycles
> movq    -16(%rsp), %rdx         ; 2 cycles
> shldq   $8, %rax, %rdx          ; 3 cycles
> salq    $8, %rax                        ; 1 cycle
> movq    %rax, -24(%rsp)         ; 2 cycles
> movq    %rdx, -16(%rsp)         ; 2 cycles
> movdqa  -24(%rsp), %xmm0        ; 2 cycles
>
>
> The very worst case timing of my patches is the five instruction rotate:
> pshufd  $78, %xmm0, %xmm1       ; 1 cycle
> pshufd  $57, %xmm0, %xmm0       ; 1 cycle
> pslld   $1, %xmm1               ; 1 cycle
> psrld   $31, %xmm0              ; 1 cycle
> por     %xmm1, %xmm0            ; 1 cycle
>
> which has 5 cycle total latency, but can complete in 3 cycles when suitably
> scheduled as the pshufd can execute concurrently, as then can the two shifts,
> finally followed by the por.
>
> Perhaps I'm missing something, but I'd expect this patch to be three or
> four times faster, on recent hardware, than the code generated by LLVM.
>
> Let me know if you'd like me to run microbenchmarks, but the documented
> timings are such a dramatic improvement, I'm a little surprised you've
> asked about performance.  My patch is also a code size win with -Os
> (ashl_8 is currently 39 bytes, shrinks to 5 bytes with this patch).

I was a bit worried about latencies, but as shown above in a great
detail, this worry was not justified. Yes, taking into account that
V1TI lives natively in XMM registers, we should keep it there as much
as possible, and even if the sequences look complicated at the first
sight, they win in all cases.

So, the patch is OK.

Thanks,
Uros.

> Please let me know what you think.
> Roger
> --
>
> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 25 October 2021 09:02
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [PATCH] x86_64: Implement V1TI mode shifts/rotates by a constant
>
> On Sun, Oct 24, 2021 at 6:34 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > This patch provides RTL expanders to implement logical shifts and
> > rotates of 128-bit values (stored in vector integer registers) by
> > constant bit counts.  Previously, GCC would transfer these values to a
> > pair of scalar registers (TImode) via memory to perform the operation,
> > then transfer the result back via memory.  Instead these operations
> > are now expanded using (between 1 and 5) SSE2 vector instructions.
>
> Hm, instead of using memory (without STL forwarding for general -> XMM
> moves!) these should use something similar to what clang produces (or use pextrq/pinsrq, at least with SSE4.1):
>
>        movq    %xmm0, %rax
>        pshufd  $78, %xmm0, %xmm0
>        movq    %xmm0, %rcx
>        shldq   $8, %rax, %rcx
>        shlq    $8, %rax
>        movq    %rcx, %xmm1
>        movq    %rax, %xmm0
>        punpcklqdq      %xmm1, %xmm0
>
> > Logical shifts by multiples of 8 can be implemented using x86_64's
> > pslldq/psrldq instruction:
> > ashl_8: pslldq  $1, %xmm0
> >         ret
> > lshr_32:
> >         psrldq  $4, %xmm0
> >         ret
> >
> > Logical shifts by greater than 64 can use pslldq/psrldq $8, followed
> > by a psllq/psrlq for the remaining bits:
> > ashl_111:
> >         pslldq  $8, %xmm0
> >         psllq   $47, %xmm0
> >         ret
> > lshr_127:
> >         psrldq  $8, %xmm0
> >         psrlq   $63, %xmm0
> >         ret
> >
> > The remaining logical shifts make use of the following idiom:
> > ashl_1:
> >         movdqa  %xmm0, %xmm1
> >         psllq   $1, %xmm0
> >         pslldq  $8, %xmm1
> >         psrlq   $63, %xmm1
> >         por     %xmm1, %xmm0
> >         ret
> > lshr_15:
> >         movdqa  %xmm0, %xmm1
> >         psrlq   $15, %xmm0
> >         psrldq  $8, %xmm1
> >         psllq   $49, %xmm1
> >         por     %xmm1, %xmm0
> >         ret
> >
> > Rotates by multiples of 32 can use x86_64's pshufd:
> > rotr_32:
> >         pshufd  $57, %xmm0, %xmm0
> >         ret
> > rotr_64:
> >         pshufd  $78, %xmm0, %xmm0
> >         ret
> > rotr_96:
> >         pshufd  $147, %xmm0, %xmm0
> >         ret
> >
> > Rotates by multiples of 8 (other than multiples of 32) can make use of
> > both pslldq and psrldq, followed by por:
> > rotr_8:
> >         movdqa  %xmm0, %xmm1
> >         psrldq  $1, %xmm0
> >         pslldq  $15, %xmm1
> >         por     %xmm1, %xmm0
> >         ret
> > rotr_112:
> >         movdqa  %xmm0, %xmm1
> >         psrldq  $14, %xmm0
> >         pslldq  $2, %xmm1
> >         por     %xmm1, %xmm0
> >         ret
> >
> > And the remaining rotates use one or two pshufd, followed by a
> > psrld/pslld/por sequence:
> > rotr_1:
> >         movdqa  %xmm0, %xmm1
> >         pshufd  $57, %xmm0, %xmm0
> >         psrld   $1, %xmm1
> >         pslld   $31, %xmm0
> >         por     %xmm1, %xmm0
> >         ret
> > rotr_63:
> >         pshufd  $78, %xmm0, %xmm1
> >         pshufd  $57, %xmm0, %xmm0
> >         pslld   $1, %xmm1
> >         psrld   $31, %xmm0
> >         por     %xmm1, %xmm0
> >         ret
> > rotr_111:
> >         pshufd  $147, %xmm0, %xmm1
> >         pslld   $17, %xmm0
> >         psrld   $15, %xmm1
> >         por     %xmm1, %xmm0
> >         ret
> >
> > The new test case, sse2-v1ti-shift.c, is a run-time check to confirm
> > that the results of V1TImode shifts/rotates by constants, exactly
> > match the expected results of TImode operations, for various input test vectors.
>
> Is the sequence of 4+ SSE instructions really faster than pinsrq/pextrq (and two movq insn) + two operations on integer registers?
>
> Uros.
>
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check with no new failures.  Ok for mainline?
> >
> >
> > 2021-10-24  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386-expand.c (ix86_expand_v1ti_shift): New helper
> >         function to expand V1TI mode logical shifts by integer constants.
> >         (ix86_expand_v1ti_rotate): New helper function to expand V1TI
> >         mode rotations by integer constants.
> >         * config/i386/i386-protos.h (ix86_expand_v1ti_shift,
> >         ix86_expand_v1ti_rotate): Prototype new functions here.
> >         * config/i386/sse.md (ashlv1ti3, lshrv1ti3, rotlv1ti3, rotrv1ti3):
> >         New TARGET_SSE2 expanders to implement V1TI shifts and rotations.
> >
> > gcc/testsuite/ChangeLog
> >         * gcc.target/i386/sse2-v1ti-shift.c: New test case.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
>