[x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

H.J. Lu hjl.tools@gmail.com
Sun Jul 10 23:56:03 GMT 2022


On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi HJ,
>
> I believe this should now be handled by the post-reload (CSE) pass.
> Consider the simple test case:
>
> __int128 a, b, c;
> void foo()
> {
>   a = 0;
>   b = 0;
>   c = 0;
> }
>
> Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
>         movq    $0, a(%rip)
>         movq    $0, a+8(%rip)
>         movq    $0, b(%rip)
>         movq    $0, b+8(%rip)
>         movq    $0, c(%rip)
>         movq    $0, c+8(%rip)
>         ret
>
> But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         movaps  %xmm0, b(%rip)
>         movaps  %xmm0, c(%rip)
>         ret
>
> You're quite right internally the STV actually generates the equivalent of:
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, b(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, c(%rip)
>         ret
>
> And currently because STV run before cse2 and combine, the const0_rtx
> gets CSE'd be the cse2 pass to produce the code we see.  However, if you
> specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we
> continue to generate the same optimized code, as the same const0_rtx
> gets CSE'd in postreload.
>
> I can't be certain until I try the experiment, but I believe that the postreload
> CSE will clean-up, all of the same common subexpressions.  Hence, it should
> be safe to perform all STV at the same point (after combine), which for a few
> additional optimizations.
>
> Does this make sense?  Do you have a test case, -fno-rerun-cse-after-loop
> produces different/inferior code for TImode STV chains?
>
> My guess is that the RTL passes have changed so much in the last six or
> seven years, that some of the original motivation no longer applies.
> Certainly we now try to keep TI mode operations visible longer, and
> then allow STV to behave like a pre-reload pass to decide which set of
> registers to use (vector V1TI or scalar doubleword DI).  Any CSE opportunities
> that cse2 finds with V1TI mode, could/should equally well be found for
> TI mode (mostly).

You are probably right.  If there are no regressions in GCC testsuite,
my original motivation is no longer valid.

Thanks.

> Cheers,
> Roger
> --
>
> > -----Original Message-----
> > From: H.J. Lu <hjl.tools@gmail.com>
> > Sent: 10 July 2022 20:15
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> > TImode to V1TImode.
> >
> > On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > Hi Uros,
> > > Yes, I agree.  I think it makes sense to have a single STV pass (after
> > > combine and before reload).  Let's hear what HJ thinks, but I'm happy
> > > to investigate a follow-up patch that unifies the STV passes.
> > > But it'll be easier to confirm there are no "code generation" changes
> > > if those modifications are pushed independently of these ones.
> > > Time to look into the (git) history of multiple STV passes...
> > >
> > > Thanks for the review.  I'll wait for HJ's thoughts.
> >
> > The TImode STV pass is run before the CSE pass so that instructions changed or
> > generated by the STV pass can be CSEed.
> >
> > > Cheers,
> > > Roger
> > > --
> > >
> > > > -----Original Message-----
> > > > From: Uros Bizjak <ubizjak@gmail.com>
> > > > Sent: 10 July 2022 19:06
> > > > To: Roger Sayle <roger@nextmovesoftware.com>
> > > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> > > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support
> > > > for TImode to V1TImode.
> > > >
> > > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle
> > > > <roger@nextmovesoftware.com>
> > > > wrote:
> > > > >
> > > > >
> > > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > > > aggressively transform 128-bit scalar TImode operations into
> > > > > vector V1TImode operations performed on SSE registers.  TImode
> > > > > functionality already exists in STV, but only for move operations,
> > > > > this changes brings support for logical operations (AND, IOR, XOR,
> > > > > NOT and ANDN) and comparisons.
> > > > >
> > > > > The effect of these changes are conveniently demonstrated by the
> > > > > new sse4_1-stv-5.c test case:
> > > > >
> > > > > __int128 a[16];
> > > > > __int128 b[16];
> > > > > __int128 c[16];
> > > > >
> > > > > void foo()
> > > > > {
> > > > >   for (unsigned int i=0; i<16; i++)
> > > > >     a[i] = b[i] & ~c[i];
> > > > > }
> > > > >
> > > > > which when currently compiled on mainline wtih -O2 -msse4 produces:
> > > > >
> > > > > foo:    xorl    %eax, %eax
> > > > > .L2:    movq    c(%rax), %rsi
> > > > >         movq    c+8(%rax), %rdi
> > > > >         addq    $16, %rax
> > > > >         notq    %rsi
> > > > >         notq    %rdi
> > > > >         andq    b-16(%rax), %rsi
> > > > >         andq    b-8(%rax), %rdi
> > > > >         movq    %rsi, a-16(%rax)
> > > > >         movq    %rdi, a-8(%rax)
> > > > >         cmpq    $256, %rax
> > > > >         jne     .L2
> > > > >         ret
> > > > >
> > > > > but with this patch now produces:
> > > > >
> > > > > foo:    xorl    %eax, %eax
> > > > > .L2:    movdqa  c(%rax), %xmm0
> > > > >         pandn   b(%rax), %xmm0
> > > > >         addq    $16, %rax
> > > > >         movaps  %xmm0, a-16(%rax)
> > > > >         cmpq    $256, %rax
> > > > >         jne     .L2
> > > > >         ret
> > > > >
> > > > > Technically, the STV pass is implemented by three C++ classes, a
> > > > > common abstract base class "scalar_chain" that contains common
> > > > > functionality, and two derived classes: general_scalar_chain
> > > > > (which handles SI and DI modes) and timode_scalar_chain (which
> > > > > handles TI modes).  As mentioned previously, because only TI mode
> > > > > moves were handled the two worker classes behaved significantly
> > differently.
> > > > > These changes bring the functionality of these two classes closer
> > > > > together, which is reflected by refactoring more shared code from
> > > > > general_scalar_chain to the parent scalar_chain and reusing it
> > > > > from timode.  There still remain significant differences (and
> > > > > simplifications) so the existing division of classes (as
> > > > > specializations) continues
> > > > to make sense.
> > > >
> > > > Please note that there are in fact two STV passes, one before
> > > > combine and the other after combine. The TImode pass that previously
> > > > handled only loads and stores is positioned before combine (there
> > > > was a reason for this decision, but I don't remember the details -
> > > > let's ask HJ...). However, DImode STV pass transforms much more
> > > > instructions and the reason it was positioned after the combine pass
> > > > was that STV pass transforms optimized insn stream where forward
> > propagation was already performed.
> > > >
> > > > What is not clear to me from the above explanation is: is the new
> > > > TImode STV pass positioned after the combine pass, and if this is
> > > > the case, how the change affects current load/store TImode STV pass.
> > > > I must admit, I don't like two separate STV passess, so if TImode is
> > > > now similar to DImode, I suggest we abandon STV1 pass and do
> > > > everything concerning TImode after the combine pass. HJ, what is your
> > opinion on this?
> > > >
> > > > Other than the above, the patch LGTM to me.
> > > >
> > > > Uros.
> > > >
> > > > > Obviously, there are more changes to come (shifts and rotates),
> > > > > and compute_convert_gain doesn't yet have its final (tuned) form,
> > > > > but is already an improvement over the "return 1;" used previously.
> > > > >
> > > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > > boostrap and make -k check, both with and without
> > > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > > >
> > > > >
> > > > > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> > > > >
> > > > > gcc/ChangeLog
> > > > >         * config/i386/i386-features.h (scalar_chain): Add fields
> > > > >         insns_conv, n_sse_to_integer and n_integer_to_sse to this
> > > > >         parent class, moved from general_scalar_chain.
> > > > >         (scalar_chain::convert_compare): Protected method moved
> > > > >         from general_scalar_chain.
> > > > >         (mark_dual_mode_def): Make protected, not private virtual.
> > > > >         (scalar_chain:convert_op): New private virtual method.
> > > > >
> > > > >         (general_scalar_chain::general_scalar_chain): Simplify constructor.
> > > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > > >         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
> > > > >         n_integer_to_sse fields to parent class, scalar_chain.
> > > > >         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > > >         (general_scalar_chain::convert_compare): Delete prototype.
> > > > >
> > > > >         (timode_scalar_chain::compute_convert_gain): Remove simplistic
> > > > >         implementation, convert to a method prototype.
> > > > >         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > > >         (timode_scalar_chain::convert_op): Prototype new virtual method.
> > > > >
> > > > >         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
> > > > >         Allocate insns_conv and initialize n_sse_to_integer and
> > > > >         n_integer_to_sse fields in constructor.
> > > > >         (scalar_chain::scalar_chain): Free insns_conv in destructor.
> > > > >
> > > > >         (general_scalar_chain::general_scalar_chain): Delete
> > > > >         constructor, now defined in the class declaration.
> > > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > > >
> > > > >         (scalar_chain::mark_dual_mode_def): Renamed from
> > > > >         general_scalar_chain::mark_dual_mode_def.
> > > > >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> > > > >         (scalar_chain::convert_compare): Renamed from
> > > > >         general_scalar_chain::convert_compare.
> > > > >
> > > > >         (timode_scalar_chain::compute_convert_gain): New method to
> > > > >         determine the gain from converting a TImode chain to V1TImode.
> > > > >         (timode_scalar_chain::convert_op): New method to convert an
> > > > >         operand from TImode to V1TImode.
> > > > >
> > > > >         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
> > > > >         on REG_EQUAL notes that were originally TImode (not CONST_INT).
> > > > >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> > > > >         (timode_mem_p): Helper predicate to check where operand is
> > > > >         memory reference with sufficient alignment for TImode STV.
> > > > >         (timode_scalar_to_vector_candidate_p): Use
> > convertible_comparison_p
> > > > >         to check whether COMPARE is convertible.  Handle SET_DESTs that
> > > > >         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
> > > > >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> > > > >
> > > > > gcc/testsuite/ChangeLog
> > > > >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> > > > >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> > > > >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> > > > >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> > > > >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> > > > >
> > > > > Roger
> > > > > --
> > > > >
> > >
> >
> >
> > --
> > H.J.
>


-- 
H.J.


More information about the Gcc-patches mailing list