This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC PATCH, i386]: Improve STV pass by correcting the cost of moves to/from XMM reg


On Tue, 27 Aug 2019, Uros Bizjak wrote:

> On Tue, Aug 27, 2019 at 9:55 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > > > > > > > > This is currently a heads-up patch that removes the minimum limitation
> > > > > > > > > of cost of moves to/from XMM reg. The immediate benefit is the removal
> > > > > > > > > of mismatched spills, caused by subreg usage.
> > > > > > > > >
> > > > > > > > > *If* the patch proves to be beneficial (as in "doesn't regress
> > > > > > > > > important benchmarks"), then we should be able to un-hide the
> > > > > > > > > inter-regset moves from RA and allow it to collapse some moves. As an
> > > > > > > > > example, patched compiler removes a movd in gcc.target/i386/minmax-6.c
> > > > > > > > > and still avoids mismatched spill.
> > > > > > > > >
> > > > > > > > > 2019-08-23  Uroš Bizjak  <ubizjak@gmail.com>
> > > > > > > > >
> > > > > > > > >     * config/i386/i386.c (ix86_register_move_cost): Do not
> > > > > > > > >     limit the cost of moves to/from XMM register to minimum 8.
> > > > > > > > >     * config/i386/i386-features.c
> > > > > > > > >     (general_scalar_chain::make_vector_copies): Do not generate
> > > > > > > > >     zeroing move from GPR to XMM register, use gen_move_insn
> > > > > > > > >     instead of gen_gpr_to_xmm_move_src.
> > > > > > > > >     (general_scalar_chain::convert_op): Ditto.
> > > > > > > > >     (gen_gpr_to_xmm_move_src): Remove.
> > > > > > > > >
> > > > > > > > > The patch was bootstrapped and regression tested on x86_64-linux-gnu
> > > > > > > > > {,-m32}, configured w/ and w/o -with-arch=ivybridge.
> > > > > > > > >
> > > > > > > > > The patch regresses PR80481 scan-asm-not (where the compiler generates
> > > > > > > > > unrelated XMM spill on register starved x86_32). However, during the
> > > > > > > > > analysis, I found that the original issue is not fixed, and is still
> > > > > > > > > visible without -funrol-loops [1].
> > > > > > > > >
> > > > > > > > > [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80481#c10
> > > > > > > > >
> > > > > > > > > So, I'd wait for the HJ's benchmark results of the cost to/from XMM
> > > > > > > > > change before proceeding with the patch.
> > > > > > > >
> > > > > > > > Otherwise looks good to me, it might clash (whitespace wise)
> > > > > > > > with my STV "rewrite" though.
> > > > > > > >
> > > > > > > > We might want to adjust ix86_register_move_cost separately from
> > > > > > > > the STV change to use regular moves though?  Just to make bisection
> > > > > > > > point to either of the two - STV "fallout"" is probably minor
> > > > > > > > compared to fallout elsewhere...
> > > > > > >
> > > > > > > Yes, this is also my plan.
> > > > > >
> > > > > > Btw, when testing w/ the costing disabled I run into
> > > > > >
> > > > > > (insn 31 7 8 2 (set (subreg:V4SI (reg:SI 107) 0)
> > > > > >         (vec_merge:V4SI (vec_duplicate:V4SI (mem/c:SI (const:DI (plus:DI
> > > > > > (symbol_ref:DI ("peakbuf") [flags 0x2]  <var_decl 0x7efe8dbfeab0 peakbuf>)
> > > > > >                             (const_int 400020000 [0x17d7d220]))) [1
> > > > > > peakbuf.PeaksInBuf+0 S4 A64]))
> > > > > >             (const_vector:V4SI [
> > > > > >                     (const_int 0 [0]) repeated x4
> > > > > >                 ])
> > > > > >             (const_int 1 [0x1])))
> > > > > >
> > > > > > being not recognized (for the large immediate I guess).  So when
> > > > > > just doing
> > > > >
> > > > > Hard to say without testcase, but I don't think this is valid memory
> > > > > address. Can you see in dumps which insn is getting converted here?
> > > >
> > > > Ah, we can use movabsdi in this case.
> > >
> > > Yes, this is simply STVing such a single load... (quite pointless
> > > of course).
> > >
> > > > > >   (set (reg:SI 107) (mem:SI ...))
> > > > > >
> > > > > > here we expect IRA to be able to allocate 107 into xmm, realizing
> > > > > > it needs to reload the RHS first?
> > > > > >
> > > > > > For current code, is testing x86_64_general_operand like in the
> > > > > > following the correct thing to do?
> > > > >
> > > > > x86_64_general_operand will only limit immediates using
> > > > > x86_64_immediate_operand, it will allow all memory_operands.
> > > >
> > > > Yes, I guess it is OK, invalid embedded address can be loaded using movabsq.
> > >
> > > OK, combining this with your patch for extended testing now.
> >
> > Please note that this new code will be removed by the above patch. The
> > failing pattern will be replaced with simple move to SUBREG, where RA
> > is able to correctly reload the operands.
> 
> BTW: A testcase would come handy here to check if the above assumption
> realy stands...

int arr[1L<<31];
int max;

void foo ()
{
  max = arr[1L<<30 + 1] < arr[1L<<30 + 2] ? arr[1L<<30 + 2] : arr[1L<<30 + 
1];
}

which has two loads:

(insn 11 10 12 2 (set (reg:SI 92 [ arr+8589934592 ])
        (mem:SI (const:DI (plus:DI (symbol_ref:DI ("arr") [flags 0x2]  
<var_decl 0x7f35f6a5cab0 arr>)
                    (const_int 8589934592 [0x200000000]))) [1 
arr+8589934592 S4 A128])) "t.c":6:61 76 {*movabssi_2}
     (nil))
(insn 12 11 13 2 (parallel [
            (set (reg:SI 91)
                (smax:SI (reg:SI 92 [ arr+8589934592 ])
                    (mem:SI (reg/f:DI 90) [1 arr+17179869184 S4 A128])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":6:61 973 {*smaxsi3_1}


in the chain which are then converted via convert_op as follows with
your suggested patch:

(insn 16 10 11 2 (set (reg:SI 93)
        (mem:SI (const:DI (plus:DI (symbol_ref:DI ("arr") [flags 0x2]  
<var_decl 0x7f35f6a5cab0 arr>)
                    (const_int 8589934592 [0x200000000]))) [1 
arr+8589934592 S4 A128])) "t.c":6:61 -1
     (nil))
(insn 11 16 17 2 (set (subreg:V4SI (reg:SI 92 [ arr+8589934592 ]) 0)
        (subreg:V4SI (reg:SI 93) 0)) "t.c":6:61 1250 {movv4si_internal}
     (nil))
(insn 17 11 12 2 (set (reg:SI 94)
        (mem:SI (reg/f:DI 90) [1 arr+17179869184 S4 A128])) "t.c":6:61 -1
     (nil))
(insn 12 17 13 2 (set (subreg:V4SI (reg:SI 91) 0)
        (smax:V4SI (subreg:V4SI (reg:SI 92 [ arr+8589934592 ]) 0)
            (subreg:V4SI (reg:SI 94) 0))) "t.c":6:61 3657 
{*sse4_1_smaxv4si3}

and allocated without spilling.  Note the insn 11 conversion
is likely a "bug" in my earlier patch since convert_insn has

  switch (GET_CODE (src))
    {
...
    case MEM:
      if (!REG_P (dst))
        convert_op (&src, insn);
      break;

supposedly leaving plain loads as-is but now we have already
replaced dst with a subreg and the above doesn't match anymore.
This should now check SUBREG_P but with that we'd end up with

t.c: In function ‘foo’:
t.c:7:1: error: unrecognizable insn:
    7 | }
      | ^
(insn 11 10 12 2 (set (subreg:V4SI (reg:SI 92 [ arr+8589934592 ]) 0)
        (mem:SI (const:DI (plus:DI (symbol_ref:DI ("arr") [flags 0x2]  
<var_decl 0x7f5ca8513ab0 arr>)
                    (const_int 8589934592 [0x200000000]))) [1 
arr+8589934592 S4 A128])) "t.c":6:61 -1
     (nil))

so eventually it's broken with earlier GCC already.  OTOH
with -m32 the large addresses cannot happen and the TImode chains
are likely too simplistic to matter, or rather, we handle
the large offsets just fine:

(insn 11 10 12 2 (set (reg:V1TI 89 [ arr+68719476736 ])
        (mem:V1TI (reg/f:DI 88) [1 arr+68719476736 S16 A128])) "t.c":5:19 
1214 {movv1ti_internal}
     (expr_list:REG_DEAD (reg/f:DI 88)
        (nil)))

where we split out the address computation completely for some reason.

So we do seem to need the extra copy anyways for a move to
subreg:V4SI.

Richard.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]