This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs


On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>
> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > >
> > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > >>>> necessary even when going the STV route.  The actual regression
> > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > >>>> back into a compare and jump rather than a conditional move sequence.
> > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > >>>> sched2 on x86 it seems).
> > > >>>>
> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > >>>> case STV doesn't end up doing any transform?
> > > >>>
> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > >>> the insn back to compare+cmove.
> > > >>
> > > >> OK, that would work.  But there's no way to force a jumpy sequence then
> > > >> which we know is faster than compare+cmove because later RTL
> > > >> if-conversion passes happily re-discover the smax (or conditional move)
> > > >> sequence.
> > > >>
> > > >>> However, considering the SImode move
> > > >>> from/to int/xmm register is relatively cheap, the cost function should
> > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > >>
> > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > >> sequence... (for the loop in hmmer which is the only one I see
> > > >> any effect of any of my patches).  So identifying chains that
> > > >> start/end in memory is important for cost reasons.
> > > >
> > > > Please note that the cost function also considers the cost of move
> > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > transformation.
> > > >
> > > >> So I think the splitting has to happen after the last if-conversion
> > > >> pass (and thus we may need to allocate a scratch register for this
> > > >> purpose?)
> > > >
> > > > I really hope that the underlying issue will be solved by a machine
> > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > would handle jcc and cmove instructions. Until then... yes your
> > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > although sometimes we would like to split to cmove instead.
> > >
> > > So the following makes STV also consider SImode chains, re-using the
> > > DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
> > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > quite off for TARGET_64BIT.  With this I get the expected conversion
> > > for the testcase derived from hmmer.
> > >
> > > No further testing sofar.
> > >
> > > Is it OK to re-use the DImode chain code this way?  I'll clean things
> > > up some more of course.
> >
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> > > Still need help with the actual patterns for minmax and how the splitters
> > > should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
>
> Thanks.  The attached patch makes the patch cleaner and it survives
> "some" barebone testing.  It also touches the cost function to
> avoid being too overly trigger-happy.  I've also ended up using
> ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> move costs shouldn't be wrapped in COST_N_INSNS.
> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> At least with the current code every reg-reg move biases in favor of
> SSE...
>
> And we're simply adding move and non-move costs in 'gain', somewhat
> mixing apples and oranges?  We could separate those and require
> both to be a net positive win?
>
> Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> costs as apples and oranges... (so it never triggers for Bulldozer)
>
> I now run into
>
> /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> error: unrecognizable insn:
> (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>         (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
>             (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> -1
>      (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> during RTL pass: stv
>
> where even with -mavx2 we do not have s{min,max}v2di3.  We do have
> an expander here but it seems only AVX512F has the DImode min/max
> ops.  I have adjusted dimode_scalar_to_vector_candidate_p
> accordingly.
>
> I'm considering to rename the
> dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> functions to drop the dimode_ prefix - is that OK or do you
> prefer some other prefix?

No, please just drop the prefix.

> So - bootstrap with --with-arch=skylake in progress.
>
> It detects quite a few chains (unsurprisingly) so I guess we need
> to address compile-time issues in the pass before enabling this
> enhancement (maybe as followup?).
>
> Further comments on the actual patch welcome, I consider it
> "finished" if testing reveals no issues.  ChangeLog still needs
> to be written and testcases to be added.

I'll look at the patch later today from the x86 target PoV, maybe an
opinion of the RTL expert would also come in hand here.

Uros.

>
> Thanks,
> Richard.
>
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c     (revision 274111)
> +++ gcc/config/i386/i386-features.c     (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>
>  /* Initialize new chain.  */
>
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
>  {
> +  smode = smode_;
> +  vmode = vmode_;
> +
>    chain_id = ++max_id;
>
>     if (dump_file)
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
>        && !HARD_REGISTER_P (SET_DEST (def_set)))
>      bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>
> +  /* ???  The following is quadratic since analyze_register_chain
> +     iterates over all refs to look for dual-mode regs.  Instead this
> +     should be done separately for all regs mentioned in the chain once.  */
>    df_ref ref;
>    df_ref def;
>    for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost (
>  {
>    gcc_assert (CONST_INT_P (exp));
>
> -  if (standard_sse_constant_p (exp, V2DImode))
> -    return COSTS_N_INSNS (1);
> -  return ix86_cost->sse_load[1];
> +  if (standard_sse_constant_p (exp, vmode))
> +    return ix86_cost->sse_op;
> +  /* We have separate costs for SImode and DImode, use SImode costs
> +     for smaller modes.  */
> +  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
>  }
>
>  /* Compute a gain for chain conversion.  */
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
>    if (dump_file)
>      fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>
> +  /* SSE costs distinguish between SImode and DImode loads/stores, for
> +     int costs factor in the number of GPRs involved.  When supporting
> +     smaller modes than SImode the int load/store costs need to be
> +     adjusted as well.  */
> +  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> +  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
>        rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      int igain = 0;
>
>        if (REG_P (src) && REG_P (dst))
> -       gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> +       igain += 2 * m - ix86_cost->xmm_move;
>        else if (REG_P (src) && MEM_P (dst))
> -       gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> +       igain
> +         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
>        else if (MEM_P (src) && REG_P (dst))
> -       gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> +       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
>        else if (GET_CODE (src) == ASHIFT
>                || GET_CODE (src) == ASHIFTRT
>                || GET_CODE (src) == LSHIFTRT)
>         {
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> -         gain += ix86_cost->shift_const;
> +           igain -= vector_const_cost (XEXP (src, 0));
> +         igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
>           if (INTVAL (XEXP (src, 1)) >= 32)
> -           gain -= COSTS_N_INSNS (1);
> +           igain -= COSTS_N_INSNS (1);
>         }
>        else if (GET_CODE (src) == PLUS
>                || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
>                || GET_CODE (src) == XOR
>                || GET_CODE (src) == AND)
>         {
> -         gain += ix86_cost->add;
> +         igain += m * ix86_cost->add - ix86_cost->sse_op;
>           /* Additional gain for andnot for targets without BMI.  */
>           if (GET_CODE (XEXP (src, 0)) == NOT
>               && !TARGET_BMI)
> -           gain += 2 * ix86_cost->add;
> +           igain += m * ix86_cost->add;
>
>           if (CONST_INT_P (XEXP (src, 0)))
> -           gain -= vector_const_cost (XEXP (src, 0));
> +           igain -= vector_const_cost (XEXP (src, 0));
>           if (CONST_INT_P (XEXP (src, 1)))
> -           gain -= vector_const_cost (XEXP (src, 1));
> +           igain -= vector_const_cost (XEXP (src, 1));
>         }
>        else if (GET_CODE (src) == NEG
>                || GET_CODE (src) == NOT)
> -       gain += ix86_cost->add - COSTS_N_INSNS (1);
> +       igain += m * ix86_cost->add - ix86_cost->sse_op;
> +      else if (GET_CODE (src) == SMAX
> +              || GET_CODE (src) == SMIN
> +              || GET_CODE (src) == UMAX
> +              || GET_CODE (src) == UMIN)
> +       {
> +         /* We do not have any conditional move cost, estimate it as a
> +            reg-reg move.  Comparisons are costed as adds.  */
> +         igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> +         /* Integer SSE ops are all costed the same.  */
> +         igain -= ix86_cost->sse_op;
> +       }
>        else if (GET_CODE (src) == COMPARE)
>         {
>           /* Assume comparison cost is the same.  */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
>        else if (CONST_INT_P (src))
>         {
>           if (REG_P (dst))
> -           gain += COSTS_N_INSNS (2);
> +           /* DImode can be immediate for TARGET_64BIT and SImode always.  */
> +           igain += COSTS_N_INSNS (m);
>           else if (MEM_P (dst))
> -           gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> -         gain -= vector_const_cost (src);
> +           igain += (m * ix86_cost->int_store[2]
> +                    - ix86_cost->sse_store[sse_cost_idx]);
> +         igain -= vector_const_cost (src);
>         }
>        else
>         gcc_unreachable ();
> +
> +      if (igain != 0 && dump_file)
> +       {
> +         fprintf (dump_file, "  Instruction gain %d for ", igain);
> +         dump_insn_slim (dump_file, insn);
> +       }
> +      gain += igain;
>      }
>
>    if (dump_file)
>      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
>
> +  /* ???  What about integer to SSE?  */
>    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
>      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>
> @@ -573,7 +611,7 @@ rtx
>  dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
>  {
>    if (x == reg)
> -    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> +    return gen_rtx_SUBREG (vmode, new_reg, 0);
>
>    const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
>    int i, j;
> @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
>         start_sequence ();
>         if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
>           {
> -           rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> -           emit_move_insn (adjust_address (tmp, SImode, 0),
> -                           gen_rtx_SUBREG (SImode, reg, 0));
> -           emit_move_insn (adjust_address (tmp, SImode, 4),
> -                           gen_rtx_SUBREG (SImode, reg, 4));
> +           rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> +           if (smode == DImode && !TARGET_64BIT)
> +             {
> +               emit_move_insn (adjust_address (tmp, SImode, 0),
> +                               gen_rtx_SUBREG (SImode, reg, 0));
> +               emit_move_insn (adjust_address (tmp, SImode, 4),
> +                               gen_rtx_SUBREG (SImode, reg, 4));
> +             }
> +           else
> +             emit_move_insn (tmp, reg);
>             emit_move_insn (vreg, tmp);
>           }
> -       else if (TARGET_SSE4_1)
> +       else if (!TARGET_64BIT && smode == DImode)
>           {
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                         gen_rtx_SUBREG (SImode, reg, 4),
> -                                         GEN_INT (2)));
> +           if (TARGET_SSE4_1)
> +             {
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                             gen_rtx_SUBREG (SImode, reg, 4),
> +                                             GEN_INT (2)));
> +             }
> +           else
> +             {
> +               rtx tmp = gen_reg_rtx (DImode);
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 0)));
> +               emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> +                                           CONST0_RTX (V4SImode),
> +                                           gen_rtx_SUBREG (SImode, reg, 4)));
> +               emit_insn (gen_vec_interleave_lowv4si
> +                          (gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, vreg, 0),
> +                           gen_rtx_SUBREG (V4SImode, tmp, 0)));
> +             }
>           }
>         else
> -         {
> -           rtx tmp = gen_reg_rtx (DImode);
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 0)));
> -           emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> -                                       CONST0_RTX (V4SImode),
> -                                       gen_rtx_SUBREG (SImode, reg, 4)));
> -           emit_insn (gen_vec_interleave_lowv4si
> -                      (gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, vreg, 0),
> -                       gen_rtx_SUBREG (V4SImode, tmp, 0)));
> -         }
> +         emit_move_insn (gen_lowpart (smode, vreg), reg);
>         rtx_insn *seq = get_insns ();
>         end_sequence ();
>         rtx_insn *insn = DF_REF_INSN (ref);
> @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
>    bitmap_copy (conv, insns);
>
>    if (scalar_copy)
> -    scopy = gen_reg_rtx (DImode);
> +    scopy = gen_reg_rtx (smode);
>
>    for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
>      {
> @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
>           start_sequence ();
>           if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
>             {
> -             rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> +             rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
>               emit_move_insn (tmp, reg);
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             adjust_address (tmp, SImode, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             adjust_address (tmp, SImode, 4));
> +             if (!TARGET_64BIT && smode == DImode)
> +               {
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 adjust_address (tmp, SImode, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 adjust_address (tmp, SImode, 4));
> +               }
> +             else
> +               emit_move_insn (scopy, tmp);
>             }
> -         else if (TARGET_SSE4_1)
> +         else if (!TARGET_64BIT && smode == DImode)
>             {
> -             rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 0),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> -             tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> -             emit_insn
> -               (gen_rtx_SET
> -                (gen_rtx_SUBREG (SImode, scopy, 4),
> -                 gen_rtx_VEC_SELECT (SImode,
> -                                     gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> +             if (TARGET_SSE4_1)
> +               {
> +                 rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> +                                             gen_rtvec (1, const0_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 0),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +
> +                 tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> +                 emit_insn
> +                   (gen_rtx_SET
> +                      (gen_rtx_SUBREG (SImode, scopy, 4),
> +                       gen_rtx_VEC_SELECT (SImode,
> +                                           gen_rtx_SUBREG (V4SImode, reg, 0),
> +                                           tmp)));
> +               }
> +             else
> +               {
> +                 rtx vcopy = gen_reg_rtx (V2DImode);
> +                 emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +                 emit_move_insn (vcopy,
> +                                 gen_rtx_LSHIFTRT (V2DImode,
> +                                                   vcopy, GEN_INT (32)));
> +                 emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> +                                 gen_rtx_SUBREG (SImode, vcopy, 0));
> +               }
>             }
>           else
> -           {
> -             rtx vcopy = gen_reg_rtx (V2DImode);
> -             emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -             emit_move_insn (vcopy,
> -                             gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> -             emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> -                             gen_rtx_SUBREG (SImode, vcopy, 0));
> -           }
> +           emit_move_insn (scopy, reg);
> +
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_conversion_insns (seq, insn);
> @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op
>    if (GET_CODE (*op) == NOT)
>      {
>        convert_op (&XEXP (*op, 0), insn);
> -      PUT_MODE (*op, V2DImode);
> +      PUT_MODE (*op, vmode);
>      }
>    else if (MEM_P (*op))
>      {
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (*op));
>
>        emit_insn_before (gen_move_insn (tmp, *op), insn);
> -      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      *op = gen_rtx_SUBREG (vmode, tmp, 0);
>
>        if (dump_file)
>         fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
>             gcc_assert (!DF_REF_CHAIN (ref));
>             break;
>           }
> -      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> +      *op = gen_rtx_SUBREG (vmode, *op, 0);
>      }
>    else if (CONST_INT_P (*op))
>      {
>        rtx vec_cst;
> -      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> +      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>
>        /* Prefer all ones vector in case of -1.  */
>        if (constm1_operand (*op, GET_MODE (*op)))
> -       vec_cst = CONSTM1_RTX (V2DImode);
> +       vec_cst = CONSTM1_RTX (vmode);
>        else
> -       vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> -                                       gen_rtvec (2, *op, const0_rtx));
> +       {
> +         unsigned n = GET_MODE_NUNITS (vmode);
> +         rtx *v = XALLOCAVEC (rtx, n);
> +         v[0] = *op;
> +         for (unsigned i = 1; i < n; ++i)
> +           v[i] = const0_rtx;
> +         vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> +       }
>
> -      if (!standard_sse_constant_p (vec_cst, V2DImode))
> +      if (!standard_sse_constant_p (vec_cst, vmode))
>         {
>           start_sequence ();
> -         vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> +         vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
>           rtx_insn *seq = get_insns ();
>           end_sequence ();
>           emit_insn_before (seq, insn);
> @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op
>    else
>      {
>        gcc_assert (SUBREG_P (*op));
> -      gcc_assert (GET_MODE (*op) == V2DImode);
> +      gcc_assert (GET_MODE (*op) == vmode);
>      }
>  }
>
> @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>      {
>        /* There are no scalar integer instructions and therefore
>          temporary register usage is required.  */
> -      rtx tmp = gen_reg_rtx (DImode);
> +      rtx tmp = gen_reg_rtx (GET_MODE (dst));
>        emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> -      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> +      dst = gen_rtx_SUBREG (vmode, tmp, 0);
>      }
>
>    switch (GET_CODE (src))
> @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case ASHIFTRT:
>      case LSHIFTRT:
>        convert_op (&XEXP (src, 0), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case PLUS:
> @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
>      case IOR:
>      case XOR:
>      case AND:
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
>        convert_op (&XEXP (src, 0), insn);
>        convert_op (&XEXP (src, 1), insn);
> -      PUT_MODE (src, V2DImode);
> +      PUT_MODE (src, vmode);
>        break;
>
>      case NEG:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> -      src = gen_rtx_MINUS (V2DImode, subreg, src);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> +      src = gen_rtx_MINUS (vmode, subreg, src);
>        break;
>
>      case NOT:
>        src = XEXP (src, 0);
>        convert_op (&src, insn);
> -      subreg = gen_reg_rtx (V2DImode);
> -      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> -      src = gen_rtx_XOR (V2DImode, src, subreg);
> +      subreg = gen_reg_rtx (vmode);
> +      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> +      src = gen_rtx_XOR (vmode, src, subreg);
>        break;
>
>      case MEM:
> @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
>        break;
>
>      case SUBREG:
> -      gcc_assert (GET_MODE (src) == V2DImode);
> +      gcc_assert (GET_MODE (src) == vmode);
>        break;
>
>      case COMPARE:
>        src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>
> -      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> -                 || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> +      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> +                 || (SUBREG_P (src) && GET_MODE (src) == vmode));
>
>        if (REG_P (src))
> -       subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> +       subreg = gen_rtx_SUBREG (vmode, src, 0);
>        else
>         subreg = copy_rtx_if_shared (src);
>        emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
>    PATTERN (insn) = def_set;
>
>    INSN_CODE (insn) = -1;
> -  recog_memoized (insn);
> +  int patt = recog_memoized (insn);
> +  if  (patt == -1)
> +    fatal_insn_not_found (insn);
>    df_insn_rescan (insn);
>  }
>
> @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
>                      (const_int 0 [0])))  */
>
>  static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    if (!TARGET_SSE4_1)
>      return false;
> @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (!SUBREG_P (op1)
>        || !SUBREG_P (op2)
> -      || GET_MODE (op1) != SImode
> -      || GET_MODE (op2) != SImode
> +      || GET_MODE (op1) != mode
> +      || GET_MODE (op2) != mode
>        || ((SUBREG_BYTE (op1) != 0
> -          || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> +          || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
>           && (SUBREG_BYTE (op2) != 0
> -             || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> +             || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
>      return false;
>
>    op1 = SUBREG_REG (op1);
> @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
>
>    if (op1 != op2
>        || !REG_P (op1)
> -      || GET_MODE (op1) != DImode)
> +      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
>      return false;
>
>    return true;
> @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
>  /* The DImode version of scalar_to_vector_candidate_p.  */
>
>  static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
>  {
>    rtx def_set = single_set (insn);
>
> @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
>    rtx dst = SET_DEST (def_set);
>
>    if (GET_CODE (src) == COMPARE)
> -    return convertible_comparison_p (insn);
> +    return convertible_comparison_p (insn, mode);
>
>    /* We are interested in DImode promotion only.  */
> -  if ((GET_MODE (src) != DImode
> +  if ((GET_MODE (src) != mode
>         && !CONST_INT_P (src))
> -      || GET_MODE (dst) != DImode)
> +      || GET_MODE (dst) != mode)
>      return false;
>
>    if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
>         return false;
>        break;
>
> +    case SMAX:
> +    case SMIN:
> +    case UMAX:
> +    case UMIN:
> +      if ((mode == DImode && !TARGET_AVX512F)
> +         || (mode == SImode && !TARGET_SSE4_1))
> +       return false;
> +      /* Fallthru.  */
> +
>      case PLUS:
>      case MINUS:
>      case IOR:
> @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>
> -      if (GET_MODE (XEXP (src, 1)) != DImode
> +      if (GET_MODE (XEXP (src, 1)) != mode
>           && !CONST_INT_P (XEXP (src, 1)))
>         return false;
>        break;
> @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
>           || !REG_P (XEXP (XEXP (src, 0), 0))))
>        return false;
>
> -  if (GET_MODE (XEXP (src, 0)) != DImode
> +  if (GET_MODE (XEXP (src, 0)) != mode
>        && !CONST_INT_P (XEXP (src, 0)))
>      return false;
>
> @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx
>    return false;
>  }
>
> -/* Return 1 if INSN may be converted into vector
> -   instruction.  */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> -  if (TARGET_64BIT)
> -    return timode_scalar_to_vector_candidate_p (insn);
> -  else
> -    return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> +   remove insn from CANDIDATES in case it has both convertible
> +   and not convertible definitions.
>
> -/* The DImode version of remove_non_convertible_regs.  */
> +   All insns in a bitmap are conversion candidates according to
> +   scalar_to_vector_candidate_p.  Currently it implies all insns
> +   are single_set.  */
>
>  static void
>  dimode_remove_non_convertible_regs (bitmap candidates)
> @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
>    BITMAP_FREE (regs);
>  }
>
> -/* For a given bitmap of insn UIDs scans all instruction and
> -   remove insn from CANDIDATES in case it has both convertible
> -   and not convertible definitions.
> -
> -   All insns in a bitmap are conversion candidates according to
> -   scalar_to_vector_candidate_p.  Currently it implies all insns
> -   are single_set.  */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> -  if (TARGET_64BIT)
> -    timode_remove_non_convertible_regs (candidates);
> -  else
> -    dimode_remove_non_convertible_regs (candidates);
> -}
> -
>  /* Main STV pass function.  Find and convert scalar
>     instructions into vector mode when profitable.  */
>
> @@ -1577,11 +1638,14 @@ static unsigned int
>  convert_scalars_to_vector ()
>  {
>    basic_block bb;
> -  bitmap candidates;
>    int converted_insns = 0;
>
>    bitmap_obstack_initialize (NULL);
> -  candidates = BITMAP_ALLOC (NULL);
> +  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> +  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> +  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
> +  for (unsigned i = 0; i < 3; ++i)
> +    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>
>    calculate_dominance_info (CDI_DOMINATORS);
>    df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
>      {
>        rtx_insn *insn;
>        FOR_BB_INSNS (bb, insn)
> -       if (scalar_to_vector_candidate_p (insn))
> +       if (TARGET_64BIT
> +           && timode_scalar_to_vector_candidate_p (insn))
>           {
>             if (dump_file)
> -             fprintf (dump_file, "  insn %d is marked as a candidate\n",
> +             fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
>                        INSN_UID (insn));
>
> -           bitmap_set_bit (candidates, INSN_UID (insn));
> +           bitmap_set_bit (&candidates[2], INSN_UID (insn));
> +         }
> +       else
> +         {
> +           /* Check {SI,DI}mode.  */
> +           for (unsigned i = 0; i <= 1; ++i)
> +             if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> +               {
> +                 if (dump_file)
> +                   fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
> +                            INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> +                 bitmap_set_bit (&candidates[i], INSN_UID (insn));
> +                 break;
> +               }
>           }
>      }
>
> -  remove_non_convertible_regs (candidates);
> +  if (TARGET_64BIT)
> +    timode_remove_non_convertible_regs (&candidates[2]);
> +  for (unsigned i = 0; i <= 1; ++i)
> +    dimode_remove_non_convertible_regs (&candidates[i]);
>
> -  if (bitmap_empty_p (candidates))
> -    if (dump_file)
> +  for (unsigned i = 0; i <= 2; ++i)
> +    if (!bitmap_empty_p (&candidates[i]))
> +      break;
> +    else if (i == 2 && dump_file)
>        fprintf (dump_file, "There are no candidates for optimization.\n");
>
> -  while (!bitmap_empty_p (candidates))
> -    {
> -      unsigned uid = bitmap_first_set_bit (candidates);
> -      scalar_chain *chain;
> +  for (unsigned i = 0; i <= 2; ++i)
> +    while (!bitmap_empty_p (&candidates[i]))
> +      {
> +       unsigned uid = bitmap_first_set_bit (&candidates[i]);
> +       scalar_chain *chain;
>
> -      if (TARGET_64BIT)
> -       chain = new timode_scalar_chain;
> -      else
> -       chain = new dimode_scalar_chain;
> +       if (cand_mode[i] == TImode)
> +         chain = new timode_scalar_chain;
> +       else
> +         chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]);
>
> -      /* Find instructions chain we want to convert to vector mode.
> -        Check all uses and definitions to estimate all required
> -        conversions.  */
> -      chain->build (candidates, uid);
> +       /* Find instructions chain we want to convert to vector mode.
> +          Check all uses and definitions to estimate all required
> +          conversions.  */
> +       chain->build (&candidates[i], uid);
>
> -      if (chain->compute_convert_gain () > 0)
> -       converted_insns += chain->convert ();
> -      else
> -       if (dump_file)
> -         fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> -                  chain->chain_id);
> +       if (chain->compute_convert_gain () > 0)
> +         converted_insns += chain->convert ();
> +       else
> +         if (dump_file)
> +           fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> +                    chain->chain_id);
>
> -      delete chain;
> -    }
> +       delete chain;
> +      }
>
>    if (dump_file)
>      fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>
> -  BITMAP_FREE (candidates);
> +  for (unsigned i = 0; i <= 2; ++i)
> +    bitmap_release (&candidates[i]);
>    bitmap_obstack_release (NULL);
>    df_process_deferred_rescans ();
>
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h     (revision 274111)
> +++ gcc/config/i386/i386-features.h     (working copy)
> @@ -127,11 +127,16 @@ namespace {
>  class scalar_chain
>  {
>   public:
> -  scalar_chain ();
> +  scalar_chain (enum machine_mode, enum machine_mode);
>    virtual ~scalar_chain ();
>
>    static unsigned max_id;
>
> +  /* Scalar mode.  */
> +  enum machine_mode smode;
> +  /* Vector mode.  */
> +  enum machine_mode vmode;
> +
>    /* ID of a chain.  */
>    unsigned int chain_id;
>    /* A queue of instructions to be included into a chain.  */
> @@ -162,6 +167,8 @@ class scalar_chain
>  class dimode_scalar_chain : public scalar_chain
>  {
>   public:
> +  dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> +    : scalar_chain (smode_, vmode_) {}
>    int compute_convert_gain ();
>   private:
>    void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
>  class timode_scalar_chain : public scalar_chain
>  {
>   public:
> +  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
>    /* Convert from TImode to V1TImode is always faster.  */
>    int compute_convert_gain () { return 1; }
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 274111)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -17721,6 +17721,27 @@ (define_peephole2
>      std::swap (operands[4], operands[5]);
>  })
>
> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> +  [(set (match_operand:SWI48 0 "register_operand")
> +       (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> +                      (match_operand:SWI48 2 "register_operand")))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_STV && TARGET_SSE4_1
> +   && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCGC FLAGS_REG)
> +       (compare:CCGC (match_dup 1)(match_dup 2)))
> +   (set (match_dup 0)
> +       (if_then_else:SWI48
> +         (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> +         (match_dup 1)
> +         (match_dup 2)))])
> +
>  ;; Conditional addition patterns
>  (define_expand "add<mode>cc"
>    [(match_operand:SWI 0 "register_operand")


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]