This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs


On Fri, 9 Aug 2019, Richard Biener wrote:

> On Fri, 9 Aug 2019, Uros Bizjak wrote:
> 
> > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> > 
> > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > > >
> > > > > > > > and then we need to split DImode for 32bits, too.
> > > > > > >
> > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > > condition, I'll provide _doubleword splitter later.
> > > > > >
> > > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > > to force use of %zmmN?
> > > > >
> > > > > It generates V4SI mode, so - yes, AVX512VL.
> > > >
> > > >     case SMAX:
> > > >     case SMIN:
> > > >     case UMAX:
> > > >     case UMIN:
> > > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > > >           || (mode == SImode && !TARGET_SSE4_1))
> > > >         return false;
> > > >
> > > > so there's no way to use AVX512VL for 32bit?
> > >
> > > There is a way, but on 32bit targets, we need to split DImode
> > > operation to a sequence of SImode operations for unconverted pattern.
> > > This is of course doable, but somehow more complex than simply
> > > emitting a DImode compare + DImode cmove, which is what current
> > > splitter does. So, a follow-up task.
> > 
> > Please find attached the complete .md part that enables SImode for
> > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> > targets. The patterns also allows for memory operand 2, so STV has
> > chance to create the vector pattern with implicit load. In case STV
> > fails, the memory operand 2 is loaded to the register first;  operand
> > 2 is used in compare and cmove instruction, so pre-loading of the
> > operand should be beneficial.
> 
> Thanks.
> 
> > Also note, that splitting should happen rarely. Due to the cost
> > function, STV should effectively always convert minmax to a vector
> > insn.
> 
> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
> this kind of "simple" conversion:
> 
>   5.50 │1d0:   test   %esi,%es
>   0.07 │       mov    $0x0,%ex
>        │       cmovs  %eax,%es
>   5.84 │       imul   %r8d,%es
> 
> to
> 
>   0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
>   0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
>  40.45 │       vmovd  %xmm0,%eax
>   2.45 │       imul   %r8d,%eax
> 
> which looks like a RA artifact in the end.  We spill %esi only
> with -mstv here as STV introduces a (subreg:V4SI ...) use
> of a pseudo ultimatively set from di.  STV creates an additional
> pseudo for this (copy-in) but it places that copy next to the
> original def rather than next to the start of the chain it
> converts which is probably the issue why we spill.  And this
> is because it inserts those at each definition of the pseudo
> rather than just at the reaching definition(s) or at the
> uses of the pseudo in the chain (that because there may be
> defs of that pseudo in the chain itself).  Note that STV emits
> such "conversion" copies as simple reg-reg moves:
> 
> (insn 1094 3 4 2 (set (reg:SI 777)
>         (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
>      (nil))
> 
> but those do not prevail very long (this one gets removed by CSE2).
> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
> and computes
> 
>     r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
>     a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
> 
> so I wonder if STV shouldn't instead emit gpr->xmm moves
> here (but I guess nothing again prevents RTL optimizers from
> combining that with the single-use in the max instruction...).
> 
> So this boils down to STV splitting live-ranges but other
> passes undoing that and then RA not considering splitting
> live-ranges here, arriving at unoptimal allocation.
> 
> A testcase showing this issue is (simplified from 464.h264ref
> UMVLine16Y_11):
> 
> unsigned short
> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> {
>   if (y != width)
>     {
>       y = y < 0 ? 0 : y;
>       return Pic[y * width];
>     }
>   return Pic[y];
> }
> 
> where the condition and the Pic[y] load mimics the other use of y.
> Different, even worse spilling is generated by
> 
> unsigned short
> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> {
>   y = y < 0 ? 0 : y;
>   return Pic[y * width] + y;
> }
> 
> I guess this all shows that STVs "trick" of simply wrapping
> integer mode pseudos in (subreg:vector-mode ...) is bad?
> 
> I've added a (failing) testcase to reflect the above.

Experimenting a bit with just for the conversion insns using
V4SImode pseudos we end up preserving those moves (but I
do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
ends up using movv4si_internal which only leaves us with
memory for the SImode operand) _plus_ moving the move next
to the actual use has an effect.  Not necssarily a good one
though:

        vpxor   %xmm0, %xmm0, %xmm0
        vmovaps %xmm0, -16(%rsp)
        movl    %esi, -16(%rsp)
        vpmaxsd -16(%rsp), %xmm0, %xmm0
        vmovd   %xmm0, %eax

eh?  I guess the lowpart set is not good (my patch has this
as well, but I got saved by never having vector modes to subset...).
Using

    (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) "t3.c":5:10 -1

for the move ends up with

        vpxor   %xmm1, %xmm1, %xmm1
        vpinsrd $0, %esi, %xmm1, %xmm0

eh?  LRA chooses the correct alternative here but somehow
postreload CSE CSEs the zero with the xmm1 clearing, leading
to the vpinsrd...  (I guess a general issue, not sure if really
worse - definitely a larger instruction).  Unfortunately
postreload-cse doesn't add a reg-equal note.  This happens only
when emitting the reg move before the use, not doing that emits
a vmovd as expected.

At least the spilling is gone here.

I am re-testing as follows, the main change is that
general_scalar_chain::make_vector_copies now generates a
vector pseudo as destination (and I've fixed up the code
to not generate (subreg:V4SI (reg:V4SI 1234) 0)).

Hope this fixes the observed slowdowns (it fixes the new testcase).

Richard.

mccas.F:twotff_ for 416.gamess
refbuf.c:UMVLine16Y_11 for 464.h264ref

2019-08-07  Richard Biener  <rguenther@suse.de>

	PR target/91154
	* config/i386/i386-features.h (scalar_chain::scalar_chain): Add
	mode arguments.
	(scalar_chain::smode): New member.
	(scalar_chain::vmode): Likewise.
	(dimode_scalar_chain): Rename to...
	(general_scalar_chain): ... this.
	(general_scalar_chain::general_scalar_chain): Take mode arguments.
	(timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain
	base with TImode and V1TImode.
	* config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust.
	(general_scalar_chain::vector_const_cost): Adjust for SImode
	chains.
	(general_scalar_chain::compute_convert_gain): Likewise.  Fix
	reg-reg move cost gain, use ix86_cost->sse_op cost and adjust
	scalar costs.  Add {S,U}{MIN,MAX} support.  Dump per-instruction
	gain if not zero.
	(general_scalar_chain::replace_with_subreg): Use vmode/smode.
	Elide the subreg if the reg is already vector.
	(general_scalar_chain::make_vector_copies): Likewise.  Handle
	non-DImode chains appropriately.  Use a vector-mode pseudo as
	destination.
	(general_scalar_chain::convert_reg): Likewise.
	(general_scalar_chain::convert_op): Likewise.  Elide the
	subreg if the reg is already vector.
	(general_scalar_chain::convert_insn): Likewise.  Add
	fatal_insn_not_found if the result is not recognized.
	(convertible_comparison_p): Pass in the scalar mode and use that.
	(general_scalar_to_vector_candidate_p): Likewise.  Rename from
	dimode_scalar_to_vector_candidate_p.  Add {S,U}{MIN,MAX} support.
	(scalar_to_vector_candidate_p): Remove by inlining into single
	caller.
	(general_remove_non_convertible_regs): Rename from
	dimode_remove_non_convertible_regs.
	(remove_non_convertible_regs): Remove by inlining into single caller.
	(convert_scalars_to_vector): Handle SImode and DImode chains
	in addition to TImode chains.
	* config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV.

	* gcc.target/i386/pr91154.c: New testcase.
	* gcc.target/i386/minmax-3.c: Likewise.
	* gcc.target/i386/minmax-4.c: Likewise.
	* gcc.target/i386/minmax-5.c: Likewise.
	* gcc.target/i386/minmax-6.c: Likewise.

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	(revision 274111)
+++ gcc/config/i386/i386-features.c	(working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
 
 /* Initialize new chain.  */
 
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
 {
+  smode = smode_;
+  vmode = vmode_;
+
   chain_id = ++max_id;
 
    if (dump_file)
@@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins
    conversion.  */
 
 void
-dimode_scalar_chain::mark_dual_mode_def (df_ref def)
+general_scalar_chain::mark_dual_mode_def (df_ref def)
 {
   gcc_assert (DF_REF_REG_DEF_P (def));
 
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
       && !HARD_REGISTER_P (SET_DEST (def_set)))
     bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
 
+  /* ???  The following is quadratic since analyze_register_chain
+     iterates over all refs to look for dual-mode regs.  Instead this
+     should be done separately for all regs mentioned in the chain once.  */
   df_ref ref;
   df_ref def;
   for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates,
    instead of using a scalar one.  */
 
 int
-dimode_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp)
 {
   gcc_assert (CONST_INT_P (exp));
 
-  if (standard_sse_constant_p (exp, V2DImode))
-    return COSTS_N_INSNS (1);
-  return ix86_cost->sse_load[1];
+  if (standard_sse_constant_p (exp, vmode))
+    return ix86_cost->sse_op;
+  /* We have separate costs for SImode and DImode, use SImode costs
+     for smaller modes.  */
+  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
 }
 
 /* Compute a gain for chain conversion.  */
 
 int
-dimode_scalar_chain::compute_convert_gain ()
+general_scalar_chain::compute_convert_gain ()
 {
   bitmap_iterator bi;
   unsigned insn_uid;
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
   if (dump_file)
     fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
 
+  /* SSE costs distinguish between SImode and DImode loads/stores, for
+     int costs factor in the number of GPRs involved.  When supporting
+     smaller modes than SImode the int load/store costs need to be
+     adjusted as well.  */
+  unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
       rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+	igain += 2 * m - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	igain
+	  += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
 	{
     	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
-	  gain += ix86_cost->shift_const;
+	    igain -= vector_const_cost (XEXP (src, 0));
+	  igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
 	  if (INTVAL (XEXP (src, 1)) >= 32)
-	    gain -= COSTS_N_INSNS (1);
+	    igain -= COSTS_N_INSNS (1);
 	}
       else if (GET_CODE (src) == PLUS
 	       || GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
 	       || GET_CODE (src) == XOR
 	       || GET_CODE (src) == AND)
 	{
-	  gain += ix86_cost->add;
+	  igain += m * ix86_cost->add - ix86_cost->sse_op;
 	  /* Additional gain for andnot for targets without BMI.  */
 	  if (GET_CODE (XEXP (src, 0)) == NOT
 	      && !TARGET_BMI)
-	    gain += 2 * ix86_cost->add;
+	    igain += m * ix86_cost->add;
 
 	  if (CONST_INT_P (XEXP (src, 0)))
-	    gain -= vector_const_cost (XEXP (src, 0));
+	    igain -= vector_const_cost (XEXP (src, 0));
 	  if (CONST_INT_P (XEXP (src, 1)))
-	    gain -= vector_const_cost (XEXP (src, 1));
+	    igain -= vector_const_cost (XEXP (src, 1));
 	}
       else if (GET_CODE (src) == NEG
 	       || GET_CODE (src) == NOT)
-	gain += ix86_cost->add - COSTS_N_INSNS (1);
+	igain += m * ix86_cost->add - ix86_cost->sse_op;
+      else if (GET_CODE (src) == SMAX
+	       || GET_CODE (src) == SMIN
+	       || GET_CODE (src) == UMAX
+	       || GET_CODE (src) == UMIN)
+	{
+	  /* We do not have any conditional move cost, estimate it as a
+	     reg-reg move.  Comparisons are costed as adds.  */
+	  igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+	  /* Integer SSE ops are all costed the same.  */
+	  igain -= ix86_cost->sse_op;
+	}
       else if (GET_CODE (src) == COMPARE)
 	{
 	  /* Assume comparison cost is the same.  */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
       else if (CONST_INT_P (src))
 	{
 	  if (REG_P (dst))
-	    gain += COSTS_N_INSNS (2);
+	    /* DImode can be immediate for TARGET_64BIT and SImode always.  */
+	    igain += COSTS_N_INSNS (m);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
-	  gain -= vector_const_cost (src);
+	    igain += (m * ix86_cost->int_store[2]
+		     - ix86_cost->sse_store[sse_cost_idx]);
+	  igain -= vector_const_cost (src);
 	}
       else
 	gcc_unreachable ();
+
+      if (igain != 0 && dump_file)
+	{
+	  fprintf (dump_file, "  Instruction gain %d for ", igain);
+	  dump_insn_slim (dump_file, insn);
+	}
+      gain += igain;
     }
 
   if (dump_file)
     fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
 
+  /* ???  What about integer to SSE?  */
   EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
     cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
 
@@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai
 /* Replace REG in X with a V2DI subreg of NEW_REG.  */
 
 rtx
-dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
+general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
 {
   if (x == reg)
-    return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+    return (GET_MODE (new_reg) == vmode
+	    ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0));
 
   const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
   int i, j;
@@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg
 /* Replace REG in INSN with a V2DI subreg of NEW_REG.  */
 
 void
-dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
+general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn,
 						  rtx reg, rtx new_reg)
 {
   replace_with_subreg (single_set (insn), reg, new_reg);
@@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx
    and replace its uses in a chain.  */
 
 void
-dimode_scalar_chain::make_vector_copies (unsigned regno)
+general_scalar_chain::make_vector_copies (unsigned regno)
 {
   rtx reg = regno_reg_rtx[regno];
-  rtx vreg = gen_reg_rtx (DImode);
+  rtx vreg = gen_reg_rtx (vmode);
   df_ref ref;
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
@@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies
 	start_sequence ();
 	if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 	  {
-	    rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
-	    emit_move_insn (adjust_address (tmp, SImode, 0),
-			    gen_rtx_SUBREG (SImode, reg, 0));
-	    emit_move_insn (adjust_address (tmp, SImode, 4),
-			    gen_rtx_SUBREG (SImode, reg, 4));
-	    emit_move_insn (vreg, tmp);
+	    rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+	    if (smode == DImode && !TARGET_64BIT)
+	      {
+		emit_move_insn (adjust_address (tmp, SImode, 0),
+				gen_rtx_SUBREG (SImode, reg, 0));
+		emit_move_insn (adjust_address (tmp, SImode, 4),
+				gen_rtx_SUBREG (SImode, reg, 4));
+	      }
+	    else
+	      emit_move_insn (tmp, reg);
+	    emit_move_insn (vreg,
+			    gen_rtx_VEC_MERGE (vmode,
+					       gen_rtx_VEC_DUPLICATE (vmode,
+								      tmp),
+					       CONST0_RTX (vmode),
+					       GEN_INT (HOST_WIDE_INT_1U)));
+
 	  }
-	else if (TARGET_SSE4_1)
+	else if (!TARGET_64BIT && smode == DImode)
 	  {
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (V4SImode, vreg, 0),
-					  gen_rtx_SUBREG (SImode, reg, 4),
-					  GEN_INT (2)));
+	    if (TARGET_SSE4_1)
+	      {
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (V4SImode, vreg, 0),
+					      gen_rtx_SUBREG (SImode, reg, 4),
+					      GEN_INT (2)));
+	      }
+	    else
+	      {
+		rtx tmp = gen_reg_rtx (DImode);
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 0)));
+		emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+					    CONST0_RTX (V4SImode),
+					    gen_rtx_SUBREG (SImode, reg, 4)));
+		emit_insn (gen_vec_interleave_lowv4si
+			   (gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, vreg, 0),
+			    gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	      }
 	  }
 	else
 	  {
-	    rtx tmp = gen_reg_rtx (DImode);
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 0)));
-	    emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
-					CONST0_RTX (V4SImode),
-					gen_rtx_SUBREG (SImode, reg, 4)));
-	    emit_insn (gen_vec_interleave_lowv4si
-		       (gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, vreg, 0),
-			gen_rtx_SUBREG (V4SImode, tmp, 0)));
+	    emit_move_insn (vreg,
+			    gen_rtx_VEC_MERGE (vmode,
+					       gen_rtx_VEC_DUPLICATE (vmode,
+								      reg),
+					       CONST0_RTX (vmode),
+					       GEN_INT (HOST_WIDE_INT_1U)));
 	  }
 	rtx_insn *seq = get_insns ();
 	end_sequence ();
@@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies
    in case register is used in not convertible insn.  */
 
 void
-dimode_scalar_chain::convert_reg (unsigned regno)
+general_scalar_chain::convert_reg (unsigned regno)
 {
   bool scalar_copy = bitmap_bit_p (defs_conv, regno);
   rtx reg = regno_reg_rtx[regno];
@@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign
   bitmap_copy (conv, insns);
 
   if (scalar_copy)
-    scopy = gen_reg_rtx (DImode);
+    scopy = gen_reg_rtx (smode);
 
   for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
     {
@@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign
 	  start_sequence ();
 	  if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
 	    {
-	      rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+	      rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
 	      emit_move_insn (tmp, reg);
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      adjust_address (tmp, SImode, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      adjust_address (tmp, SImode, 4));
+	      if (!TARGET_64BIT && smode == DImode)
+		{
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  adjust_address (tmp, SImode, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  adjust_address (tmp, SImode, 4));
+		}
+	      else
+		emit_move_insn (scopy, tmp);
 	    }
-	  else if (TARGET_SSE4_1)
+	  else if (!TARGET_64BIT && smode == DImode)
 	    {
-	      rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 0),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
-	      tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
-	      emit_insn
-		(gen_rtx_SET
-		 (gen_rtx_SUBREG (SImode, scopy, 4),
-		  gen_rtx_VEC_SELECT (SImode,
-				      gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+	      if (TARGET_SSE4_1)
+		{
+		  rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+					      gen_rtvec (1, const0_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 0),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+
+		  tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+		  emit_insn
+		    (gen_rtx_SET
+		       (gen_rtx_SUBREG (SImode, scopy, 4),
+			gen_rtx_VEC_SELECT (SImode,
+					    gen_rtx_SUBREG (V4SImode, reg, 0),
+					    tmp)));
+		}
+	      else
+		{
+		  rtx vcopy = gen_reg_rtx (V2DImode);
+		  emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		  emit_move_insn (vcopy,
+				  gen_rtx_LSHIFTRT (V2DImode,
+						    vcopy, GEN_INT (32)));
+		  emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+				  gen_rtx_SUBREG (SImode, vcopy, 0));
+		}
 	    }
 	  else
-	    {
-	      rtx vcopy = gen_reg_rtx (V2DImode);
-	      emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	      emit_move_insn (vcopy,
-			      gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
-	      emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
-			      gen_rtx_SUBREG (SImode, vcopy, 0));
-	    }
+	    emit_move_insn (scopy, reg);
+
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_conversion_insns (seq, insn);
@@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign
    registers conversion.  */
 
 void
-dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
+general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
 {
   *op = copy_rtx_if_shared (*op);
 
   if (GET_CODE (*op) == NOT)
     {
       convert_op (&XEXP (*op, 0), insn);
-      PUT_MODE (*op, V2DImode);
+      PUT_MODE (*op, vmode);
     }
   else if (MEM_P (*op))
     {
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (*op));
 
       emit_insn_before (gen_move_insn (tmp, *op), insn);
-      *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      *op = gen_rtx_SUBREG (vmode, tmp, 0);
 
       if (dump_file)
 	fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
@@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op
 	    gcc_assert (!DF_REF_CHAIN (ref));
 	    break;
 	  }
-      *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+      if (GET_MODE (*op) != vmode)
+	*op = gen_rtx_SUBREG (vmode, *op, 0);
     }
   else if (CONST_INT_P (*op))
     {
       rtx vec_cst;
-      rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+      rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
 
       /* Prefer all ones vector in case of -1.  */
       if (constm1_operand (*op, GET_MODE (*op)))
-	vec_cst = CONSTM1_RTX (V2DImode);
+	vec_cst = CONSTM1_RTX (vmode);
       else
-	vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
-					gen_rtvec (2, *op, const0_rtx));
+	{
+	  unsigned n = GET_MODE_NUNITS (vmode);
+	  rtx *v = XALLOCAVEC (rtx, n);
+	  v[0] = *op;
+	  for (unsigned i = 1; i < n; ++i)
+	    v[i] = const0_rtx;
+	  vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+	}
 
-      if (!standard_sse_constant_p (vec_cst, V2DImode))
+      if (!standard_sse_constant_p (vec_cst, vmode))
 	{
 	  start_sequence ();
-	  vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+	  vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
 	  rtx_insn *seq = get_insns ();
 	  end_sequence ();
 	  emit_insn_before (seq, insn);
@@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op
   else
     {
       gcc_assert (SUBREG_P (*op));
-      gcc_assert (GET_MODE (*op) == V2DImode);
+      gcc_assert (GET_MODE (*op) == vmode);
     }
 }
 
 /* Convert INSN to vector mode.  */
 
 void
-dimode_scalar_chain::convert_insn (rtx_insn *insn)
+general_scalar_chain::convert_insn (rtx_insn *insn)
 {
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
@@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i
     {
       /* There are no scalar integer instructions and therefore
 	 temporary register usage is required.  */
-      rtx tmp = gen_reg_rtx (DImode);
+      rtx tmp = gen_reg_rtx (GET_MODE (dst));
       emit_conversion_insns (gen_move_insn (dst, tmp), insn);
-      dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+      dst = gen_rtx_SUBREG (vmode, tmp, 0);
     }
 
   switch (GET_CODE (src))
@@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i
     case ASHIFTRT:
     case LSHIFTRT:
       convert_op (&XEXP (src, 0), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case PLUS:
@@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
-      PUT_MODE (src, V2DImode);
+      PUT_MODE (src, vmode);
       break;
 
     case NEG:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
-      src = gen_rtx_MINUS (V2DImode, subreg, src);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+      src = gen_rtx_MINUS (vmode, subreg, src);
       break;
 
     case NOT:
       src = XEXP (src, 0);
       convert_op (&src, insn);
-      subreg = gen_reg_rtx (V2DImode);
-      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
-      src = gen_rtx_XOR (V2DImode, src, subreg);
+      subreg = gen_reg_rtx (vmode);
+      emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+      src = gen_rtx_XOR (vmode, src, subreg);
       break;
 
     case MEM:
@@ -939,17 +1027,17 @@ dimode_scalar_chain::convert_insn (rtx_i
       break;
 
     case SUBREG:
-      gcc_assert (GET_MODE (src) == V2DImode);
+      gcc_assert (GET_MODE (src) == vmode);
       break;
 
     case COMPARE:
       src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
 
-      gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
-		  || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+      gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+		  || (SUBREG_P (src) && GET_MODE (src) == vmode));
 
       if (REG_P (src))
-	subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+	subreg = gen_rtx_SUBREG (vmode, src, 0);
       else
 	subreg = copy_rtx_if_shared (src);
       emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1065,9 @@ dimode_scalar_chain::convert_insn (rtx_i
   PATTERN (insn) = def_set;
 
   INSN_CODE (insn) = -1;
-  recog_memoized (insn);
+  int patt = recog_memoized (insn);
+  if  (patt == -1)
+    fatal_insn_not_found (insn);
   df_insn_rescan (insn);
 }
 
@@ -1116,7 +1206,7 @@ timode_scalar_chain::convert_insn (rtx_i
 }
 
 void
-dimode_scalar_chain::convert_registers ()
+general_scalar_chain::convert_registers ()
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1186,7 +1276,7 @@ has_non_address_hard_reg (rtx_insn *insn
 		     (const_int 0 [0])))  */
 
 static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
 {
   if (!TARGET_SSE4_1)
     return false;
@@ -1219,12 +1309,12 @@ convertible_comparison_p (rtx_insn *insn
 
   if (!SUBREG_P (op1)
       || !SUBREG_P (op2)
-      || GET_MODE (op1) != SImode
-      || GET_MODE (op2) != SImode
+      || GET_MODE (op1) != mode
+      || GET_MODE (op2) != mode
       || ((SUBREG_BYTE (op1) != 0
-	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+	   || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
 	  && (SUBREG_BYTE (op2) != 0
-	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+	      || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
     return false;
 
   op1 = SUBREG_REG (op1);
@@ -1232,7 +1322,7 @@ convertible_comparison_p (rtx_insn *insn
 
   if (op1 != op2
       || !REG_P (op1)
-      || GET_MODE (op1) != DImode)
+      || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
     return false;
 
   return true;
@@ -1241,7 +1331,7 @@ convertible_comparison_p (rtx_insn *insn
 /* The DImode version of scalar_to_vector_candidate_p.  */
 
 static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
 {
   rtx def_set = single_set (insn);
 
@@ -1255,12 +1345,12 @@ dimode_scalar_to_vector_candidate_p (rtx
   rtx dst = SET_DEST (def_set);
 
   if (GET_CODE (src) == COMPARE)
-    return convertible_comparison_p (insn);
+    return convertible_comparison_p (insn, mode);
 
   /* We are interested in DImode promotion only.  */
-  if ((GET_MODE (src) != DImode
+  if ((GET_MODE (src) != mode
        && !CONST_INT_P (src))
-      || GET_MODE (dst) != DImode)
+      || GET_MODE (dst) != mode)
     return false;
 
   if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1370,15 @@ dimode_scalar_to_vector_candidate_p (rtx
 	return false;
       break;
 
+    case SMAX:
+    case SMIN:
+    case UMAX:
+    case UMIN:
+      if ((mode == DImode && !TARGET_AVX512VL)
+	  || (mode == SImode && !TARGET_SSE4_1))
+	return false;
+      /* Fallthru.  */
+
     case PLUS:
     case MINUS:
     case IOR:
@@ -1290,7 +1389,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
 
-      if (GET_MODE (XEXP (src, 1)) != DImode
+      if (GET_MODE (XEXP (src, 1)) != mode
 	  && !CONST_INT_P (XEXP (src, 1)))
 	return false;
       break;
@@ -1319,7 +1418,7 @@ dimode_scalar_to_vector_candidate_p (rtx
 	  || !REG_P (XEXP (XEXP (src, 0), 0))))
       return false;
 
-  if (GET_MODE (XEXP (src, 0)) != DImode
+  if (GET_MODE (XEXP (src, 0)) != mode
       && !CONST_INT_P (XEXP (src, 0)))
     return false;
 
@@ -1383,22 +1482,16 @@ timode_scalar_to_vector_candidate_p (rtx
   return false;
 }
 
-/* Return 1 if INSN may be converted into vector
-   instruction.  */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
-  if (TARGET_64BIT)
-    return timode_scalar_to_vector_candidate_p (insn);
-  else
-    return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
 
-/* The DImode version of remove_non_convertible_regs.  */
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
-dimode_remove_non_convertible_regs (bitmap candidates)
+general_remove_non_convertible_regs (bitmap candidates)
 {
   bitmap_iterator bi;
   unsigned id;
@@ -1553,23 +1646,6 @@ timode_remove_non_convertible_regs (bitm
   BITMAP_FREE (regs);
 }
 
-/* For a given bitmap of insn UIDs scans all instruction and
-   remove insn from CANDIDATES in case it has both convertible
-   and not convertible definitions.
-
-   All insns in a bitmap are conversion candidates according to
-   scalar_to_vector_candidate_p.  Currently it implies all insns
-   are single_set.  */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
-  if (TARGET_64BIT)
-    timode_remove_non_convertible_regs (candidates);
-  else
-    dimode_remove_non_convertible_regs (candidates);
-}
-
 /* Main STV pass function.  Find and convert scalar
    instructions into vector mode when profitable.  */
 
@@ -1577,11 +1653,14 @@ static unsigned int
 convert_scalars_to_vector ()
 {
   basic_block bb;
-  bitmap candidates;
   int converted_insns = 0;
 
   bitmap_obstack_initialize (NULL);
-  candidates = BITMAP_ALLOC (NULL);
+  const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+  const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+  bitmap_head candidates[3];  /* { SImode, DImode, TImode } */
+  for (unsigned i = 0; i < 3; ++i)
+    bitmap_initialize (&candidates[i], &bitmap_default_obstack);
 
   calculate_dominance_info (CDI_DOMINATORS);
   df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1676,73 @@ convert_scalars_to_vector ()
     {
       rtx_insn *insn;
       FOR_BB_INSNS (bb, insn)
-	if (scalar_to_vector_candidate_p (insn))
+	if (TARGET_64BIT
+	    && timode_scalar_to_vector_candidate_p (insn))
 	  {
 	    if (dump_file)
-	      fprintf (dump_file, "  insn %d is marked as a candidate\n",
+	      fprintf (dump_file, "  insn %d is marked as a TImode candidate\n",
 		       INSN_UID (insn));
 
-	    bitmap_set_bit (candidates, INSN_UID (insn));
+	    bitmap_set_bit (&candidates[2], INSN_UID (insn));
+	  }
+	else
+	  {
+	    /* Check {SI,DI}mode.  */
+	    for (unsigned i = 0; i <= 1; ++i)
+	      if (general_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "  insn %d is marked as a %s candidate\n",
+			     INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+		  bitmap_set_bit (&candidates[i], INSN_UID (insn));
+		  break;
+		}
 	  }
     }
 
-  remove_non_convertible_regs (candidates);
+  if (TARGET_64BIT)
+    timode_remove_non_convertible_regs (&candidates[2]);
+  for (unsigned i = 0; i <= 1; ++i)
+    general_remove_non_convertible_regs (&candidates[i]);
 
-  if (bitmap_empty_p (candidates))
-    if (dump_file)
+  for (unsigned i = 0; i <= 2; ++i)
+    if (!bitmap_empty_p (&candidates[i]))
+      break;
+    else if (i == 2 && dump_file)
       fprintf (dump_file, "There are no candidates for optimization.\n");
 
-  while (!bitmap_empty_p (candidates))
-    {
-      unsigned uid = bitmap_first_set_bit (candidates);
-      scalar_chain *chain;
+  for (unsigned i = 0; i <= 2; ++i)
+    while (!bitmap_empty_p (&candidates[i]))
+      {
+	unsigned uid = bitmap_first_set_bit (&candidates[i]);
+	scalar_chain *chain;
 
-      if (TARGET_64BIT)
-	chain = new timode_scalar_chain;
-      else
-	chain = new dimode_scalar_chain;
+	if (cand_mode[i] == TImode)
+	  chain = new timode_scalar_chain;
+	else
+	  chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]);
 
-      /* Find instructions chain we want to convert to vector mode.
-	 Check all uses and definitions to estimate all required
-	 conversions.  */
-      chain->build (candidates, uid);
+	/* Find instructions chain we want to convert to vector mode.
+	   Check all uses and definitions to estimate all required
+	   conversions.  */
+	chain->build (&candidates[i], uid);
 
-      if (chain->compute_convert_gain () > 0)
-	converted_insns += chain->convert ();
-      else
-	if (dump_file)
-	  fprintf (dump_file, "Chain #%d conversion is not profitable\n",
-		   chain->chain_id);
+	if (chain->compute_convert_gain () > 0)
+	  converted_insns += chain->convert ();
+	else
+	  if (dump_file)
+	    fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+		     chain->chain_id);
 
-      delete chain;
-    }
+	delete chain;
+      }
 
   if (dump_file)
     fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
 
-  BITMAP_FREE (candidates);
+  for (unsigned i = 0; i <= 2; ++i)
+    bitmap_release (&candidates[i]);
   bitmap_obstack_release (NULL);
   df_process_deferred_rescans ();
 
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h	(revision 274111)
+++ gcc/config/i386/i386-features.h	(working copy)
@@ -127,11 +127,16 @@ namespace {
 class scalar_chain
 {
  public:
-  scalar_chain ();
+  scalar_chain (enum machine_mode, enum machine_mode);
   virtual ~scalar_chain ();
 
   static unsigned max_id;
 
+  /* Scalar mode.  */
+  enum machine_mode smode;
+  /* Vector mode.  */
+  enum machine_mode vmode;
+
   /* ID of a chain.  */
   unsigned int chain_id;
   /* A queue of instructions to be included into a chain.  */
@@ -159,9 +164,11 @@ class scalar_chain
   virtual void convert_registers () = 0;
 };
 
-class dimode_scalar_chain : public scalar_chain
+class general_scalar_chain : public scalar_chain
 {
  public:
+  general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+    : scalar_chain (smode_, vmode_) {}
   int compute_convert_gain ();
  private:
   void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
 class timode_scalar_chain : public scalar_chain
 {
  public:
+  timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
   /* Convert from TImode to V1TImode is always faster.  */
   int compute_convert_gain () { return 1; }
 
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 274111)
+++ gcc/config/i386/i386.md	(working copy)
@@ -17729,6 +17729,110 @@ (define_expand "add<mode>cc"
    (match_operand:SWI 3 "const_int_operand")]
   ""
   "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;")
+
+;; min/max patterns
+
+(define_mode_iterator MAXMIN_IMODE
+  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
+(define_code_attr maxmin_rel
+  [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
+
+(define_expand "<code><mode>3"
+  [(parallel
+    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	  (maxmin:MAXMIN_IMODE
+	    (match_operand:MAXMIN_IMODE 1 "register_operand")
+	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+     (clobber (reg:CC FLAGS_REG))])]
+  "TARGET_STV")
+
+(define_insn_and_split "*<code><mode>3_1"
+  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
+	(maxmin:MAXMIN_IMODE
+	  (match_operand:MAXMIN_IMODE 1 "register_operand")
+	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	  (match_dup 1)
+	  (match_dup 2)))]
+{
+  machine_mode mode = <MODE>mode;
+
+  if (!register_operand (operands[2], mode))
+    operands[2] = force_reg (mode, operands[2]);
+
+  enum rtx_code code = <maxmin_rel>;
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+  rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
+
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  emit_insn (gen_rtx_SET (flags, tmp));
+
+  operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+})
+
+(define_insn_and_split "*<code>di3_doubleword"
+  [(set (match_operand:DI 0 "register_operand")
+	(maxmin:DI (match_operand:DI 1 "register_operand")
+		   (match_operand:DI 2 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+   && can_create_pseudo_p ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 1)
+	  (match_dup 2)))
+   (set (match_dup 3)
+	(if_then_else:SI (match_dup 6)
+	  (match_dup 4)
+	  (match_dup 5)))]
+{
+  if (!register_operand (operands[2], DImode))
+    operands[2] = force_reg (DImode, operands[2]);
+
+  split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
+
+  rtx cmplo[2] = { operands[1], operands[2] };
+  rtx cmphi[2] = { operands[4], operands[5] };
+
+  enum rtx_code code = <maxmin_rel>;
+
+  switch (code)
+    {
+    case LE: case LEU:
+      std::swap (cmplo[0], cmplo[1]);
+      std::swap (cmphi[0], cmphi[1]);
+      code = swap_condition (code);
+      /* FALLTHRU */
+
+    case GE: case GEU:
+      {
+	bool uns = (code == GEU);
+	rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx)
+	  = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz;
+
+	emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1]));
+
+	rtx tmp = gen_rtx_SCRATCH (SImode);
+	emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1]));
+
+	rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG);
+	operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
+
+	break;
+      }
+
+    default:
+      gcc_unreachable ();
+    }
+})
 
 ;; Misc patterns (?)
 
Index: gcc/testsuite/gcc.target/i386/minmax-3.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-3.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-3.c	(working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int ssi[1024];
+unsigned int usi[1024];
+long long sdi[1024];
+unsigned long long udi[1024];
+
+#define CHECK(FN, VARIANT) \
+void \
+FN ## VARIANT (void) \
+{ \
+  for (int i = 1; i < 1024; ++i) \
+    VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \
+}
+
+CHECK(max, ssi);
+CHECK(min, ssi);
+CHECK(max, usi);
+CHECK(min, usi);
+CHECK(max, sdi);
+CHECK(min, sdi);
+CHECK(max, udi);
+CHECK(min, udi);
Index: gcc/testsuite/gcc.target/i386/minmax-4.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-4.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-4.c	(working copy)
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstv -msse4.1" } */
+
+#include "minmax-3.c"
+
+/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */
+/* { dg-final { scan-assembler-times "pmaxud" 1 } } */
+/* { dg-final { scan-assembler-times "pminsd" 1 } } */
+/* { dg-final { scan-assembler-times "pminud" 1 } } */
Index: gcc/testsuite/gcc.target/i386/minmax-6.c
===================================================================
--- gcc/testsuite/gcc.target/i386/minmax-6.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/minmax-6.c	(working copy)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=haswell" } */
+
+unsigned short
+UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
+{
+  if (y != width)
+    {
+      y = y < 0 ? 0 : y;
+      return Pic[y * width];
+    }
+  return Pic[y];
+} 
+
+/* We do not want the RA to spill %esi for it's dual-use but using
+   pmaxsd is OK.  */
+/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
+/* { dg-final { scan-assembler "pmaxsd" } } */

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]