This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
On Sun, 4 Aug 2019, Uros Bizjak wrote:
> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> >
> > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > >>>> necessary even when going the STV route. The actual regression
> > >>>> for the testcase could also be solved by turing the smaxsi3
> > >>>> back into a compare and jump rather than a conditional move sequence.
> > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > >>>> after pass_split_after_reload and I'm not sure we can split
> > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > >>>> sched2 on x86 it seems).
> > >>>>
> > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > >>>> case STV doesn't end up doing any transform?
> > >>>
> > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > >>> the insn back to compare+cmove.
> > >>
> > >> OK, that would work. But there's no way to force a jumpy sequence then
> > >> which we know is faster than compare+cmove because later RTL
> > >> if-conversion passes happily re-discover the smax (or conditional move)
> > >> sequence.
> > >>
> > >>> However, considering the SImode move
> > >>> from/to int/xmm register is relatively cheap, the cost function should
> > >>> be tuned so that STV always converts smaxsi3 pattern.
> > >>
> > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > >> sequence... (for the loop in hmmer which is the only one I see
> > >> any effect of any of my patches). So identifying chains that
> > >> start/end in memory is important for cost reasons.
> > >
> > > Please note that the cost function also considers the cost of move
> > > from/to xmm. So, the cost of the whole chain would disable the
> > > transformation.
> > >
> > >> So I think the splitting has to happen after the last if-conversion
> > >> pass (and thus we may need to allocate a scratch register for this
> > >> purpose?)
> > >
> > > I really hope that the underlying issue will be solved by a machine
> > > dependant pass inserted somewhere after the pre-reload split. This
> > > way, we can split unconverted smax to the cmove, and this later pass
> > > would handle jcc and cmove instructions. Until then... yes your
> > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > although sometimes we would like to split to cmove instead.
> >
> > So the following makes STV also consider SImode chains, re-using the
> > DImode chain code. I've kept a simple incomplete smaxsi3 pattern
> > and also did not alter the {SI,DI}mode chain cost function - it's
> > quite off for TARGET_64BIT. With this I get the expected conversion
> > for the testcase derived from hmmer.
> >
> > No further testing sofar.
> >
> > Is it OK to re-use the DImode chain code this way? I'll clean things
> > up some more of course.
>
> Yes, the approach looks OK to me. It makes chain building mode
> agnostic, and the chain building can be used for
> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> minmax and surrounding SImode operations)
> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> DImode operations)
>
> > Still need help with the actual patterns for minmax and how the splitters
> > should look like.
>
> Please look at the attached patch. Maybe we can add memory_operand as
> operand 1 and operand 2 predicate, but let's keep things simple for
> now.
Thanks. The attached patch makes the patch cleaner and it survives
"some" barebone testing. It also touches the cost function to
avoid being too overly trigger-happy. I've also ended up using
ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In
particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
move costs shouldn't be wrapped in COST_N_INSNS.
IMHO we should probably disregard any reg-reg moves for costing pre-RA.
At least with the current code every reg-reg move biases in favor of
SSE...
And we're simply adding move and non-move costs in 'gain', somewhat
mixing apples and oranges? We could separate those and require
both to be a net positive win?
Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
costs as apples and oranges... (so it never triggers for Bulldozer)
I now run into
/space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
error: unrecognizable insn:
(insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
(smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
(subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
-1
(expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil))))
during RTL pass: stv
where even with -mavx2 we do not have s{min,max}v2di3. We do have
an expander here but it seems only AVX512F has the DImode min/max
ops. I have adjusted dimode_scalar_to_vector_candidate_p
accordingly.
I'm considering to rename the
dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
functions to drop the dimode_ prefix - is that OK or do you
prefer some other prefix?
So - bootstrap with --with-arch=skylake in progress.
It detects quite a few chains (unsurprisingly) so I guess we need
to address compile-time issues in the pass before enabling this
enhancement (maybe as followup?).
Further comments on the actual patch welcome, I consider it
"finished" if testing reveals no issues. ChangeLog still needs
to be written and testcases to be added.
Thanks,
Richard.
Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c (revision 274111)
+++ gcc/config/i386/i386-features.c (working copy)
@@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
/* Initialize new chain. */
-scalar_chain::scalar_chain ()
+scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
{
+ smode = smode_;
+ vmode = vmode_;
+
chain_id = ++max_id;
if (dump_file)
@@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
&& !HARD_REGISTER_P (SET_DEST (def_set)))
bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
+ /* ??? The following is quadratic since analyze_register_chain
+ iterates over all refs to look for dual-mode regs. Instead this
+ should be done separately for all regs mentioned in the chain once. */
df_ref ref;
df_ref def;
for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost (
{
gcc_assert (CONST_INT_P (exp));
- if (standard_sse_constant_p (exp, V2DImode))
- return COSTS_N_INSNS (1);
- return ix86_cost->sse_load[1];
+ if (standard_sse_constant_p (exp, vmode))
+ return ix86_cost->sse_op;
+ /* We have separate costs for SImode and DImode, use SImode costs
+ for smaller modes. */
+ return ix86_cost->sse_load[smode == DImode ? 1 : 0];
}
/* Compute a gain for chain conversion. */
@@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
if (dump_file)
fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
+ /* SSE costs distinguish between SImode and DImode loads/stores, for
+ int costs factor in the number of GPRs involved. When supporting
+ smaller modes than SImode the int load/store costs need to be
+ adjusted as well. */
+ unsigned sse_cost_idx = smode == DImode ? 1 : 0;
+ unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+
EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
{
rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
rtx def_set = single_set (insn);
rtx src = SET_SRC (def_set);
rtx dst = SET_DEST (def_set);
+ int igain = 0;
if (REG_P (src) && REG_P (dst))
- gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
+ igain += 2 * m - ix86_cost->xmm_move;
else if (REG_P (src) && MEM_P (dst))
- gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+ igain
+ += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
else if (MEM_P (src) && REG_P (dst))
- gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+ igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
else if (GET_CODE (src) == ASHIFT
|| GET_CODE (src) == ASHIFTRT
|| GET_CODE (src) == LSHIFTRT)
{
if (CONST_INT_P (XEXP (src, 0)))
- gain -= vector_const_cost (XEXP (src, 0));
- gain += ix86_cost->shift_const;
+ igain -= vector_const_cost (XEXP (src, 0));
+ igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
if (INTVAL (XEXP (src, 1)) >= 32)
- gain -= COSTS_N_INSNS (1);
+ igain -= COSTS_N_INSNS (1);
}
else if (GET_CODE (src) == PLUS
|| GET_CODE (src) == MINUS
@@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
|| GET_CODE (src) == XOR
|| GET_CODE (src) == AND)
{
- gain += ix86_cost->add;
+ igain += m * ix86_cost->add - ix86_cost->sse_op;
/* Additional gain for andnot for targets without BMI. */
if (GET_CODE (XEXP (src, 0)) == NOT
&& !TARGET_BMI)
- gain += 2 * ix86_cost->add;
+ igain += m * ix86_cost->add;
if (CONST_INT_P (XEXP (src, 0)))
- gain -= vector_const_cost (XEXP (src, 0));
+ igain -= vector_const_cost (XEXP (src, 0));
if (CONST_INT_P (XEXP (src, 1)))
- gain -= vector_const_cost (XEXP (src, 1));
+ igain -= vector_const_cost (XEXP (src, 1));
}
else if (GET_CODE (src) == NEG
|| GET_CODE (src) == NOT)
- gain += ix86_cost->add - COSTS_N_INSNS (1);
+ igain += m * ix86_cost->add - ix86_cost->sse_op;
+ else if (GET_CODE (src) == SMAX
+ || GET_CODE (src) == SMIN
+ || GET_CODE (src) == UMAX
+ || GET_CODE (src) == UMIN)
+ {
+ /* We do not have any conditional move cost, estimate it as a
+ reg-reg move. Comparisons are costed as adds. */
+ igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
+ /* Integer SSE ops are all costed the same. */
+ igain -= ix86_cost->sse_op;
+ }
else if (GET_CODE (src) == COMPARE)
{
/* Assume comparison cost is the same. */
@@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
else if (CONST_INT_P (src))
{
if (REG_P (dst))
- gain += COSTS_N_INSNS (2);
+ /* DImode can be immediate for TARGET_64BIT and SImode always. */
+ igain += COSTS_N_INSNS (m);
else if (MEM_P (dst))
- gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
- gain -= vector_const_cost (src);
+ igain += (m * ix86_cost->int_store[2]
+ - ix86_cost->sse_store[sse_cost_idx]);
+ igain -= vector_const_cost (src);
}
else
gcc_unreachable ();
+
+ if (igain != 0 && dump_file)
+ {
+ fprintf (dump_file, " Instruction gain %d for ", igain);
+ dump_insn_slim (dump_file, insn);
+ }
+ gain += igain;
}
if (dump_file)
fprintf (dump_file, " Instruction conversion gain: %d\n", gain);
+ /* ??? What about integer to SSE? */
EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
@@ -573,7 +611,7 @@ rtx
dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
{
if (x == reg)
- return gen_rtx_SUBREG (V2DImode, new_reg, 0);
+ return gen_rtx_SUBREG (vmode, new_reg, 0);
const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
int i, j;
@@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
start_sequence ();
if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
{
- rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
- emit_move_insn (adjust_address (tmp, SImode, 0),
- gen_rtx_SUBREG (SImode, reg, 0));
- emit_move_insn (adjust_address (tmp, SImode, 4),
- gen_rtx_SUBREG (SImode, reg, 4));
+ rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
+ if (smode == DImode && !TARGET_64BIT)
+ {
+ emit_move_insn (adjust_address (tmp, SImode, 0),
+ gen_rtx_SUBREG (SImode, reg, 0));
+ emit_move_insn (adjust_address (tmp, SImode, 4),
+ gen_rtx_SUBREG (SImode, reg, 4));
+ }
+ else
+ emit_move_insn (tmp, reg);
emit_move_insn (vreg, tmp);
}
- else if (TARGET_SSE4_1)
+ else if (!TARGET_64BIT && smode == DImode)
{
- emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
- CONST0_RTX (V4SImode),
- gen_rtx_SUBREG (SImode, reg, 0)));
- emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
- gen_rtx_SUBREG (V4SImode, vreg, 0),
- gen_rtx_SUBREG (SImode, reg, 4),
- GEN_INT (2)));
+ if (TARGET_SSE4_1)
+ {
+ emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+ CONST0_RTX (V4SImode),
+ gen_rtx_SUBREG (SImode, reg, 0)));
+ emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
+ gen_rtx_SUBREG (V4SImode, vreg, 0),
+ gen_rtx_SUBREG (SImode, reg, 4),
+ GEN_INT (2)));
+ }
+ else
+ {
+ rtx tmp = gen_reg_rtx (DImode);
+ emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
+ CONST0_RTX (V4SImode),
+ gen_rtx_SUBREG (SImode, reg, 0)));
+ emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
+ CONST0_RTX (V4SImode),
+ gen_rtx_SUBREG (SImode, reg, 4)));
+ emit_insn (gen_vec_interleave_lowv4si
+ (gen_rtx_SUBREG (V4SImode, vreg, 0),
+ gen_rtx_SUBREG (V4SImode, vreg, 0),
+ gen_rtx_SUBREG (V4SImode, tmp, 0)));
+ }
}
else
- {
- rtx tmp = gen_reg_rtx (DImode);
- emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
- CONST0_RTX (V4SImode),
- gen_rtx_SUBREG (SImode, reg, 0)));
- emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
- CONST0_RTX (V4SImode),
- gen_rtx_SUBREG (SImode, reg, 4)));
- emit_insn (gen_vec_interleave_lowv4si
- (gen_rtx_SUBREG (V4SImode, vreg, 0),
- gen_rtx_SUBREG (V4SImode, vreg, 0),
- gen_rtx_SUBREG (V4SImode, tmp, 0)));
- }
+ emit_move_insn (gen_lowpart (smode, vreg), reg);
rtx_insn *seq = get_insns ();
end_sequence ();
rtx_insn *insn = DF_REF_INSN (ref);
@@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
bitmap_copy (conv, insns);
if (scalar_copy)
- scopy = gen_reg_rtx (DImode);
+ scopy = gen_reg_rtx (smode);
for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
{
@@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
start_sequence ();
if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
{
- rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
+ rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
emit_move_insn (tmp, reg);
- emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
- adjust_address (tmp, SImode, 0));
- emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
- adjust_address (tmp, SImode, 4));
+ if (!TARGET_64BIT && smode == DImode)
+ {
+ emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+ adjust_address (tmp, SImode, 0));
+ emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+ adjust_address (tmp, SImode, 4));
+ }
+ else
+ emit_move_insn (scopy, tmp);
}
- else if (TARGET_SSE4_1)
+ else if (!TARGET_64BIT && smode == DImode)
{
- rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
- emit_insn
- (gen_rtx_SET
- (gen_rtx_SUBREG (SImode, scopy, 0),
- gen_rtx_VEC_SELECT (SImode,
- gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
-
- tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
- emit_insn
- (gen_rtx_SET
- (gen_rtx_SUBREG (SImode, scopy, 4),
- gen_rtx_VEC_SELECT (SImode,
- gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
+ if (TARGET_SSE4_1)
+ {
+ rtx tmp = gen_rtx_PARALLEL (VOIDmode,
+ gen_rtvec (1, const0_rtx));
+ emit_insn
+ (gen_rtx_SET
+ (gen_rtx_SUBREG (SImode, scopy, 0),
+ gen_rtx_VEC_SELECT (SImode,
+ gen_rtx_SUBREG (V4SImode, reg, 0),
+ tmp)));
+
+ tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
+ emit_insn
+ (gen_rtx_SET
+ (gen_rtx_SUBREG (SImode, scopy, 4),
+ gen_rtx_VEC_SELECT (SImode,
+ gen_rtx_SUBREG (V4SImode, reg, 0),
+ tmp)));
+ }
+ else
+ {
+ rtx vcopy = gen_reg_rtx (V2DImode);
+ emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
+ emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
+ gen_rtx_SUBREG (SImode, vcopy, 0));
+ emit_move_insn (vcopy,
+ gen_rtx_LSHIFTRT (V2DImode,
+ vcopy, GEN_INT (32)));
+ emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
+ gen_rtx_SUBREG (SImode, vcopy, 0));
+ }
}
else
- {
- rtx vcopy = gen_reg_rtx (V2DImode);
- emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
- emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
- gen_rtx_SUBREG (SImode, vcopy, 0));
- emit_move_insn (vcopy,
- gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
- emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
- gen_rtx_SUBREG (SImode, vcopy, 0));
- }
+ emit_move_insn (scopy, reg);
+
rtx_insn *seq = get_insns ();
end_sequence ();
emit_conversion_insns (seq, insn);
@@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op
if (GET_CODE (*op) == NOT)
{
convert_op (&XEXP (*op, 0), insn);
- PUT_MODE (*op, V2DImode);
+ PUT_MODE (*op, vmode);
}
else if (MEM_P (*op))
{
- rtx tmp = gen_reg_rtx (DImode);
+ rtx tmp = gen_reg_rtx (GET_MODE (*op));
emit_insn_before (gen_move_insn (tmp, *op), insn);
- *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
+ *op = gen_rtx_SUBREG (vmode, tmp, 0);
if (dump_file)
fprintf (dump_file, " Preloading operand for insn %d into r%d\n",
@@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
gcc_assert (!DF_REF_CHAIN (ref));
break;
}
- *op = gen_rtx_SUBREG (V2DImode, *op, 0);
+ *op = gen_rtx_SUBREG (vmode, *op, 0);
}
else if (CONST_INT_P (*op))
{
rtx vec_cst;
- rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
+ rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
/* Prefer all ones vector in case of -1. */
if (constm1_operand (*op, GET_MODE (*op)))
- vec_cst = CONSTM1_RTX (V2DImode);
+ vec_cst = CONSTM1_RTX (vmode);
else
- vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
- gen_rtvec (2, *op, const0_rtx));
+ {
+ unsigned n = GET_MODE_NUNITS (vmode);
+ rtx *v = XALLOCAVEC (rtx, n);
+ v[0] = *op;
+ for (unsigned i = 1; i < n; ++i)
+ v[i] = const0_rtx;
+ vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+ }
- if (!standard_sse_constant_p (vec_cst, V2DImode))
+ if (!standard_sse_constant_p (vec_cst, vmode))
{
start_sequence ();
- vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
+ vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
rtx_insn *seq = get_insns ();
end_sequence ();
emit_insn_before (seq, insn);
@@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op
else
{
gcc_assert (SUBREG_P (*op));
- gcc_assert (GET_MODE (*op) == V2DImode);
+ gcc_assert (GET_MODE (*op) == vmode);
}
}
@@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
{
/* There are no scalar integer instructions and therefore
temporary register usage is required. */
- rtx tmp = gen_reg_rtx (DImode);
+ rtx tmp = gen_reg_rtx (GET_MODE (dst));
emit_conversion_insns (gen_move_insn (dst, tmp), insn);
- dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
+ dst = gen_rtx_SUBREG (vmode, tmp, 0);
}
switch (GET_CODE (src))
@@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
case ASHIFTRT:
case LSHIFTRT:
convert_op (&XEXP (src, 0), insn);
- PUT_MODE (src, V2DImode);
+ PUT_MODE (src, vmode);
break;
case PLUS:
@@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
case IOR:
case XOR:
case AND:
+ case SMAX:
+ case SMIN:
+ case UMAX:
+ case UMIN:
convert_op (&XEXP (src, 0), insn);
convert_op (&XEXP (src, 1), insn);
- PUT_MODE (src, V2DImode);
+ PUT_MODE (src, vmode);
break;
case NEG:
src = XEXP (src, 0);
convert_op (&src, insn);
- subreg = gen_reg_rtx (V2DImode);
- emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
- src = gen_rtx_MINUS (V2DImode, subreg, src);
+ subreg = gen_reg_rtx (vmode);
+ emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
+ src = gen_rtx_MINUS (vmode, subreg, src);
break;
case NOT:
src = XEXP (src, 0);
convert_op (&src, insn);
- subreg = gen_reg_rtx (V2DImode);
- emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
- src = gen_rtx_XOR (V2DImode, src, subreg);
+ subreg = gen_reg_rtx (vmode);
+ emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
+ src = gen_rtx_XOR (vmode, src, subreg);
break;
case MEM:
@@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
break;
case SUBREG:
- gcc_assert (GET_MODE (src) == V2DImode);
+ gcc_assert (GET_MODE (src) == vmode);
break;
case COMPARE:
src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
- gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
- || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
+ gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
+ || (SUBREG_P (src) && GET_MODE (src) == vmode));
if (REG_P (src))
- subreg = gen_rtx_SUBREG (V2DImode, src, 0);
+ subreg = gen_rtx_SUBREG (vmode, src, 0);
else
subreg = copy_rtx_if_shared (src);
emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
@@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
PATTERN (insn) = def_set;
INSN_CODE (insn) = -1;
- recog_memoized (insn);
+ int patt = recog_memoized (insn);
+ if (patt == -1)
+ fatal_insn_not_found (insn);
df_insn_rescan (insn);
}
@@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
(const_int 0 [0]))) */
static bool
-convertible_comparison_p (rtx_insn *insn)
+convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
{
if (!TARGET_SSE4_1)
return false;
@@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
if (!SUBREG_P (op1)
|| !SUBREG_P (op2)
- || GET_MODE (op1) != SImode
- || GET_MODE (op2) != SImode
+ || GET_MODE (op1) != mode
+ || GET_MODE (op2) != mode
|| ((SUBREG_BYTE (op1) != 0
- || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
+ || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
&& (SUBREG_BYTE (op2) != 0
- || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
+ || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
return false;
op1 = SUBREG_REG (op1);
@@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
if (op1 != op2
|| !REG_P (op1)
- || GET_MODE (op1) != DImode)
+ || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
return false;
return true;
@@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
/* The DImode version of scalar_to_vector_candidate_p. */
static bool
-dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
+dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
{
rtx def_set = single_set (insn);
@@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
rtx dst = SET_DEST (def_set);
if (GET_CODE (src) == COMPARE)
- return convertible_comparison_p (insn);
+ return convertible_comparison_p (insn, mode);
/* We are interested in DImode promotion only. */
- if ((GET_MODE (src) != DImode
+ if ((GET_MODE (src) != mode
&& !CONST_INT_P (src))
- || GET_MODE (dst) != DImode)
+ || GET_MODE (dst) != mode)
return false;
if (!REG_P (dst) && !MEM_P (dst))
@@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
return false;
break;
+ case SMAX:
+ case SMIN:
+ case UMAX:
+ case UMIN:
+ if ((mode == DImode && !TARGET_AVX512F)
+ || (mode == SImode && !TARGET_SSE4_1))
+ return false;
+ /* Fallthru. */
+
case PLUS:
case MINUS:
case IOR:
@@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
&& !CONST_INT_P (XEXP (src, 1)))
return false;
- if (GET_MODE (XEXP (src, 1)) != DImode
+ if (GET_MODE (XEXP (src, 1)) != mode
&& !CONST_INT_P (XEXP (src, 1)))
return false;
break;
@@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
|| !REG_P (XEXP (XEXP (src, 0), 0))))
return false;
- if (GET_MODE (XEXP (src, 0)) != DImode
+ if (GET_MODE (XEXP (src, 0)) != mode
&& !CONST_INT_P (XEXP (src, 0)))
return false;
@@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx
return false;
}
-/* Return 1 if INSN may be converted into vector
- instruction. */
-
-static bool
-scalar_to_vector_candidate_p (rtx_insn *insn)
-{
- if (TARGET_64BIT)
- return timode_scalar_to_vector_candidate_p (insn);
- else
- return dimode_scalar_to_vector_candidate_p (insn);
-}
+/* For a given bitmap of insn UIDs scans all instruction and
+ remove insn from CANDIDATES in case it has both convertible
+ and not convertible definitions.
-/* The DImode version of remove_non_convertible_regs. */
+ All insns in a bitmap are conversion candidates according to
+ scalar_to_vector_candidate_p. Currently it implies all insns
+ are single_set. */
static void
dimode_remove_non_convertible_regs (bitmap candidates)
@@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
BITMAP_FREE (regs);
}
-/* For a given bitmap of insn UIDs scans all instruction and
- remove insn from CANDIDATES in case it has both convertible
- and not convertible definitions.
-
- All insns in a bitmap are conversion candidates according to
- scalar_to_vector_candidate_p. Currently it implies all insns
- are single_set. */
-
-static void
-remove_non_convertible_regs (bitmap candidates)
-{
- if (TARGET_64BIT)
- timode_remove_non_convertible_regs (candidates);
- else
- dimode_remove_non_convertible_regs (candidates);
-}
-
/* Main STV pass function. Find and convert scalar
instructions into vector mode when profitable. */
@@ -1577,11 +1638,14 @@ static unsigned int
convert_scalars_to_vector ()
{
basic_block bb;
- bitmap candidates;
int converted_insns = 0;
bitmap_obstack_initialize (NULL);
- candidates = BITMAP_ALLOC (NULL);
+ const machine_mode cand_mode[3] = { SImode, DImode, TImode };
+ const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
+ bitmap_head candidates[3]; /* { SImode, DImode, TImode } */
+ for (unsigned i = 0; i < 3; ++i)
+ bitmap_initialize (&candidates[i], &bitmap_default_obstack);
calculate_dominance_info (CDI_DOMINATORS);
df_set_flags (DF_DEFER_INSN_RESCAN);
@@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
{
rtx_insn *insn;
FOR_BB_INSNS (bb, insn)
- if (scalar_to_vector_candidate_p (insn))
+ if (TARGET_64BIT
+ && timode_scalar_to_vector_candidate_p (insn))
{
if (dump_file)
- fprintf (dump_file, " insn %d is marked as a candidate\n",
+ fprintf (dump_file, " insn %d is marked as a TImode candidate\n",
INSN_UID (insn));
- bitmap_set_bit (candidates, INSN_UID (insn));
+ bitmap_set_bit (&candidates[2], INSN_UID (insn));
+ }
+ else
+ {
+ /* Check {SI,DI}mode. */
+ for (unsigned i = 0; i <= 1; ++i)
+ if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i]))
+ {
+ if (dump_file)
+ fprintf (dump_file, " insn %d is marked as a %s candidate\n",
+ INSN_UID (insn), i == 0 ? "SImode" : "DImode");
+
+ bitmap_set_bit (&candidates[i], INSN_UID (insn));
+ break;
+ }
}
}
- remove_non_convertible_regs (candidates);
+ if (TARGET_64BIT)
+ timode_remove_non_convertible_regs (&candidates[2]);
+ for (unsigned i = 0; i <= 1; ++i)
+ dimode_remove_non_convertible_regs (&candidates[i]);
- if (bitmap_empty_p (candidates))
- if (dump_file)
+ for (unsigned i = 0; i <= 2; ++i)
+ if (!bitmap_empty_p (&candidates[i]))
+ break;
+ else if (i == 2 && dump_file)
fprintf (dump_file, "There are no candidates for optimization.\n");
- while (!bitmap_empty_p (candidates))
- {
- unsigned uid = bitmap_first_set_bit (candidates);
- scalar_chain *chain;
+ for (unsigned i = 0; i <= 2; ++i)
+ while (!bitmap_empty_p (&candidates[i]))
+ {
+ unsigned uid = bitmap_first_set_bit (&candidates[i]);
+ scalar_chain *chain;
- if (TARGET_64BIT)
- chain = new timode_scalar_chain;
- else
- chain = new dimode_scalar_chain;
+ if (cand_mode[i] == TImode)
+ chain = new timode_scalar_chain;
+ else
+ chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]);
- /* Find instructions chain we want to convert to vector mode.
- Check all uses and definitions to estimate all required
- conversions. */
- chain->build (candidates, uid);
+ /* Find instructions chain we want to convert to vector mode.
+ Check all uses and definitions to estimate all required
+ conversions. */
+ chain->build (&candidates[i], uid);
- if (chain->compute_convert_gain () > 0)
- converted_insns += chain->convert ();
- else
- if (dump_file)
- fprintf (dump_file, "Chain #%d conversion is not profitable\n",
- chain->chain_id);
+ if (chain->compute_convert_gain () > 0)
+ converted_insns += chain->convert ();
+ else
+ if (dump_file)
+ fprintf (dump_file, "Chain #%d conversion is not profitable\n",
+ chain->chain_id);
- delete chain;
- }
+ delete chain;
+ }
if (dump_file)
fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
- BITMAP_FREE (candidates);
+ for (unsigned i = 0; i <= 2; ++i)
+ bitmap_release (&candidates[i]);
bitmap_obstack_release (NULL);
df_process_deferred_rescans ();
Index: gcc/config/i386/i386-features.h
===================================================================
--- gcc/config/i386/i386-features.h (revision 274111)
+++ gcc/config/i386/i386-features.h (working copy)
@@ -127,11 +127,16 @@ namespace {
class scalar_chain
{
public:
- scalar_chain ();
+ scalar_chain (enum machine_mode, enum machine_mode);
virtual ~scalar_chain ();
static unsigned max_id;
+ /* Scalar mode. */
+ enum machine_mode smode;
+ /* Vector mode. */
+ enum machine_mode vmode;
+
/* ID of a chain. */
unsigned int chain_id;
/* A queue of instructions to be included into a chain. */
@@ -162,6 +167,8 @@ class scalar_chain
class dimode_scalar_chain : public scalar_chain
{
public:
+ dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
+ : scalar_chain (smode_, vmode_) {}
int compute_convert_gain ();
private:
void mark_dual_mode_def (df_ref def);
@@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
class timode_scalar_chain : public scalar_chain
{
public:
+ timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
+
/* Convert from TImode to V1TImode is always faster. */
int compute_convert_gain () { return 1; }
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md (revision 274111)
+++ gcc/config/i386/i386.md (working copy)
@@ -17721,6 +17721,27 @@ (define_peephole2
std::swap (operands[4], operands[5]);
})
+;; min/max patterns
+
+(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
+
+(define_insn_and_split "<code><mode>3"
+ [(set (match_operand:SWI48 0 "register_operand")
+ (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
+ (match_operand:SWI48 2 "register_operand")))
+ (clobber (reg:CC FLAGS_REG))]
+ "TARGET_STV && TARGET_SSE4_1
+ && can_create_pseudo_p ()"
+ "#"
+ "&& 1"
+ [(set (reg:CCGC FLAGS_REG)
+ (compare:CCGC (match_dup 1)(match_dup 2)))
+ (set (match_dup 0)
+ (if_then_else:SWI48
+ (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
+ (match_dup 1)
+ (match_dup 2)))])
+
;; Conditional addition patterns
(define_expand "add<mode>cc"
[(match_operand:SWI 0 "register_operand")
- References:
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs