[PATCH] Reintroduce vec_shl_optab and use it for #pragma omp scan inclusive

Wed Jun 19 08:55:00 GMT 2019

Hi!

When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR, vec_shl_optab
has been removed as unused, because we only used vec_shr_optab for the
reductions.
Without this patch the vect-simd-*.c tests can be vectorized just fine
for SSE4 and above, but can't be with SSE2.  As the comment in
tree-vect-stmts.c tries to explain, for the inclusive scan operation we
want (when using V8SImode vectors):
       _30 = MEM <vector(8) int> [(int *)&D.2043];
       _31 = MEM <vector(8) int> [(int *)&D.2042];
       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
       _33 = _31 + _32;
       // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., _31[6]+_31[7] };
       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
       _35 = _33 + _34;
       // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 }>;
       _37 = _35 + _36;
       // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
       _38 = _30 + _37;
       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
       MEM <vector(8) int> [(int *)&D.2043] = _39;
       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, 1, 2 }>,
VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, 2 }>
permutation (the other two it can do).  Well, to be precise, it can do it
using the vector left shift which has been removed as unused, provided
that init is initializer_zerop (shifting all zeros from the left).
init usually is all zeros, that is the neutral element of additive
reductions and couple of others too, in the unlikely case that some other
reduction is used with scan (multiplication, minimum, maximum, bitwise and),
we can use a VEC_COND_EXPR with constant first argument, i.e. a blend or
and/or.

So, this patch reintroduces vec_shl_optab (most backends actually have those
patterns already) and handles its expansion and vector generic lowering
similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the first
operand is initializer_zerop and third operand starts with a few numbers
smaller than number of elements (doesn't matter which one, as all elements
are same - zero) followed by nelts, nelts+1, nelts+2, ...
Unlike vec_shr_optab which has zero as the second operand, this one has it
as first operand, because VEC_PERM_EXPR canonicalization wants to have
first element selector smaller than number of elements.  And unlike
vec_shr_optab, where we also have a fallback in have_whole_vector_shift
using normal permutations, this one doesn't need it, that "fallback" is tried
first before vec_shl_optab.

For the vec_shl_optab checks, it tests only for constant number of elements
vectors, not really sure if our VECTOR_CST encoding can express the left
shifts in any way nor whether SVE supports those (I see aarch64 has
vec_shl_insert but that is just a fixed shift by element bits and shifts in
a scalar rather than zeros).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2019-06-19  Jakub Jelinek  <jakub@redhat.com>

	* doc/md.texi: Document vec_shl_<mode> pattern.
	* optabs.def (vec_shl_optab): New optab.
	* optabs.c (shift_amt_for_vec_perm_mask): Add shift_optab
	argument, if == vec_shl_optab, check for left whole vector shift
	pattern rather than right shift.
	(expand_vec_perm_const): Add vec_shl_optab support.
	* optabs-query.c (can_vec_perm_var_p): Mention also vec_shl optab
	in the comment.
	* tree-vect-generic.c (lower_vec_perm): Support permutations which
	can be handled by vec_shl_optab.
	* tree-vect-stmts.c (scan_store_can_perm_p): New function.
	(check_scan_store): Use it.
	(vectorizable_scan_store): If target can't do normal permutations,
	try to use whole vector left shifts and if needed a VEC_COND_EXPR
	after it.
	* config/i386/sse.md (vec_shl_<mode>): New expander.

	* gcc.dg/vect/vect-simd-8.c: If main is defined, don't include
	tree-vect.h nor call check_vect.
	* gcc.dg/vect/vect-simd-9.c: Likewise.
	* gcc.dg/vect/vect-simd-10.c: New test.
	* gcc.target/i386/sse2-vect-simd-8.c: New test.
	* gcc.target/i386/sse2-vect-simd-9.c: New test.
	* gcc.target/i386/sse2-vect-simd-10.c: New test.
	* gcc.target/i386/avx2-vect-simd-8.c: New test.
	* gcc.target/i386/avx2-vect-simd-9.c: New test.
	* gcc.target/i386/avx2-vect-simd-10.c: New test.
	* gcc.target/i386/avx512f-vect-simd-8.c: New test.
	* gcc.target/i386/avx512f-vect-simd-9.c: New test.
	* gcc.target/i386/avx512f-vect-simd-10.c: New test.

--- gcc/doc/md.texi.jj	2019-06-13 00:35:43.518942525 +0200
+++ gcc/doc/md.texi	2019-06-18 15:32:38.496629946 +0200
@@ -5454,6 +5454,14 @@ in operand 2.  Store the result in vecto
 0 and 1 have mode @var{m} and operand 2 has the mode appropriate for
 one element of @var{m}.
 
+@cindex @code{vec_shl_@var{m}} instruction pattern
+@item @samp{vec_shl_@var{m}}
+Whole vector left shift in bits, i.e.@: away from element 0.
+Operand 1 is a vector to be shifted.
+Operand 2 is an integer shift amount in bits.
+Operand 0 is where the resulting shifted vector is stored.
+The output and input vectors should have the same modes.
+
 @cindex @code{vec_shr_@var{m}} instruction pattern
 @item @samp{vec_shr_@var{m}}
 Whole vector right shift in bits, i.e.@: towards element 0.
--- gcc/optabs.def.jj	2019-02-11 11:38:08.263617017 +0100
+++ gcc/optabs.def	2019-06-18 14:56:57.934971410 +0200
@@ -348,6 +348,7 @@ OPTAB_D (vec_packu_float_optab, "vec_pac
 OPTAB_D (vec_perm_optab, "vec_perm$a")
 OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")
 OPTAB_D (vec_set_optab, "vec_set$a")
+OPTAB_D (vec_shl_optab, "vec_shl_$a")
 OPTAB_D (vec_shr_optab, "vec_shr_$a")
 OPTAB_D (vec_unpack_sfix_trunc_hi_optab, "vec_unpack_sfix_trunc_hi_$a")
 OPTAB_D (vec_unpack_sfix_trunc_lo_optab, "vec_unpack_sfix_trunc_lo_$a")
--- gcc/optabs.c.jj	2019-02-13 13:11:47.927612362 +0100
+++ gcc/optabs.c	2019-06-18 16:45:29.347895585 +0200
@@ -5444,19 +5444,45 @@ vector_compare_rtx (machine_mode cmp_mod
 }
 
 /* Check if vec_perm mask SEL is a constant equivalent to a shift of
-   the first vec_perm operand, assuming the second operand is a constant
-   vector of zeros.  Return the shift distance in bits if so, or NULL_RTX
-   if the vec_perm is not a shift.  MODE is the mode of the value being
-   shifted.  */
+   the first vec_perm operand, assuming the second operand (for left shift
+   first operand) is a constant vector of zeros.  Return the shift distance
+   in bits if so, or NULL_RTX if the vec_perm is not a shift.  MODE is the
+   mode of the value being shifted.  SHIFT_OPTAB is vec_shr_optab for right
+   shift or vec_shl_optab for left shift.  */
 static rtx
-shift_amt_for_vec_perm_mask (machine_mode mode, const vec_perm_indices &sel)
+shift_amt_for_vec_perm_mask (machine_mode mode, const vec_perm_indices &sel,
+			     optab shift_optab)
 {
   unsigned int bitsize = GET_MODE_UNIT_BITSIZE (mode);
   poly_int64 first = sel[0];
   if (maybe_ge (sel[0], GET_MODE_NUNITS (mode)))
     return NULL_RTX;
 
-  if (!sel.series_p (0, 1, first, 1))
+  if (shift_optab == vec_shl_optab)
+    {
+      unsigned int nelt;
+      if (!GET_MODE_NUNITS (mode).is_constant (&nelt))
+	return NULL_RTX;
+      unsigned firstidx = 0;
+      for (unsigned int i = 0; i < nelt; i++)
+	{
+	  if (known_eq (sel[i], nelt))
+	    {
+	      if (i == 0 || firstidx)
+		return NULL_RTX;
+	      firstidx = i;
+	    }
+	  else if (firstidx
+		   ? maybe_ne (sel[i], nelt + i - firstidx)
+		   : maybe_ge (sel[i], nelt))
+	    return NULL_RTX;
+	}
+
+      if (firstidx == 0)
+	return NULL_RTX;
+      first = firstidx;
+    }
+  else if (!sel.series_p (0, 1, first, 1))
     {
       unsigned int nelt;
       if (!GET_MODE_NUNITS (mode).is_constant (&nelt))
@@ -5544,25 +5570,37 @@ expand_vec_perm_const (machine_mode mode
      target instruction.  */
   vec_perm_indices indices (sel, 2, GET_MODE_NUNITS (mode));
 
-  /* See if this can be handled with a vec_shr.  We only do this if the
-     second vector is all zeroes.  */
-  insn_code shift_code = optab_handler (vec_shr_optab, mode);
-  insn_code shift_code_qi = ((qimode != VOIDmode && qimode != mode)
-			     ? optab_handler (vec_shr_optab, qimode)
-			     : CODE_FOR_nothing);
-
-  if (v1 == CONST0_RTX (GET_MODE (v1))
-      && (shift_code != CODE_FOR_nothing
-	  || shift_code_qi != CODE_FOR_nothing))
+  /* See if this can be handled with a vec_shr or vec_shl.  We only do this
+     if the second (for vec_shr) or first (for vec_shl) vector is all
+     zeroes.  */
+  insn_code shift_code = CODE_FOR_nothing;
+  insn_code shift_code_qi = CODE_FOR_nothing;
+  optab shift_optab = unknown_optab;
+  rtx v2 = v0;
+  if (v1 == CONST0_RTX (GET_MODE (v1)))
+    shift_optab = vec_shr_optab;
+  else if (v0 == CONST0_RTX (GET_MODE (v0)))
+    {
+      shift_optab = vec_shl_optab;
+      v2 = v1;
+    }
+  if (shift_optab != unknown_optab)
+    {
+      shift_code = optab_handler (shift_optab, mode);
+      shift_code_qi = ((qimode != VOIDmode && qimode != mode)
+		       ? optab_handler (shift_optab, qimode)
+		       : CODE_FOR_nothing);
+    }
+  if (shift_code != CODE_FOR_nothing || shift_code_qi != CODE_FOR_nothing)
     {
-      rtx shift_amt = shift_amt_for_vec_perm_mask (mode, indices);
+      rtx shift_amt = shift_amt_for_vec_perm_mask (mode, indices, shift_optab);
       if (shift_amt)
 	{
 	  struct expand_operand ops[3];
 	  if (shift_code != CODE_FOR_nothing)
 	    {
 	      create_output_operand (&ops[0], target, mode);
-	      create_input_operand (&ops[1], v0, mode);
+	      create_input_operand (&ops[1], v2, mode);
 	      create_convert_operand_from_type (&ops[2], shift_amt, sizetype);
 	      if (maybe_expand_insn (shift_code, 3, ops))
 		return ops[0].value;
@@ -5571,7 +5609,7 @@ expand_vec_perm_const (machine_mode mode
 	    {
 	      rtx tmp = gen_reg_rtx (qimode);
 	      create_output_operand (&ops[0], tmp, qimode);
-	      create_input_operand (&ops[1], gen_lowpart (qimode, v0), qimode);
+	      create_input_operand (&ops[1], gen_lowpart (qimode, v2), qimode);
 	      create_convert_operand_from_type (&ops[2], shift_amt, sizetype);
 	      if (maybe_expand_insn (shift_code_qi, 3, ops))
 		return gen_lowpart (mode, ops[0].value);
--- gcc/optabs-query.c.jj	2019-05-20 11:40:16.691121967 +0200
+++ gcc/optabs-query.c	2019-06-18 15:26:53.028980804 +0200
@@ -415,8 +415,9 @@ can_vec_perm_var_p (machine_mode mode)
    permute (if the target supports that).
 
    Note that additional permutations representing whole-vector shifts may
-   also be handled via the vec_shr optab, but only where the second input
-   vector is entirely constant zeroes; this case is not dealt with here.  */
+   also be handled via the vec_shr or vec_shl optab, but only where the
+   second input vector is entirely constant zeroes; this case is not dealt
+   with here.  */
 
 bool
 can_vec_perm_const_p (machine_mode mode, const vec_perm_indices &sel,
--- gcc/tree-vect-generic.c.jj	2019-01-07 09:47:32.988518893 +0100
+++ gcc/tree-vect-generic.c	2019-06-18 16:35:29.033319526 +0200
@@ -1367,6 +1367,32 @@ lower_vec_perm (gimple_stmt_iterator *gs
 	      return;
 	    }
 	}
+      /* And similarly vec_shl pattern.  */
+      if (optab_handler (vec_shl_optab, TYPE_MODE (vect_type))
+	  != CODE_FOR_nothing
+	  && TREE_CODE (vec0) == VECTOR_CST
+	  && initializer_zerop (vec0))
+	{
+	  unsigned int first = 0;
+	  for (i = 0; i < elements; ++i)
+	    if (known_eq (poly_uint64 (indices[i]), elements))
+	      {
+		if (i == 0 || first)
+		  break;
+		first = i;
+	      }
+	    else if (first
+		     ? maybe_ne (poly_uint64 (indices[i]),
+					      elements + i - first)
+		     : maybe_ge (poly_uint64 (indices[i]), elements))
+	      break;
+	  if (i == elements)
+	    {
+	      gimple_assign_set_rhs3 (stmt, mask);
+	      update_stmt (stmt);
+	      return;
+	    }
+	}
     }
   else if (can_vec_perm_var_p (TYPE_MODE (vect_type)))
     return;
--- gcc/tree-vect-stmts.c.jj	2019-06-17 23:18:53.620850072 +0200
+++ gcc/tree-vect-stmts.c	2019-06-18 17:43:27.484350807 +0200
@@ -6356,6 +6356,71 @@ scan_operand_equal_p (tree ref1, tree re
 
 /* Function check_scan_store.
 
+   Verify if we can perform the needed permutations or whole vector shifts.
+   Return -1 on failure, otherwise exact log2 of vectype's nunits.  */
+
+static int
+scan_store_can_perm_p (tree vectype, tree init, int *use_whole_vector_p = NULL)
+{
+  enum machine_mode vec_mode = TYPE_MODE (vectype);
+  unsigned HOST_WIDE_INT nunits;
+  if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
+    return -1;
+  int units_log2 = exact_log2 (nunits);
+  if (units_log2 <= 0)
+    return -1;
+
+  int i;
+  for (i = 0; i <= units_log2; ++i)
+    {
+      unsigned HOST_WIDE_INT j, k;
+      vec_perm_builder sel (nunits, nunits, 1);
+      sel.quick_grow (nunits);
+      if (i == 0)
+	{
+	  for (j = 0; j < nunits; ++j)
+	    sel[j] = nunits - 1;
+	}
+      else
+	{
+	  for (j = 0; j < (HOST_WIDE_INT_1U << (i - 1)); ++j)
+	    sel[j] = j;
+	  for (k = 0; j < nunits; ++j, ++k)
+	    sel[j] = nunits + k;
+	}
+      vec_perm_indices indices (sel, i == 0 ? 1 : 2, nunits);
+      if (!can_vec_perm_const_p (vec_mode, indices))
+	break;
+    }
+
+  if (i == 0)
+    return -1;
+
+  if (i <= units_log2)
+    {
+      if (optab_handler (vec_shl_optab, vec_mode) == CODE_FOR_nothing)
+	return -1;
+      int kind = 1;
+      /* Whole vector shifts shift in zeros, so if init is all zero constant,
+	 there is no need to do anything further.  */
+      if ((TREE_CODE (init) != INTEGER_CST
+	   && TREE_CODE (init) != REAL_CST)
+	  || !initializer_zerop (init))
+	{
+	  tree masktype = build_same_sized_truth_vector_type (vectype);
+	  if (!expand_vec_cond_expr_p (vectype, masktype, VECTOR_CST))
+	    return -1;
+	  kind = 2;
+	}
+      if (use_whole_vector_p)
+	*use_whole_vector_p = kind;
+    }
+  return units_log2;
+}
+
+
+/* Function check_scan_store.
+
    Check magic stores for #pragma omp scan {in,ex}clusive reductions.  */
 
 static bool
@@ -6596,34 +6661,9 @@ check_scan_store (stmt_vec_info stmt_inf
   if (!optab || optab_handler (optab, vec_mode) == CODE_FOR_nothing)
     goto fail;
 
-  unsigned HOST_WIDE_INT nunits;
-  if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
+  int units_log2 = scan_store_can_perm_p (vectype, *init);
+  if (units_log2 == -1)
     goto fail;
-  int units_log2 = exact_log2 (nunits);
-  if (units_log2 <= 0)
-    goto fail;
-
-  for (int i = 0; i <= units_log2; ++i)
-    {
-      unsigned HOST_WIDE_INT j, k;
-      vec_perm_builder sel (nunits, nunits, 1);
-      sel.quick_grow (nunits);
-      if (i == units_log2)
-	{
-	  for (j = 0; j < nunits; ++j)
-	    sel[j] = nunits - 1;
-	}
-      else
-	{
-	  for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
-	    sel[j] = nunits + j;
-	  for (k = 0; j < nunits; ++j, ++k)
-	    sel[j] = k;
-	}
-      vec_perm_indices indices (sel, i == units_log2 ? 1 : 2, nunits);
-      if (!can_vec_perm_const_p (vec_mode, indices))
-	goto fail;
-    }
 
   return true;
 }
@@ -6686,7 +6726,8 @@ vectorizable_scan_store (stmt_vec_info s
   unsigned HOST_WIDE_INT nunits;
   if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
     gcc_unreachable ();
-  int units_log2 = exact_log2 (nunits);
+  int use_whole_vector_p = 0;
+  int units_log2 = scan_store_can_perm_p (vectype, *init, &use_whole_vector_p);
   gcc_assert (units_log2 > 0);
   auto_vec<tree, 16> perms;
   perms.quick_grow (units_log2 + 1);
@@ -6696,21 +6737,25 @@ vectorizable_scan_store (stmt_vec_info s
       vec_perm_builder sel (nunits, nunits, 1);
       sel.quick_grow (nunits);
       if (i == units_log2)
-	{
-	  for (j = 0; j < nunits; ++j)
-	    sel[j] = nunits - 1;
-	}
-      else
-	{
-	  for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
-	    sel[j] = nunits + j;
-	  for (k = 0; j < nunits; ++j, ++k)
-	    sel[j] = k;
-	}
+	for (j = 0; j < nunits; ++j)
+	  sel[j] = nunits - 1;
+	else
+	  {
+	    for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
+	      sel[j] = j;
+	    for (k = 0; j < nunits; ++j, ++k)
+	      sel[j] = nunits + k;
+	  }
       vec_perm_indices indices (sel, i == units_log2 ? 1 : 2, nunits);
-      perms[i] = vect_gen_perm_mask_checked (vectype, indices);
+      if (use_whole_vector_p && i < units_log2)
+	perms[i] = vect_gen_perm_mask_any (vectype, indices);
+      else
+	perms[i] = vect_gen_perm_mask_checked (vectype, indices);
     }
 
+  tree zero_vec = use_whole_vector_p ? build_zero_cst (vectype) : NULL_TREE;
+  tree masktype = (use_whole_vector_p == 2
+		   ? build_same_sized_truth_vector_type (vectype) : NULL_TREE);
   stmt_vec_info prev_stmt_info = NULL;
   tree vec_oprnd1 = NULL_TREE;
   tree vec_oprnd2 = NULL_TREE;
@@ -6742,8 +6787,9 @@ vectorizable_scan_store (stmt_vec_info s
       for (int i = 0; i < units_log2; ++i)
 	{
 	  tree new_temp = make_ssa_name (vectype);
-	  gimple *g = gimple_build_assign (new_temp, VEC_PERM_EXPR, v,
-					   vec_oprnd1, perms[i]);
+	  gimple *g = gimple_build_assign (new_temp, VEC_PERM_EXPR,
+					   zero_vec ? zero_vec : vec_oprnd1, v,
+					   perms[i]);
 	  new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
 	  if (prev_stmt_info == NULL)
 	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt_info;
@@ -6751,6 +6797,25 @@ vectorizable_scan_store (stmt_vec_info s
 	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt_info;
 	  prev_stmt_info = new_stmt_info;
 
+	  if (use_whole_vector_p == 2)
+	    {
+	      /* Whole vector shift shifted in zero bits, but if *init
+		 is not initializer_zerop, we need to replace those elements
+		 with elements from vec_oprnd1.  */
+	      tree_vector_builder vb (masktype, nunits, 1);
+	      for (unsigned HOST_WIDE_INT k = 0; k < nunits; ++k)
+		vb.quick_push (k < (HOST_WIDE_INT_1U << i)
+			       ? boolean_false_node : boolean_true_node);
+
+	      tree new_temp2 = make_ssa_name (vectype);
+	      g = gimple_build_assign (new_temp2, VEC_COND_EXPR, vb.build (),
+				       new_temp, vec_oprnd1);
+	      new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
+	      STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt_info;
+	      prev_stmt_info = new_stmt_info;
+	      new_temp = new_temp2;
+	    }
+
 	  tree new_temp2 = make_ssa_name (vectype);
 	  g = gimple_build_assign (new_temp2, code, v, new_temp);
 	  new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
--- gcc/config/i386/sse.md.jj	2019-06-17 23:18:26.821267440 +0200
+++ gcc/config/i386/sse.md	2019-06-18 15:37:28.342043528 +0200
@@ -11758,6 +11758,19 @@ (define_insn "<shift_insn><mode>3<mask_n
    (set_attr "mode" "<sseinsnmode>")])
 
 
+(define_expand "vec_shl_<mode>"
+  [(set (match_dup 3)
+	(ashift:V1TI
+	 (match_operand:VI_128 1 "register_operand")
+	 (match_operand:SI 2 "const_0_to_255_mul_8_operand")))
+   (set (match_operand:VI_128 0 "register_operand") (match_dup 4))]
+  "TARGET_SSE2"
+{
+  operands[1] = gen_lowpart (V1TImode, operands[1]);
+  operands[3] = gen_reg_rtx (V1TImode);
+  operands[4] = gen_lowpart (<MODE>mode, operands[3]);
+})
+
 (define_expand "vec_shr_<mode>"
   [(set (match_dup 3)
 	(lshiftrt:V1TI
--- gcc/testsuite/gcc.dg/vect/vect-simd-8.c.jj	2019-06-17 23:18:53.621850057 +0200
+++ gcc/testsuite/gcc.dg/vect/vect-simd-8.c	2019-06-18 18:02:09.428798006 +0200
@@ -3,7 +3,9 @@
 /* { dg-additional-options "-mavx" { target avx_runtime } } */
 /* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" { target i?86-*-* x86_64-*-* } } } */
 
+#ifndef main
 #include "tree-vect.h"
+#endif
 
 int r, a[1024], b[1024];
 
@@ -63,7 +65,9 @@ int
 main ()
 {
   int s = 0;
+#ifndef main
   check_vect ();
+#endif
   for (int i = 0; i < 1024; ++i)
     {
       a[i] = i;
--- gcc/testsuite/gcc.dg/vect/vect-simd-9.c.jj	2019-06-17 23:18:53.621850057 +0200
+++ gcc/testsuite/gcc.dg/vect/vect-simd-9.c	2019-06-18 18:02:34.649406773 +0200
@@ -3,7 +3,9 @@
 /* { dg-additional-options "-mavx" { target avx_runtime } } */
 /* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" { target i?86-*-* x86_64-*-* } } } */
 
+#ifndef main
 #include "tree-vect.h"
+#endif
 
 int r, a[1024], b[1024];
 
@@ -65,7 +67,9 @@ int
 main ()
 {
   int s = 0;
+#ifndef main
   check_vect ();
+#endif
   for (int i = 0; i < 1024; ++i)
     {
       a[i] = i;
--- gcc/testsuite/gcc.dg/vect/vect-simd-10.c.jj	2019-06-18 18:37:30.742838613 +0200
+++ gcc/testsuite/gcc.dg/vect/vect-simd-10.c	2019-06-18 19:44:20.614082076 +0200
@@ -0,0 +1,96 @@
+/* { dg-require-effective-target size32plus } */
+/* { dg-additional-options "-fopenmp-simd" } */
+/* { dg-additional-options "-mavx" { target avx_runtime } } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" { target i?86-*-* x86_64-*-* } } } */
+
+#ifndef main
+#include "tree-vect.h"
+#endif
+
+float r = 1.0f, a[1024], b[1024];
+
+__attribute__((noipa)) void
+foo (float *a, float *b)
+{
+  #pragma omp simd reduction (inscan, *:r)
+  for (int i = 0; i < 1024; i++)
+    {
+      r *= a[i];
+      #pragma omp scan inclusive(r)
+      b[i] = r;
+    }
+}
+
+__attribute__((noipa)) float
+bar (void)
+{
+  float s = -__builtin_inff ();
+  #pragma omp simd reduction (inscan, max:s)
+  for (int i = 0; i < 1024; i++)
+    {
+      s = s > a[i] ? s : a[i];
+      #pragma omp scan inclusive(s)
+      b[i] = s;
+    }
+  return s;
+}
+
+int
+main ()
+{
+  float s = 1.0f;
+#ifndef main
+  check_vect ();
+#endif
+  for (int i = 0; i < 1024; ++i)
+    {
+      if (i < 80)
+	a[i] = (i & 1) ? 0.25f : 0.5f;
+      else if (i < 200)
+	a[i] = (i % 3) == 0 ? 2.0f : (i % 3) == 1 ? 4.0f : 1.0f;
+      else if (i < 280)
+	a[i] = (i & 1) ? 0.25f : 0.5f;
+      else if (i < 380)
+	a[i] = (i % 3) == 0 ? 2.0f : (i % 3) == 1 ? 4.0f : 1.0f;
+      else
+	switch (i % 6)
+	  {
+	  case 0: a[i] = 0.25f; break;
+	  case 1: a[i] = 2.0f; break;
+	  case 2: a[i] = -1.0f; break;
+	  case 3: a[i] = -4.0f; break;
+	  case 4: a[i] = 0.5f; break;
+	  case 5: a[i] = 1.0f; break;
+	  default: a[i] = 0.0f; break;
+	  }
+      b[i] = -19.0f;
+      asm ("" : "+g" (i));
+    }
+  foo (a, b);
+  if (r * 16384.0f != 0.125f)
+    abort ();
+  float m = -175.25f;
+  for (int i = 0; i < 1024; ++i)
+    {
+      s *= a[i];
+      if (b[i] != s)
+	abort ();
+      else
+	{
+	  a[i] = m - ((i % 3) == 1 ? 2.0f : (i % 3) == 2 ? 4.0f : 0.0f);
+	  b[i] = -231.75f;
+	  m += 0.75f;
+	}
+    }
+  if (bar () != 592.0f)
+    abort ();
+  s = -__builtin_inff ();
+  for (int i = 0; i < 1024; ++i)
+    {
+      if (s < a[i])
+	s = a[i];
+      if (b[i] != s)
+	abort ();
+    }
+  return 0;
+}
--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-8.c.jj	2019-06-18 17:59:27.182314827 +0200
+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-8.c	2019-06-18 18:19:48.417341734 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target sse2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "sse2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-8.c"
+
+static void
+sse2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-9.c.jj	2019-06-18 18:03:30.174545446 +0200
+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-9.c	2019-06-18 18:20:05.770072628 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target sse2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "sse2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-9.c"
+
+static void
+sse2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-10.c.jj	2019-06-18 19:46:09.015410603 +0200
+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-10.c	2019-06-18 19:50:31.621361409 +0200
@@ -0,0 +1,15 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target sse2 } */
+
+#include "sse2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-10.c"
+
+static void
+sse2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-8.c.jj	2019-06-18 17:59:27.182314827 +0200
+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-8.c	2019-06-18 18:19:40.310467451 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-8.c"
+
+static void
+avx2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-9.c.jj	2019-06-18 18:03:30.174545446 +0200
+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-9.c	2019-06-18 18:19:56.479216712 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-9.c"
+
+static void
+avx2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-10.c.jj	2019-06-18 19:50:47.692113611 +0200
+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-10.c	2019-06-18 19:50:56.180982721 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-10.c"
+
+static void
+avx2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-8.c.jj	2019-06-18 17:59:27.182314827 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-8.c	2019-06-18 18:19:44.364404586 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx512f-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-8.c"
+
+static void
+avx512f_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-9.c.jj	2019-06-18 18:03:30.174545446 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-9.c	2019-06-18 18:20:00.884148400 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx512f-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-9.c"
+
+static void
+avx512f_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-10.c.jj	2019-06-18 19:51:12.309734025 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-10.c	2019-06-18 19:51:18.285641883 +0200
@@ -0,0 +1,16 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx512f-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-10.c"
+
+static void
+avx512f_test (void)
+{
+  do_main ();
+}

	Jakub