91940 – __builtin_bswap16 loop optimization

Bug 91940 - __builtin_bswap16 loop optimization

Summary: __builtin_bswap16 loop optimization

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	9.2.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	Jakub Jelinek

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2019-09-30 16:22 UTC by Matwey V. Kornilov
Modified:	2024-08-08 08:12 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-09-30 00:00:00

Attachments
code snippet (545 bytes, text/x-csrc) 2019-09-30 16:22 UTC, Matwey V. Kornilov	Details
gcc10-pr91940.patch (2.46 KB, patch) 2019-10-01 10:52 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matwey V. Kornilov 2019-09-30 16:22:11 UTC

Created attachment 46984 [details]
code snippet

Hello,

I am using "gcc (SUSE Linux) 9.2.1 20190903 [gcc-9-branch revision 275330]" and I see the following performance issue with the __builtin_bswap16() on x86_64 platform.

Attached here is code sample implementing byte swapping for arrays of 2-byte words.
I see that the following code (when compiled with -O3)

inline void swab_bi(const void* from, void* to, std::size_t size) {
        const auto begin = reinterpret_cast<const std::uint16_t*>(from);
        const auto end = reinterpret_cast<const std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size);
        auto out = reinterpret_cast<std::uint16_t*>(to);

        for(auto it = begin; it != end; ++it) {
                *(out++) = __builtin_bswap16(*it);
        }
}

takes 0.023 sec. on average to execute on my hardware (Intel Core-i5).
While the following code

inline void swab(const void* from, void* to, std::size_t size) {
        const auto begin = reinterpret_cast<const std::uint16_t*>(from);
        const auto end = reinterpret_cast<const std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size);
        auto out = reinterpret_cast<std::uint16_t*>(to);

        for(auto it = begin; it != end; ++it) {
                *(out++) = ((*it & 0xFF) << 8) | ((*it & 0xFF00) >> 8);
        }
}

is *more* efficiently. It takes only 0.011 sec.

When I try to dump assembler output for both function I see that packed instructions are used for the latter case:

        movdqu  0(%rbp,%rax), %xmm0
        movdqa  %xmm0, %xmm1
        psllw   $8, %xmm0
        psrlw   $8, %xmm1
        por     %xmm1, %xmm0
        movups  %xmm0, (%r12,%rax)
        addq    $16, %rax

while rolw is used for the former case:

        movzwl  0(%rbp,%rax), %edx
        rolw    $8, %dx
        movw    %dx, (%r12,%rax)
        addq    $2, %rax

Comment 1 Jakub Jelinek 2019-09-30 17:17:09 UTC

The loop with the rotate is vectorized, while the one with __builtin_bswap16 is not.  For rotates if the ISA doesn't have vector support for rotates, we use vect_recog_rotate_pattern to undo the matching of hand written rotate into a rotate by breaking it up again into shifts + blend.
For __builtin_bswap* we have vectorizable_bswap support but it only works if there is no type promotion in the call argument; in such case it is not handled using rotates etc., but as a permutation of the vector elements (if supported).
Unfortunately, for __builtin_bswap16 the argument is promoted.
So, the options are look through the argument promotion for vectorizable_bswap, or in tree-vect-patterns.c pattern match the __builtin_bswap16 on a promoted integer to a call with non-promoted argument, and optionally check if the permutation would be supported and maybe fall back to rotate that vect_recog_rotate_pattern can produce.

Comment 2 Richard Biener 2019-10-01 07:13:20 UTC

Another option is to elide the promotion?

int foo (unsigned short x)
{
  return __builtin_bswap16 (x);
}

  return (int) __builtin_bswap16 ((int) x);

but BUILT_IN_BSWAP16 is BT_FN_UINT16_UINT16, not sure why there's
a truncation missing?!  Is that the frontend "promote according to ABI"
thing?  Yeah, I guess so :/

Comment 3 Jakub Jelinek 2019-10-01 09:40:10 UTC

Untested WIP patch that does both.
If it finds vectorize_bswap will work (the corresponding permutation is supported), it will just undo the promotion, if target supports vector rotates, will use vector rotate, if target supports vector shifts, it will use those + ior.  For the first case, e.g. -mavx2 now vectorizes using permutation what wasn't vectorized, for the second case I don't have any particular target in mind ATM (e.g. -mxop supports rotates, but will support also the permutations), and the last case is e.g. SSE2.
--- gcc/tree-vect-patterns.c.jj	2019-09-20 12:25:48.186387075 +0200
+++ gcc/tree-vect-patterns.c	2019-10-01 11:29:18.229215895 +0200
@@ -46,6 +46,8 @@ along with GCC; see the file COPYING3.
 #include "cgraph.h"
 #include "omp-simd-clone.h"
 #include "predict.h"
+#include "tree-vector-builder.h"
+#include "vec-perm-indices.h"
 
 /* Return true if we have a useful VR_RANGE range for VAR, storing it
    in *MIN_VALUE and *MAX_VALUE if so.  Note the range in the dump files.  */
@@ -2168,24 +2170,107 @@ vect_recog_rotate_pattern (stmt_vec_info
   enum vect_def_type dt;
   optab optab1, optab2;
   edge ext_def = NULL;
+  bool bswap16_p = false;
 
-  if (!is_gimple_assign (last_stmt))
-    return NULL;
-
-  rhs_code = gimple_assign_rhs_code (last_stmt);
-  switch (rhs_code)
+  if (is_gimple_assign (last_stmt))
     {
-    case LROTATE_EXPR:
-    case RROTATE_EXPR:
-      break;
-    default:
-      return NULL;
+      rhs_code = gimple_assign_rhs_code (last_stmt);
+      switch (rhs_code)
+	{
+	case LROTATE_EXPR:
+	case RROTATE_EXPR:
+	  break;
+	default:
+	  return NULL;
+	}
+
+      lhs = gimple_assign_lhs (last_stmt);
+      oprnd0 = gimple_assign_rhs1 (last_stmt);
+      type = TREE_TYPE (oprnd0);
+      oprnd1 = gimple_assign_rhs2 (last_stmt);
+    }
+  else if (gimple_call_builtin_p (last_stmt, BUILT_IN_BSWAP16))
+    {
+      /* __builtin_bswap16 (x) is another form of x r>> 8.
+	 The vectorizer has bswap support, but only if the argument isn't
+	 promoted.  */
+      lhs = gimple_call_lhs (last_stmt);
+      oprnd0 = gimple_call_arg (last_stmt, 0);
+      type = TREE_TYPE (oprnd0);
+      if (TYPE_PRECISION (TREE_TYPE (lhs)) != 16
+	  || TYPE_PRECISION (type) <= 16
+	  || TREE_CODE (oprnd0) != SSA_NAME
+	  || BITS_PER_UNIT != 8
+	  || !TYPE_UNSIGNED (TREE_TYPE (lhs)))
+	return NULL;
+
+      stmt_vec_info def_stmt_info;
+      if (!vect_is_simple_use (oprnd0, vinfo, &dt, &def_stmt_info, &def_stmt))
+	return NULL;
+
+      if (dt != vect_internal_def)
+	return NULL;
+
+      if (gimple_assign_cast_p (def_stmt))
+	{
+	  def = gimple_assign_rhs1 (def_stmt);
+	  if (INTEGRAL_TYPE_P (TREE_TYPE (def))
+	      && TYPE_PRECISION (TREE_TYPE (def)) == 16)
+	    oprnd0 = def;
+	}
+
+      type = TREE_TYPE (lhs);
+      vectype = get_vectype_for_scalar_type (type);
+      if (vectype == NULL_TREE)
+	return NULL;
+
+      if (tree char_vectype = get_same_sized_vectype (char_type_node, vectype))
+	{
+	  /* The encoding uses one stepped pattern for each byte in the
+	     16-bit word.  */
+	  vec_perm_builder elts (TYPE_VECTOR_SUBPARTS (char_vectype), 2, 3);
+	  for (unsigned i = 0; i < 3; ++i)
+	    for (unsigned j = 0; j < 2; ++j)
+	      elts.quick_push ((i + 1) * 2 - j - 1);
+
+	  vec_perm_indices indices (elts, 1,
+				    TYPE_VECTOR_SUBPARTS (char_vectype));
+	  if (can_vec_perm_const_p (TYPE_MODE (char_vectype), indices))
+	    {
+	      /* vectorizable_bswap can handle the __builtin_bswap16 if we
+		 undo the argument promotion.  */
+	      if (!useless_type_conversion_p (type, TREE_TYPE (oprnd0)))
+		{
+		  def = vect_recog_temp_ssa_var (type, NULL);
+		  def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd0);
+		  append_pattern_def_seq (stmt_vinfo, def_stmt);
+		  oprnd0 = def;
+		}
+
+	      /* Pattern detected.  */
+	      vect_pattern_detected ("vect_recog_rotate_pattern", last_stmt);
+
+	      *type_out = vectype;
+
+	      /* Pattern supported.  Create a stmt to be used to replace the
+		 pattern, with the unpromoted argument.  */
+	      var = vect_recog_temp_ssa_var (type, NULL);
+	      pattern_stmt = gimple_build_call (gimple_call_fndecl (last_stmt),
+						1, oprnd0);
+	      gimple_call_set_lhs (pattern_stmt, var);
+	      gimple_call_set_fntype (as_a <gcall *> (pattern_stmt),
+				      gimple_call_fntype (last_stmt));
+	      return pattern_stmt;
+	    }
+	}
+
+      oprnd1 = build_int_cst (integer_type_node, 8);
+      rhs_code = LROTATE_EXPR;
+      bswap16_p = true;
     }
+  else
+    return NULL;
 
-  lhs = gimple_assign_lhs (last_stmt);
-  oprnd0 = gimple_assign_rhs1 (last_stmt);
-  type = TREE_TYPE (oprnd0);
-  oprnd1 = gimple_assign_rhs2 (last_stmt);
   if (TREE_CODE (oprnd0) != SSA_NAME
       || TYPE_PRECISION (TREE_TYPE (lhs)) != TYPE_PRECISION (type)
       || !INTEGRAL_TYPE_P (type)
@@ -2210,14 +2295,39 @@ vect_recog_rotate_pattern (stmt_vec_info
   optab1 = optab_for_tree_code (rhs_code, vectype, optab_vector);
   if (optab1
       && optab_handler (optab1, TYPE_MODE (vectype)) != CODE_FOR_nothing)
-    return NULL;
+    {
+     use_rotate:
+      if (bswap16_p)
+	{
+	  if (!useless_type_conversion_p (type, TREE_TYPE (oprnd0)))
+	    {
+	      def = vect_recog_temp_ssa_var (type, NULL);
+	      def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd0);
+	      append_pattern_def_seq (stmt_vinfo, def_stmt);
+	      oprnd0 = def;
+	    }
+
+	  /* Pattern detected.  */
+	  vect_pattern_detected ("vect_recog_rotate_pattern", last_stmt);
+
+	  *type_out = vectype;
+
+	  /* Pattern supported.  Create a stmt to be used to replace the
+	     pattern.  */
+	  var = vect_recog_temp_ssa_var (type, NULL);
+	  pattern_stmt = gimple_build_assign (var, LROTATE_EXPR, oprnd0,
+					      oprnd1);
+	  return pattern_stmt;
+	}
+      return NULL;
+    }
 
   if (is_a <bb_vec_info> (vinfo) || dt != vect_internal_def)
     {
       optab2 = optab_for_tree_code (rhs_code, vectype, optab_scalar);
       if (optab2
 	  && optab_handler (optab2, TYPE_MODE (vectype)) != CODE_FOR_nothing)
-	return NULL;
+	goto use_rotate;
     }
 
   /* If vector/vector or vector/scalar shifts aren't supported by the target,
@@ -2242,6 +2352,14 @@ vect_recog_rotate_pattern (stmt_vec_info
 
   *type_out = vectype;
 
+  if (bswap16_p && !useless_type_conversion_p (type, TREE_TYPE (oprnd0)))
+    {
+      def = vect_recog_temp_ssa_var (type, NULL);
+      def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd0);
+      append_pattern_def_seq (stmt_vinfo, def_stmt);
+      oprnd0 = def;
+    }
+
   if (dt == vect_external_def
       && TREE_CODE (oprnd1) == SSA_NAME)
     ext_def = vect_get_external_def_edge (vinfo, oprnd1);

Comment 4 Richard Biener 2019-10-01 10:14:33 UTC

Looks good from a quick look.

Comment 5 Jakub Jelinek 2019-10-01 10:52:14 UTC

Created attachment 46985 [details]
gcc10-pr91940.patch

Full untested patch.

Comment 6 Jakub Jelinek 2019-10-02 10:19:21 UTC

Author: jakub
Date: Wed Oct  2 10:18:50 2019
New Revision: 276442

URL: https://gcc.gnu.org/viewcvs?rev=276442&root=gcc&view=rev
Log:
	PR tree-optimization/91940
	* tree-vect-patterns.c: Include tree-vector-builder.h and
	vec-perm-indices.h.
	(vect_recog_rotate_pattern): Also handle __builtin_bswap16, either by
	unpromoting the argument back to uint16_t, or by converting into a
	rotate, or into shifts plus ior.

	* gcc.dg/vect/vect-bswap16.c: Add -msse4 on x86, run on all targets,
	expect vectorized 1 loops message on both vect_bswap and sse4_runtime
	targets.
	* gcc.dg/vect/vect-bswap16a.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/vect/vect-bswap16a.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/vect/vect-bswap16.c
    trunk/gcc/tree-vect-patterns.c

Comment 7 Marc Glisse 2019-10-02 10:21:37 UTC

(In reply to Jakub Jelinek from comment #1)
> The loop with the rotate is vectorized, while the one with __builtin_bswap16
> is not.

It is a bit surprising that we do not canonicalize one to the other somewhere in the middle-end.

Comment 8 Jakub Jelinek 2019-10-21 08:55:25 UTC

Fixed.

Comment 9 Jorn Wolfgang Rennecke 2024-08-08 03:20:09 UTC

	  || TYPE_PRECISION (type) <= 16

Why do you have that test?  I find for a 32-bit target without bswap but
V2HI shifts, allowing 16 there is pivotal to enabling vectorization for
gcc.dg/vect/vect-bswap16a.c .

Comment 10 Jorn Wolfgang Rennecke 2024-08-08 03:57:48 UTC

Even if you add support for V2HI bswap, it won't help vectorization without support for V4QI vectors and permutations, because vectorizable_bswap won't recognize the bswap capability of the target any other way.

Comment 11 Jakub Jelinek 2024-08-08 07:03:44 UTC

Only 16-bit byteswap has the r>> 8 canonical form, larger byteswaps don't.
Larger byteswaps certainly aren't rotates, but just permutes.  So, if the vectorizer doesn't try that already, it should try to vectorize bswap32, bswap64 etc. as a permutation.

Comment 12 Jorn Wolfgang Rennecke 2024-08-08 08:12:53 UTC

(In reply to Jakub Jelinek from comment #11)

But the condition I quoted rejects the recognition of a bswap16 with non-promoted
arguments.
vectorizable_bswap doesn't do anything for processors that don't have byte vectors,
even if they have support for shifts, or and bswap16 for V2HI vectors.
As it is, the test just fails.

If I change the test in vect_recog_rotate_pattern to:

         || TYPE_PRECISION (type) < 16

I get an expansion with shifts and or, which a combiner splitter pattern can
transform into a bswapv2hi2 .