Created attachment 46984 [details] code snippet Hello, I am using "gcc (SUSE Linux) 9.2.1 20190903 [gcc-9-branch revision 275330]" and I see the following performance issue with the __builtin_bswap16() on x86_64 platform. Attached here is code sample implementing byte swapping for arrays of 2-byte words. I see that the following code (when compiled with -O3) inline void swab_bi(const void* from, void* to, std::size_t size) { const auto begin = reinterpret_cast<const std::uint16_t*>(from); const auto end = reinterpret_cast<const std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size); auto out = reinterpret_cast<std::uint16_t*>(to); for(auto it = begin; it != end; ++it) { *(out++) = __builtin_bswap16(*it); } } takes 0.023 sec. on average to execute on my hardware (Intel Core-i5). While the following code inline void swab(const void* from, void* to, std::size_t size) { const auto begin = reinterpret_cast<const std::uint16_t*>(from); const auto end = reinterpret_cast<const std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size); auto out = reinterpret_cast<std::uint16_t*>(to); for(auto it = begin; it != end; ++it) { *(out++) = ((*it & 0xFF) << 8) | ((*it & 0xFF00) >> 8); } } is *more* efficiently. It takes only 0.011 sec. When I try to dump assembler output for both function I see that packed instructions are used for the latter case: movdqu 0(%rbp,%rax), %xmm0 movdqa %xmm0, %xmm1 psllw $8, %xmm0 psrlw $8, %xmm1 por %xmm1, %xmm0 movups %xmm0, (%r12,%rax) addq $16, %rax while rolw is used for the former case: movzwl 0(%rbp,%rax), %edx rolw $8, %dx movw %dx, (%r12,%rax) addq $2, %rax
The loop with the rotate is vectorized, while the one with __builtin_bswap16 is not. For rotates if the ISA doesn't have vector support for rotates, we use vect_recog_rotate_pattern to undo the matching of hand written rotate into a rotate by breaking it up again into shifts + blend. For __builtin_bswap* we have vectorizable_bswap support but it only works if there is no type promotion in the call argument; in such case it is not handled using rotates etc., but as a permutation of the vector elements (if supported). Unfortunately, for __builtin_bswap16 the argument is promoted. So, the options are look through the argument promotion for vectorizable_bswap, or in tree-vect-patterns.c pattern match the __builtin_bswap16 on a promoted integer to a call with non-promoted argument, and optionally check if the permutation would be supported and maybe fall back to rotate that vect_recog_rotate_pattern can produce.
Another option is to elide the promotion? int foo (unsigned short x) { return __builtin_bswap16 (x); } return (int) __builtin_bswap16 ((int) x); but BUILT_IN_BSWAP16 is BT_FN_UINT16_UINT16, not sure why there's a truncation missing?! Is that the frontend "promote according to ABI" thing? Yeah, I guess so :/
Untested WIP patch that does both. If it finds vectorize_bswap will work (the corresponding permutation is supported), it will just undo the promotion, if target supports vector rotates, will use vector rotate, if target supports vector shifts, it will use those + ior. For the first case, e.g. -mavx2 now vectorizes using permutation what wasn't vectorized, for the second case I don't have any particular target in mind ATM (e.g. -mxop supports rotates, but will support also the permutations), and the last case is e.g. SSE2. --- gcc/tree-vect-patterns.c.jj 2019-09-20 12:25:48.186387075 +0200 +++ gcc/tree-vect-patterns.c 2019-10-01 11:29:18.229215895 +0200 @@ -46,6 +46,8 @@ along with GCC; see the file COPYING3. #include "cgraph.h" #include "omp-simd-clone.h" #include "predict.h" +#include "tree-vector-builder.h" +#include "vec-perm-indices.h" /* Return true if we have a useful VR_RANGE range for VAR, storing it in *MIN_VALUE and *MAX_VALUE if so. Note the range in the dump files. */ @@ -2168,24 +2170,107 @@ vect_recog_rotate_pattern (stmt_vec_info enum vect_def_type dt; optab optab1, optab2; edge ext_def = NULL; + bool bswap16_p = false; - if (!is_gimple_assign (last_stmt)) - return NULL; - - rhs_code = gimple_assign_rhs_code (last_stmt); - switch (rhs_code) + if (is_gimple_assign (last_stmt)) { - case LROTATE_EXPR: - case RROTATE_EXPR: - break; - default: - return NULL; + rhs_code = gimple_assign_rhs_code (last_stmt); + switch (rhs_code) + { + case LROTATE_EXPR: + case RROTATE_EXPR: + break; + default: + return NULL; + } + + lhs = gimple_assign_lhs (last_stmt); + oprnd0 = gimple_assign_rhs1 (last_stmt); + type = TREE_TYPE (oprnd0); + oprnd1 = gimple_assign_rhs2 (last_stmt); + } + else if (gimple_call_builtin_p (last_stmt, BUILT_IN_BSWAP16)) + { + /* __builtin_bswap16 (x) is another form of x r>> 8. + The vectorizer has bswap support, but only if the argument isn't + promoted. */ + lhs = gimple_call_lhs (last_stmt); + oprnd0 = gimple_call_arg (last_stmt, 0); + type = TREE_TYPE (oprnd0); + if (TYPE_PRECISION (TREE_TYPE (lhs)) != 16 + || TYPE_PRECISION (type) <= 16 + || TREE_CODE (oprnd0) != SSA_NAME + || BITS_PER_UNIT != 8 + || !TYPE_UNSIGNED (TREE_TYPE (lhs))) + return NULL; + + stmt_vec_info def_stmt_info; + if (!vect_is_simple_use (oprnd0, vinfo, &dt, &def_stmt_info, &def_stmt)) + return NULL; + + if (dt != vect_internal_def) + return NULL; + + if (gimple_assign_cast_p (def_stmt)) + { + def = gimple_assign_rhs1 (def_stmt); + if (INTEGRAL_TYPE_P (TREE_TYPE (def)) + && TYPE_PRECISION (TREE_TYPE (def)) == 16) + oprnd0 = def; + } + + type = TREE_TYPE (lhs); + vectype = get_vectype_for_scalar_type (type); + if (vectype == NULL_TREE) + return NULL; + + if (tree char_vectype = get_same_sized_vectype (char_type_node, vectype)) + { + /* The encoding uses one stepped pattern for each byte in the + 16-bit word. */ + vec_perm_builder elts (TYPE_VECTOR_SUBPARTS (char_vectype), 2, 3); + for (unsigned i = 0; i < 3; ++i) + for (unsigned j = 0; j < 2; ++j) + elts.quick_push ((i + 1) * 2 - j - 1); + + vec_perm_indices indices (elts, 1, + TYPE_VECTOR_SUBPARTS (char_vectype)); + if (can_vec_perm_const_p (TYPE_MODE (char_vectype), indices)) + { + /* vectorizable_bswap can handle the __builtin_bswap16 if we + undo the argument promotion. */ + if (!useless_type_conversion_p (type, TREE_TYPE (oprnd0))) + { + def = vect_recog_temp_ssa_var (type, NULL); + def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd0); + append_pattern_def_seq (stmt_vinfo, def_stmt); + oprnd0 = def; + } + + /* Pattern detected. */ + vect_pattern_detected ("vect_recog_rotate_pattern", last_stmt); + + *type_out = vectype; + + /* Pattern supported. Create a stmt to be used to replace the + pattern, with the unpromoted argument. */ + var = vect_recog_temp_ssa_var (type, NULL); + pattern_stmt = gimple_build_call (gimple_call_fndecl (last_stmt), + 1, oprnd0); + gimple_call_set_lhs (pattern_stmt, var); + gimple_call_set_fntype (as_a <gcall *> (pattern_stmt), + gimple_call_fntype (last_stmt)); + return pattern_stmt; + } + } + + oprnd1 = build_int_cst (integer_type_node, 8); + rhs_code = LROTATE_EXPR; + bswap16_p = true; } + else + return NULL; - lhs = gimple_assign_lhs (last_stmt); - oprnd0 = gimple_assign_rhs1 (last_stmt); - type = TREE_TYPE (oprnd0); - oprnd1 = gimple_assign_rhs2 (last_stmt); if (TREE_CODE (oprnd0) != SSA_NAME || TYPE_PRECISION (TREE_TYPE (lhs)) != TYPE_PRECISION (type) || !INTEGRAL_TYPE_P (type) @@ -2210,14 +2295,39 @@ vect_recog_rotate_pattern (stmt_vec_info optab1 = optab_for_tree_code (rhs_code, vectype, optab_vector); if (optab1 && optab_handler (optab1, TYPE_MODE (vectype)) != CODE_FOR_nothing) - return NULL; + { + use_rotate: + if (bswap16_p) + { + if (!useless_type_conversion_p (type, TREE_TYPE (oprnd0))) + { + def = vect_recog_temp_ssa_var (type, NULL); + def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd0); + append_pattern_def_seq (stmt_vinfo, def_stmt); + oprnd0 = def; + } + + /* Pattern detected. */ + vect_pattern_detected ("vect_recog_rotate_pattern", last_stmt); + + *type_out = vectype; + + /* Pattern supported. Create a stmt to be used to replace the + pattern. */ + var = vect_recog_temp_ssa_var (type, NULL); + pattern_stmt = gimple_build_assign (var, LROTATE_EXPR, oprnd0, + oprnd1); + return pattern_stmt; + } + return NULL; + } if (is_a <bb_vec_info> (vinfo) || dt != vect_internal_def) { optab2 = optab_for_tree_code (rhs_code, vectype, optab_scalar); if (optab2 && optab_handler (optab2, TYPE_MODE (vectype)) != CODE_FOR_nothing) - return NULL; + goto use_rotate; } /* If vector/vector or vector/scalar shifts aren't supported by the target, @@ -2242,6 +2352,14 @@ vect_recog_rotate_pattern (stmt_vec_info *type_out = vectype; + if (bswap16_p && !useless_type_conversion_p (type, TREE_TYPE (oprnd0))) + { + def = vect_recog_temp_ssa_var (type, NULL); + def_stmt = gimple_build_assign (def, NOP_EXPR, oprnd0); + append_pattern_def_seq (stmt_vinfo, def_stmt); + oprnd0 = def; + } + if (dt == vect_external_def && TREE_CODE (oprnd1) == SSA_NAME) ext_def = vect_get_external_def_edge (vinfo, oprnd1);
Looks good from a quick look.
Created attachment 46985 [details] gcc10-pr91940.patch Full untested patch.
Author: jakub Date: Wed Oct 2 10:18:50 2019 New Revision: 276442 URL: https://gcc.gnu.org/viewcvs?rev=276442&root=gcc&view=rev Log: PR tree-optimization/91940 * tree-vect-patterns.c: Include tree-vector-builder.h and vec-perm-indices.h. (vect_recog_rotate_pattern): Also handle __builtin_bswap16, either by unpromoting the argument back to uint16_t, or by converting into a rotate, or into shifts plus ior. * gcc.dg/vect/vect-bswap16.c: Add -msse4 on x86, run on all targets, expect vectorized 1 loops message on both vect_bswap and sse4_runtime targets. * gcc.dg/vect/vect-bswap16a.c: New test. Added: trunk/gcc/testsuite/gcc.dg/vect/vect-bswap16a.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/vect-bswap16.c trunk/gcc/tree-vect-patterns.c
(In reply to Jakub Jelinek from comment #1) > The loop with the rotate is vectorized, while the one with __builtin_bswap16 > is not. It is a bit surprising that we do not canonicalize one to the other somewhere in the middle-end.
Fixed.
|| TYPE_PRECISION (type) <= 16 Why do you have that test? I find for a 32-bit target without bswap but V2HI shifts, allowing 16 there is pivotal to enabling vectorization for gcc.dg/vect/vect-bswap16a.c .
Even if you add support for V2HI bswap, it won't help vectorization without support for V4QI vectors and permutations, because vectorizable_bswap won't recognize the bswap capability of the target any other way.
Only 16-bit byteswap has the r>> 8 canonical form, larger byteswaps don't. Larger byteswaps certainly aren't rotates, but just permutes. So, if the vectorizer doesn't try that already, it should try to vectorize bswap32, bswap64 etc. as a permutation.
(In reply to Jakub Jelinek from comment #11) But the condition I quoted rejects the recognition of a bswap16 with non-promoted arguments. vectorizable_bswap doesn't do anything for processors that don't have byte vectors, even if they have support for shifts, or and bswap16 for V2HI vectors. As it is, the test just fails. If I change the test in vect_recog_rotate_pattern to: || TYPE_PRECISION (type) < 16 I get an expansion with shifts and or, which a combiner splitter pattern can transform into a bswapv2hi2 .