GCC 7 and newer optimize `std::floor(float)` into `vroundss` with -O2 and -march=skylake, which is great. However, I can see in Compiler Explorer that the following example: ``` #include <cmath> struct TVec { float x, y; }; struct TKey { int i, j; }; class TDom { private: static int Floor(float const x) { return static_cast<int>(std::floor(x)); } public: TKey CalcKey(TVec const &) const; }; TKey TDom::CalcKey(TVec const &r) const { return {Floor(r.x), Floor(r.y)}; } ``` produces: ``` vxorps %xmm1, %xmm1, %xmm1 vroundss $9, (%rsi), %xmm1, %xmm0 vroundss $9, 4(%rsi), %xmm1, %xmm1 vunpcklps %xmm1, %xmm0, %xmm0 vcvttps2dq %xmm0, %xmm2 vmovq %xmm2, %rax ret ``` Couldn’t the pair of `vroundss` have been merged into a single `vroundps`?
On x86_64: /app/example.cpp:19:35: missed: function is not vectorizable. /opt/compiler-explorer/gcc-trunk-20220912/include/c++/13.0.0/cmath:261:28: missed: not vectorized: relevant stmt not supported: _9 = __builtin_floorf (_1); While on aarch64, GCC can handle the SLP just fine: vect__1.7_12 = MEM <const vector(2) float> [(float *)r_4(D)]; vect__9.8_13 = .FLOOR (vect__1.7_12); vect__10.9_14 = (vector(2) int) vect__9.8_13; MEM <vector(2) int> [(int *)&D.25164] = vect__10.9_14; So this is a target backend improvement that is needed.
Probably missing patterns for V2SFmode here. Hmm, we don't seem to have any vector mode patterns here but possibly rely on ix86_builtin_vectorized_function which indeed doesn't have any V2SFmode support. The vectorizer would go the direct internal fn way for those, querying the floor optab but the x86 backend only has scalar modes supported for the rounding optabs. The backend should modernize itself, get rid of the ix86_builtin_vectorized_function parts for those functions and instead rely on define_expands with vector modes.
> The backend should modernize itself, get rid of the > ix86_builtin_vectorized_function parts for those functions and instead rely > on define_expands with vector modes. Indeed, let me do it.
> The vectorizer would go the direct internal fn way for those, querying the > floor optab but the x86 backend only has scalar modes supported for the > rounding optabs. For CFN_BUILT_IN_ICEIL, the modifier is narrow, and currently vectorizable_call require op_in and op_out to be simple_integer_narrowing, which means it fails to go the direct internal fn way. ---------cut from vectorizable_call----------- tree_code convert_code = ERROR_MARK; if (cfn != CFN_LAST && (modifier == NONE || (modifier == NARROW && simple_integer_narrowing (vectype_out, vectype_in, &convert_code)))) ifn = vectorizable_internal_function (cfn, callee, vectype_out, vectype_in); -----------cut end---------------------- Similar for CFN_BUILT_IN_LCEIL under 32-bit target. For 64-bit target CFN_BUILT_IN_LCEIL, CFN_BUILT_IN_LLCEIL will go the direct internal fn way, add lceilmn2 expanders works. Not sure whether vectorizable_call should be extended or just leave CFN_BUILT_IN_ICEIL/IFLOOR/IRINT/IROUND part in ix86_builtin_vectorized_function.
On Thu, 15 Sep 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106910 > > --- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- > > > The vectorizer would go the direct internal fn way for those, querying the > > floor optab but the x86 backend only has scalar modes supported for the > > rounding optabs. > For CFN_BUILT_IN_ICEIL, the modifier is narrow, and currently vectorizable_call > require op_in and op_out to be simple_integer_narrowing, which means it fails > to go the direct internal fn way. > > ---------cut from vectorizable_call----------- > tree_code convert_code = ERROR_MARK; > if (cfn != CFN_LAST > && (modifier == NONE > || (modifier == NARROW > && simple_integer_narrowing (vectype_out, vectype_in, > &convert_code)))) > ifn = vectorizable_internal_function (cfn, callee, vectype_out, > vectype_in); > -----------cut end---------------------- > > Similar for CFN_BUILT_IN_LCEIL under 32-bit target. > For 64-bit target CFN_BUILT_IN_LCEIL, CFN_BUILT_IN_LLCEIL will go the direct > internal fn way, add lceilmn2 expanders works. > > Not sure whether vectorizable_call should be extended or just leave > CFN_BUILT_IN_ICEIL/IFLOOR/IRINT/IROUND part in > ix86_builtin_vectorized_function. I think this is a known issue (we may even have a bugreport) so I'd leave handling of those cases in ix86_builtin_vectorized_function.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:3e8c4b925a9825fdb8c81f47b621f63108894362 commit r13-2694-g3e8c4b925a9825fdb8c81f47b621f63108894362 Author: liuhongt <hongtao.liu@intel.com> Date: Thu Sep 15 18:43:16 2022 +0800 Modernize ix86_builtin_vectorized_function with corresponding expanders. For ifloor/lfloor/iceil/lceil/irint/lrint/iround/lround when size of in_mode is not equal out_mode, vectorizer doesn't go to internal fn way,still left that part in the ix86_builtin_vectorized_function. Remove others builtins and add corresponding expanders. gcc/ChangeLog: PR target/106910 * config/i386/i386-builtins.cc (ix86_builtin_vectorized_function): Modernized with corresponding expanders. * config/i386/sse.md (lrint<mode><sseintvecmodelower>2): New expander. (floor<mode>2): Ditto. (lfloor<mode><sseintvecmodelower>2): Ditto. (ceil<mode>2): Ditto. (lceil<mode><sseintvecmodelower>2): Ditto. (btrunc<mode>2): Ditto. (lround<mode><sseintvecmodelower>2): Ditto. (exp2<mode>2): Ditto.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:d0c73b6c85677e6755b60fa02d79a5c5e1a8eacd commit r13-2730-gd0c73b6c85677e6755b60fa02d79a5c5e1a8eacd Author: liuhongt <hongtao.liu@intel.com> Date: Fri Sep 16 14:28:34 2022 +0800 Support 64-bit vectorization for single-precision floating rounding operation. Here's list the patch supported. rint/nearbyint/ceil/floor/trunc/lrint/lceil/lfloor/round/lround. gcc/ChangeLog: PR target/106910 * config/i386/mmx.md (nearbyintv2sf2): New expander. (rintv2sf2): Ditto. (ceilv2sf2): Ditto. (lceilv2sfv2si2): Ditto. (floorv2sf2): Ditto. (lfloorv2sfv2si2): Ditto. (btruncv2sf2): Ditto. (lrintv2sfv2si2): Ditto. (roundv2sf2): Ditto. (lroundv2sfv2si2): Ditto. (*mmx_roundv2sf2): New define_insn. gcc/testsuite/ChangeLog: * gcc.target/i386/pr106910-1.c: New test.
Fixed in GCC13.
Thans so much for the fix. I have tested last night (20220920) snapshot and I can confirm the testcase is vectorized. However, for the vectorization to kick in I had to use -fno-trapping-math. I assume that's intentional. Could you please explain why it is necessary?
Because the round insn does not trap on denormals.
(In reply to Hongtao.liu from comment #10) > Because the round insn does not trap on denormals. Note for scalar version, gcc generates roundss $9, %xmm0, %xmm0 which doesn't raise exception because it's allowed by the below option. ----- ffp-int-builtin-inexact Common Var(flag_fp_int_builtin_inexact) Init(1) Optimization Allow built-in functions ceil, floor, round, trunc to raise \"inexact\" exceptions. -----
Wouldn't it make sense for scalar and vector versions to be affected by the same options?
(In reply to Pilar Latiesa from comment #12) > Wouldn't it make sense for scalar and vector versions to be affected by the > same options? I guess the reason is we don't have __builtin_floorf for vector version, and that option is only apply to direct call for builtin function. I'm not sure should we apply that to internal fn either(to make vector version behave the same as scalar version).