Bug 106910 - roundss not vectorized
Summary: roundss not vectorized
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2022-09-12 16:15 UTC by Pilar Latiesa
Modified: 2022-09-22 07:52 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-09-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pilar Latiesa 2022-09-12 16:15:16 UTC
GCC 7 and newer optimize `std::floor(float)` into `vroundss` with -O2 and -march=skylake, which is great.

However, I can see in Compiler Explorer that the following example:

```
#include <cmath>

struct TVec { float x, y; };

struct TKey { int i, j; };

class TDom
{
private:
  static int Floor(float const x)
    { return static_cast<int>(std::floor(x)); }

public:
  TKey CalcKey(TVec const &) const;
};

TKey TDom::CalcKey(TVec const &r) const
  { return {Floor(r.x), Floor(r.y)}; }
```

produces:

```

vxorps      %xmm1, %xmm1, %xmm1
vroundss    $9, (%rsi), %xmm1, %xmm0
vroundss    $9, 4(%rsi), %xmm1, %xmm1
vunpcklps   %xmm1, %xmm0, %xmm0
vcvttps2dq  %xmm0, %xmm2
vmovq       %xmm2, %rax
ret
```

Couldn’t the pair of `vroundss` have been merged into a single `vroundps`?
Comment 1 Drea Pinski 2022-09-12 16:35:53 UTC
On x86_64:
/app/example.cpp:19:35: missed:   function is not vectorizable.
/opt/compiler-explorer/gcc-trunk-20220912/include/c++/13.0.0/cmath:261:28: missed:   not vectorized: relevant stmt not supported: _9 = __builtin_floorf (_1);

While on aarch64, GCC can handle the SLP just fine:

  vect__1.7_12 = MEM <const vector(2) float> [(float *)r_4(D)];
  vect__9.8_13 = .FLOOR (vect__1.7_12);
  vect__10.9_14 = (vector(2) int) vect__9.8_13;
  MEM <vector(2) int> [(int *)&D.25164] = vect__10.9_14;

So this is a target backend improvement that is needed.
Comment 2 Richard Biener 2022-09-13 06:31:16 UTC
Probably missing patterns for V2SFmode here.  Hmm, we don't seem to have any vector mode patterns here but possibly rely on ix86_builtin_vectorized_function
which indeed doesn't have any V2SFmode support.

The vectorizer would go the direct internal fn way for those, querying the
floor optab but the x86 backend only has scalar modes supported for the
rounding optabs.

The backend should modernize itself, get rid of the ix86_builtin_vectorized_function parts for those functions and instead rely
on define_expands with vector modes.
Comment 3 Hongtao.liu 2022-09-13 07:30:44 UTC
> The backend should modernize itself, get rid of the
> ix86_builtin_vectorized_function parts for those functions and instead rely
> on define_expands with vector modes.

Indeed, let me do it.
Comment 4 Hongtao.liu 2022-09-15 09:35:34 UTC
> The vectorizer would go the direct internal fn way for those, querying the
> floor optab but the x86 backend only has scalar modes supported for the
> rounding optabs.
For CFN_BUILT_IN_ICEIL, the modifier is narrow, and currently vectorizable_call require op_in and op_out to be simple_integer_narrowing, which means it fails to go the direct internal fn way.

---------cut from vectorizable_call-----------
  tree_code convert_code = ERROR_MARK;
  if (cfn != CFN_LAST
      && (modifier == NONE
	  || (modifier == NARROW
	      && simple_integer_narrowing (vectype_out, vectype_in,
					   &convert_code))))
    ifn = vectorizable_internal_function (cfn, callee, vectype_out,
					  vectype_in);
-----------cut end----------------------

Similar for CFN_BUILT_IN_LCEIL under 32-bit target.
For 64-bit target CFN_BUILT_IN_LCEIL, CFN_BUILT_IN_LLCEIL will go the direct internal fn way, add lceilmn2 expanders works.

Not sure whether vectorizable_call should be extended or just leave CFN_BUILT_IN_ICEIL/IFLOOR/IRINT/IROUND part in ix86_builtin_vectorized_function.
Comment 5 rguenther@suse.de 2022-09-15 11:40:28 UTC
On Thu, 15 Sep 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106910
> 
> --- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > The vectorizer would go the direct internal fn way for those, querying the
> > floor optab but the x86 backend only has scalar modes supported for the
> > rounding optabs.
> For CFN_BUILT_IN_ICEIL, the modifier is narrow, and currently vectorizable_call
> require op_in and op_out to be simple_integer_narrowing, which means it fails
> to go the direct internal fn way.
> 
> ---------cut from vectorizable_call-----------
>   tree_code convert_code = ERROR_MARK;
>   if (cfn != CFN_LAST
>       && (modifier == NONE
>           || (modifier == NARROW
>               && simple_integer_narrowing (vectype_out, vectype_in,
>                                            &convert_code))))
>     ifn = vectorizable_internal_function (cfn, callee, vectype_out,
>                                           vectype_in);
> -----------cut end----------------------
> 
> Similar for CFN_BUILT_IN_LCEIL under 32-bit target.
> For 64-bit target CFN_BUILT_IN_LCEIL, CFN_BUILT_IN_LLCEIL will go the direct
> internal fn way, add lceilmn2 expanders works.
> 
> Not sure whether vectorizable_call should be extended or just leave
> CFN_BUILT_IN_ICEIL/IFLOOR/IRINT/IROUND part in
> ix86_builtin_vectorized_function.

I think this is a known issue (we may even have a bugreport) so I'd
leave handling of those cases in ix86_builtin_vectorized_function.
Comment 6 GCC Commits 2022-09-16 07:45:09 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:3e8c4b925a9825fdb8c81f47b621f63108894362

commit r13-2694-g3e8c4b925a9825fdb8c81f47b621f63108894362
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Sep 15 18:43:16 2022 +0800

    Modernize ix86_builtin_vectorized_function with corresponding expanders.
    
    For ifloor/lfloor/iceil/lceil/irint/lrint/iround/lround when size of
    in_mode is not equal out_mode, vectorizer doesn't go to internal fn
    way,still left that part in the ix86_builtin_vectorized_function.
    
    Remove others builtins and add corresponding expanders.
    
    gcc/ChangeLog:
    
            PR target/106910
            * config/i386/i386-builtins.cc
            (ix86_builtin_vectorized_function): Modernized with
            corresponding expanders.
            * config/i386/sse.md (lrint<mode><sseintvecmodelower>2): New
            expander.
            (floor<mode>2): Ditto.
            (lfloor<mode><sseintvecmodelower>2): Ditto.
            (ceil<mode>2): Ditto.
            (lceil<mode><sseintvecmodelower>2): Ditto.
            (btrunc<mode>2): Ditto.
            (lround<mode><sseintvecmodelower>2): Ditto.
            (exp2<mode>2): Ditto.
Comment 7 GCC Commits 2022-09-20 06:54:27 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:d0c73b6c85677e6755b60fa02d79a5c5e1a8eacd

commit r13-2730-gd0c73b6c85677e6755b60fa02d79a5c5e1a8eacd
Author: liuhongt <hongtao.liu@intel.com>
Date:   Fri Sep 16 14:28:34 2022 +0800

    Support 64-bit vectorization for single-precision floating rounding operation.
    
    Here's list the patch supported.
    rint/nearbyint/ceil/floor/trunc/lrint/lceil/lfloor/round/lround.
    
    gcc/ChangeLog:
    
            PR target/106910
            * config/i386/mmx.md (nearbyintv2sf2): New expander.
            (rintv2sf2): Ditto.
            (ceilv2sf2): Ditto.
            (lceilv2sfv2si2): Ditto.
            (floorv2sf2): Ditto.
            (lfloorv2sfv2si2): Ditto.
            (btruncv2sf2): Ditto.
            (lrintv2sfv2si2): Ditto.
            (roundv2sf2): Ditto.
            (lroundv2sfv2si2): Ditto.
            (*mmx_roundv2sf2): New define_insn.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/i386/pr106910-1.c: New test.
Comment 8 Hongtao.liu 2022-09-20 07:00:58 UTC
Fixed in GCC13.
Comment 9 Pilar Latiesa 2022-09-21 10:44:14 UTC
Thans so much for the fix.

I have tested last night (20220920) snapshot and I can confirm the testcase is vectorized.

However, for the vectorization to kick in I had to use -fno-trapping-math. I assume that's intentional. Could you please explain why it is necessary?
Comment 10 Hongtao.liu 2022-09-22 01:09:46 UTC
Because the round insn does not trap on denormals.
Comment 11 Hongtao.liu 2022-09-22 01:27:04 UTC
(In reply to Hongtao.liu from comment #10)
> Because the round insn does not trap on denormals.

Note for scalar version, gcc generates roundss $9, %xmm0, %xmm0 which doesn't raise exception because it's allowed by the below option.
-----
ffp-int-builtin-inexact
Common Var(flag_fp_int_builtin_inexact) Init(1) Optimization
Allow built-in functions ceil, floor, round, trunc to raise \"inexact\" exceptions.
-----
Comment 12 Pilar Latiesa 2022-09-22 07:45:05 UTC
Wouldn't it make sense for scalar and vector versions to be affected by the same options?
Comment 13 Hongtao.liu 2022-09-22 07:52:48 UTC
(In reply to Pilar Latiesa from comment #12)
> Wouldn't it make sense for scalar and vector versions to be affected by the
> same options?

I guess the reason is we don't have __builtin_floorf for vector version, and that option is only apply to direct call for builtin function.

I'm not sure should we apply that to internal fn either(to make vector version behave the same as scalar version).