Bug 93039 - Fails to use SSE bitwise ops for float-as-int manipulations
Summary: Fails to use SSE bitwise ops for float-as-int manipulations
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2019-12-21 15:05 UTC by Alexander Monakov
Modified: 2020-01-09 08:34 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-01-08 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Monakov 2019-12-21 15:05:58 UTC
(the non-regression part of PR 92905)

libm functions need to manipulate individual bits of float/double representations with good efficiency, but on x86 gcc typically does them on gprs even when it results in sse-gpreg-sse move chain:

float foo(float x)
{
    union {float f; unsigned i;} u = {x};
    u.i &= ~0x80000000;
    return u.f;
}

foo:
        movd    eax, xmm0
        and     eax, 2147483647
        movd    xmm0, eax
        ret

It's good to use bitwise ops on general registers if the source or destination needs to be in a general registe, but for cases like the above creating a roundtrip is not desirable.

(GCC gets this example right on aarch64; LLVM on x86 compiles this to SSE/AVX bitwise 'and', taking the immediate from memory)
Comment 1 Marc Glisse 2019-12-21 16:17:04 UTC
This looks related to Bug 54716 (which was restricted to vectors).
Comment 2 Richard Biener 2020-01-08 14:58:07 UTC
STV doesn't recognize

(insn 7 6 11 2 (parallel [
            (set (subreg:SI (reg:SF 84 [ <retval> ]) 0)
                (and:SI (subreg:SI (reg:SF 88) 0)
                    (const_int 2147483647 [0x7fffffff])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":5:13 444 {*andsi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:SF 88)
            (nil))))

it has

  if (!REG_P (XEXP (src, 0))
      && !MEM_P (XEXP (src, 0))
      && !CONST_INT_P (XEXP (src, 0))
      /* Check for andnot case.  */
      && (GET_CODE (src) != AND
          || GET_CODE (XEXP (src, 0)) != NOT
          || !REG_P (XEXP (XEXP (src, 0), 0))))
      return false;

and thus doesn't allow punning subregs.  OTOH I wonder why the above
isn't matched by a SImode SSE op ... (yeah, well, we don't have that).

If I "fix" STV with

Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c     (revision 280006)
+++ gcc/config/i386/i386-features.c     (working copy)
@@ -1365,7 +1365,7 @@ general_scalar_to_vector_candidate_p (rt
       || GET_MODE (dst) != mode)
     return false;
 
-  if (!REG_P (dst) && !MEM_P (dst))
+  if (!REG_P (dst) && !SUBREG_P (dst) && !MEM_P (dst))
     return false;
 
   switch (GET_CODE (src))
@@ -1422,6 +1422,7 @@ general_scalar_to_vector_candidate_p (rt
     }
 
   if (!REG_P (XEXP (src, 0))
+      && !SUBREG_P (XEXP (src, 0))
       && !MEM_P (XEXP (src, 0))
       && !CONST_INT_P (XEXP (src, 0))
       /* Check for andnot case.  */

I see

Building chain #1...
  Adding insn 7 to chain #1
  r84 use in insn 11 isn't convertible
  Mark r84 def in insn 7 as requiring both modes in chain #1
  r88 def in insn 14 isn't convertible
  Mark r88 def in insn 14 as requiring both modes in chain #1
Collected chain #1...
  insns: 7
  defs to convert: r84, r88
Computing gain for chain #1...
  Instruction gain -6 for     7: {r84:SF#0=r88:SF#0&0x7fffffff;clobber flags:CC;}
      REG_UNUSED flags:CC
      REG_DEAD r88:SF
  Instruction conversion gain: -6
  Registers conversion cost: 12
  Total gain: -18
Chain #1 conversion is not profitable

so besides it not handling the subregs correctly for costing the
costing for the actual instruction is negative as well (likely
because of the cost of loading the constant).  STV doesn't compute
"gain" when an existing conversion becomes unnecessary either.

The question is for which CPUs is it actually faster to use SSE?
Comment 3 Alexander Monakov 2020-01-08 15:34:40 UTC
> The question is for which CPUs is it actually faster to use SSE?

In the context of chains where the source and the destination need to be SSE registers, pretty much all CPUs? Inter-unit moves typically have some latency, e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for sse<->gpr moves (surprisingly though four generations prior to Skylake had latency 1). Older AMDs with shared fpu had even worse latencies. At the same time SSE integer ops have comparable latencies and throughput to gpr ones, so generally moving a chain to SSE ops isn't making it slower. Plus it helps with register pressure.

When either the source or the destination of a chain is bound to a general register or memory, it's ok to continue doing it on general regs.
Comment 4 rguenther@suse.de 2020-01-08 16:09:30 UTC
On January 8, 2020 4:34:40 PM GMT+01:00, "amonakov at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039
>
>--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
>> The question is for which CPUs is it actually faster to use SSE?
>
>In the context of chains where the source and the destination need to
>be SSE
>registers, pretty much all CPUs? Inter-unit moves typically have some
>latency,
>e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for
>sse<->gpr
>moves (surprisingly though four generations prior to Skylake had
>latency 1).
>Older AMDs with shared fpu had even worse latencies. At the same time
>SSE
>integer ops have comparable latencies and throughput to gpr ones, so
>generally
>moving a chain to SSE ops isn't making it slower. Plus it helps with
>register
>pressure.
>
>When either the source or the destination of a chain is bound to a
>general
>register or memory, it's ok to continue doing it on general regs.

But we need an extra load for the constant operand with an SSE op.
Comment 5 Alexander Monakov 2020-01-09 08:34:07 UTC
Ah, in that sense. The extra load is problematic in cold code where it's likely a TLB miss. For hot code: the load does not depend on any previous computations and so does not increase dependency chains. So it's ok from latency point of view; from throughput point of view, there's a tradeoff, one extra load per chain may be ok, but if every other instruction in a chain needs a different load, that's probably excessive. So it needs to be costed somehow.

That said, sufficiently simple constants can be synthesized with SSE in-place without loading them from memory, for example the constant in the opening example:

  pcmpeqd %xmm1, %xmm1  // xmm1 = ~0
  pslld   $31, %xmm1    // xmm1 <<= 31

(again, if we need to synthesize just one constant per chain that's preferable, if we need many, the extra work would need to be costed against the latency improvement of keeping the chain on SSE)