Bug 68211 - Free __m128d subreg of double
Summary: Free __m128d subreg of double
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: genvector 88918
  Show dependency treegraph
 
Reported: 2015-11-04 16:40 UTC by Marc Glisse
Modified: 2021-09-05 08:13 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-03-04 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Glisse 2015-11-04 16:40:58 UTC
Hello,

we still seem to be missing some way of passing a double to intrinsics that take a __m128d argument (see below for an example) without any overhead when we do not care about the high part.

__m128d to __m256d has an intrinsic, although its implementation is not optimal (see PR 50829). But Intel apparently "forgot" to add a similar one for double to __m128d.

Say I want to use the new AVX512 _mm_sqrt_round_sd to compute the square root of a double rounded towards +infinity. Using -mavx512f, I can try:

#include <x86intrin.h>

double sqrt_up(double x){
  __m128d y = { x, 0 };
  return _mm_cvtsd_f64(_mm_sqrt_round_sd(y, y, _MM_FROUND_TO_POS_INF|_MM_FROUND_NO_EXC));
}

which generates

	vmovsd	%xmm0, -16(%rsp)
	vmovsd	-16(%rsp), %xmm0
	vsqrtsd	{ru-sae}, %xmm0, %xmm0, %xmm0

I get the exact same code with

  double d = d;
  __m128d y = { x, d };

or

  __m128d y = y;
  y[0] = x;

I can shorten it to

	vmovddup	%xmm0, %xmm0
	vsqrtsd	{ru-sae}, %xmm0, %xmm0, %xmm0

using

  __m128d y = { x, x };

I am forced to use inline asm

  __m128d y;
  asm("":"=x"(y):"0"(x));

to get what I wanted, i.e. only vsqrtsd without any extra instruction. But that makes the code non-portable, and I might as well write the vsqrtsd instruction myself in the asm. It probably also has similar drawbacks to the unspec in PR 50829.
Comment 1 H.J. Lu 2016-04-18 15:38:03 UTC
When the patch for PR 70708, I got

	vmovq	%xmm0, %xmm0
	vsqrtsd	{ru-sae}, %xmm0, %xmm0, %xmm0
	ret

*** This bug has been marked as a duplicate of bug 70708 ***
Comment 2 Marc Glisse 2016-04-18 16:40:52 UTC
(In reply to H.J. Lu from comment #1)
> When the patch for PR 70708, I got
> 
> 	vmovq	%xmm0, %xmm0
> 	vsqrtsd	{ru-sae}, %xmm0, %xmm0, %xmm0
> 	ret

Note that the goal of both PRs is to get rid of that movq (I agree that movq looks like an improvement compared to what we have currently, thanks).
Comment 3 H.J. Lu 2016-04-18 17:01:09 UTC
(In reply to Marc Glisse from comment #2)
> (In reply to H.J. Lu from comment #1)
> > When the patch for PR 70708, I got
> > 
> > 	vmovq	%xmm0, %xmm0
> > 	vsqrtsd	{ru-sae}, %xmm0, %xmm0, %xmm0
> > 	ret
> 
> Note that the goal of both PRs is to get rid of that movq (I agree that movq
> looks like an improvement compared to what we have currently, thanks).

It depends on how the upper bits should be treated.
Comment 4 Jakub Jelinek 2016-04-18 17:18:13 UTC
If the upper bits of the register can contain arbitrary garbage, then keeping it there might result in e.g. floating point exceptions being raised (it could be even a sNAN).  Of course a different thing is if we can prove what is in those upper bits and be sure it doesn't do any harm, or if the operations on it later on are masked.
Comment 5 H.J. Lu 2016-04-19 01:57:16 UTC
There is

(insn 10 9 11 2 (parallel [
            (set (reg:V2DF 92) 
                (vec_merge:V2DF (sqrt:V2DF (reg:V2DF 93))
                    (reg:V2DF 94) 
                    (const_int 1 [0x1])))
            (unspec [
                    (const_int 10 [0xa])
                ] UNSPEC_EMBEDDED_ROUNDING)
        ]) /export/build/gnu/gcc/build-x86_64-linux/gcc/include/avx512fintrin.h:1736 1448 {sse2_vmsqrtv2df2_round}
     (nil))

Is that possible to change it to make sqrt to apply only to the
first 64-bit of input operand?
Comment 6 Marc Glisse 2017-05-23 11:49:25 UTC
(In reply to Jakub Jelinek from comment #4)
> If the upper bits of the register can contain arbitrary garbage, then
> keeping it there might result in e.g. floating point exceptions being raised
> (it could be even a sNAN).  Of course a different thing is if we can prove
> what is in those upper bits and be sure it doesn't do any harm, or if the
> operations on it later on are masked.

For
  double f(double x){return __builtin_sqrt(x);}
we generate
	vsqrtsd	%xmm0, %xmm0, %xmm0
with -O -fno-math-errno.
I don't see what makes the rounded version different from the non-rounded version.
Comment 7 Steven Bosscher 2019-03-04 14:36:28 UTC
"g++ (Compiler-Explorer-Build) 9.0.1 20190303 (experimental)":

#include <x86intrin.h>

double sqrt_up(double x){
  __m128d y = { x, 0 };
  return _mm_cvtsd_f64(_mm_sqrt_round_sd(y, y, _MM_FROUND_TO_POS_INF|_MM_FROUND_NO_EXC));
}

double f(double x)
{
  return __builtin_sqrt(x);
}


sqrt_up(double):
        vmovq   %xmm0, %xmm0
        vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0
        ret
f(double):
        vsqrtsd %xmm0, %xmm0, %xmm0
        ret
Comment 8 Marc Glisse 2019-03-05 02:25:39 UTC
(In reply to Steven Bosscher from comment #7)
>   __m128d y = { x, 0 };
>   return _mm_cvtsd_f64(_mm_sqrt_round_sd(y, y,
> _MM_FROUND_TO_POS_INF|_MM_FROUND_NO_EXC));

I don't necessarily advocate for optimizing out an existing explicit mov. Maybe I should, but there could be cases where mov makes the code faster, and I haven't experimented enough. I am only asking for a way to explicitly skip it if I believe I know what I am doing. Of course, if we start optimizing out the mov, it makes my request useless.
Comment 9 Andrew Pinski 2021-09-05 08:13:18 UTC
I thought I saw this bug like this recently too.