Hello, we still seem to be missing some way of passing a double to intrinsics that take a __m128d argument (see below for an example) without any overhead when we do not care about the high part. __m128d to __m256d has an intrinsic, although its implementation is not optimal (see PR 50829). But Intel apparently "forgot" to add a similar one for double to __m128d. Say I want to use the new AVX512 _mm_sqrt_round_sd to compute the square root of a double rounded towards +infinity. Using -mavx512f, I can try: #include <x86intrin.h> double sqrt_up(double x){ __m128d y = { x, 0 }; return _mm_cvtsd_f64(_mm_sqrt_round_sd(y, y, _MM_FROUND_TO_POS_INF|_MM_FROUND_NO_EXC)); } which generates vmovsd %xmm0, -16(%rsp) vmovsd -16(%rsp), %xmm0 vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0 I get the exact same code with double d = d; __m128d y = { x, d }; or __m128d y = y; y[0] = x; I can shorten it to vmovddup %xmm0, %xmm0 vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0 using __m128d y = { x, x }; I am forced to use inline asm __m128d y; asm("":"=x"(y):"0"(x)); to get what I wanted, i.e. only vsqrtsd without any extra instruction. But that makes the code non-portable, and I might as well write the vsqrtsd instruction myself in the asm. It probably also has similar drawbacks to the unspec in PR 50829.

When the patch for PR 70708, I got vmovq %xmm0, %xmm0 vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0 ret *** This bug has been marked as a duplicate of bug 70708 ***

(In reply to H.J. Lu from comment #1) > When the patch for PR 70708, I got > > vmovq %xmm0, %xmm0 > vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0 > ret Note that the goal of both PRs is to get rid of that movq (I agree that movq looks like an improvement compared to what we have currently, thanks).

(In reply to Marc Glisse from comment #2) > (In reply to H.J. Lu from comment #1) > > When the patch for PR 70708, I got > > > > vmovq %xmm0, %xmm0 > > vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0 > > ret > > Note that the goal of both PRs is to get rid of that movq (I agree that movq > looks like an improvement compared to what we have currently, thanks). It depends on how the upper bits should be treated.

If the upper bits of the register can contain arbitrary garbage, then keeping it there might result in e.g. floating point exceptions being raised (it could be even a sNAN). Of course a different thing is if we can prove what is in those upper bits and be sure it doesn't do any harm, or if the operations on it later on are masked.

There is (insn 10 9 11 2 (parallel [ (set (reg:V2DF 92) (vec_merge:V2DF (sqrt:V2DF (reg:V2DF 93)) (reg:V2DF 94) (const_int 1 [0x1]))) (unspec [ (const_int 10 [0xa]) ] UNSPEC_EMBEDDED_ROUNDING) ]) /export/build/gnu/gcc/build-x86_64-linux/gcc/include/avx512fintrin.h:1736 1448 {sse2_vmsqrtv2df2_round} (nil)) Is that possible to change it to make sqrt to apply only to the first 64-bit of input operand?

(In reply to Jakub Jelinek from comment #4) > If the upper bits of the register can contain arbitrary garbage, then > keeping it there might result in e.g. floating point exceptions being raised > (it could be even a sNAN). Of course a different thing is if we can prove > what is in those upper bits and be sure it doesn't do any harm, or if the > operations on it later on are masked. For double f(double x){return __builtin_sqrt(x);} we generate vsqrtsd %xmm0, %xmm0, %xmm0 with -O -fno-math-errno. I don't see what makes the rounded version different from the non-rounded version.

"g++ (Compiler-Explorer-Build) 9.0.1 20190303 (experimental)": #include <x86intrin.h> double sqrt_up(double x){ __m128d y = { x, 0 }; return _mm_cvtsd_f64(_mm_sqrt_round_sd(y, y, _MM_FROUND_TO_POS_INF|_MM_FROUND_NO_EXC)); } double f(double x) { return __builtin_sqrt(x); } sqrt_up(double): vmovq %xmm0, %xmm0 vsqrtsd {ru-sae}, %xmm0, %xmm0, %xmm0 ret f(double): vsqrtsd %xmm0, %xmm0, %xmm0 ret

(In reply to Steven Bosscher from comment #7) > __m128d y = { x, 0 }; > return _mm_cvtsd_f64(_mm_sqrt_round_sd(y, y, > _MM_FROUND_TO_POS_INF|_MM_FROUND_NO_EXC)); I don't necessarily advocate for optimizing out an existing explicit mov. Maybe I should, but there could be cases where mov makes the code faster, and I haven't experimented enough. I am only asking for a way to explicitly skip it if I believe I know what I am doing. Of course, if we start optimizing out the mov, it makes my request useless.

I thought I saw this bug like this recently too.