is there any reason why rsqrtss and rcpss are not used for scalar code while rsqrtps and rcpps are used for loops? cat scalar.cc #include<cmath> void scalar(float& a, float& b) { a = std::sqrt(a); b = 1.f/b; } float v[1024]; float w[1024]; void vector() { for(int i=0;i!=1024;++i) { v[i] = std::sqrt(v[i]); w[i] = 1.f/w[i]; } } c++ -std=c++11 -Ofast -march=corei7 -S scalar.cc -ftree-vectorizer-verbose=1 -ftree-loop-if-convert-stores; cat scalar.s | c++filt scalar(float&, float&): LFB221: sqrtss (%rdi), %xmm0 movss %xmm0, (%rdi) movss LC0(%rip), %xmm0 divss (%rsi), %xmm0 movss %xmm0, (%rsi) ret LFE221: .align 4,0x90 .globl vector() vector(): LFB222: movaps LC1(%rip), %xmm5 leaq void(%rip), %rax xorps %xmm3, %xmm3 movaps LC2(%rip), %xmm4 leaq wchar_t(%rip), %rdx leaq 4096+void(%rip), %rcx .align 4,0x90 L4: movaps (%rax), %xmm1 movaps %xmm3, %xmm2 addq $16, %rax addq $16, %rdx rsqrtps %xmm1, %xmm0 cmpneqps %xmm1, %xmm2 andps %xmm2, %xmm0 mulps %xmm0, %xmm1 mulps %xmm1, %xmm0 mulps %xmm4, %xmm1 addps %xmm5, %xmm0 mulps %xmm1, %xmm0 movaps %xmm0, -16(%rax) movaps -16(%rdx), %xmm1 rcpps %xmm1, %xmm0 mulps %xmm0, %xmm1 mulps %xmm0, %xmm1 addps %xmm0, %xmm0 subps %xmm1, %xmm0 movaps %xmm0, -16(%rdx) cmpq %rcx, %rax jne L4 rep; ret
Use -mrecip. It's otherwise not safe for SPEC CPU 2006 which is why it is not enabled by default for -ffast-math.
Thanks. not safe meaning producing incorrect results? Is it documented?
(In reply to comment #2) > Thanks. > not safe meaning producing incorrect results? Yes. > Is it documented? See the documentation for -mrecip: ... Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994). ...
> is there any reason why rsqrtss and rcpss are not used for scalar code while > rsqrtps and rcpps are used for loops? Yep! I don't have the patience to dig the bugzilla archive right now, but the main reason is related to a loss of accuracy (especially 1/2.0 != 0.5) leading to problems in some codes (see gas_dyn.f90 in the polyhedron tests). You can pass options to force the use of rsqrtss and rcpss for scalars: -mrecip This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -finite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994). Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn't need -mrecip. Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn't need -mrecip. -mrecip=opt This option controls which reciprocal estimate instructions may be used. opt is a comma-separated list of options, which may be preceded by a `!' to invert the option: `all' Enable all estimate instructions. `default' Enable the default instructions, equivalent to -mrecip. `none' Disable all estimate instructions, equivalent to -mno-recip. `div' Enable the approximation for scalar division. `vec-div' Enable the approximation for vectorized division. `sqrt' Enable the approximation for scalar square root. `vec-sqrt' Enable the approximation for vectorized square root. So, for example, -mrecip=all,!sqrt enables all of the reciprocal approximations, except for square root.
we just got "hit" by this great type of code (copysign is unknown to scientists) most probably gcc could optimize it for -Ofast to return copysignf(1.f,x); (x/x is optimized in 1) cat one.cc;c++ -Ofast -mrecip -S one.cc; cat one.s #include<cmath> int one(float x) { return x/std::abs(x); } .text .align 4,0x90 .globl __Z3onef __Z3onef: LFB86: movss LC0(%rip), %xmm2 andps %xmm0, %xmm2 rcpss %xmm2, %xmm1 mulss %xmm1, %xmm2 mulss %xmm1, %xmm2 addss %xmm1, %xmm1 subss %xmm2, %xmm1 mulss %xmm0, %xmm1 cvttss2si %xmm1, %eax ret
(In reply to comment #5) > we just got "hit" by this great type of code (copysign is unknown to > scientists) > > most probably gcc could optimize it for -Ofast to return copysignf(1.f,x); (x/x > is optimized in 1) > > > cat one.cc;c++ -Ofast -mrecip -S one.cc; cat one.s > #include<cmath> > int one(float x) { > return x/std::abs(x); > } That looks like a completely different issue than this PR, I think you should open a different PR if you don't want it to get lost. It seems easy to add a few lines to fold_binary_loc about it (not the best place, but that's where the others are) near the place that optimizes A / A to 1.0. You could try writing the patch, I don't foresee any trap.
See PR 47989 for the reason why this option is not enabled for scalar code and why it was only enabled for vectorized code.