55760 – scalar code non using rsqrtss and rcpss

Bug 55760 - scalar code non using rsqrtss and rcpss

Summary: scalar code non using rsqrtss and rcpss

Status:	RESOLVED WONTFIX

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	documentation

Depends on:
Blocks:

Reported:	2012-12-20 15:49 UTC by vincenzo Innocente
Modified:	2021-08-07 22:59 UTC (History)
CC List:	0 users

See Also:	47989
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description vincenzo Innocente 2012-12-20 15:49:04 UTC

is there any reason why rsqrtss and rcpss are not used for scalar code while
rsqrtps and rcpps are used for loops?

cat scalar.cc
#include<cmath>
void scalar(float& a, float& b) {
  a = std::sqrt(a);
  b = 1.f/b;
}

float v[1024];
float w[1024];

void vector() {
  for(int i=0;i!=1024;++i) {
    v[i] = std::sqrt(v[i]);
    w[i] = 1.f/w[i];
  }
}

c++ -std=c++11 -Ofast -march=corei7 -S scalar.cc -ftree-vectorizer-verbose=1  -ftree-loop-if-convert-stores; cat scalar.s | c++filt


scalar(float&, float&):
LFB221:
	sqrtss	(%rdi), %xmm0
	movss	%xmm0, (%rdi)
	movss	LC0(%rip), %xmm0
	divss	(%rsi), %xmm0
	movss	%xmm0, (%rsi)
	ret
LFE221:
	.align 4,0x90
	.globl vector()
vector():
LFB222:
	movaps	LC1(%rip), %xmm5
	leaq	void(%rip), %rax
	xorps	%xmm3, %xmm3
	movaps	LC2(%rip), %xmm4
	leaq	wchar_t(%rip), %rdx
	leaq	4096+void(%rip), %rcx
	.align 4,0x90
L4:
	movaps	(%rax), %xmm1
	movaps	%xmm3, %xmm2
	addq	$16, %rax
	addq	$16, %rdx
	rsqrtps	%xmm1, %xmm0
	cmpneqps	%xmm1, %xmm2
	andps	%xmm2, %xmm0
	mulps	%xmm0, %xmm1
	mulps	%xmm1, %xmm0
	mulps	%xmm4, %xmm1
	addps	%xmm5, %xmm0
	mulps	%xmm1, %xmm0
	movaps	%xmm0, -16(%rax)
	movaps	-16(%rdx), %xmm1
	rcpps	%xmm1, %xmm0
	mulps	%xmm0, %xmm1
	mulps	%xmm0, %xmm1
	addps	%xmm0, %xmm0
	subps	%xmm1, %xmm0
	movaps	%xmm0, -16(%rdx)
	cmpq	%rcx, %rax
	jne	L4
	rep; ret

Comment 1 Richard Biener 2012-12-20 15:52:31 UTC

Use -mrecip.  It's otherwise not safe for SPEC CPU 2006 which is why it is not
enabled by default for -ffast-math.

Comment 2 vincenzo Innocente 2012-12-20 15:55:03 UTC

Thanks.
not safe meaning producing incorrect results?
Is it documented?

Comment 3 Richard Biener 2012-12-20 15:58:55 UTC

(In reply to comment #2)
> Thanks.
> not safe meaning producing incorrect results?

Yes.

> Is it documented?

See the documentation for -mrecip:

...

Note that while the throughput of the sequence is higher than the throughput
of the non-reciprocal instruction, the precision of the sequence can be
decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).

...

Comment 4 Dominique d'Humieres 2012-12-20 16:07:11 UTC

> is there any reason why rsqrtss and rcpss are not used for scalar code while
> rsqrtps and rcpps are used for loops?

Yep! I don't have the patience to dig the bugzilla archive right now, but the main reason is related to a loss of accuracy (especially 1/2.0 != 0.5) leading to problems in some codes (see gas_dyn.f90 in the polyhedron tests). You can pass options to force the use of rsqrtss and rcpss for scalars:

-mrecip
This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -finite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).
Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn't need -mrecip. 

-mrecip=opt
This option controls which reciprocal estimate instructions may be used. opt is a comma-separated list of options, which may be preceded by a `!' to invert the option:
`all'
Enable all estimate instructions. 
`default'
Enable the default instructions, equivalent to -mrecip. 
`none'
Disable all estimate instructions, equivalent to -mno-recip. 
`div'
Enable the approximation for scalar division. 
`vec-div'
Enable the approximation for vectorized division. 
`sqrt'
Enable the approximation for scalar square root. 
`vec-sqrt'
Enable the approximation for vectorized square root.
So, for example, -mrecip=all,!sqrt enables all of the reciprocal approximations, except for square root.

Comment 5 vincenzo Innocente 2013-01-08 15:29:18 UTC

we just got "hit" by this great type of code (copysign is unknown to scientists)

most probably gcc could optimize it for -Ofast to return copysignf(1.f,x); (x/x is optimized in 1)


cat one.cc;c++ -Ofast -mrecip -S one.cc; cat one.s
#include<cmath>
int one(float x) {
  return x/std::abs(x);
}

	.text
	.align 4,0x90
	.globl __Z3onef
__Z3onef:
LFB86:
	movss	LC0(%rip), %xmm2
	andps	%xmm0, %xmm2
	rcpss	%xmm2, %xmm1
	mulss	%xmm1, %xmm2
	mulss	%xmm1, %xmm2
	addss	%xmm1, %xmm1
	subss	%xmm2, %xmm1
	mulss	%xmm0, %xmm1
	cvttss2si	%xmm1, %eax
	ret

Comment 6 Marc Glisse 2013-01-08 23:55:18 UTC

(In reply to comment #5)
> we just got "hit" by this great type of code (copysign is unknown to
> scientists)
> 
> most probably gcc could optimize it for -Ofast to return copysignf(1.f,x); (x/x
> is optimized in 1)
> 
> 
> cat one.cc;c++ -Ofast -mrecip -S one.cc; cat one.s
> #include<cmath>
> int one(float x) {
>   return x/std::abs(x);
> }

That looks like a completely different issue than this PR, I think you should open a different PR if you don't want it to get lost. It seems easy to add a few lines to fold_binary_loc about it (not the best place, but that's where the others are) near the place that optimizes A / A to 1.0. You could try writing the patch, I don't foresee any trap.

Comment 7 Andrew Pinski 2021-08-07 22:59:23 UTC

See PR 47989 for the reason why this option is not enabled for scalar code and why it was only enabled for vectorized code.