31249 – pseudo-optimization with sincos/cexpi

Bug 31249 - pseudo-optimization with sincos/cexpi

Summary: pseudo-optimization with sincos/cexpi

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.3.0

Importance:	P3 minor
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:	46926
Blocks:
	Show dependency tree / graph

Reported:	2007-03-17 21:11 UTC by Dominique d'Humieres
Modified:	2022-03-08 16:20 UTC (History)
CC List:	4 users (show)

See Also:	30969 30980 31161
Host:
Target:	powerpc-apple-darwin7
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominique d'Humieres 2007-03-17 21:11:36 UTC

With gfortran and g++ the computation of cos(x) and sin(x) is "optimized" by taking
the real and imaginary parts of cexpi(x) (at least it is what I understand). This is working
if and only if the computation of cexpi(x) is faster than the sum of the separate computations 
of cos(x) and sin(x). 

Now consider the following code:

integer, parameter :: n=1000000
integer :: i
real(8) :: pi, ss, sc, t, dt
pi = acos(-1.0d0)
dt=pi/n
sc=0
ss=0
t=0
do i= 1, 100*n
  sc = sc + cos(t-dt)
  ss = ss + sin(t)
  t = t+dt
end do
print *, sc, ss
end

the result is (G5 1.8Ghz, OSX 10.3.9):

[karma] bug/timing% gfc -O3 sincos.f90 
[karma] bug/timing% time a.out 
 -6.324121638644320E-002 -2.934958087315009E-003
13.020u 0.050s 0:13.59 96.1%    0+0k 0+2io 0pf+0w

It is easy to see that I have fooled the optimizer with the line

  sc = sc + cos(t-dt)

If I replace it by:

  sc = sc + cos(t)

the result is now (over a 50% increase of the CPU time):

[karma] bug/timing% gfc -O3 sincos_o.f90
[karma] bug/timing% time a.out
 -6.324121573032526E-002 -2.934958087315009E-003
21.740u 0.080s 0:22.18 98.3%    0+0k 0+2io 0pf+0w

to be compared with the result of the code:

integer, parameter :: n=1000000
integer :: i
real(8) :: pi, ss, sc, t, dt
complex(8) :: z, dz
pi = acos(-1.0d0)
dt=pi/n
dz=cmplx(0.0d0,dt,8)
sc=0
ss=0
z=0
do i= 1, 100*n
  sc = sc + real(exp(z))
  ss = ss + aimag(exp(z))
  z = z+dz
end do
print *, sc, ss
end

is

[karma] bug/timing% gfc -O3 cexp.f90
[karma] bug/timing% time a.out
 -6.324121573032526E-002 -2.934958087315009E-003
20.850u 0.110s 0:21.45 97.7%    0+0k 0+2io 0pf+0w

Following the comments in PR #30969, 30980, and 31161, I have understood that
on OSX cexpi "fallback" to cexp in perfect agreement with the above timings.

So it would probably nice to disable the sincos "optimisation" on platforms that
do not support fast cexpi such as OSX (as presently configured).

Note that on Sat, 30 Sep 2006 in

http://gcc.gnu.org/ml/fortran/2006-09/msg00454.html

I have reported (in vain) a timing regression for the fatigue.f90 polyhedron test case.
Is this related to this pseudo-optimization or to another change?

Comment 1 Andrew Pinski 2007-03-18 09:49:54 UTC

The only reason why cexp is slow on PPC darwin is because the ABI is stupid.  Complex float arguments are passed via the GPR and returned also the same way instead of via the FPRs.  So you will get a transfer of registers.  This is also true of PPC64 darwin, why they made the same mistake twice I have no idea, guess they did not expect people to use complex that much.

Comment 2 Dominique d'Humieres 2007-03-18 10:20:06 UTC

Andrew,

Thanks for the answer. Additional timings for AMD Opteron(tm) Processor 250, 2.4Ghz:

Target: x86_64-unknown-linux-gnu
...
gcc version 4.3.0 20061231 (experimental)

[tocata] test/fortran> gfc -O3 sincos.f90 
[tocata] test/fortran> time a.out 
 -6.324121691031215E-002 -2.934957388823078E-003
19.847u 0.001s 0:20.41 97.2%    0+0k 0+0io 0pf+0w

[tocata] test/fortran> gfc -O3 sincos_o.f90
[tocata] test/fortran> time a.out
 -6.324121619598655E-002 -2.934957388823078E-003
19.793u 0.000s 0:19.80 99.9%    0+0k 0+0io 0pf+0w

[tocata] test/fortran> gfc -O3 cexp.f90 
[tocata] test/fortran> time a.out
 -6.324121619598655E-002 -2.934957388823078E-003
15.613u 0.000s 0:15.63 99.8%    0+0k 0+0io 0pf+0w

sin+cos is not optimized as cexpi.

Target: i386-pc-linux-gnu
...
gcc version 4.3.0 20070225 (experimental)

[tocata] test/fortran> gfc32 -Wa,-32 -O3 -fdump-tree-optimized sincos.f90
[tocata] test/fortran> time a.out
 -6.324122144403047E-002 -2.934963088285132E-003
10.757u 0.000s 0:10.76 99.9%    0+0k 0+0io 0pf+0w

[tocata] test/fortran> gfc32 -Wa,-32 -O3 -fdump-tree-optimized sincos_o.f90
tocata] test/fortran> time a.out 
 -6.324122124732012E-002 -2.934963117388848E-003
7.291u 0.001s 0:07.47 97.5%     0+0k 0+0io 4pf+0w

[tocata] test/fortran> gfc32 -Wa,-32 -O3 -fdump-tree-optimized cexp.f90
[tocata] test/fortran> time a.out
 -6.324122124732012E-002 -2.934963117388848E-003
11.412u 0.000s 0:11.41 100.0%   0+0k 0+0io 0pf+0w

sin+cos is optimized as cexpi which is faster than cexp -> real optimization!
The i386 code is almost twice as fast as the x86_64 one.

Comment 3 Dominique d'Humieres 2007-03-19 09:28:13 UTC

BTW, did I miss an option to turn this optimization off?

Comment 4 Richard Biener 2007-03-19 10:43:32 UTC

There is no option to turn it off.  But for !TARGET_C99_FUNCTIONS and !TARGET_HAS_SINCOS targets it's off.  Usually (in fact, for every libm I looked
into), cexp is implemented as

complex double cexp (complex double x)
{
  double cos = cos (imag(x));
  double sin = sin (imag(x));
  double e = 1;
  if (real(x) != 0)
    e = exp (real(x));
...

possibly computing cos and sin in an efficient way (using sincos).  So
cexp () should be never slower than calling sin () and cos ().  If the ABI
were not stupid of course ;)

Does darwin have a sincos() library function?

Comment 5 Dominique d'Humieres 2007-03-19 12:43:49 UTC

> There is no option to turn it off.  But for !TARGET_C99_FUNCTIONS and
> !TARGET_HAS_SINCOS targets it's off.  

From my understanding of the thread

http://gcc.gnu.org/ml/gcc/2007-03/msg00639.html

if !TARGET_64BIT, then TARGET_C99_FUNCTIONS depends on 
darwin_macosx_version_min which seems presently default to 10.1.
So TARGET_C99_FUNCTIONS seems to be set to 0 at least for a G5 under 
OSX 10.3 and a G4 under 10.4.

> Does darwin have a sincos() library function?

I don't know. If there is no answer in this PR, I can ask the 
question on gcc@gcc.gnu.org.

From the behavior reported in PR30969, PR30980, and PR31161, it seems
that the optimization is off for gcc, but on for g++ and gfortran,
though I cannot figure out why.

> If the ABI were not stupid of course ;)

Since sin() and cos() are non trivial functions, I am very surprised 
that a wrong API makes a 50% difference.

If the API is so time consuming, why not inline sin() and cos()? 
In addition a decent optimizer should be able to eliminate redundant
part of the codes, making the use of a sincos() function not necessary,
or am I too naive?

What is the best way to collect data on the different platform
to evaluate how this optimization is really working?
I have only access to OSX and Intel or AMD64 under Linux.

Comment 6 pinskia@gmail.com 2007-03-19 17:52:20 UTC

Subject: Re:  pseudo-optimzation with sincos/cexpi

On 19 Mar 2007 12:43:49 -0000, dominiq at lps dot ens dot fr
<gcc-bugzilla@gcc.gnu.org> wrote:
>
> Since sin() and cos() are non trivial functions, I am very surprised
> that a wrong API makes a 50% difference.

Well Here is how it can make a 50% difference (at least on the Cell,
the 970 has less of a restriction and only the dispatch group is
rejected).  Modern PowerPC processors like not to store stuff to the
stack and then load it again with in a number of cycles (cell is
around 50 cycles while the 970 is just within a dispatch group).
Transfering between the integer register set and the floating point
register set can only be done via memory so you will get a LHS or a
LRU reject (depending on what processor you are on).  This can either
cause a 50 cycle delay or reject of the dispatch group (the later can
cause multiple rejects).  The number of cycles used up by this issue
can add up with both sides of the function having this hazard.

Thanks,
Andrew Pinski

Comment 7 Richard Biener 2007-03-20 11:04:09 UTC

I agree it's surprising to get user-visible effects with the TARGET_C99_FUNCTIONS
difference between the frontends, but they are (supposed to) providing C99
runtime completion by their runtime libraries.  And they rely on full C99 support.

Comment 8 Dominique d'Humieres 2007-03-20 13:57:00 UTC

> The only reason why cexp is slow on PPC darwin is because the ABI is stupid. 
> Complex float arguments are passed via the GPR and returned also the same way
> instead of via the FPRs.  So you will get a transfer of registers.  This is
> also true of PPC64 darwin, why they made the same mistake twice I have no idea,
> guess they did not expect people to use complex that much.

Is this also true for complex double on 32 bit architectures (i.e., 4 GPRs)
or do you mean the GPR is used to pass a pointer?

> ... .  The number of cycles used up by this issue
> can add up with both sides of the function having this hazard.

You are comforting my prejudice against using procedures in critical loops.

Now if you cannot convince darwin people to fix the problem, I cannot how I could.  My short term interests are:

(1) to understand how this reverse optimization is triggered.

(2) to know what are the non-Linux platform that are affected beside
Darwin.

(3) to get a work around less hackish that what I did in my first example.

Comment 9 Dominique d'Humieres 2007-03-20 14:03:07 UTC

> I agree it's surprising to get user-visible effects with the
> TARGET_C99_FUNCTIONS difference between the frontends, 
> but they are (supposed to) providing C99 runtime completion 
> by their runtime libraries.  And they rely on full C99 support.

Do you mean that g++ and gfortran set TARGET_C99_FUNCTIONS on
their own?

If yes, the cexpi optimization should probably another condition:
what is the point to replace sin+cos by a call to a function
calling sin+cos?

Comment 10 Richard Biener 2007-03-20 14:26:32 UTC

That sin+cos is practically sincos (so you get one for free).  Just not every
library exports that sincos.

Comment 11 Dominique d'Humieres 2007-03-20 14:57:52 UTC

Subject: Re:  pseudo-optimzation with sincos/cexpi

> That sin+cos is practically sincos (so you get one for free).  Just not every
> library exports that sincos.

Does not this assume that it exists a real sincos(x) and not a faked one
thorugh cexp((1,x))?

Comment 12 Richard Biener 2007-03-20 15:06:57 UTC

Depends on how you name it ;)  You can propose that we only enable sincos transformation if TARGET_HAS_SINCOS is set, I wouldn't necessarily object to that.
(The targets I care for have a sincos)

Comment 13 Dominique d'Humieres 2007-03-20 16:08:50 UTC

> You can propose that we only enable sincos transformation 
> if TARGET_HAS_SINCOS is set, I wouldn't necessarily object to
> that. (The targets I care for have a sincos)

Sound reasonable: replacing:

  return (TARGET_HAS_SINCOS
          || TARGET_C99_FUNCTIONS)
         && optimize;

by

  return TARGET_HAS_SINCOS
         && optimize;

in gcc/tree-ssa-math-opts.c, isn't it?

I can even do a preliminary test to check that it
does not break anything.

What's bother me is that i suspect the problem is present
for all non-Linux platforms and to have no feedback from them.
If you have some idea about the way to trigger their interest,
it would be nice.

Comment 14 Richard Biener 2007-03-20 16:12:04 UTC

The recommended way is to post a message to gcc@gcc.gnu.org or gcc-patches@gcc.gnu.org

Comment 15 Andrew Pinski 2007-03-20 16:14:37 UTC

> Is this also true for complex double on 32 bit architectures (i.e., 4 GPRs)
> or do you mean the GPR is used to pass a pointer?

4 GPRS

Yes this was a stupid decission on Apple's part for not looking at fixing GCC before setting an ABI.

And really this problem is only with PPC no other target has this stupid ABI issue.

Comment 16 Dominique d'Humieres 2007-03-21 15:36:32 UTC

> The recommended way is to post a message to gcc@gcc.gnu.org or
> gcc-patches@gcc.gnu.org

I'll follow your advice, but before I'ld like some feedback about what follows.

I have applied the following patch

--- gcc-4.3-20070316/gcc/tree-ssa-math-opts.c   Thu Mar  8 20:02:51 2007
+++ gcc-4.3-20070317/gcc/tree-ssa-math-opts.c   Tue Mar 20 17:21:16 2007
@@ -704,9 +704,7 @@
 gate_cse_sincos (void)
 {
   /* Make sure we have either sincos or cexp.  */
-  return (TARGET_HAS_SINCOS
-         || TARGET_C99_FUNCTIONS)
-        && optimize;
+  return TARGET_HAS_SINCOS && optimize;
 }
 
 struct tree_opt_pass pass_cse_sincos =

I have regtested it with no change in the reports for gcc, g++, gfortran, 
and objc. The timings before and after are

			      before		       after
			-O0	-O1	 %	-O0	-O1	 %

g++-4 sincos_o.c	6.2	9.6	+55	6.3	5.6	-11
gfc sincos_o.f90	6.3	9.6	+52	6.4	5.5	-14

for the following C and Fotran tests:

[karma] bug/timing% cat sincos_o.c
#include <math.h>
#include <stdio.h>

int main()
{

  long    n = 1000000;
  long    i;
  double mo = -1.0;
  double pi = acos(mo);
  double sc = 0.0;
  double ss = 0.0;
  double  t = 0.0;
  double dt = pi/n;

  printf("%.17g \n", pi);
  printf("%.17g \n", dt);
  for (i=0; i< 40*n; i++) {
    sc += cos(t);
    ss += sin(t);
    t += dt;
  }
  printf("%.17g %.17g \n", sc, ss);
}

[karma] bug/timing% cat sincos_o.f90 
integer, parameter :: n=1000000
integer :: i
real(8) :: pi, ss, sc, t, dt
pi = acos(-1.0d0)
dt=pi/n
sc=0
ss=0
t=0
do i= 1, 40*n
  sc = sc + cos(t)
  ss = ss + sin(t)
  t = t+dt
end do
print *, sc, ss
end

So from the PPC Darwin point of view, everything is working as expected.

I have done a search on the regtest list based on

FAIL: gcc.dg/builtins-59.c scan-tree-dump __builtin_cexpi

assuming that platforms that do not pass it, are likely to have not
__builtin_cexpi, thus are exposed to the same bad optimization.
I have found the following list tested on a regular basis:

powerpc-apple-darwin8.5.0
hppa2.0w-hp-hpux11.11
v850-unknown-elf
sparc-unknown-elf
sh-unknown-elf
powerpc-unknown-eabisim
mips-unknown-elf
m32r-unknown-elf
m32c-unknown-elf
avr-unknown-none
frv-unknown-elf
arm-unknown-elf
cris-axis-elf
arm-none-eabi

As far as I can tell, only the first two are tested against g++ and 
gfortran.

Note that the list does not include powerpc64-apple-darwin8.8.0.
So it seems that it has __builtin_cexpi.

I don't know what will be the final decision about the proposed patch,
but there is no "emergency" since I can use it for my coming weekly builds.
I would prefer to have some feedback from the listed platforms before
seeing the patch applied.

Comment 17 Richard Biener 2007-03-21 15:57:35 UTC

It would be nice to know whether darwin does not implement cexp in an optimal
way (special casing zero real part and dispatching to a sincos equivalent for
the imaginary part) or if the issue is only a bad ABI for complex values.  Otherwise the proposal looks ok, those other targets are mostly embedded ones
and as such don't care too much about optimized sincos probably.

Comment 18 Dominique d'Humieres 2007-03-21 16:09:08 UTC

> It would be nice to know whether darwin does not implement cexp in an optimal way ...

I have forgotten to mention that I did some (quick) profiling: cexp seens trivially implemented.
It calls sin, cos and exp + the ABI problem mentionned by Andrew.  It seems that any 
inplementation of cexp calling sin and cos would at best lead to a draw and more likely to
a regression in timing.  Now I cannot rule out that on some platforms cexp is implemented
in a clever way, using some kind of sincos alrgorithm (this is what I would like to know).

Comment 19 Dominique d'Humieres 2011-01-14 13:04:28 UTC

Still there on powerpc-apple-darwin9 [trunk revision 168762]:

gfc sincos_o.f90        4.230s  8.32s  7.98s
gfc -m64 sincos_o.f90    4.60s 7.98s  7.07s
gcc46 sincos_o.c      4.17s 7.95s  7.63s
gcc46 -m64 sincos_o.c  4.55s  7.55s  7.03s

Comment 20 Dominique d'Humieres 2011-01-14 13:27:24 UTC

I have forgotten to give the optimization levels:
                                                               -O1      -O3
> gfc sincos_o.f90                4.23s    8.32s    7.98s
> gfc -m64 sincos_o.f90    4.60s   7.98s    7.07s
> gcc46 sincos_o.c              4.17s   7.95s    7.63s
> gcc46 -m64 sincos_o.c  4.55s    7.55s    7.03s

Comment 21 Richard Biener 2011-01-14 13:33:20 UTC

See also PR46926.  The flags controlling whether we can emit a call to
sincos need some work (but it's not exactly clear how it should work).

Comment 22 Bill Schmidt 2011-06-01 16:59:15 UTC

It seems we should give targets like darwin a permanent workaround for this issue.  Since config can't correctly determine in all cases whether an implementation has a "fast enough" sincos/cexp, the few targets where the current flags don't work should be able to disable the transformation.

I can create a cse-sincos option flag, set on by default for most platforms.  The darwin platform maintainer (and others, if necessary) could add -fno-cse-sincos to the platform defaults.

Any concerns with this approach?

Comment 23 rguenther@suse.de 2011-06-01 19:02:03 UTC

On Wed, 1 Jun 2011, wschmidt at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31249
> 
> William J. Schmidt <wschmidt at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |wschmidt at gcc dot gnu.org
> 
> --- Comment #22 from William J. Schmidt <wschmidt at gcc dot gnu.org> 2011-06-01 16:59:15 UTC ---
> It seems we should give targets like darwin a permanent workaround for this
> issue.  Since config can't correctly determine in all cases whether an
> implementation has a "fast enough" sincos/cexp, the few targets where the
> current flags don't work should be able to disable the transformation.
> 
> I can create a cse-sincos option flag, set on by default for most platforms. 
> The darwin platform maintainer (and others, if necessary) could add
> -fno-cse-sincos to the platform defaults.
> 
> Any concerns with this approach?

The concern is that we might want to vectorize sincos and that might
have a fast implementation.  I think rather than a flag disabling
sincos a target with slow sincos should expand that to calling
sin and cos (they can define an optab for sincos for example).

That would also make user code that uses sincos fast.