With gfortran and g++ the computation of cos(x) and sin(x) is "optimized" by taking the real and imaginary parts of cexpi(x) (at least it is what I understand). This is working if and only if the computation of cexpi(x) is faster than the sum of the separate computations of cos(x) and sin(x). Now consider the following code: integer, parameter :: n=1000000 integer :: i real(8) :: pi, ss, sc, t, dt pi = acos(-1.0d0) dt=pi/n sc=0 ss=0 t=0 do i= 1, 100*n sc = sc + cos(t-dt) ss = ss + sin(t) t = t+dt end do print *, sc, ss end the result is (G5 1.8Ghz, OSX 10.3.9): [karma] bug/timing% gfc -O3 sincos.f90 [karma] bug/timing% time a.out -6.324121638644320E-002 -2.934958087315009E-003 13.020u 0.050s 0:13.59 96.1% 0+0k 0+2io 0pf+0w It is easy to see that I have fooled the optimizer with the line sc = sc + cos(t-dt) If I replace it by: sc = sc + cos(t) the result is now (over a 50% increase of the CPU time): [karma] bug/timing% gfc -O3 sincos_o.f90 [karma] bug/timing% time a.out -6.324121573032526E-002 -2.934958087315009E-003 21.740u 0.080s 0:22.18 98.3% 0+0k 0+2io 0pf+0w to be compared with the result of the code: integer, parameter :: n=1000000 integer :: i real(8) :: pi, ss, sc, t, dt complex(8) :: z, dz pi = acos(-1.0d0) dt=pi/n dz=cmplx(0.0d0,dt,8) sc=0 ss=0 z=0 do i= 1, 100*n sc = sc + real(exp(z)) ss = ss + aimag(exp(z)) z = z+dz end do print *, sc, ss end is [karma] bug/timing% gfc -O3 cexp.f90 [karma] bug/timing% time a.out -6.324121573032526E-002 -2.934958087315009E-003 20.850u 0.110s 0:21.45 97.7% 0+0k 0+2io 0pf+0w Following the comments in PR #30969, 30980, and 31161, I have understood that on OSX cexpi "fallback" to cexp in perfect agreement with the above timings. So it would probably nice to disable the sincos "optimisation" on platforms that do not support fast cexpi such as OSX (as presently configured). Note that on Sat, 30 Sep 2006 in http://gcc.gnu.org/ml/fortran/2006-09/msg00454.html I have reported (in vain) a timing regression for the fatigue.f90 polyhedron test case. Is this related to this pseudo-optimization or to another change?
The only reason why cexp is slow on PPC darwin is because the ABI is stupid. Complex float arguments are passed via the GPR and returned also the same way instead of via the FPRs. So you will get a transfer of registers. This is also true of PPC64 darwin, why they made the same mistake twice I have no idea, guess they did not expect people to use complex that much.
Andrew, Thanks for the answer. Additional timings for AMD Opteron(tm) Processor 250, 2.4Ghz: Target: x86_64-unknown-linux-gnu ... gcc version 4.3.0 20061231 (experimental) [tocata] test/fortran> gfc -O3 sincos.f90 [tocata] test/fortran> time a.out -6.324121691031215E-002 -2.934957388823078E-003 19.847u 0.001s 0:20.41 97.2% 0+0k 0+0io 0pf+0w [tocata] test/fortran> gfc -O3 sincos_o.f90 [tocata] test/fortran> time a.out -6.324121619598655E-002 -2.934957388823078E-003 19.793u 0.000s 0:19.80 99.9% 0+0k 0+0io 0pf+0w [tocata] test/fortran> gfc -O3 cexp.f90 [tocata] test/fortran> time a.out -6.324121619598655E-002 -2.934957388823078E-003 15.613u 0.000s 0:15.63 99.8% 0+0k 0+0io 0pf+0w sin+cos is not optimized as cexpi. Target: i386-pc-linux-gnu ... gcc version 4.3.0 20070225 (experimental) [tocata] test/fortran> gfc32 -Wa,-32 -O3 -fdump-tree-optimized sincos.f90 [tocata] test/fortran> time a.out -6.324122144403047E-002 -2.934963088285132E-003 10.757u 0.000s 0:10.76 99.9% 0+0k 0+0io 0pf+0w [tocata] test/fortran> gfc32 -Wa,-32 -O3 -fdump-tree-optimized sincos_o.f90 tocata] test/fortran> time a.out -6.324122124732012E-002 -2.934963117388848E-003 7.291u 0.001s 0:07.47 97.5% 0+0k 0+0io 4pf+0w [tocata] test/fortran> gfc32 -Wa,-32 -O3 -fdump-tree-optimized cexp.f90 [tocata] test/fortran> time a.out -6.324122124732012E-002 -2.934963117388848E-003 11.412u 0.000s 0:11.41 100.0% 0+0k 0+0io 0pf+0w sin+cos is optimized as cexpi which is faster than cexp -> real optimization! The i386 code is almost twice as fast as the x86_64 one.
BTW, did I miss an option to turn this optimization off?
There is no option to turn it off. But for !TARGET_C99_FUNCTIONS and !TARGET_HAS_SINCOS targets it's off. Usually (in fact, for every libm I looked into), cexp is implemented as complex double cexp (complex double x) { double cos = cos (imag(x)); double sin = sin (imag(x)); double e = 1; if (real(x) != 0) e = exp (real(x)); ... possibly computing cos and sin in an efficient way (using sincos). So cexp () should be never slower than calling sin () and cos (). If the ABI were not stupid of course ;) Does darwin have a sincos() library function?
> There is no option to turn it off. But for !TARGET_C99_FUNCTIONS and > !TARGET_HAS_SINCOS targets it's off. From my understanding of the thread http://gcc.gnu.org/ml/gcc/2007-03/msg00639.html if !TARGET_64BIT, then TARGET_C99_FUNCTIONS depends on darwin_macosx_version_min which seems presently default to 10.1. So TARGET_C99_FUNCTIONS seems to be set to 0 at least for a G5 under OSX 10.3 and a G4 under 10.4. > Does darwin have a sincos() library function? I don't know. If there is no answer in this PR, I can ask the question on gcc@gcc.gnu.org. From the behavior reported in PR30969, PR30980, and PR31161, it seems that the optimization is off for gcc, but on for g++ and gfortran, though I cannot figure out why. > If the ABI were not stupid of course ;) Since sin() and cos() are non trivial functions, I am very surprised that a wrong API makes a 50% difference. If the API is so time consuming, why not inline sin() and cos()? In addition a decent optimizer should be able to eliminate redundant part of the codes, making the use of a sincos() function not necessary, or am I too naive? What is the best way to collect data on the different platform to evaluate how this optimization is really working? I have only access to OSX and Intel or AMD64 under Linux.
Subject: Re: pseudo-optimzation with sincos/cexpi On 19 Mar 2007 12:43:49 -0000, dominiq at lps dot ens dot fr <gcc-bugzilla@gcc.gnu.org> wrote: > > Since sin() and cos() are non trivial functions, I am very surprised > that a wrong API makes a 50% difference. Well Here is how it can make a 50% difference (at least on the Cell, the 970 has less of a restriction and only the dispatch group is rejected). Modern PowerPC processors like not to store stuff to the stack and then load it again with in a number of cycles (cell is around 50 cycles while the 970 is just within a dispatch group). Transfering between the integer register set and the floating point register set can only be done via memory so you will get a LHS or a LRU reject (depending on what processor you are on). This can either cause a 50 cycle delay or reject of the dispatch group (the later can cause multiple rejects). The number of cycles used up by this issue can add up with both sides of the function having this hazard. Thanks, Andrew Pinski
I agree it's surprising to get user-visible effects with the TARGET_C99_FUNCTIONS difference between the frontends, but they are (supposed to) providing C99 runtime completion by their runtime libraries. And they rely on full C99 support.
> The only reason why cexp is slow on PPC darwin is because the ABI is stupid. > Complex float arguments are passed via the GPR and returned also the same way > instead of via the FPRs. So you will get a transfer of registers. This is > also true of PPC64 darwin, why they made the same mistake twice I have no idea, > guess they did not expect people to use complex that much. Is this also true for complex double on 32 bit architectures (i.e., 4 GPRs) or do you mean the GPR is used to pass a pointer? > ... . The number of cycles used up by this issue > can add up with both sides of the function having this hazard. You are comforting my prejudice against using procedures in critical loops. Now if you cannot convince darwin people to fix the problem, I cannot how I could. My short term interests are: (1) to understand how this reverse optimization is triggered. (2) to know what are the non-Linux platform that are affected beside Darwin. (3) to get a work around less hackish that what I did in my first example.
> I agree it's surprising to get user-visible effects with the > TARGET_C99_FUNCTIONS difference between the frontends, > but they are (supposed to) providing C99 runtime completion > by their runtime libraries. And they rely on full C99 support. Do you mean that g++ and gfortran set TARGET_C99_FUNCTIONS on their own? If yes, the cexpi optimization should probably another condition: what is the point to replace sin+cos by a call to a function calling sin+cos?
That sin+cos is practically sincos (so you get one for free). Just not every library exports that sincos.
Subject: Re: pseudo-optimzation with sincos/cexpi > That sin+cos is practically sincos (so you get one for free). Just not every > library exports that sincos. Does not this assume that it exists a real sincos(x) and not a faked one thorugh cexp((1,x))?
Depends on how you name it ;) You can propose that we only enable sincos transformation if TARGET_HAS_SINCOS is set, I wouldn't necessarily object to that. (The targets I care for have a sincos)
> You can propose that we only enable sincos transformation > if TARGET_HAS_SINCOS is set, I wouldn't necessarily object to > that. (The targets I care for have a sincos) Sound reasonable: replacing: return (TARGET_HAS_SINCOS || TARGET_C99_FUNCTIONS) && optimize; by return TARGET_HAS_SINCOS && optimize; in gcc/tree-ssa-math-opts.c, isn't it? I can even do a preliminary test to check that it does not break anything. What's bother me is that i suspect the problem is present for all non-Linux platforms and to have no feedback from them. If you have some idea about the way to trigger their interest, it would be nice.
The recommended way is to post a message to gcc@gcc.gnu.org or gcc-patches@gcc.gnu.org
> Is this also true for complex double on 32 bit architectures (i.e., 4 GPRs) > or do you mean the GPR is used to pass a pointer? 4 GPRS Yes this was a stupid decission on Apple's part for not looking at fixing GCC before setting an ABI. And really this problem is only with PPC no other target has this stupid ABI issue.
> The recommended way is to post a message to gcc@gcc.gnu.org or > gcc-patches@gcc.gnu.org I'll follow your advice, but before I'ld like some feedback about what follows. I have applied the following patch --- gcc-4.3-20070316/gcc/tree-ssa-math-opts.c Thu Mar 8 20:02:51 2007 +++ gcc-4.3-20070317/gcc/tree-ssa-math-opts.c Tue Mar 20 17:21:16 2007 @@ -704,9 +704,7 @@ gate_cse_sincos (void) { /* Make sure we have either sincos or cexp. */ - return (TARGET_HAS_SINCOS - || TARGET_C99_FUNCTIONS) - && optimize; + return TARGET_HAS_SINCOS && optimize; } struct tree_opt_pass pass_cse_sincos = I have regtested it with no change in the reports for gcc, g++, gfortran, and objc. The timings before and after are before after -O0 -O1 % -O0 -O1 % g++-4 sincos_o.c 6.2 9.6 +55 6.3 5.6 -11 gfc sincos_o.f90 6.3 9.6 +52 6.4 5.5 -14 for the following C and Fotran tests: [karma] bug/timing% cat sincos_o.c #include <math.h> #include <stdio.h> int main() { long n = 1000000; long i; double mo = -1.0; double pi = acos(mo); double sc = 0.0; double ss = 0.0; double t = 0.0; double dt = pi/n; printf("%.17g \n", pi); printf("%.17g \n", dt); for (i=0; i< 40*n; i++) { sc += cos(t); ss += sin(t); t += dt; } printf("%.17g %.17g \n", sc, ss); } [karma] bug/timing% cat sincos_o.f90 integer, parameter :: n=1000000 integer :: i real(8) :: pi, ss, sc, t, dt pi = acos(-1.0d0) dt=pi/n sc=0 ss=0 t=0 do i= 1, 40*n sc = sc + cos(t) ss = ss + sin(t) t = t+dt end do print *, sc, ss end So from the PPC Darwin point of view, everything is working as expected. I have done a search on the regtest list based on FAIL: gcc.dg/builtins-59.c scan-tree-dump __builtin_cexpi assuming that platforms that do not pass it, are likely to have not __builtin_cexpi, thus are exposed to the same bad optimization. I have found the following list tested on a regular basis: powerpc-apple-darwin8.5.0 hppa2.0w-hp-hpux11.11 v850-unknown-elf sparc-unknown-elf sh-unknown-elf powerpc-unknown-eabisim mips-unknown-elf m32r-unknown-elf m32c-unknown-elf avr-unknown-none frv-unknown-elf arm-unknown-elf cris-axis-elf arm-none-eabi As far as I can tell, only the first two are tested against g++ and gfortran. Note that the list does not include powerpc64-apple-darwin8.8.0. So it seems that it has __builtin_cexpi. I don't know what will be the final decision about the proposed patch, but there is no "emergency" since I can use it for my coming weekly builds. I would prefer to have some feedback from the listed platforms before seeing the patch applied.
It would be nice to know whether darwin does not implement cexp in an optimal way (special casing zero real part and dispatching to a sincos equivalent for the imaginary part) or if the issue is only a bad ABI for complex values. Otherwise the proposal looks ok, those other targets are mostly embedded ones and as such don't care too much about optimized sincos probably.
> It would be nice to know whether darwin does not implement cexp in an optimal way ... I have forgotten to mention that I did some (quick) profiling: cexp seens trivially implemented. It calls sin, cos and exp + the ABI problem mentionned by Andrew. It seems that any inplementation of cexp calling sin and cos would at best lead to a draw and more likely to a regression in timing. Now I cannot rule out that on some platforms cexp is implemented in a clever way, using some kind of sincos alrgorithm (this is what I would like to know).
Still there on powerpc-apple-darwin9 [trunk revision 168762]: gfc sincos_o.f90 4.230s 8.32s 7.98s gfc -m64 sincos_o.f90 4.60s 7.98s 7.07s gcc46 sincos_o.c 4.17s 7.95s 7.63s gcc46 -m64 sincos_o.c 4.55s 7.55s 7.03s
I have forgotten to give the optimization levels: -O1 -O3 > gfc sincos_o.f90 4.23s 8.32s 7.98s > gfc -m64 sincos_o.f90 4.60s 7.98s 7.07s > gcc46 sincos_o.c 4.17s 7.95s 7.63s > gcc46 -m64 sincos_o.c 4.55s 7.55s 7.03s
See also PR46926. The flags controlling whether we can emit a call to sincos need some work (but it's not exactly clear how it should work).
It seems we should give targets like darwin a permanent workaround for this issue. Since config can't correctly determine in all cases whether an implementation has a "fast enough" sincos/cexp, the few targets where the current flags don't work should be able to disable the transformation. I can create a cse-sincos option flag, set on by default for most platforms. The darwin platform maintainer (and others, if necessary) could add -fno-cse-sincos to the platform defaults. Any concerns with this approach?
On Wed, 1 Jun 2011, wschmidt at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31249 > > William J. Schmidt <wschmidt at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |wschmidt at gcc dot gnu.org > > --- Comment #22 from William J. Schmidt <wschmidt at gcc dot gnu.org> 2011-06-01 16:59:15 UTC --- > It seems we should give targets like darwin a permanent workaround for this > issue. Since config can't correctly determine in all cases whether an > implementation has a "fast enough" sincos/cexp, the few targets where the > current flags don't work should be able to disable the transformation. > > I can create a cse-sincos option flag, set on by default for most platforms. > The darwin platform maintainer (and others, if necessary) could add > -fno-cse-sincos to the platform defaults. > > Any concerns with this approach? The concern is that we might want to vectorize sincos and that might have a fast implementation. I think rather than a flag disabling sincos a target with slow sincos should expand that to calling sin and cos (they can define an optab for sincos for example). That would also make user code that uses sincos fast.