Bug 33780 - different results between O3 and O0
Summary: different results between O3 and O0
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2007-10-15 14:49 UTC by Joost VandeVondele
Modified: 2008-02-11 11:25 UTC (History)
1 user (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2007-10-15 15:01:08


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joost VandeVondele 2007-10-15 14:49:41 UTC
The following program:

FUNCTION F(r,a,e)
 REAL*8 :: a(2:15)
 REAL*8 :: r,f,e
 f=0
 DO i = 2, 15
    f = f + a(i)/(r**(i-1)*REAL(i-1,8))
 END DO
 f=f-e
END FUNCTION F

PROGRAM TEST

  REAL*8 :: a(2:15)=(/-195.771601327700D0,  15343.7861339500D0, &
                      -530864.458651600D0,  10707934.3905800D0, &
                      -140099704.789000D0,  1250943273.78500D0, &
                      -7795458330.67600D0,  33955897217.3100D0, &
                      -101135640744.000D0,  193107995718.700D0, &
                      -193440560940.000D0, -4224406093.91800D0, &
                       217192386506.500D0, -157581228915.500D0/)

  REAL*8 :: r=4.51556282882533D0
  REAL*8 :: e=-2.21199635966809471000D0
  REAL*8 :: f

  write(6,*)  f(r,a,e)

END PROGRAM TEST

generates different results with gfortran 4.3.0 at O0 and O3

> gfortran -O0 -march=k8 -msse  test.f90
> ./a.out
   0.00000000000000

> gfortran -O3 -march=k8 -msse  test.f90
> ./a.out
 -1.143973804573761E-012

notice that this is without -ffast-math and with sse.

gfortran 4.1 gives 0.0 in all cases. 

I suspect that the reason is that 
__powidf2
is replaced by inline code at -O3 that gives slightly different results.

If this is expected behavior, maybe this substitution should be guarded by -ffast-math ?
Comment 1 Richard Biener 2007-10-15 15:01:07 UTC
Confirmed.  It changes for me -O2 vs. -O2 -fpeel-loops which causes the loop
in f to be completely unrolled (which at the first sight doesn't look wrong).
And we expand __powidf2 inline because the exponent is now a constant for all
calls.  This indeed might explain the difference as __powidf2 is implemented
as

TYPE
NAME (TYPE x, int m)
{
  unsigned int n = m < 0 ? -m : m;
  TYPE y = n % 2 ? x : 1;
  while (n >>= 1)
    {
      x = x * x;
      if (n % 2)
        y = y * x;
    }
  return m < 0 ? 1/y : y;
}  

while we expand a constant integral power to an optimal (as of the count of
multiplications / additions) series of multiplications and additions.

As Fortran allows open-coding of integral powers this is not a bug.  But
I'll leave final closing as INVALID to more fortran-knowing people.
Comment 2 Joost VandeVondele 2007-10-15 15:30:01 UTC
(In reply to comment #1)
> As Fortran allows open-coding of integral powers this is not a bug.  But
> I'll leave final closing as INVALID to more fortran-knowing people.

I agree that Fortran-wise this is not a bug. However, gcc tends to do some of the things that Fortran allows by default (x/y -> x*(1/y) ; x+Y+z -> (x=z)+y ; ..) only at -ffast-math. AFAIK, this seems to be the only optimisation in gcc that causes numerical results with CP2K to change going from -O0 to -O3. 
Comment 3 Dominique d'Humieres 2007-10-15 15:51:19 UTC
> that causes numerical results with CP2K to change going from -O0 to -O3. 

If you do expect that optimization optimizes your computation, you should expect some change of the numerical results, so put some tolerance in the comparisons.
How small can be the tolerance depends on your problem. I have modified your code to compute the polynomial in three different ways (yours, plus two others, the third one being the recommended one for "numerical stability", if I did not make some errors):

FUNCTION F(r,a,e,pf,qf)
 REAL*8 :: a(2:15)
 REAL*8 :: r,f,e, pr,pf,qf,rr
 f=0
 pf=0
 pr=r
 rr=1.0d0/r
 DO i = 2, 15
    f = f + a(i)/(r**(i-1)*REAL(i-1,8))
    pf = pf + a(i)/(pr*REAL(i-1,8))
    pr = r*pr
 END DO
 f=f-e
 pf=pf-e
 qf=a(15)/REAL(14,8)
 do i = 14, 2, -1
    qf=rr*qf+a(i)/REAL(i-1,8)
 end do
 qf=rr*qf-e
END FUNCTION F

PROGRAM TEST

  REAL*8 :: a(2:15)=(/-195.771601327700D0,  15343.7861339500D0, &
                      -530864.458651600D0,  10707934.3905800D0, &
                      -140099704.789000D0,  1250943273.78500D0, &
                      -7795458330.67600D0,  33955897217.3100D0, &
                      -101135640744.000D0,  193107995718.700D0, &
                      -193440560940.000D0, -4224406093.91800D0, &
                       217192386506.500D0, -157581228915.500D0/)

  REAL*8 :: r=4.51556282882533D0
  REAL*8 :: e=-2.21199635966809471000D0
  REAL*8 :: pf, qf
  REAL*8 :: f

  write(6,*)  f(r,a,e,pf,qf)
  write(6,*)  pf, qf

END PROGRAM TEST

[karma] f90/bug% gfc pr33780_red.f90
[karma] f90/bug% a.out
   0.0000000000000000     
  1.36957112317759311E-012  7.66053886991358013E-012
[karma] f90/bug% gfc -O3 pr33780_red.f90
[karma] f90/bug% a.out
 -1.14397380457376130E-012
  1.36957112317759311E-012  1.04861186749364478E-011

So for your problem the tolerance should be ~1.0E-11, and since pf and qf don't use powers, this is not a question of inlining, but how the operations are shuffled to "optimize" your computation.

Comment 4 Dominique d'Humieres 2007-10-18 15:23:19 UTC
> while we expand a constant integral power to an optimal (as of the count of
> multiplications / additions) series of multiplications and additions.

It seems the difference is coming from something else. I'll bet that with -O3 the powers are computed at compile time with some extra precision. At least it is what I conclude from the following code:

  integer :: ia(16) = (/ 1, 1, 2, 2, 3, 3, 4, 4, 6, 5, 6, 6, 10, 7, 9, 8 /)
  real(8) :: a(16), b(16), c(16), d(16)
  real(16) :: e(16)
  real(8), parameter :: r=4.51556282882533D0
  a(1) = r
  a(2) = a(1)*a(1)
  a(3:4) = a(2)*a(1:2)
  a(5:8) = a(4)*a(1:4)
  a(9:16) = a(8)*a(1:8)
  b(1) = r
  do i = 2, 16
     b(i) = b(ia(i))*b(i-ia(i))
  end do
  e(1) = r
  e(2) = e(1)*e(1)
  e(3:4) = e(2)*e(1:2)
  e(5:8) = e(4)*e(1:4)
  e(9:16) = e(8)*e(1:8)
  d = real(e,8)
  do i=1, 16
     c(i) = r**i
  end do
  do i=1, 16
     print *, a(i)/d(i)-1.0d0, b(i)/d(i)-1.0d0, c(i)/d(i)-1.0d0
  end do
  print *
  print *, sum(abs(a/d-1.0d0)), sum(abs(b/d-1.0d0)), sum(abs(c/d-1.0d0)), epsilon(r)
end

a stores the powers computed from the usual shift, b those computed from the "optimal" tree (from gcc/builtins.c, c those computed from the power, and d the values rounded from values with higher precision. At -O2 the output is:

   0.0000000000000000        0.0000000000000000        0.0000000000000000     
   0.0000000000000000        0.0000000000000000        0.0000000000000000     
   0.0000000000000000        0.0000000000000000        0.0000000000000000     
  2.22044604925031308E-016  2.22044604925031308E-016  2.22044604925031308E-016
  2.22044604925031308E-016  2.22044604925031308E-016  2.22044604925031308E-016
  2.22044604925031308E-016   0.0000000000000000       2.22044604925031308E-016
  2.22044604925031308E-016  2.22044604925031308E-016  2.22044604925031308E-016
  4.44089209850062616E-016  4.44089209850062616E-016  4.44089209850062616E-016
  2.22044604925031308E-016   0.0000000000000000       2.22044604925031308E-016
  4.44089209850062616E-016  2.22044604925031308E-016  4.44089209850062616E-016
  4.44089209850062616E-016  2.22044604925031308E-016  4.44089209850062616E-016
  4.44089209850062616E-016   0.0000000000000000       4.44089209850062616E-016
  4.44089209850062616E-016  4.44089209850062616E-016  4.44089209850062616E-016
  4.44089209850062616E-016  2.22044604925031308E-016  4.44089209850062616E-016
  6.66133814775093924E-016   0.0000000000000000       6.66133814775093924E-016
  6.66133814775093924E-016  6.66133814775093924E-016  6.66133814775093924E-016

  5.10702591327572009E-015  2.88657986402540701E-015  5.10702591327572009E-015  2.22044604925031308E-016

a and c are equal, while at -O3 the output is

   0.0000000000000000        0.0000000000000000        0.0000000000000000     
   0.0000000000000000        0.0000000000000000        0.0000000000000000     
   0.0000000000000000        0.0000000000000000        0.0000000000000000     
  2.22044604925031308E-016  2.22044604925031308E-016   0.0000000000000000     
  2.22044604925031308E-016  2.22044604925031308E-016   0.0000000000000000     
  2.22044604925031308E-016   0.0000000000000000        0.0000000000000000     
  2.22044604925031308E-016  2.22044604925031308E-016   0.0000000000000000     
  4.44089209850062616E-016  4.44089209850062616E-016   0.0000000000000000     
  2.22044604925031308E-016   0.0000000000000000        0.0000000000000000     
  4.44089209850062616E-016  2.22044604925031308E-016   0.0000000000000000     
  4.44089209850062616E-016  2.22044604925031308E-016   0.0000000000000000     
  4.44089209850062616E-016   0.0000000000000000        0.0000000000000000     
  4.44089209850062616E-016  4.44089209850062616E-016   0.0000000000000000     
  4.44089209850062616E-016  2.22044604925031308E-016   0.0000000000000000     
  6.66133814775093924E-016   0.0000000000000000        0.0000000000000000     
  6.66133814775093924E-016  6.66133814775093924E-016   0.0000000000000000     

  5.10702591327572009E-015  2.88657986402540701E-015   0.0000000000000000       2.22044604925031308E-016

c and d are equal. If the difference were coming only from the "optimal" tree, c should be equal to b.

Comment 5 Richard Biener 2008-02-05 11:08:36 UTC
The difference with your last testcase is indeed that we unroll the loop
setting c[] and constant-fold the result by computing the integral powers
with 0.5ulp precision (and a single rounding(!)).  But all your other
computations involve intermediate roundings, so this is still invalid IMHO.

Or are you complaining that we constant fold integer powers as if they were
non-integer powers (not using the fact that Fortran would allow expansion
to multiple FLOPs)?
Comment 6 Joost VandeVondele 2008-02-11 11:25:54 UTC
(In reply to comment #5)
> Or are you complaining that we constant fold integer powers as if they were
> non-integer powers

Richard,

I think I'm suggesting more a enhancement than a bug fix. Some good quality compiler like IBM's xlf90 guarantee bit-wise identical results at all optimisation levels, unless one specifies -qno-strict (which is included by default). My idea was/is that optimization independent results are (on the right hardware, i.e. not x87) also provided by gcc, unless one specifies -ffast-math. The above testcase is the only one I found in our 500kLOC simulation package that changed results as I changed optimisation levels, so I think it is valuable to make the underlying optimisation depending on a compile time flag (which is included in -ffast-math).

From a Fortran language point of view the compiler is free to use any implementation to compute a**i, including one that varies with optimisation level or context.