Bug 82004 - [8 Regression] SPEC CPU2017 628.pop2_s miscompare
Summary: [8 Regression] SPEC CPU2017 628.pop2_s miscompare
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 8.0
: P1 normal
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks: spec 84613
  Show dependency treegraph
 
Reported: 2017-08-28 09:51 UTC by Andrey Guskov
Modified: 2020-08-11 14:17 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-05 00:00:00


Attachments
gcc8-pr82004.patch (1.01 KB, patch)
2018-02-19 17:03 UTC, Jakub Jelinek
Details | Diff
Dump as requested in #c31 (21.46 KB, application/x-7z-compressed)
2018-03-26 14:54 UTC, Andrey Guskov
Details
pr82004_dumps.tar.xz (221.58 KB, application/x-xz)
2018-03-26 17:50 UTC, Jakub Jelinek
Details
gcc8-pr82004.patch (2.32 KB, patch)
2018-03-27 16:12 UTC, Jakub Jelinek
Details | Diff
gcc8-pr82004.patch (1.62 KB, patch)
2018-03-27 17:08 UTC, Jakub Jelinek
Details | Diff
More testing 7-aug-2020 (1.12 KB, text/plain)
2020-08-07 22:24 UTC, john henning
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Guskov 2017-08-28 09:51:27 UTC
I`m seeing a miscompare in 628.pop2_s compiled with the following option set:

-m64 -march=core-avx2 -mfpmath=sse -Ofast -funroll-loops -fconvert=big-endian -fopenmp

The guilty revision is r251230. Despite exp(log(c)*x) being faster, it seems to lack the precision of pow(c,x).
Comment 1 Richard Biener 2017-08-28 10:47:07 UTC
Can you see what 'c' is in the case of 628.pop2_s?
Comment 2 Andrey Guskov 2017-08-28 13:41:01 UTC
Approximately a hundred different variables and function calls, the majority of which are raised to the second or third power, sometimes fourth or fifth. As 628.pop2_s is mostly written in Fortran 90, all of the cases look like c**x. I`m not sure if this gets transformed into pow() when x is constant, though.

As for the cases where x is either not constant or non-integer, there are 22 of them:

co2calc.F90:          x1(i) = c10 ** (-phhi(i))
co2calc.F90:          x2(i) = c10 ** (-phlo(i))

ecosys_mod.F90:       Tfunc = Q_10**(((TEMP + T0_Kelvin)-(Tref + T0_Kelvin))/c10)
ecosys_mod.F90:       zoo_loss = z_mort2 * Zprime**1.4_r8 + z_mort * Zprime
ecosys_mod.F90:       TfuncS = 2.5_r8**(((TEMP + T0_Kelvin) - (Tref + T0_Kelvin)) / c10)

hmix_del4.F90:        AMF(:,:,iblock) = (UAREA(:,:,iblock)/uarea_equator)**1.5
hmix_del4.F90:        AHF(:,:,iblock) = (TAREA(:,:,iblock)/uarea_equator)**1.5

hmix_gm.F90:          BOLUS_SP(:,:,bid) = 50.0_r8 * sqrt(U_ISOP**c2 + V_ISOP**c2)

overflows.F90:        phi = c1-Fgeo**(-c2/c3)

seq_flux_mct.F90:     anidr = (.026_R8/(cosz**1.7_R8 + 0.065_R8))
shr_flux_mod.F90:     temp = (zkB*zkB)**(1.0_R8/a2) ! note: zkB < 0, zkB*zkB > 0
shr_flux_mod.F90:     temp = (zeta*zeta)**(1.0_R8/a2) ! note: zeta < 0, zeta*zeta > 0

state_mod.F90:        st15 = salt**(1.5_r8)

sw_absorption.F90:    chlamnt = 10**(logchl)

time_management.F90:  ifact = 10**(string_length - n)

vmix_kpp.F90:         cg = cstar*vonkar*(c_s*vonkar*epssfc)**p33
vmix_kpp.F90:         bckgrnd_vdc_psis= bckgrnd_vdc_psim*exp(-(0.4_r8*(tlatd+28.9_r8))**c2)
vmix_kpp.F90:         bckgrnd_vdc_psin= bckgrnd_vdc_psim*exp(-(0.4_r8*(tlatd-28.9_r8))**c2)
vmix_kpp.F90:         bckgrnd_vdc1 * (tlatd/10.0_r8)**c2
vmix_kpp.F90:         WM(i,j) = vonkar*USTAR(i,j)*(c1 - c16*ZETA(i,j))**p25
vmix_kpp.F90:         WM(i,j) = vonkar*(a_m*(USTAR(i,j)**3)-c_m*ZETAH(i,j))**p33
vmix_kpp.F90:         WS(i,j) = vonkar*(a_s*(USTAR(i,j)**3)-c_s*ZETAH(i,j))**p33


...Something tells me I didn`t get the question.
Comment 3 Richard Biener 2017-08-29 08:08:49 UTC
(In reply to Andrey Guskov from comment #2)
> Approximately a hundred different variables and function calls, the majority
> of which are raised to the second or third power, sometimes fourth or fifth.
> As 628.pop2_s is mostly written in Fortran 90, all of the cases look like
> c**x. I`m not sure if this gets transformed into pow() when x is constant,
> though.
> 
> As for the cases where x is either not constant or non-integer, there are 22
> of them:
> 
> co2calc.F90:          x1(i) = c10 ** (-phhi(i))
> co2calc.F90:          x2(i) = c10 ** (-phlo(i))
> 
> ecosys_mod.F90:       Tfunc = Q_10**(((TEMP + T0_Kelvin)-(Tref +
> T0_Kelvin))/c10)
> ecosys_mod.F90:       zoo_loss = z_mort2 * Zprime**1.4_r8 + z_mort * Zprime
> ecosys_mod.F90:       TfuncS = 2.5_r8**(((TEMP + T0_Kelvin) - (Tref +
> T0_Kelvin)) / c10)
> 
> hmix_del4.F90:        AMF(:,:,iblock) =
> (UAREA(:,:,iblock)/uarea_equator)**1.5
> hmix_del4.F90:        AHF(:,:,iblock) =
> (TAREA(:,:,iblock)/uarea_equator)**1.5
> 
> hmix_gm.F90:          BOLUS_SP(:,:,bid) = 50.0_r8 * sqrt(U_ISOP**c2 +
> V_ISOP**c2)
> 
> overflows.F90:        phi = c1-Fgeo**(-c2/c3)
> 
> seq_flux_mct.F90:     anidr = (.026_R8/(cosz**1.7_R8 + 0.065_R8))
> shr_flux_mod.F90:     temp = (zkB*zkB)**(1.0_R8/a2) ! note: zkB < 0, zkB*zkB
> > 0
> shr_flux_mod.F90:     temp = (zeta*zeta)**(1.0_R8/a2) ! note: zeta < 0,
> zeta*zeta > 0
> 
> state_mod.F90:        st15 = salt**(1.5_r8)
> 
> sw_absorption.F90:    chlamnt = 10**(logchl)
> 
> time_management.F90:  ifact = 10**(string_length - n)
> 
> vmix_kpp.F90:         cg = cstar*vonkar*(c_s*vonkar*epssfc)**p33
> vmix_kpp.F90:         bckgrnd_vdc_psis=
> bckgrnd_vdc_psim*exp(-(0.4_r8*(tlatd+28.9_r8))**c2)
> vmix_kpp.F90:         bckgrnd_vdc_psin=
> bckgrnd_vdc_psim*exp(-(0.4_r8*(tlatd-28.9_r8))**c2)
> vmix_kpp.F90:         bckgrnd_vdc1 * (tlatd/10.0_r8)**c2
> vmix_kpp.F90:         WM(i,j) = vonkar*USTAR(i,j)*(c1 - c16*ZETA(i,j))**p25
> vmix_kpp.F90:         WM(i,j) =
> vonkar*(a_m*(USTAR(i,j)**3)-c_m*ZETAH(i,j))**p33
> vmix_kpp.F90:         WS(i,j) =
> vonkar*(a_s*(USTAR(i,j)**3)-c_s*ZETAH(i,j))**p33
> 
> 
> ...Something tells me I didn`t get the question.

;)  We're doing the transform only for constant 'c' -- a constant may be exposed
through optimization but it is more likely constant from the start.

I spot 10 and 2.5_r8 above unless some constant params appear.  The 10
is probably powi and thus not relevant.  That would leave

> ecosys_mod.F90:       TfuncS = 2.5_r8**(((TEMP + T0_Kelvin) - (Tref +
> T0_Kelvin)) / c10)

can you verify by compiling ecosys_mod.F90 with -O3 instead of -Ofast?
Or add -fno-unsafe-math-optimizations for this file.  Is 'c10' a parameter?
Comment 4 Andrey Guskov 2017-09-04 15:33:21 UTC
Okay, finally I`ve got a minimal reproducer.
All this miscompare boils down to the following:

program r628
real(8) :: l = -3
print *, merge("PASSED", "FAILED", 1D-3 .le. 10**l)
end program r628

Normally, 10^-3 should equal 10^-3, but not after the optimization in question.

$ gfortran -m64 -O3 -o r628 r628.f90 && ./r628
 PASSED
$ gfortran -m64 -Ofast -o r628 r628.f90 && ./r628
 FAILED
Comment 5 Andrey Guskov 2017-09-04 15:37:14 UTC
This is the actual guilty line:

sw_absorption.F90:    chlamnt = 10**(logchl)

It computes 'chlamnt' and then compares it to the list of acceptable values.
When -Ofast is enabled, CPU2017::628 considers 'chlamnt' unacceptable.
Comment 6 Richard Biener 2017-09-05 07:49:39 UTC
(In reply to Andrey Guskov from comment #5)
> This is the actual guilty line:
> 
> sw_absorption.F90:    chlamnt = 10**(logchl)
> 
> It computes 'chlamnt' and then compares it to the list of acceptable values.
> When -Ofast is enabled, CPU2017::628 considers 'chlamnt' unacceptable.

It actually computes which interval it falls in:

     chlamnt = 10**(logchl)
     do m=1,nchl-1
       if (chlcnc(m) .le. chlamnt .and. &
           chlamnt .le. chlcnc(m+1) ) then
      mc = m
      goto 30

but the list of values is

   real (r8), parameter, dimension(nchl) :: &!chl for table look-up
     chlcnc = (/                            &
     .001_r8, .005_r8, .01_r8,  .02_r8,     &
     .03_r8,  .05_r8,  .10_r8,  .15_r8,     &

so in case logchl is -3 and we get sth like 0.009999 it doesn't find anything.

Maybe we just need to make sure to compute log with round away from zero.

Or, instead of

+ /* pow(C,x) -> exp(log(C)*x) if C > 0.  */
+ (for pows (POW)
+      exps (EXP)
+      logs (LOG)
+  (simplify
+   (pows REAL_CST@0 @1)
+    (if (real_compare (GT_EXPR, TREE_REAL_CST_PTR (@0), &dconst0)
+        && real_isfinite (TREE_REAL_CST_PTR (@0)))
+     (exps (mult (logs @0) @1)))))

compute the log and the multiplication with extra precision.  On x86
this would mean using x87 math here, on other targets we might have to
give up for double.  Or we need to restrict @0 to the case where log (@0)
is exactly representable.

I wonder if for the case of pow (10., x) using exp10 (x) would work
in practice for pop2.  Note that simply using exp10 in a pattern like
above likely wouldn't work given fortran/mathbuiltins.def doesn't include
exp10 (it's a GNU extension so including it wouldn't be ok I guess).


Note cases like this are unfortunate but within -ffast-math / -Ofast they
are to be expected.  Still our intention for -Ofast is that it doesn't
cause SPEC miscompares.

I suppose reporting this to SPEC folks and requesting a chlcnc table
change to .009999_r8 isn't going to fly ;)
Comment 7 Thomas Koenig 2017-09-05 08:43:08 UTC
(In reply to Richard Biener from comment #6)

> I suppose reporting this to SPEC folks and requesting a chlcnc table
> change to .009999_r8 isn't going to fly ;)

I think reporting this to be a good idea.

All the SPEC people should have read "What Every Computer Scientist
Should Know About Floating-Point Arithmetic", right? :-)
Comment 8 Jakub Jelinek 2017-09-05 10:27:57 UTC
(In reply to Thomas Koenig from comment #7)
> (In reply to Richard Biener from comment #6)
> 
> > I suppose reporting this to SPEC folks and requesting a chlcnc table
> > change to .009999_r8 isn't going to fly ;)
> 
> I think reporting this to be a good idea.
> 
> All the SPEC people should have read "What Every Computer Scientist
> Should Know About Floating-Point Arithmetic", right? :-)

They often don't care though, something that happened e.g. with out of bound array accesses in SPEC2k6.  But maybe they will when SPEC2k17 is new.
Comment 9 rguenther@suse.de 2017-09-05 10:53:56 UTC
On Tue, 5 Sep 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82004
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jakub at gcc dot gnu.org
> 
> --- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Thomas Koenig from comment #7)
> > (In reply to Richard Biener from comment #6)
> > 
> > > I suppose reporting this to SPEC folks and requesting a chlcnc table
> > > change to .009999_r8 isn't going to fly ;)
> > 
> > I think reporting this to be a good idea.
> > 
> > All the SPEC people should have read "What Every Computer Scientist
> > Should Know About Floating-Point Arithmetic", right? :-)
> 
> They often don't care though, something that happened e.g. with out of bound
> array accesses in SPEC2k6.  But maybe they will when SPEC2k17 is new.

Note in this case we have some really weird code anyway:

   integer (int_kind), parameter :: &
      nsub = 400,  &! total number chlorophyll concentrations in look up 
table
      ksol = 2*km, &! (0:ksol) transmission levels (layer interface + 
mid-point)
      nchl = 31     ! number of chlorophyll absorption coefficients

...
   real (r8), parameter, dimension(nchl) :: &!chl for table look-up
     chlcnc = (/                            &
     .001_r8, .005_r8, .01_r8,  .02_r8,     &
     .03_r8,  .05_r8,  .10_r8,  .15_r8,     &
     .20_r8,  .25_r8,  .30_r8,  .35_r8,     &
     .40_r8,  .45_r8,  .50_r8,  .60_r8,     &
     .70_r8,  .80_r8,  .90_r8, 1.00_r8,     &
     1.50_r8, 2.00_r8, 2.50_r8, 3.00_r8,    &
     4.00_r8, 5.00_r8, 6.00_r8, 7.00_r8,    &
     8.00_r8, 9.00_r8,10.00_r8  /)

...
   chlmin  = chlcnc(1)
   chlmax  = chlcnc(nchl)
   dlogchl = (log10(chlmax)-log10(chlmin))/real(nsub)
...
   logchl = log10(chlmin) - dlogchl
   do n=0,nsub
     logchl = logchl + dlogchl
     chlamnt = 10**(logchl)
     do m=1,nchl-1
       if (chlcnc(m) .le. chlamnt .and. &
           chlamnt .le. chlcnc(m+1) ) then
      mc = m

so we start with 10**-3.  We'll eventually even see that in PRE
but likely exactly evaluate the starting FP constant.

What is fishy here is that the code compares

chlcnc(m) <= chlamnt && chlamnt <= chlcnc(m+1)

that is, both <=.  This is likely _just_ to make the first entry
match.  And the code relies on log10 and 10** exactly
cancelling for the .001_r8 entry.

So yes, I think a fix is warranted.

As a last resort we can always choose to not touch 10**x
but only transform it to exp10 when available (similar for
2**x and e**x -- I doubt log(e) computes 1 "exactly enough"
though I hope math.h has M_E and M_El that when fed to
mpfr log will compute 1.0).
Comment 10 Wilco 2017-09-11 12:43:16 UTC
Using latest GLIBC, exp(log(10),-3) is only 3ULP different from 0.001. Since (double)0.001 is rounded up, any <1ULP POW implementation could round down and fail the test. So yes the first element in chlcnc really needs to be fixed to 0.00099999999999999 to allow for a few ULP of error in POW.

> As a last resort we can always choose to not touch 10**x

Unfortunately almost all useful cases are 10**x... So it would be great if we can allow use of exp10 and exp2 to get more accurate results. This requires a real implementation in GLIBC, and a way to disable it. Would it be feasible to add a exp10 symbol to libgcc/libgfortran in case the math libraries don't support it?
Comment 11 rguenther@suse.de 2017-09-12 12:44:32 UTC
On Mon, 11 Sep 2017, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82004
> 
> Wilco <wilco at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |wilco at gcc dot gnu.org
> 
> --- Comment #10 from Wilco <wilco at gcc dot gnu.org> ---
> Using latest GLIBC, exp(log(10),-3) is only 3ULP different from 0.001. Since
> (double)0.001 is rounded up, any <1ULP POW implementation could round down and
> fail the test. So yes the first element in chlcnc really needs to be fixed to
> 0.00099999999999999 to allow for a few ULP of error in POW.
> 
> > As a last resort we can always choose to not touch 10**x
> 
> Unfortunately almost all useful cases are 10**x... So it would be great if we
> can allow use of exp10 and exp2 to get more accurate results. This requires a
> real implementation in GLIBC, and a way to disable it. Would it be feasible to
> add a exp10 symbol to libgcc/libgfortran in case the math libraries don't
> support it?

Not sure, it depends on whether we can make sure libm is prefered over
libgfortran.  Maybe we have existing cases in libgfortran already.
Comment 12 Jakub Jelinek 2017-12-05 13:08:50 UTC
So, any progress here?
Has SPEC accepted some fix, or is this still broken?
Comment 13 Andrey Guskov 2017-12-05 14:04:46 UTC
Nope. The issue persists.

I don`t know whether anyone reported this to SPEC already, but even if so I share Richard`s pessimism.
Comment 14 Jakub Jelinek 2018-02-19 16:54:28 UTC
If glibc implements a faster exp10/exp10f, we could do what I've done for
PR84309 and expand pow (10, x) as exp10 (x) and pow (C, x) where C is pow (10, N) for integral N as exp10 (log10 (C) * x).
Even if it doesn't, there is some chance that perhaps 628.pop2_s propagates a constant into that x of pow (10, x), so in theory even just deferring the folding to after vectorization like we do for C which is pow (2, N) for integral N might help.

Looking at
http://oceans11.lanl.gov/trac/POP/browser/branches/ice_shelf_partial_cells/source/sw_absorption.F90?rev=373&order=name&desc=True
I think that actually is the case here.
From that source (haven't looked at SPEC2k17) it seems it is essentially:
! PR middle-end/82004
! { dg-do run }
! { dg-options "-Ofast" }

  integer, parameter :: r8 = selected_real_kind(13), i4 = kind(1)
  integer (i4), parameter :: a = 400, b = 2
  real (r8), parameter, dimension(b) :: c = (/ .001_r8, 10.00_r8 /)
  real (r8) :: d, e, f, g, h, j

  d = c(1)
  e = c(b)
  f = (log10(e)-log10(d))/real(a)
  g = log10(d) - f
  h = 10**(g)
  j = 10
  j = j**(g)
  if (h.ne.j) stop 1
end

(with the j and j = 10; j = j**(g); if (h.ne.j) stop 1 added by me to make it fail when miscompiled).
Comment 15 Jakub Jelinek 2018-02-19 17:03:17 UTC
Created attachment 43464 [details]
gcc8-pr82004.patch

Thus, does this fix the miscompare?  I'll bootstrap/regtest it on x86_64-linux and i686-linux, but don't have SPEC2k17 around.
Comment 16 Andrey Guskov 2018-02-19 17:05:05 UTC
Oh wow, that was unexpected!
Will test and report back ASAP.
Comment 17 Wilco 2018-02-19 17:23:29 UTC
(In reply to Jakub Jelinek from comment #15)
> Created attachment 43464 [details]
> gcc8-pr82004.patch
> 
> Thus, does this fix the miscompare?  I'll bootstrap/regtest it on
> x86_64-linux and i686-linux, but don't have SPEC2k17 around.

It can't since the pow is in a loop:

   do n=0,nsub
     logchl = logchl + dlogchl
     chlamnt = 10**(logchl)

As noted above the code is very badly written, it expects pow(10,log10(x)) to be bit-identical for any x. Even perfect rounding cannot avoid 0.5ULP worst-case error.
Comment 18 Jakub Jelinek 2018-02-19 17:39:58 UTC
(In reply to Wilco from comment #17)
> (In reply to Jakub Jelinek from comment #15)
> > Created attachment 43464 [details]
> > gcc8-pr82004.patch
> > 
> > Thus, does this fix the miscompare?  I'll bootstrap/regtest it on
> > x86_64-linux and i686-linux, but don't have SPEC2k17 around.
> 
> It can't since the pow is in a loop:
> 
>    do n=0,nsub
>      logchl = logchl + dlogchl
>      chlamnt = 10**(logchl)

Ugh, you're right.  Despite that, I think my patch is useful anyway if the x folds into a constant, or even just as a simpler/smaller representation through most of the optimization passes.

With the delayed folding of this, perhaps some loop optimization could note that we have:
  for (n = 0; n < nsub - 1; n++)
    {
      chlamnt = pow (10, constant + (n + 1) * constant2);
...
    }
and turn that (-Ofast) into
  chlamnt = pow (10, constant);
  chlamnt_step = pow (10, constant2);
  for (n = 0; n < nsub - 1; n++)
    {
      chlamnt = chlamnt * chlamnt_step;
...
    }
Comment 19 Jakub Jelinek 2018-02-19 17:51:08 UTC
In any case, having a fast and at least for double precise exp10 in glibc would be also beneficial, then this issue would go away at least on glibc targets.
Comment 20 Wilco 2018-02-19 18:07:58 UTC
(In reply to Jakub Jelinek from comment #19)
> In any case, having a fast and at least for double precise exp10 in glibc
> would be also beneficial, then this issue would go away at least on glibc
> targets.

Yes, indeed. The key question is how do we know we are linking with GLIBC (including from Fortran, where there are no headers)?
Comment 21 Jakub Jelinek 2018-02-19 18:13:36 UTC
Any way we like, like a target hook (e.g. extend libc_has_function targhook, though it is more about presence of APIs rather than their performance; on the other side it is the closest API we have right now), macro, etc.
Comment 22 Jakub Jelinek 2018-02-20 13:56:31 UTC
Author: jakub
Date: Tue Feb 20 13:56:00 2018
New Revision: 257846

URL: https://gcc.gnu.org/viewcvs?rev=257846&root=gcc&view=rev
Log:
	PR middle-end/82004
	* match.pd (pow(C,x) -> exp(log(C)*x)): Delay all folding until
	after vectorization.

	* gfortran.dg/pr82004.f90: New test.

Added:
    trunk/gcc/testsuite/gfortran.dg/pr82004.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/match.pd
    trunk/gcc/testsuite/ChangeLog
Comment 23 Andrey Guskov 2018-02-20 14:42:39 UTC
Tested.

gcc8-pr82004.patch does not work on CPU2017::628.
Comment 24 Martin Liška 2018-03-14 12:55:46 UTC
I can confirm that the Jakub's commit is fixing that. Revering the patch I see a miscomparison. That said, I would close it as resolved. Andrey?
Comment 25 Jakub Jelinek 2018-03-14 13:27:19 UTC
(In reply to Martin Liška from comment #24)
> I can confirm that the Jakub's commit is fixing that. Revering the patch I
> see a miscomparison. That said, I would close it as resolved. Andrey?

That is weird, because we've figured that in the actual SPEC test it is used in a loop with a variable exponent.  If it no longer reproduces, isn't that some other change?  For a real fix I was hoping somebody would provide new, faster and precise exp10{,f,l} implementations to glibc and then we could just use exp10 rather than exp.
Comment 26 Andrey Guskov 2018-03-14 14:53:19 UTC
Martin, Jakub, for the Base optset (-Ofast -funroll-loops -march=haswell) there is no change, 628 keeps failing, but for O2 (-O2 -ffast-math -march=haswell) it did start passing.
Comment 27 Jakub Jelinek 2018-03-23 12:54:48 UTC
So, I've tried to create a small reproducer that actually shows what sw_absorption.F90 is doing (from the above URL, so not actually from the SPEC provided sources) and can't reproduce a problem:

! PR middle-end/82004
! { dg-do run }
! { dg-options "-Ofast" }

  integer, parameter :: r8 = selected_real_kind(13), i4 = kind(1)
  integer (i4), parameter :: a = 400, b = 31
  real (r8), parameter, dimension(b) :: c = (/ .001_r8, .005_r8, &
    .01_r8, .02_r8, .03_r8, .05_r8, .10_r8, .15_r8, .20_r8,      &
    .25_r8, .30_r8, .35_r8, .40_r8, .45_r8, .50_r8, .60_r8,      &
    .70_r8, .80_r8, .90_r8, 1.00_r8, 1.50_r8, 2.00_r8, 2.50_r8,  &
    3.00_r8, 4.00_r8, 5.00_r8, 6.00_r8, 7.00_r8, 8.00_r8,        &
    9.00_r8,10.00_r8 /)
  integer (i4), parameter, dimension(b) :: p = (/ 70, 31, 30,    &
    17, 22, 31, 17, 13, 9, 8, 7, 6, 5, 4, 8, 7, 6, 5, 5, 17, 13, &
    9, 8, 13, 9, 8, 7, 6, 5, 5, 0 /)
  real (r8) :: d, e, f, g, h
  integer (i4) :: n, m
  integer (i4), dimension(b) :: o

  d = c(1)
  e = c(b)
  o = 0
  f = (log10(e)-log10(d))/real(a)
  g = log10(d) - f
  do n = 0, a
    g = g + f
    h = 10**(g)
    do m = 1, b - 1
      if (c(m) .le. h .and. h .le. c(m + 1)) then
        o(m) = o(m) + 1
      end if
    end do
  end do
  if (any (o .ne. p)) stop 1
end

Maybe we peel off the first iteration (which is all we care about here, all other values are not within ulps of the bounds) on this smaller testcase and not on 628.pop2_s?
Comment 28 Jakub Jelinek 2018-03-23 13:05:37 UTC
Ah, I can actually reproduce the failure with the #c27 testcase with -Ofast -fno-tree-pre.

So, to workaround this issue with SPEC, either we need to find out why PRE peels off the first iteration on this small testcase and not on the original 628.pop2_s and see if we can convince it to do so, or we need some pattern match to handle the -Ofast -fno-tree-pre #c27 testcase the way I've outlined in #c18.
Comment 29 Jakub Jelinek 2018-03-23 13:36:49 UTC
Tried to enlarge the testcase, so it better matches what sw_absorption.F90 does, but still it works with -Ofast, just not with -Ofast -fno-tree-pre:

! PR middle-end/82004
! { dg-do run }
! { dg-options "-Ofast" }

  integer, parameter :: r8 = selected_real_kind(13), i4 = kind(1)
  integer (i4), parameter :: a = 400, b = 31
  real (r8), parameter, dimension(b) :: c = (/ .001_r8, .005_r8, &
    .01_r8, .02_r8, .03_r8, .05_r8, .10_r8, .15_r8, .20_r8,      &
    .25_r8, .30_r8, .35_r8, .40_r8, .45_r8, .50_r8, .60_r8,      &
    .70_r8, .80_r8, .90_r8, 1.00_r8, 1.50_r8, 2.00_r8, 2.50_r8,  &
    3.00_r8, 4.00_r8, 5.00_r8, 6.00_r8, 7.00_r8, 8.00_r8,        &
    9.00_r8,10.00_r8 /)
  real (r8), parameter, dimension(b) :: a3 = (/                  &
    0.4421_r8, 0.4451_r8, 0.4488_r8, 0.4563_r8,                  &
    0.4622_r8, 0.4715_r8, 0.4877_r8, 0.4993_r8,                  &
    0.5084_r8, 0.5159_r8, 0.5223_r8, 0.5278_r8,                  &
    0.5326_r8, 0.5369_r8, 0.5408_r8, 0.5474_r8,                  &
    0.5529_r8, 0.5576_r8, 0.5615_r8, 0.5649_r8,                  &
    0.5757_r8, 0.5802_r8, 0.5808_r8, 0.5788_r8,                  &
    0.56965_r8, 0.55638_r8, 0.54091_r8, 0.52442_r8,              &
    0.50766_r8,0.49110_r8,0.47505_r8 /)
  real (r8), parameter, dimension(b) :: a4 = (/                  &
    0.2981_r8, 0.2963_r8, 0.2940_r8, 0.2894_r8,                  &
    0.2858_r8, 0.2800_r8, 0.2703_r8, 0.2628_r8,                  &
    0.2571_r8, 0.2523_r8, 0.2481_r8, 0.2444_r8,                  &
    0.2411_r8, 0.2382_r8, 0.2356_r8, 0.2309_r8,                  &
    0.2269_r8, 0.2235_r8, 0.2206_r8, 0.2181_r8,                  &
    0.2106_r8, 0.2089_r8, 0.2113_r8, 0.2167_r8,                  &
    0.23357_r8, 0.25504_r8, 0.27829_r8, 0.30274_r8,              &
    0.32698_r8, 0.35056_r8, 0.37303_r8 /)
  real (r8), parameter, dimension(b) :: b3 = (/                  &
    0.0287_r8, 0.0301_r8, 0.0319_r8, 0.0355_r8,                  &
    0.0384_r8, 0.0434_r8, 0.0532_r8, 0.0612_r8,                  &
    0.0681_r8, 0.0743_r8, 0.0800_r8, 0.0853_r8,                  &
    0.0902_r8, 0.0949_r8, 0.0993_r8, 0.1077_r8,                  &
    0.1154_r8, 0.1227_r8, 0.1294_r8, 0.1359_r8,                  &
    0.1640_r8, 0.1876_r8, 0.2082_r8, 0.2264_r8,                  &
    0.25808_r8, 0.28498_r8, 0.30844_r8, 0.32932_r8,              &
    0.34817_r8, 0.36540_r8, 0.38132_r8 /)
  real (r8), parameter, dimension(b) :: b4 = (/                  &
    0.3192_r8, 0.3243_r8, 0.3306_r8, 0.3433_r8,                  &
    0.3537_r8, 0.3705_r8, 0.4031_r8, 0.4262_r8,                  &
    0.4456_r8, 0.4621_r8, 0.4763_r8, 0.4889_r8,                  &
    0.4999_r8, 0.5100_r8, 0.5191_r8, 0.5347_r8,                  &
    0.5477_r8, 0.5588_r8, 0.5682_r8, 0.5764_r8,                  &
    0.6042_r8, 0.6206_r8, 0.6324_r8, 0.6425_r8,                  &
    0.66172_r8, 0.68144_r8, 0.70086_r8, 0.72144_r8,              &
    0.74178_r8, 0.76190_r8, 0.78155_r8 /)
  real (r8) :: d, e, f, g, h, w1, w2, a1, a2, b1, b2
  integer (i4) :: n, m, o
  integer, volatile :: p

  d = c(1)
  e = c(b)
  o = 0
  p = 0
  f = (log10(e)-log10(d))/real(a)
  g = log10(d) - f
  do n = 0, a
    g = g + f
    h = 10**(g)
    do m = 1, b - 1
      if (c(m) .le. h .and. h .le. c(m + 1)) then
        o = m
        goto 10
      end if
    end do
    if (p == 0) then
      print *, 'Failed', n, h
      stop 1
    end if
10  continue
    w2 = (h - c(o)) / (c(o + 1) - c(o))
    w1 = 1.0_r8 - w2
    a1 = a3(o) * w1 + a3(o + 1) * w2
    a2 = a4(o) * w1 + a4(o + 1) * w2
    b1 = b3(o) * w1 + b3(o + 1) * w2
    b2 = b4(o) * w1 + b4(o + 1) * w2
    if (p == 0) then
      print *, n, h, a1, a2, b1, b2
    endif
  end do
end
Comment 30 Richard Biener 2018-03-23 13:37:10 UTC
(In reply to Jakub Jelinek from comment #28)
> Ah, I can actually reproduce the failure with the #c27 testcase with -Ofast
> -fno-tree-pre.
> 
> So, to workaround this issue with SPEC, either we need to find out why PRE
> peels off the first iteration on this small testcase and not on the original
> 628.pop2_s and see if we can convince it to do so, or we need some pattern
> match to handle the -Ofast -fno-tree-pre #c27 testcase the way I've outlined
> in #c18.

For me, with -Ofast -march=core-avx2 PRE gets us to

  dlogchl = 1.00000000000000002081668171172168513294309377670288085938e-2;
  pretmp_1370 = my_task;
  pretmp_1372 = master_task;

  <bb 35> [local count: 85892]:
  # logchl_645 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(34), logchl_747(44)>
  # mc_647 = PHI <mc_841(D)(34), mc_761(44)>
  # n_651 = PHI <0(34), _835(44)>
  # prephitmp_1174 = PHI <1.00000000000000002081668171172168513294309377670288085938e-2(34), pretmp_1173(44)>
  # prephitmp_1177 = PHI <1.00000000000000002081668171172168513294309377670288085938e-3(34), _1176(44)>
  # prephitmp_1371 = PHI <pretmp_1370(34), prephitmp_1242(44)>
  # prephitmp_1373 = PHI <pretmp_1372(34), prephitmp_1245(44)>
  logchl_747 = logchl_645 + prephitmp_1174;
  chlamnt = prephitmp_1177;
...
  <bb 43> [local count: 85892]:
  _835 = n_651 + 1;
  if (_835 == 401)
    goto <bb 45>; [12.36%]
  else
    goto <bb 44>; [87.64%]

  <bb 44> [local count: 75276]:
  pretmp_1173 = dlogchl;
  _1175 = logchl_747 + pretmp_1173;
  _1176 = __builtin_pow (1.0e+1, _1175);
  goto <bb 35>; [100.00%]
Comment 31 Jakub Jelinek 2018-03-23 13:43:41 UTC
Andrey, can you still reproduce and if yes, see what you get for the sw_absorption.F90 loop after PRE (i.e. if the value of __builtin_pow (10, ...) is precomputed for the first iteration and pow moved to latch or not)?
Thanks.
Comment 32 Andrey Guskov 2018-03-26 14:54:00 UTC
Created attachment 43761 [details]
Dump as requested in #c31

Jakub, see the attachment. This is the log of what I get, packed in 7Z.
At a glance, the importatnt part looks quite similar to what Richard has shown.
Comment 33 Andrey Guskov 2018-03-26 14:57:32 UTC
This is the full execution line I used to produce the log in question, and with which the test continues failing:

$ gfortran -fdump-tree-all -fdump-rtl-all -m64 -c -o sw_absorption.fppized.o -march=core-avx2 -mfpmath=sse -Ofast -fno-associative-math -funroll-loops -flto -fopenmp -fconvert=big-endian sw_absorption.fppized.f90
Comment 34 Jakub Jelinek 2018-03-26 17:40:40 UTC
Ok, I can now reproduce, but only with -flto, not without that.
Without -flto, before pre I see:
  <bb 34> [local count: 85892]:
  # logchl_591 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_701(129)>
  # mc_799 = PHI <mc_795(D)(33), mc_715(129)>
  # n_623 = PHI <0(33), _789(129)>
  # DEBUG n => n_623
  # DEBUG mc => mc_799
  # DEBUG logchl => logchl_591
  dlogchl.345_699 = dlogchl;
  logchl_701 = logchl_591 + dlogchl.345_699;
  # DEBUG logchl => logchl_701
  _702 = __builtin_pow (1.0e+1, logchl_701);
  chlamnt = _702;
  # DEBUG m => 1
  # DEBUG m => 1
...
in -fdump-tree-pre-details dump I see:
SCC consists of 67: logchl_591 .MEM_621 dlogchl.345_699 logchl_701 _702 .MEM_913 .MEM_914 .MEM_915 .MEM_916 .MEM_917 .MEM_918 stdout.350_714 .MEM_9
19 .MEM_920 .MEM_921 .MEM_922 .MEM_923 .MEM_924 .MEM_807 _718 _719 _723 _724 w2_725 w1_726 _727 _728 _729 _730 _731 .MEM_925 _732 _733 _734 _735 _7
36 .MEM_926 _737 _738 _739 _740 _741 .MEM_927 _742 _743 _744 _745 _746 .MEM_928 .MEM_944 mpercm.360_758 .MEM_622 _755 _757 _759 M.119_814 _761 _762
 _765 _767 _768 M.120_202 _770 _771 _773 _774 .MEM_946
Starting iteration 1
Value numbering logchl_591 stmt = logchl_591 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_701(129)>
Setting value number of logchl_591 to -3.0099999999999997868371792719699442386627197265625e+0 (changed)
Value numbering .MEM_621 stmt = .MEM_621 = PHI <.MEM_898(33), .MEM_946(129)>
Setting value number of .MEM_621 to .MEM_898 (changed)
Value numbering dlogchl.345_699 stmt = dlogchl.345_699 = dlogchl;
Setting value number of dlogchl.345_699 to 1.00000000000000002081668171172168513294309377670288085938e-2 (changed)
Value numbering logchl_701 stmt = logchl_701 = logchl_591 + dlogchl.345_699;
Match-and-simplified logchl_591 + dlogchl.345_699 to -3.0e+0
RHS logchl_591 + dlogchl.345_699 simplified to -3.0e+0
Setting value number of logchl_701 to -3.0e+0 (changed)
Value numbering _702 stmt = _702 = __builtin_pow (1.0e+1, logchl_701);
Match-and-simplified __builtin_pow (1.0e+1, logchl_701) to 1.00000000000000002081668171172168513294309377670288085938e-3
call __builtin_pow (1.0e+1, logchl_701) simplified to 1.00000000000000002081668171172168513294309377670288085938e-3
Setting value number of _702 to 1.00000000000000002081668171172168513294309377670288085938e-3 (changed)

With -flto, I see just similar:
  <bb 34> [local count: 16255]:
  # n_925 = PHI <0(33), _1128(129)>
  # logchl_926 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_1040(129)>
  # mc_928 = PHI <mc_1134(D)(33), mc_1054(129)>
  # a1_lsm.5953_134 = PHI <a1_lsm.5953_1135(33), a1_lsm.5953_1120(129)>
  # a2_lsm.5954_1153 = PHI <a2_lsm.5954_135(33), a2_lsm.5954_127(129)>
  # b1_lsm.5955_1099 = PHI <b1_lsm.5955_136(33), b1_lsm.5955_128(129)>
  # b2_lsm.5956_139 = PHI <b2_lsm.5956_1233(33), b2_lsm.5956_129(129)>
  # DEBUG n => n_925
  # DEBUG mc => mc_928
  # DEBUG logchl => logchl_926
  logchl_1040 = logchl_926 + 1.00000000000000002081668171172168513294309377670288085938e-2;
  # DEBUG logchl => logchl_1040
  _1041 = __builtin_pow (1.0e+1, logchl_1040);
  chlamnt_lsm.5952_1150 = _1041;
  # DEBUG m => 1
  # DEBUG m => 1
before pre, but instead:
SCC consists of 2: logchl_926 logchl_1040
Starting iteration 1
Value numbering logchl_926 stmt = logchl_926 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_1040(129)>
Setting value number of logchl_926 to -3.0099999999999997868371792719699442386627197265625e+0 (changed)
Value numbering logchl_1040 stmt = logchl_1040 = logchl_926 + 1.00000000000000002081668171172168513294309377670288085938e-2;
Match-and-simplified logchl_926 + 1.00000000000000002081668171172168513294309377670288085938e-2 to -3.0e+0
RHS logchl_926 + 1.00000000000000002081668171172168513294309377670288085938e-2 simplified to -3.0e+0
Setting value number of logchl_1040 to -3.0e+0 (changed)
Starting iteration 2
Value numbering logchl_926 stmt = logchl_926 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_1040(129)>
Setting value number of logchl_926 to logchl_926 (changed)
Value numbering logchl_1040 stmt = logchl_1040 = logchl_926 + 1.00000000000000002081668171172168513294309377670288085938e-2;
Setting value number of logchl_1040 to logchl_1040 (changed)
Processing SCC needed 3 iterations
Value numbering _1093 stmt = _1093 = (long int) k_1137;
Setting value number of _1093 to _1093 (changed)
Value numbering _1103 stmt = _1103 = _1091 + _1093;
Setting value number of _1103 to _1103 (changed)
Value numbering _1041 stmt = _1041 = __builtin_pow (1.0e+1, logchl_1040);
Setting value number of _1041 to _1041 (changed)

so this precomputation of the first iteration doesn't happen in that case.  Richard, any way to debug why?  I'll attach the log files (lim2 and pre-details).
Comment 35 Jakub Jelinek 2018-03-26 17:50:08 UTC
Created attachment 43763 [details]
pr82004_dumps.tar.xz

Dumps.  For lto I've just added the init_sw_absorption function parts of the dump, the dumps are too large.
Comment 36 Richard Biener 2018-03-27 11:28:48 UTC
(In reply to Jakub Jelinek from comment #35)
> Created attachment 43763 [details]
> pr82004_dumps.tar.xz
> 
> Dumps.  For lto I've just added the init_sw_absorption function parts of the
> dump, the dumps are too large.

Skipping partial redundancy for expression {plus_expr,logchl_926,1.00000000000000002081668171172168513294309377670288085938e-2} (0812), no redundancy on to be optimized for speed edge
Skipping partial redundancy for expression {call_expr<__builtin_pow>,real_cst<1.0e+1>,logchl_1040} (0813), no redundancy on to be optimized for speed edge

so with LTO we have "better" profile estimates and the entry edge is considered cold...

LTO:

  <bb 33> [local count: 3813]:
...
  <bb 34> [local count: 16255]:
  # n_925 = PHI <0(33), _1128(129)>
  # logchl_926 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_1040(129)>

non-LTO:

  <bb 33> [local count: 10616]:
...
  <bb 34> [local count: 85892]:
  # logchl_591 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_701(129)>

in general optimizing the redundancy on the entry edge isn't worth it given
it often increases register pressure by introducing loop-carried dependences.
So I think LTO is "correct" here, even if that's unfortunate... :/
Comment 37 Jakub Jelinek 2018-03-27 16:12:00 UTC
Created attachment 43771 [details]
gcc8-pr82004.patch

Untested hack.  With this it works even with -flto.  Though, the rounding errors because we do 400 multiplications get perhaps way too high even for -Ofast (the benchmark doesn't care that much except for the first iteration, but I think it is too much).  So, as an alternative I'll just try to something different.
Comment 38 Jakub Jelinek 2018-03-27 17:08:02 UTC
Created attachment 43776 [details]
gcc8-pr82004.patch

Variant fix, instead of trying to optimize it some way (which is even less precise than what we do right now), this just reverts to GCC 7 behavior if we detect a SPEC2k17-like pow call.  If/once glibc has a fast exp10, we can use exp10 in that case instead of leaving pow around.
Comment 39 Jakub Jelinek 2018-03-28 19:16:12 UTC
Author: jakub
Date: Wed Mar 28 19:15:39 2018
New Revision: 258930

URL: https://gcc.gnu.org/viewcvs?rev=258930&root=gcc&view=rev
Log:
	PR tree-optimization/82004
	* gimple-match-head.c (optimize_pow_to_exp): New function.
	* match.pd (pow(C,x) -> exp(log(C)*x)): Wrap with #if GIMPLE.
	Don't fold to exp if optimize_pow_to_exp is false.

	* gcc.dg/pr82004.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/pr82004.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-match-head.c
    trunk/gcc/match.pd
    trunk/gcc/testsuite/ChangeLog
Comment 40 Jakub Jelinek 2018-03-28 19:25:06 UTC
Workaround applied, should work now even with -flto.  Would be nice to report it to SPEC though, to make the code accept a few ulps difference.
Comment 41 Jakub Jelinek 2018-04-03 13:06:07 UTC
.
Comment 42 john henning 2020-08-06 11:26:24 UTC
A comment, an invitation, and some questions from one of those "SPEC people".

The comment: 

SPEC CPU starts from real applications (rather than artificial kernels).  Therefore, yes, it begins with code that includes practices that are less than perfect - including standards violations and mistakes in numerical methods.  

SPEC then weeds out benchmark candidates when they are discovered to be inadequate.  Many have been weeded out due to numerical malpractice, when they cannot be relied upon to produce correct answers within some predefined tolerance.  

The remaining candidates get pushed toward the standard, get pushed toward better numerical practices.

But all this is driven by testing.   If there's nobody pushing hard on compiler X advanced optimization feature Y, and if Y quietly relies on source code to be virtuous as regards Z, it is entirely possible that sins against virtue Z will escape notice until after release of a SPEC benchmark suite.

Pre-release, fixes are easy to sponsor and apply.   Post-release, it is much harder, because of SPEC's preference for "no moving targets". 

The invitation:

Thus I would invite any of the individuals who commented about "those SPEC people" to consider becoming one, and helping with the testing of benchmark candidates for the next benchmark suite.  SPEC has a mechanism to accept free volunteer labor :-) and if you would like to volunteer, please write to info at SPEC dot org.  If you like, you could mention my name and this bug when you do.

The questions:

(q1) Do I correctly gather that the fix described here would have gone into GCC 8?

(q2) On aarch64 with GCC 8.2, GCC 9.3, and GCC 10.1 (Red Hat 7.8, glibc 2.17-307.0.2.el7.1) I am seeing the following 628.pop2_s miscompare and abort.  Is this the same problem or a different problem as the one described in this bug?

Miscompare:
   1513:    Chlorophyll transmission table computed
           Could not find range for chlamnt =  1.0000E-03

Abort: 

   Contents of pop2_s.err
   ****************************************
   Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG
   ------------------------------------------------------------------------
   POP aborting...
     set_chl_trans range error for chlamnt

(q3) The text mentions that -funsafe-math-optimizations as a workaround.  Was this tested?  Was it narrowed to any of the components of -funsafe-math-optimizations which apparently are -fno-signed-zeros, -fno-trapping-math, -fassociative-math and -freciprocal-math?  

I am just about to start such testing on my aarch64 system, but if someone has already done such, it would be useful to know.

(q4) *Is* there a proposed patch for 628.pop2_s?   Although such may be unlikely to be applied to CPU 2017, it is conceivable that there might be an updated version of 628.pop2_s in a future suite.  So, it would be good to have the proposed patch on record.
Comment 43 Richard Biener 2020-08-06 12:03:53 UTC
(In reply to john henning from comment #42)
> A comment, an invitation, and some questions from one of those "SPEC people".
> 
> The comment: 
> 
> SPEC CPU starts from real applications (rather than artificial kernels). 
> Therefore, yes, it begins with code that includes practices that are less
> than perfect - including standards violations and mistakes in numerical
> methods.  
> 
> SPEC then weeds out benchmark candidates when they are discovered to be
> inadequate.  Many have been weeded out due to numerical malpractice, when
> they cannot be relied upon to produce correct answers within some predefined
> tolerance.  
> 
> The remaining candidates get pushed toward the standard, get pushed toward
> better numerical practices.
> 
> But all this is driven by testing.   If there's nobody pushing hard on
> compiler X advanced optimization feature Y, and if Y quietly relies on
> source code to be virtuous as regards Z, it is entirely possible that sins
> against virtue Z will escape notice until after release of a SPEC benchmark
> suite.
> 
> Pre-release, fixes are easy to sponsor and apply.   Post-release, it is much
> harder, because of SPEC's preference for "no moving targets". 
> 
> The invitation:
> 
> Thus I would invite any of the individuals who commented about "those SPEC
> people" to consider becoming one, and helping with the testing of benchmark
> candidates for the next benchmark suite.  SPEC has a mechanism to accept
> free volunteer labor :-) and if you would like to volunteer, please write to
> info at SPEC dot org.  If you like, you could mention my name and this bug
> when you do.

Thanks.  But all testing effort is limited.  We do test SPEC before the
release (but I am not aware that sources are available freely before
release - I mean, including the data and scripts).  But the issue required
flags to be reproduced that we didn't test in time.  A full review of
the source packages is of course not possible.

> The questions:
> 
> (q1) Do I correctly gather that the fix described here would have gone into
> GCC 8?

Yes.
 
> (q2) On aarch64 with GCC 8.2, GCC 9.3, and GCC 10.1 (Red Hat 7.8, glibc
> 2.17-307.0.2.el7.1) I am seeing the following 628.pop2_s miscompare and
> abort.  Is this the same problem or a different problem as the one described
> in this bug?
> 
> Miscompare:
>    1513:    Chlorophyll transmission table computed
>            Could not find range for chlamnt =  1.0000E-03
> 
> Abort: 
> 
>    Contents of pop2_s.err
>    ****************************************
>    Note: The following floating-point exceptions are signalling:
> IEEE_UNDERFLOW_FLAG
>    ------------------------------------------------------------------------
>    POP aborting...
>      set_chl_trans range error for chlamnt

It looks like the same issue, yes.

> (q3) The text mentions that -funsafe-math-optimizations as a workaround. 
> Was this tested?  Was it narrowed to any of the components of
> -funsafe-math-optimizations which apparently are -fno-signed-zeros,
> -fno-trapping-math, -fassociative-math and -freciprocal-math?  
> 
> I am just about to start such testing on my aarch64 system, but if someone
> has already done such, it would be useful to know.

I'm not aware that anyone tried to narrow down a workaround.
 
> (q4) *Is* there a proposed patch for 628.pop2_s?   Although such may be
> unlikely to be applied to CPU 2017, it is conceivable that there might be an
> updated version of 628.pop2_s in a future suite.  So, it would be good to
> have the proposed patch on record.

From what I understand the proposed workaround is something like the
following if not rewriting the problematic loop in another way.

diff --git a/benchspec/CPU/628.pop2_s/src/sw_absorption.F90 b/benchspec/CPU/628.pop2_s/src/sw_absorption.F90
index 1aa25904..489e88f2 100644
--- a/benchspec/CPU/628.pop2_s/src/sw_absorption.F90
+++ b/benchspec/CPU/628.pop2_s/src/sw_absorption.F90
@@ -134,7 +134,7 @@
 
    real (r8), parameter, dimension(nchl) :: &!chl for table look-up
      chlcnc = (/                            &
-     .001_r8, .005_r8, .01_r8,  .02_r8,     &
+     .00099999999_r8, .005_r8, .01_r8,  .02_r8,     &
      .03_r8,  .05_r8,  .10_r8,  .15_r8,     &
      .20_r8,  .25_r8,  .30_r8,  .35_r8,     &
      .40_r8,  .45_r8,  .50_r8,  .60_r8,     &
Comment 44 Jakub Jelinek 2020-08-06 12:39:20 UTC
Sorry if my comment sounded harsh.
Anyway, the older issue I was referring to was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53073 with SPEC2k6 464.h264ref.
Comment 45 john henning 2020-08-07 22:23:16 UTC
I had promised to do some more testing.  

There were miscompares during the training run (thus causing the build to fail) using

     -Ofast -flto -fprofile-generate/-fprofile-use 

on aarch64 with GCC 8.2, 9.3, and 10.1 and
on x86_64  with GCC 8.3, 9.3, and 10.1.

The above succeeded with GCC 7.3 on both aarch64 and x86_64.

The attachment 628 [details].pop2.moretests.7aug2020.txt digs into 10.1 a bit deeper (posted as an attachment to reduce the chances of unwanted line wraps or white space mangling for the tables)
Comment 46 john henning 2020-08-07 22:24:31 UTC
Created attachment 49027 [details]
More testing 7-aug-2020
Comment 47 john henning 2020-08-11 14:17:24 UTC
SPEC next step: Because the performance differences were small (in my limited testing) no matter which workaround I picked (-O3, or remove Feedback Directed Optimization, or add -fno-unsafe-math-optimizations), I will probably update the CPU 2017 Docs to recommend simply that GCC users set "basepeak" for 628.pop2. This turns off FDO, and is easy for them to do.  (https://www.spec.org/cpu2017/Docs/config.html#basepeak)

A loose end: is it worth filing a documentation bug for the fact that using the alleged components of -fno-unsafe-math-optimizations does not have the same effect as -fno-unsafe-math-optimizations itself?  (See the attachment from comment 46)