Bug 84201

Summary: 549.fotonik3d_r from SPEC2017 fails verification with recent Intel and AMD CPUs
Product: gcc Reporter: Martin Jambor <jamborm>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: dimhen, mailboxnotfound, rguenth, sje
Priority: P3    
Version: 8.0   
Target Milestone: ---   
Host: Target: x86_64-*-*, i?86-*-*
Build: Known to work:
Known to fail: 6.4.0, 7.3.0, 8.2.0, 9.0 Last reconfirmed: 2018-09-09 00:00:00
Bug Depends on:    
Bug Blocks: 26163, 84613    

Description Martin Jambor 2018-02-04 22:52:57 UTC
I'm not sure if I can post any more details than what the subject
says. On AMD EPYC and Ryzen machines (at least), 549.fotonik3d_r from
SPEC 2017 fails with a verification error when compiled with any of
the following combination of options:

  -Ofast -g -march=native -mtune=native -mprefer-vector-width=256

  -Ofast -g -march=native -mtune=native -mprefer-vector-width=512

Unfortunately, it does not fail the test run but only the reference
Comment 1 Martin Jambor 2018-05-11 15:42:20 UTC
When benchmarking GCC 8 on an older Ivy Bridge Xeon, I also got 549.fotonik3d_r verification error just with -Ofast -g -march=native -mtune=native
Comment 2 Martin Liška 2018-09-09 10:00:24 UTC
Same happens on an Intel machine with -march=skylake. The issue is as old as the skylake option (GCC 6.1+), bisection will not help us here. I can reproduce that on a Haswell machine with the march option.
Comment 3 Martin Liška 2018-09-10 12:55:03 UTC
If I'm correct it's also problematic on Haswell with -march=native.
Particularly the only affected file is power.fppized.o.
I also tried -mno-fma4 and -mno-fma and it does not help. From the assembly diff, there is additional vectorization happening.
Comment 4 Martin Liška 2018-09-10 13:14:58 UTC
I see it failing with 4 years old revision r216027 on the Haswell machine.
Looking at portability issues: https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html

It is perhaps worth mentioning that some calculations generate 'subnormal' numbers (wikipedia) which may cause slower operation than normal numbers on some hardware platforms. On such platforms, performance may be improved if "flush to zero on underflow" (FTZ) is enabled. During SPEC's testing of Fotonik3d, the output validated correctly whether or not FTZ was enabled.

Maybe that's causing the -Ofast issues?
Comment 5 Steve Ellcey 2019-02-05 16:57:45 UTC
Has anyone looked into this any more to see what optimization is causing this failure?

In my testing:

-Ofast fails
-Ofast -fno-unsafe-math-optimizations works
-Ofast -fno-tree-loop-vectorize works
-O3 works

So it seems to be some combination of unsafe math optimizations and vectorization that is causing the failure.
Comment 6 Richard Biener 2019-02-06 09:55:13 UTC
If Martins bisection to power.fppized.o is correct you can bisect the loop
via the vect_loop or vect_slp debug counters (or first try with just
-fno-tree-{loop,slp}-vectorize to narrow down to loop vs. BB vectorization).
Comment 7 Steve Ellcey 2019-02-06 16:49:24 UTC
(In reply to Richard Biener from comment #6)
> If Martins bisection to power.fppized.o is correct you can bisect the loop
> via the vect_loop or vect_slp debug counters (or first try with just
> -fno-tree-{loop,slp}-vectorize to narrow down to loop vs. BB vectorization).

I will let one of the x86 experts try that.  I was just surprised to find that one of the most popular benchmarks fails on one of the most popular targets
and that it has been that way for about a year.
Comment 8 Martin Liška 2019-03-26 11:28:49 UTC
For some reason, I can't reproduce that now on Haswell with both GCC 8 and 9. I'll retry with a Zen machine.
Comment 9 Martin Liška 2019-03-26 13:40:21 UTC
(In reply to Martin Liška from comment #8)
> For some reason, I can't reproduce that now on Haswell with both GCC 8 and
> 9. I'll retry with a Zen machine.

The reason is that I had adjusted tolerance for the test from 1e-10 to 1e-9.
Comment 10 Martin Liška 2019-03-27 09:45:35 UTC
Using -fdbg-cnt option:
-fdbg-cnt=vect_loop:0 -fdbg-cnt=vect_loop:4:5:power.fppized

I was able to track that to a single vectorization that happens here:

   308  !! Set up frequency vector(s)
   309  tmppower => first_power
   310  ipower = 1
   311  do while(associated(tmppower))
   312    if ( tmppower%nofreq>1 ) then
   313      freqstep = (tmppower%freqlast-tmppower%freqfirst)                       &
   314               / real(tmppower%nofreq - 1, kind=rfp)
   315    else
   316      freqstep = 0.0_rfp
   317    end if
   318    freq = tmppower%freqfirst
   319    do ifreq = 1, tmppower%nofreq <------ HERE
   320      frequency(ifreq,ipower) = freq
   321      freq = freq + freqstep
   322    end do
   323    tmppower => tmppower%next
   324    ipower = ipower + 1
   325  end do

vect dump:
power.fppized.f90:319:0: note:    Runtime profitability threshold = 4
power.fppized.f90:319:0: note:    Static estimate profitability threshold = 9
power.fppized.f90:319:0: note:  epilog loop required
power.fppized.f90:319:0: note:  vect_can_advance_ivs_p:
power.fppized.f90:319:0: note:  Analyze phi: freq_20 = PHI <pretmp_1948(136), freq_728(180)>
power.fppized.f90:319:0: note:  Analyze phi: ifreq_1653 = PHI <1(136), ifreq_729(180)>
power.fppized.f90:319:0: note:  Analyze phi: .MEM_137 = PHI <.MEM_134(136), .MEM_727(180)>
power.fppized.f90:319:0: note:  reduc or virtual phi. skip.
***dbgcnt: upper limit 5 reached for vect_loop.***
power.fppized.f90:319:0: optimized: loop vectorized using 32 byte vectors
power.fppized.f90:319:0: note:  === vec_transform_loop ===
power.fppized.f90:319:0: note:  Profitability threshold is 4 loop iterations.

which leads to following assembly:

	movl	88(%rdi), %esi
	vmovsd	96(%rdi), %xmm2
	cmpl	$1, %esi
	jle	.L243
	vmovsd	104(%rdi), %xmm1
	leal	-1(%rsi), %eax
	vcvtsi2sdl	%eax, %xmm5, %xmm3
	vsubsd	%xmm2, %xmm1, %xmm1
	testl	%esi, %esi
	movl	%r10d, %r9d
	vdivsd	%xmm3, %xmm1, %xmm3
	cmovg	%esi, %r9d
	cmpl	$3, %esi
	jle	.L371
	vaddsd	%xmm2, %xmm3, %xmm0
	movl	%r9d, %ecx
	shrl	$2, %ecx
	vaddsd	%xmm3, %xmm0, %xmm1
	salq	$5, %rcx
	vunpcklpd	%xmm0, %xmm2, %xmm0
	vaddsd	%xmm3, %xmm1, %xmm4
	addq	%r8, %rcx
	movq	%r8, %rax
	vunpcklpd	%xmm4, %xmm1, %xmm1
	vmulsd	%xmm6, %xmm3, %xmm4
	vinsertf128	$0x1, %xmm1, %ymm0, %ymm0
	vbroadcastsd	%xmm4, %ymm4
	.p2align 4,,10
	.p2align 3
	vmovapd	%ymm0, %ymm1
	vmovupd	%ymm1, (%rax)
	addq	$32, %rax
	vaddpd	%ymm4, %ymm0, %ymm0
	cmpq	%rax, %rcx
	jne	.L250
	movl	%r9d, %eax
	andl	$-4, %eax
	vcvtsi2sdl	%eax, %xmm5, %xmm0
	leal	1(%rax), %ecx
	vfmadd231sd	%xmm3, %xmm0, %xmm2
	cmpl	%r9d, %eax
	je	.L251
	movslq	%ecx, %rcx
	addq	%rdx, %rcx
	addl	$2, %eax
	vmovsd	%xmm2, (%r11,%rcx,8)
	vaddsd	%xmm3, %xmm2, %xmm2
	cmpl	%esi, %eax
	jg	.L251
	movslq	%eax, %rcx
	addq	%rdx, %rcx
	incl	%eax
	vmovsd	%xmm2, (%r11,%rcx,8)
	vaddsd	%xmm3, %xmm2, %xmm2
	cmpl	%eax, %esi
	jl	.L251
	addq	%rdx, %rax
	vmovsd	%xmm2, (%r11,%rax,8)
	movq	136(%rdi), %rdi
	addq	%r15, %rdx
	addq	%r12, %r8
	testq	%rdi, %rdi
	jne	.L248

So there's probably nothing we can do right now? I'm removing wrong-code as it's about floating point precision
when using -Ofast and vectors, which is a known limitation if I'm correct.
Comment 11 Richard Biener 2019-03-27 10:20:32 UTC
Yeah, it should even add _extra_ precision because it does less rounding steps.

I wonder what the difference in frequency(ifreq,ipower) is when comparing
vectorization vs. non-vectorization.
Comment 12 john henning 2020-09-17 20:46:33 UTC
I contributed to the development of benchmark 549.fotonik3d_r.  The opinions herein are my own, not necessarily SPEC's.

Martin Liška wrote in comment 9:

> adjusted tolerance for the test from 1e-10 to 1e-9

That change would have been highly desirable, if this problem had been found prior to the release of CPU 2017.  Unfortunately, post-release, it is very difficult to change a SPEC CPU benchmark, because of the philosophy of "no moving targets".  

To be clear, a rule-compliant SPEC CPU run is not allowed to change the tolerance.

Why wasn't the problem found before release?

Although GCC was tested prior to release of CPU 2017, the circumstances that lead to this problem were not encountered.  As Steve Ellcey wrote in Comment 5, the problem comes and goes with various optimizations:

> -Ofast fails
> -Ofast -fno-unsafe-math-optimizations works
> -Ofast -fno-tree-loop-vectorize works
> -O3 works

In addition, it appears that the problem:

  - Depends on the particular architecture. In my tests today, it disappears when I remove -march=native

  - Does not happen in the presence of FDO (Feedback-Directed Optimization, i.e. -fprofile-generate / -fprofile-use), typically used by "peak" tuning (see https://www.spec.org/cpu2017/Docs/overview.html#Q16 for info on base vs. peak). 

SPEC CPU Examples 

SPEC CPU 2017 users who start from the Example config files for GCC in $SPEC/config are unlikely to hit the problem because most of the GCC Example config files use -O3 (not -Ofast) for base.  However, if a user modifies the example to use

   -Ofast -march=native 

then they will need to also add -fno-unsafe-math-optimizations or -fno-tree-loop-vectorize.

Working around the problem

The same-for-all rule -- https://www.spec.org/cpu2017/Docs/runrules.html#BaseFlags -- says that all benchmarks in a suite of a given language must use the same base flags.  Here are several examples of how a config file could obey that rule while working around the problem:

Option (a) - In base, avoid Ofast for Fortran 
      COPTIMIZE      = -Ofast -flto -march=native 
      CXXOPTIMIZE    = -Ofast -flto -march=native 
      FOPTIMIZE      = -O3    -flto -march=native

Option (b) - In base, avoid -march=native for Fortran
      COPTIMIZE      = -Ofast -flto -march=native 
      CXXOPTIMIZE    = -Ofast -flto -march=native 
      FOPTIMIZE      = -Ofast -flto

Option (c) - Turn off tree loop vectorizer for Fortran
      OPTIMIZE       = -Ofast -flto -march=native 
      FOPTIMIZE      = -fno-tree-loop-vectorize 

Option (d) - Turn off unsafe math for Fortran
      OPTIMIZE       = -Ofast -flto -march=native 
      FOPTIMIZE      = -fno-unsafe-math-optimizations

Performance impact

The performance impact of the options above will be system dependent, and may depend on how hard you exercise the system (e.g. one copy or many copies).   For a particular system tested by one particular person running only one copy, here are the results of the above 4 options, normalized to option (a):

  Performance of the Fortran benchmarks in SPECrate2017 FP
  Normalized to option (a).  Higher is better

               503.    507.   521.  527.  549.     554. 
               bwaves  cactu  wrf   cam4  fotonik  roms  geomean
 (a) O3          1.00   1.00  1.00  1.00   1.00    1.00    1.000
 (b) no native   1.31    .82  1.31  1.01    .93     .98    1.045
 (c) no vect     1.31   1.01   .94   .89    .90     .85     .973
 (d) no unsafe    .99   1.02  1.36  1.01   1.03    1.01    1.064

Given the above, at the moment option (d) seems best.

Next steps

I will add a summary of this discussion to 

Thank you Martin, Martin, Steve, and Richard for the clarity in this report.