84201 – 549.fotonik3d_r from SPEC2017 fails verification with recent Intel and AMD CPUs

Bug 84201 - 549.fotonik3d_r from SPEC2017 fails verification with recent Intel and AMD CPUs

Summary: 549.fotonik3d_r from SPEC2017 fails verification with recent Intel and AMD CPUs

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Duplicates (1):	113570 (view as bug list)
Depends on:
Blocks:	spec 84613
	Show dependency tree / graph

Reported:	2018-02-04 22:52 UTC by Martin Jambor
Modified:	2024-01-24 16:07 UTC (History)
CC List:	7 users (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:	6.4.0, 7.3.0, 8.2.0, 9.0
Last reconfirmed:	2018-09-09 00:00:00

Attachments
simple --param patch (1.28 KB, patch) 2022-03-08 11:12 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2018-02-04 22:52:57 UTC

I'm not sure if I can post any more details than what the subject
says. On AMD EPYC and Ryzen machines (at least), 549.fotonik3d_r from
SPEC 2017 fails with a verification error when compiled with any of
the following combination of options:

  -Ofast -g -march=native -mtune=native -mprefer-vector-width=256

  -Ofast -g -march=native -mtune=native -mprefer-vector-width=512

Unfortunately, it does not fail the test run but only the reference
run.

Comment 1 Martin Jambor 2018-05-11 15:42:20 UTC

When benchmarking GCC 8 on an older Ivy Bridge Xeon, I also got 549.fotonik3d_r verification error just with -Ofast -g -march=native -mtune=native

Comment 2 Martin Liška 2018-09-09 10:00:24 UTC

Same happens on an Intel machine with -march=skylake. The issue is as old as the skylake option (GCC 6.1+), bisection will not help us here. I can reproduce that on a Haswell machine with the march option.

Comment 3 Martin Liška 2018-09-10 12:55:03 UTC

If I'm correct it's also problematic on Haswell with -march=native.
Particularly the only affected file is power.fppized.o.
I also tried -mno-fma4 and -mno-fma and it does not help. From the assembly diff, there is additional vectorization happening.

Comment 4 Martin Liška 2018-09-10 13:14:58 UTC

I see it failing with 4 years old revision r216027 on the Haswell machine.
Looking at portability issues: https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html

```
It is perhaps worth mentioning that some calculations generate 'subnormal' numbers (wikipedia) which may cause slower operation than normal numbers on some hardware platforms. On such platforms, performance may be improved if "flush to zero on underflow" (FTZ) is enabled. During SPEC's testing of Fotonik3d, the output validated correctly whether or not FTZ was enabled.
```

Maybe that's causing the -Ofast issues?

Comment 5 Steve Ellcey 2019-02-05 16:57:45 UTC

Has anyone looked into this any more to see what optimization is causing this failure?

In my testing:

-Ofast fails
-Ofast -fno-unsafe-math-optimizations works
-Ofast -fno-tree-loop-vectorize works
-O3 works

So it seems to be some combination of unsafe math optimizations and vectorization that is causing the failure.

Comment 6 Richard Biener 2019-02-06 09:55:13 UTC

If Martins bisection to power.fppized.o is correct you can bisect the loop
via the vect_loop or vect_slp debug counters (or first try with just
-fno-tree-{loop,slp}-vectorize to narrow down to loop vs. BB vectorization).

Comment 7 Steve Ellcey 2019-02-06 16:49:24 UTC

(In reply to Richard Biener from comment #6)
> If Martins bisection to power.fppized.o is correct you can bisect the loop
> via the vect_loop or vect_slp debug counters (or first try with just
> -fno-tree-{loop,slp}-vectorize to narrow down to loop vs. BB vectorization).

I will let one of the x86 experts try that.  I was just surprised to find that one of the most popular benchmarks fails on one of the most popular targets
and that it has been that way for about a year.

Comment 8 Martin Liška 2019-03-26 11:28:49 UTC

For some reason, I can't reproduce that now on Haswell with both GCC 8 and 9. I'll retry with a Zen machine.

Comment 9 Martin Liška 2019-03-26 13:40:21 UTC

(In reply to Martin Liška from comment #8)
> For some reason, I can't reproduce that now on Haswell with both GCC 8 and
> 9. I'll retry with a Zen machine.

The reason is that I had adjusted tolerance for the test from 1e-10 to 1e-9.

Comment 10 Martin Liška 2019-03-27 09:45:35 UTC

Using -fdbg-cnt option:
-fdbg-cnt=vect_loop:0 -fdbg-cnt=vect_loop:4:5:power.fppized

I was able to track that to a single vectorization that happens here:

   308  !! Set up frequency vector(s)
   309  tmppower => first_power
   310  ipower = 1
   311  do while(associated(tmppower))
   312    if ( tmppower%nofreq>1 ) then
   313      freqstep = (tmppower%freqlast-tmppower%freqfirst)                       &
   314               / real(tmppower%nofreq - 1, kind=rfp)
   315    else
   316      freqstep = 0.0_rfp
   317    end if
   318    freq = tmppower%freqfirst
   319    do ifreq = 1, tmppower%nofreq <------ HERE
   320      frequency(ifreq,ipower) = freq
   321      freq = freq + freqstep
   322    end do
   323    tmppower => tmppower%next
   324    ipower = ipower + 1
   325  end do

vect dump:
...
power.fppized.f90:319:0: note:    Runtime profitability threshold = 4
power.fppized.f90:319:0: note:    Static estimate profitability threshold = 9
power.fppized.f90:319:0: note:  epilog loop required
power.fppized.f90:319:0: note:  vect_can_advance_ivs_p:
power.fppized.f90:319:0: note:  Analyze phi: freq_20 = PHI <pretmp_1948(136), freq_728(180)>
power.fppized.f90:319:0: note:  Analyze phi: ifreq_1653 = PHI <1(136), ifreq_729(180)>
power.fppized.f90:319:0: note:  Analyze phi: .MEM_137 = PHI <.MEM_134(136), .MEM_727(180)>
power.fppized.f90:319:0: note:  reduc or virtual phi. skip.
***dbgcnt: upper limit 5 reached for vect_loop.***
power.fppized.f90:319:0: optimized: loop vectorized using 32 byte vectors
power.fppized.f90:319:0: note:  === vec_transform_loop ===
power.fppized.f90:319:0: note:  Profitability threshold is 4 loop iterations.
...

which leads to following assembly:

.L248:
	movl	88(%rdi), %esi
	vmovsd	96(%rdi), %xmm2
	cmpl	$1, %esi
	jle	.L243
	vmovsd	104(%rdi), %xmm1
	leal	-1(%rsi), %eax
	vcvtsi2sdl	%eax, %xmm5, %xmm3
	vsubsd	%xmm2, %xmm1, %xmm1
	testl	%esi, %esi
	movl	%r10d, %r9d
	vdivsd	%xmm3, %xmm1, %xmm3
	cmovg	%esi, %r9d
	cmpl	$3, %esi
	jle	.L371
	vaddsd	%xmm2, %xmm3, %xmm0
	movl	%r9d, %ecx
	shrl	$2, %ecx
	vaddsd	%xmm3, %xmm0, %xmm1
	salq	$5, %rcx
	vunpcklpd	%xmm0, %xmm2, %xmm0
	vaddsd	%xmm3, %xmm1, %xmm4
	addq	%r8, %rcx
	movq	%r8, %rax
	vunpcklpd	%xmm4, %xmm1, %xmm1
	vmulsd	%xmm6, %xmm3, %xmm4
	vinsertf128	$0x1, %xmm1, %ymm0, %ymm0
	vbroadcastsd	%xmm4, %ymm4
	.p2align 4,,10
	.p2align 3
.L250:
	vmovapd	%ymm0, %ymm1
	vmovupd	%ymm1, (%rax)
	addq	$32, %rax
	vaddpd	%ymm4, %ymm0, %ymm0
	cmpq	%rax, %rcx
	jne	.L250
	movl	%r9d, %eax
	andl	$-4, %eax
	vcvtsi2sdl	%eax, %xmm5, %xmm0
	leal	1(%rax), %ecx
	vfmadd231sd	%xmm3, %xmm0, %xmm2
	cmpl	%r9d, %eax
	je	.L251
	movslq	%ecx, %rcx
	addq	%rdx, %rcx
	addl	$2, %eax
	vmovsd	%xmm2, (%r11,%rcx,8)
	vaddsd	%xmm3, %xmm2, %xmm2
	cmpl	%esi, %eax
	jg	.L251
.L270:
	movslq	%eax, %rcx
	addq	%rdx, %rcx
	incl	%eax
	vmovsd	%xmm2, (%r11,%rcx,8)
	vaddsd	%xmm3, %xmm2, %xmm2
	cmpl	%eax, %esi
	jl	.L251
	cltq
	addq	%rdx, %rax
	vmovsd	%xmm2, (%r11,%rax,8)
.L251:
	movq	136(%rdi), %rdi
	addq	%r15, %rdx
	addq	%r12, %r8
	testq	%rdi, %rdi
	jne	.L248

So there's probably nothing we can do right now? I'm removing wrong-code as it's about floating point precision
when using -Ofast and vectors, which is a known limitation if I'm correct.

Comment 11 Richard Biener 2019-03-27 10:20:32 UTC

Yeah, it should even add _extra_ precision because it does less rounding steps.

I wonder what the difference in frequency(ifreq,ipower) is when comparing
vectorization vs. non-vectorization.

Comment 12 john henning 2020-09-17 20:46:33 UTC

I contributed to the development of benchmark 549.fotonik3d_r.  The opinions herein are my own, not necessarily SPEC's.

Martin Liška wrote in comment 9:

> adjusted tolerance for the test from 1e-10 to 1e-9

That change would have been highly desirable, if this problem had been found prior to the release of CPU 2017.  Unfortunately, post-release, it is very difficult to change a SPEC CPU benchmark, because of the philosophy of "no moving targets".  

To be clear, a rule-compliant SPEC CPU run is not allowed to change the tolerance.

Why wasn't the problem found before release?

Although GCC was tested prior to release of CPU 2017, the circumstances that lead to this problem were not encountered.  As Steve Ellcey wrote in Comment 5, the problem comes and goes with various optimizations:

> -Ofast fails
> -Ofast -fno-unsafe-math-optimizations works
> -Ofast -fno-tree-loop-vectorize works
> -O3 works

In addition, it appears that the problem:

  - Depends on the particular architecture. In my tests today, it disappears when I remove -march=native

  - Does not happen in the presence of FDO (Feedback-Directed Optimization, i.e. -fprofile-generate / -fprofile-use), typically used by "peak" tuning (see https://www.spec.org/cpu2017/Docs/overview.html#Q16 for info on base vs. peak). 

SPEC CPU Examples 

SPEC CPU 2017 users who start from the Example config files for GCC in $SPEC/config are unlikely to hit the problem because most of the GCC Example config files use -O3 (not -Ofast) for base.  However, if a user modifies the example to use

   -Ofast -march=native 

then they will need to also add -fno-unsafe-math-optimizations or -fno-tree-loop-vectorize.

Working around the problem

The same-for-all rule -- https://www.spec.org/cpu2017/Docs/runrules.html#BaseFlags -- says that all benchmarks in a suite of a given language must use the same base flags.  Here are several examples of how a config file could obey that rule while working around the problem:

Option (a) - In base, avoid Ofast for Fortran 
   default=base:     
      COPTIMIZE      = -Ofast -flto -march=native 
      CXXOPTIMIZE    = -Ofast -flto -march=native 
      FOPTIMIZE      = -O3    -flto -march=native

Option (b) - In base, avoid -march=native for Fortran
   default=base:     
      COPTIMIZE      = -Ofast -flto -march=native 
      CXXOPTIMIZE    = -Ofast -flto -march=native 
      FOPTIMIZE      = -Ofast -flto

Option (c) - Turn off tree loop vectorizer for Fortran
   default=base:     
      OPTIMIZE       = -Ofast -flto -march=native 
   fprate,fpspeed=base:
      FOPTIMIZE      = -fno-tree-loop-vectorize 

Option (d) - Turn off unsafe math for Fortran
   default=base:     
      OPTIMIZE       = -Ofast -flto -march=native 
   fprate,fpspeed=base:
      FOPTIMIZE      = -fno-unsafe-math-optimizations

Performance impact

The performance impact of the options above will be system dependent, and may depend on how hard you exercise the system (e.g. one copy or many copies).   For a particular system tested by one particular person running only one copy, here are the results of the above 4 options, normalized to option (a):

  Performance of the Fortran benchmarks in SPECrate2017 FP
  Normalized to option (a).  Higher is better

               503.    507.   521.  527.  549.     554. 
               bwaves  cactu  wrf   cam4  fotonik  roms  geomean
 (a) O3          1.00   1.00  1.00  1.00   1.00    1.00    1.000
 (b) no native   1.31    .82  1.31  1.01    .93     .98    1.045
 (c) no vect     1.31   1.01   .94   .89    .90     .85     .973
 (d) no unsafe    .99   1.02  1.36  1.01   1.03    1.01    1.064

Given the above, at the moment option (d) seems best.

Next steps

I will add a summary of this discussion to 
   https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html

Thank you Martin, Martin, Steve, and Richard for the clarity in this report.

Comment 13 Richard Biener 2022-02-17 13:55:52 UTC

One option is to introduce a less invasive optimization option to avoid the undesirable vectorization.  For example -fno-vectorize-fp-inductions noting
this particular loop is a floating-point induction.  Usually those tend not
to be performance critical.

More general -f[no-]vectorize-{in,re}ductions[={fp,int}] would be possible
as well.

Comment 14 Martin Jambor 2022-02-17 15:48:54 UTC

(In reply to Richard Biener from comment #13)
> One option is to introduce a less invasive optimization option[...]

I think that would be useful, yes.  It could even be a param if we do not want to commit to having it forever.

Comment 15 Richard Biener 2022-02-18 08:22:36 UTC

OK, let me try cooking sth up.

Comment 16 Richard Biener 2022-03-08 11:12:43 UTC

Created attachment 52582 [details]
simple --param patch

This adds the ability to turn off floating point induction vectorization with
--param vect-induction-float=0 which should be usable ontop of -Ofast.

Can somebody with unaltered sources verify if that helps?

Comment 17 Richard Biener 2022-03-08 11:53:17 UTC

(In reply to Richard Biener from comment #16)
> Created attachment 52582 [details]
> simple --param patch
> 
> This adds the ability to turn off floating point induction vectorization with
> --param vect-induction-float=0 which should be usable ontop of -Ofast.
> 
> Can somebody with unaltered sources verify if that helps?

I've verified it works to avoid the miscompare.

Comment 18 GCC Commits 2022-03-08 14:58:09 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:058d19b42ad4c4c22635f70db6913a80884aedec

commit r12-7535-g058d19b42ad4c4c22635f70db6913a80884aedec
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Mar 8 12:07:07 2022 +0100

    tree-optimization/84201 - add --param vect-induction-float
    
    This adds a --param to allow disabling of vectorization of
    floating point inductions.  Ontop of -Ofast this should allow
    549.fotonik3d_r to not miscompare.
    
    2022-03-08  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/84201
            * params.opt (-param=vect-induction-float): Add.
            * doc/invoke.texi (vect-induction-float): Document.
            * tree-vect-loop.cc (vectorizable_induction): Honor
            param_vect_induction_float.
    
            * gcc.dg/vect/pr84201.c: New testcase.

Comment 19 Richard Biener 2022-03-08 15:01:06 UTC

There's now --param vect-induction-float=0 available as mitigation which
disables all vectorization of floating point typed inductions.

There's nothing wrong with GCC as for the original report.  Though we attempt to make -Ofast usable for SPEC this particular vectorization isn't off the hooks and the rounding errors are well within what you'd expect.

So, INVALID for the bug.

Comment 20 Jeffrey A. Law 2024-01-24 16:07:26 UTC

*** Bug 113570 has been marked as a duplicate of this bug. ***