I'm not sure if I can post any more details than what the subject says. On AMD EPYC and Ryzen machines (at least), 549.fotonik3d_r from SPEC 2017 fails with a verification error when compiled with any of the following combination of options: -Ofast -g -march=native -mtune=native -mprefer-vector-width=256 -Ofast -g -march=native -mtune=native -mprefer-vector-width=512 Unfortunately, it does not fail the test run but only the reference run.
When benchmarking GCC 8 on an older Ivy Bridge Xeon, I also got 549.fotonik3d_r verification error just with -Ofast -g -march=native -mtune=native
Same happens on an Intel machine with -march=skylake. The issue is as old as the skylake option (GCC 6.1+), bisection will not help us here. I can reproduce that on a Haswell machine with the march option.
If I'm correct it's also problematic on Haswell with -march=native. Particularly the only affected file is power.fppized.o. I also tried -mno-fma4 and -mno-fma and it does not help. From the assembly diff, there is additional vectorization happening.
I see it failing with 4 years old revision r216027 on the Haswell machine. Looking at portability issues: https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html ``` It is perhaps worth mentioning that some calculations generate 'subnormal' numbers (wikipedia) which may cause slower operation than normal numbers on some hardware platforms. On such platforms, performance may be improved if "flush to zero on underflow" (FTZ) is enabled. During SPEC's testing of Fotonik3d, the output validated correctly whether or not FTZ was enabled. ``` Maybe that's causing the -Ofast issues?
Has anyone looked into this any more to see what optimization is causing this failure? In my testing: -Ofast fails -Ofast -fno-unsafe-math-optimizations works -Ofast -fno-tree-loop-vectorize works -O3 works So it seems to be some combination of unsafe math optimizations and vectorization that is causing the failure.
If Martins bisection to power.fppized.o is correct you can bisect the loop via the vect_loop or vect_slp debug counters (or first try with just -fno-tree-{loop,slp}-vectorize to narrow down to loop vs. BB vectorization).
(In reply to Richard Biener from comment #6) > If Martins bisection to power.fppized.o is correct you can bisect the loop > via the vect_loop or vect_slp debug counters (or first try with just > -fno-tree-{loop,slp}-vectorize to narrow down to loop vs. BB vectorization). I will let one of the x86 experts try that. I was just surprised to find that one of the most popular benchmarks fails on one of the most popular targets and that it has been that way for about a year.
For some reason, I can't reproduce that now on Haswell with both GCC 8 and 9. I'll retry with a Zen machine.
(In reply to Martin Liška from comment #8) > For some reason, I can't reproduce that now on Haswell with both GCC 8 and > 9. I'll retry with a Zen machine. The reason is that I had adjusted tolerance for the test from 1e-10 to 1e-9.
Using -fdbg-cnt option: -fdbg-cnt=vect_loop:0 -fdbg-cnt=vect_loop:4:5:power.fppized I was able to track that to a single vectorization that happens here: 308 !! Set up frequency vector(s) 309 tmppower => first_power 310 ipower = 1 311 do while(associated(tmppower)) 312 if ( tmppower%nofreq>1 ) then 313 freqstep = (tmppower%freqlast-tmppower%freqfirst) & 314 / real(tmppower%nofreq - 1, kind=rfp) 315 else 316 freqstep = 0.0_rfp 317 end if 318 freq = tmppower%freqfirst 319 do ifreq = 1, tmppower%nofreq <------ HERE 320 frequency(ifreq,ipower) = freq 321 freq = freq + freqstep 322 end do 323 tmppower => tmppower%next 324 ipower = ipower + 1 325 end do vect dump: ... power.fppized.f90:319:0: note: Runtime profitability threshold = 4 power.fppized.f90:319:0: note: Static estimate profitability threshold = 9 power.fppized.f90:319:0: note: epilog loop required power.fppized.f90:319:0: note: vect_can_advance_ivs_p: power.fppized.f90:319:0: note: Analyze phi: freq_20 = PHI <pretmp_1948(136), freq_728(180)> power.fppized.f90:319:0: note: Analyze phi: ifreq_1653 = PHI <1(136), ifreq_729(180)> power.fppized.f90:319:0: note: Analyze phi: .MEM_137 = PHI <.MEM_134(136), .MEM_727(180)> power.fppized.f90:319:0: note: reduc or virtual phi. skip. ***dbgcnt: upper limit 5 reached for vect_loop.*** power.fppized.f90:319:0: optimized: loop vectorized using 32 byte vectors power.fppized.f90:319:0: note: === vec_transform_loop === power.fppized.f90:319:0: note: Profitability threshold is 4 loop iterations. ... which leads to following assembly: .L248: movl 88(%rdi), %esi vmovsd 96(%rdi), %xmm2 cmpl $1, %esi jle .L243 vmovsd 104(%rdi), %xmm1 leal -1(%rsi), %eax vcvtsi2sdl %eax, %xmm5, %xmm3 vsubsd %xmm2, %xmm1, %xmm1 testl %esi, %esi movl %r10d, %r9d vdivsd %xmm3, %xmm1, %xmm3 cmovg %esi, %r9d cmpl $3, %esi jle .L371 vaddsd %xmm2, %xmm3, %xmm0 movl %r9d, %ecx shrl $2, %ecx vaddsd %xmm3, %xmm0, %xmm1 salq $5, %rcx vunpcklpd %xmm0, %xmm2, %xmm0 vaddsd %xmm3, %xmm1, %xmm4 addq %r8, %rcx movq %r8, %rax vunpcklpd %xmm4, %xmm1, %xmm1 vmulsd %xmm6, %xmm3, %xmm4 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 vbroadcastsd %xmm4, %ymm4 .p2align 4,,10 .p2align 3 .L250: vmovapd %ymm0, %ymm1 vmovupd %ymm1, (%rax) addq $32, %rax vaddpd %ymm4, %ymm0, %ymm0 cmpq %rax, %rcx jne .L250 movl %r9d, %eax andl $-4, %eax vcvtsi2sdl %eax, %xmm5, %xmm0 leal 1(%rax), %ecx vfmadd231sd %xmm3, %xmm0, %xmm2 cmpl %r9d, %eax je .L251 movslq %ecx, %rcx addq %rdx, %rcx addl $2, %eax vmovsd %xmm2, (%r11,%rcx,8) vaddsd %xmm3, %xmm2, %xmm2 cmpl %esi, %eax jg .L251 .L270: movslq %eax, %rcx addq %rdx, %rcx incl %eax vmovsd %xmm2, (%r11,%rcx,8) vaddsd %xmm3, %xmm2, %xmm2 cmpl %eax, %esi jl .L251 cltq addq %rdx, %rax vmovsd %xmm2, (%r11,%rax,8) .L251: movq 136(%rdi), %rdi addq %r15, %rdx addq %r12, %r8 testq %rdi, %rdi jne .L248 So there's probably nothing we can do right now? I'm removing wrong-code as it's about floating point precision when using -Ofast and vectors, which is a known limitation if I'm correct.
Yeah, it should even add _extra_ precision because it does less rounding steps. I wonder what the difference in frequency(ifreq,ipower) is when comparing vectorization vs. non-vectorization.
I contributed to the development of benchmark 549.fotonik3d_r. The opinions herein are my own, not necessarily SPEC's. Martin Liška wrote in comment 9: > adjusted tolerance for the test from 1e-10 to 1e-9 That change would have been highly desirable, if this problem had been found prior to the release of CPU 2017. Unfortunately, post-release, it is very difficult to change a SPEC CPU benchmark, because of the philosophy of "no moving targets". To be clear, a rule-compliant SPEC CPU run is not allowed to change the tolerance. Why wasn't the problem found before release? Although GCC was tested prior to release of CPU 2017, the circumstances that lead to this problem were not encountered. As Steve Ellcey wrote in Comment 5, the problem comes and goes with various optimizations: > -Ofast fails > -Ofast -fno-unsafe-math-optimizations works > -Ofast -fno-tree-loop-vectorize works > -O3 works In addition, it appears that the problem: - Depends on the particular architecture. In my tests today, it disappears when I remove -march=native - Does not happen in the presence of FDO (Feedback-Directed Optimization, i.e. -fprofile-generate / -fprofile-use), typically used by "peak" tuning (see https://www.spec.org/cpu2017/Docs/overview.html#Q16 for info on base vs. peak). SPEC CPU Examples SPEC CPU 2017 users who start from the Example config files for GCC in $SPEC/config are unlikely to hit the problem because most of the GCC Example config files use -O3 (not -Ofast) for base. However, if a user modifies the example to use -Ofast -march=native then they will need to also add -fno-unsafe-math-optimizations or -fno-tree-loop-vectorize. Working around the problem The same-for-all rule -- https://www.spec.org/cpu2017/Docs/runrules.html#BaseFlags -- says that all benchmarks in a suite of a given language must use the same base flags. Here are several examples of how a config file could obey that rule while working around the problem: Option (a) - In base, avoid Ofast for Fortran default=base: COPTIMIZE = -Ofast -flto -march=native CXXOPTIMIZE = -Ofast -flto -march=native FOPTIMIZE = -O3 -flto -march=native Option (b) - In base, avoid -march=native for Fortran default=base: COPTIMIZE = -Ofast -flto -march=native CXXOPTIMIZE = -Ofast -flto -march=native FOPTIMIZE = -Ofast -flto Option (c) - Turn off tree loop vectorizer for Fortran default=base: OPTIMIZE = -Ofast -flto -march=native fprate,fpspeed=base: FOPTIMIZE = -fno-tree-loop-vectorize Option (d) - Turn off unsafe math for Fortran default=base: OPTIMIZE = -Ofast -flto -march=native fprate,fpspeed=base: FOPTIMIZE = -fno-unsafe-math-optimizations Performance impact The performance impact of the options above will be system dependent, and may depend on how hard you exercise the system (e.g. one copy or many copies). For a particular system tested by one particular person running only one copy, here are the results of the above 4 options, normalized to option (a): Performance of the Fortran benchmarks in SPECrate2017 FP Normalized to option (a). Higher is better 503. 507. 521. 527. 549. 554. bwaves cactu wrf cam4 fotonik roms geomean (a) O3 1.00 1.00 1.00 1.00 1.00 1.00 1.000 (b) no native 1.31 .82 1.31 1.01 .93 .98 1.045 (c) no vect 1.31 1.01 .94 .89 .90 .85 .973 (d) no unsafe .99 1.02 1.36 1.01 1.03 1.01 1.064 Given the above, at the moment option (d) seems best. Next steps I will add a summary of this discussion to https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html Thank you Martin, Martin, Steve, and Richard for the clarity in this report.
One option is to introduce a less invasive optimization option to avoid the undesirable vectorization. For example -fno-vectorize-fp-inductions noting this particular loop is a floating-point induction. Usually those tend not to be performance critical. More general -f[no-]vectorize-{in,re}ductions[={fp,int}] would be possible as well.
(In reply to Richard Biener from comment #13) > One option is to introduce a less invasive optimization option[...] I think that would be useful, yes. It could even be a param if we do not want to commit to having it forever.
OK, let me try cooking sth up.
Created attachment 52582 [details] simple --param patch This adds the ability to turn off floating point induction vectorization with --param vect-induction-float=0 which should be usable ontop of -Ofast. Can somebody with unaltered sources verify if that helps?
(In reply to Richard Biener from comment #16) > Created attachment 52582 [details] > simple --param patch > > This adds the ability to turn off floating point induction vectorization with > --param vect-induction-float=0 which should be usable ontop of -Ofast. > > Can somebody with unaltered sources verify if that helps? I've verified it works to avoid the miscompare.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:058d19b42ad4c4c22635f70db6913a80884aedec commit r12-7535-g058d19b42ad4c4c22635f70db6913a80884aedec Author: Richard Biener <rguenther@suse.de> Date: Tue Mar 8 12:07:07 2022 +0100 tree-optimization/84201 - add --param vect-induction-float This adds a --param to allow disabling of vectorization of floating point inductions. Ontop of -Ofast this should allow 549.fotonik3d_r to not miscompare. 2022-03-08 Richard Biener <rguenther@suse.de> PR tree-optimization/84201 * params.opt (-param=vect-induction-float): Add. * doc/invoke.texi (vect-induction-float): Document. * tree-vect-loop.cc (vectorizable_induction): Honor param_vect_induction_float. * gcc.dg/vect/pr84201.c: New testcase.
There's now --param vect-induction-float=0 available as mitigation which disables all vectorization of floating point typed inductions. There's nothing wrong with GCC as for the original report. Though we attempt to make -Ofast usable for SPEC this particular vectorization isn't off the hooks and the rounding errors are well within what you'd expect. So, INVALID for the bug.
*** Bug 113570 has been marked as a duplicate of this bug. ***