[Bug fortran/82362] New: [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713
alexander.nesterovskiy at intel dot com
gcc-bugzilla@gcc.gnu.org
Fri Sep 29 14:11:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362
Bug ID: 82362
Summary: [8 Regression] SPEC CPU2006 436.cactusADM ~7%
performance deviation with trunk@251713
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
r251713 brings reasonable improvement to alloca. However there is a side effect
of this patch - 436.cactusADM performance became unstable when compiled with
-Ofast -march=core-avx2 -mfpmath=sse -funroll-loops
The impact is more noticeable when compiled with auto-parallelization
-ftree-parallelize-loops=N
Comparing performance for particular 7-runs
(relative to median performance of r251711):
r251711: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2%
r251713: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6%
r251711 is prettty stable, while r251713 is +7% faster on some runs and -7%
slower on other.
There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in
StaggeredLeapfrog2.fppized.f.
When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated
by alloca.
Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size
= (size + 15) & -16".
In prior revisions it differs in just one byte: "size = (size + 22) & -16"
Which actually may just waste extra 16 bytes for each array depending on
initial "size" value.
Actual r251713 code, built with
gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s
-O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops
-ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f
------------
lea rax, [15+r13*8] ; size = <...> + 15
shr rax, 4 ; zero-out
sal rax, 4 ; lower 4 bits
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp ; Array 1
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp ; Array 2
sub rsp, rax
mov QWORD PTR [rbp-4784], rsp ; Array 3 ... and so on
------------
Aligning rsp to cache line size (on each allocation or even once in the
beginning) brings performance to stable high values:
------------
lea rax, [15+r13*8]
shr rax, 4
sal rax, 4
shr rsp, 6 ; Align rsp to
shl rsp, 6 ; 64-byte border
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp
sub rsp, rax
mov QWORD PTR [rbp-4784], rsp
------------
64-byte aligned version performance
compared to the same median performance of r251711:
106,7% 107,0% 107,0% 107,1% 107,1% 107,2% 107,4%
Maybe what is necessary here is some kind of option to force array aligning for
gfortran (like "-align array64byte" for ifort) ?
More information about the Gcc-bugs
mailing list