[Bug fortran/82362] New: [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713

Fri Sep 29 14:11:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362

            Bug ID: 82362
           Summary: [8 Regression] SPEC CPU2006 436.cactusADM ~7%
                    performance deviation with trunk@251713
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.nesterovskiy at intel dot com
  Target Milestone: ---

r251713 brings reasonable improvement to alloca. However there is a side effect
of this patch - 436.cactusADM performance became unstable when compiled with
    -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops
The impact is more noticeable when compiled with auto-parallelization
    -ftree-parallelize-loops=N

Comparing performance for particular 7-runs
(relative to median performance of r251711):
r251711: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
r251713: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%

r251711 is prettty stable, while r251713 is +7% faster on some runs and -7%
slower on other.

There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in
StaggeredLeapfrog2.fppized.f.
When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated
by alloca.
Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size
= (size + 15) & -16".
In prior revisions it differs in just one byte: "size = (size + 22) & -16"
Which actually may just waste extra 16 bytes for each array depending on
initial "size" value.

Actual r251713 code, built with
    gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s
    -O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops
    -ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f
------------
lea rax, [15+r13*8]             ; size = <...> + 15
shr rax, 4                      ; zero-out
sal rax, 4                      ;     lower 4 bits
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp   ; Array 1
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp   ; Array 2
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp   ; Array 3 ... and so on
------------

Aligning rsp to cache line size (on each allocation or even once in the
beginning) brings performance to stable high values:
------------
lea rax, [15+r13*8] 
shr rax, 4
sal rax, 4
shr rsp, 6                      ; Align rsp to
shl rsp, 6                      ;     64-byte border
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp
------------

64-byte aligned version performance
compared to the same median performance of r251711:
106,7%  107,0%  107,0%  107,1%  107,1%  107,2%  107,4%

Maybe what is necessary here is some kind of option to force array aligning for
gfortran (like "-align array64byte" for ifort) ?