Bug 82362 - [9/10/11/12 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713
Summary: [9/10/11/12 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation ...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 8.0
: P2 normal
Target Milestone: 9.5
Assignee: Not yet assigned to anyone
URL:
Keywords: deferred, missed-optimization
Depends on:
Blocks: spec 84613
  Show dependency treegraph
 
Reported: 2017-09-29 14:10 UTC by Alexander Nesterovskiy
Modified: 2021-06-01 08:09 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-10-02 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Nesterovskiy 2017-09-29 14:10:56 UTC
r251713 brings reasonable improvement to alloca. However there is a side effect of this patch - 436.cactusADM performance became unstable when compiled with
    -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops
The impact is more noticeable when compiled with auto-parallelization
    -ftree-parallelize-loops=N

Comparing performance for particular 7-runs
(relative to median performance of r251711):
r251711: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
r251713: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%

r251711 is prettty stable, while r251713 is +7% faster on some runs and -7% slower on other.

There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in StaggeredLeapfrog2.fppized.f.
When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated by alloca.
Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size = (size + 15) & -16".
In prior revisions it differs in just one byte: "size = (size + 22) & -16"
Which actually may just waste extra 16 bytes for each array depending on initial "size" value.

Actual r251713 code, built with
    gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s
    -O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops
    -ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f
------------
lea rax, [15+r13*8]             ; size = <...> + 15
shr rax, 4                      ; zero-out
sal rax, 4                      ;     lower 4 bits
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp   ; Array 1
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp   ; Array 2
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp   ; Array 3 ... and so on
------------

Aligning rsp to cache line size (on each allocation or even once in the beginning) brings performance to stable high values:
------------
lea rax, [15+r13*8] 
shr rax, 4
sal rax, 4
shr rsp, 6                      ; Align rsp to
shl rsp, 6                      ;     64-byte border
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp
------------

64-byte aligned version performance
compared to the same median performance of r251711:
106,7%  107,0%  107,0%  107,1%  107,1%  107,2%  107,4%

Maybe what is necessary here is some kind of option to force array aligning for gfortran (like "-align array64byte" for ifort) ?
Comment 1 Dominique d'Humieres 2017-09-30 14:53:55 UTC
Is it really a gfortran bug?
Comment 2 Richard Biener 2017-10-02 08:12:13 UTC
I think the Fortran FE could help with using alloca-with-align with larger alignment (maybe using BIGGEST_ALIGNMENT, I see really no good reason for
using sth as large as 64 bytes by default, so yes, for this we'd need an
option that amends -fstack-arrays).

But that Fortran part isn't really a regression.

> Comparing performance for particular 7-runs
> (relative to median performance of r251711):
> r251711: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
> r251713: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%

I suppose you swapped the revs here as here r251713 looks more stable.

The median looks unchanged here so the regression would be the instability.
Comment 3 Thomas Koenig 2017-10-02 08:34:33 UTC
The Fortran part is now PR82392.
Comment 4 Alexander Nesterovskiy 2017-10-02 09:13:13 UTC
(In reply to Richard Biener from comment #2)
> I suppose you swapped the revs here
Yep, sorry. It's supposed to be:
r251711: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%
r251713: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
Comment 5 Richard Biener 2018-03-27 09:18:56 UTC
Deferred.  Not sure if this "regression" is worth tracking.
Comment 6 Jakub Jelinek 2019-05-03 09:18:09 UTC
GCC 9.1 has been released.
Comment 7 Jakub Jelinek 2019-08-12 08:57:45 UTC
GCC 9.2 has been released.
Comment 8 Jakub Jelinek 2020-03-12 11:58:49 UTC
GCC 9.3.0 has been released, adjusting target milestone.
Comment 9 Richard Biener 2021-06-01 08:09:21 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.