82362 – [11/12/13/14 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713

Bug 82362 - [11/12/13/14 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713

Summary: [11/12/13/14 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	8.0

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	deferred, missed-optimization

Depends on:
Blocks:	spec 84613
	Show dependency tree / graph

Reported:	2017-09-29 14:10 UTC by Alexander Nesterovskiy
Modified:	2023-07-07 10:32 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-10-02 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Nesterovskiy 2017-09-29 14:10:56 UTC

r251713 brings reasonable improvement to alloca. However there is a side effect of this patch - 436.cactusADM performance became unstable when compiled with
    -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops
The impact is more noticeable when compiled with auto-parallelization
    -ftree-parallelize-loops=N

Comparing performance for particular 7-runs
(relative to median performance of r251711):
r251711: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
r251713: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%

r251711 is prettty stable, while r251713 is +7% faster on some runs and -7% slower on other.

There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in StaggeredLeapfrog2.fppized.f.
When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated by alloca.
Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size = (size + 15) & -16".
In prior revisions it differs in just one byte: "size = (size + 22) & -16"
Which actually may just waste extra 16 bytes for each array depending on initial "size" value.

Actual r251713 code, built with
    gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s
    -O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops
    -ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f
------------
lea rax, [15+r13*8]             ; size = <...> + 15
shr rax, 4                      ; zero-out
sal rax, 4                      ;     lower 4 bits
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp   ; Array 1
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp   ; Array 2
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp   ; Array 3 ... and so on
------------

Aligning rsp to cache line size (on each allocation or even once in the beginning) brings performance to stable high values:
------------
lea rax, [15+r13*8] 
shr rax, 4
sal rax, 4
shr rsp, 6                      ; Align rsp to
shl rsp, 6                      ;     64-byte border
sub rsp, rax
mov QWORD PTR [rbp-4984], rsp
sub rsp, rax
mov QWORD PTR [rbp-4448], rsp
sub rsp, rax 
mov QWORD PTR [rbp-4784], rsp
------------

64-byte aligned version performance
compared to the same median performance of r251711:
106,7%  107,0%  107,0%  107,1%  107,1%  107,2%  107,4%

Maybe what is necessary here is some kind of option to force array aligning for gfortran (like "-align array64byte" for ifort) ?

Comment 1 Dominique d'Humieres 2017-09-30 14:53:55 UTC

Is it really a gfortran bug?

Comment 2 Richard Biener 2017-10-02 08:12:13 UTC

I think the Fortran FE could help with using alloca-with-align with larger alignment (maybe using BIGGEST_ALIGNMENT, I see really no good reason for
using sth as large as 64 bytes by default, so yes, for this we'd need an
option that amends -fstack-arrays).

But that Fortran part isn't really a regression.

> Comparing performance for particular 7-runs
> (relative to median performance of r251711):
> r251711: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%
> r251713: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%

I suppose you swapped the revs here as here r251713 looks more stable.

The median looks unchanged here so the regression would be the instability.

Comment 3 Thomas Koenig 2017-10-02 08:34:33 UTC

The Fortran part is now PR82392.

Comment 4 Alexander Nesterovskiy 2017-10-02 09:13:13 UTC

(In reply to Richard Biener from comment #2)
> I suppose you swapped the revs here
Yep, sorry. It's supposed to be:
r251711: 99,5%   99,6%   99,8%   100,0%  100,3%  100,6%  100,6%
r251713: 92,8%   92,9%   93,0%   106,7%  107,0%  107,0%  107,2%

Comment 5 Richard Biener 2018-03-27 09:18:56 UTC

Deferred.  Not sure if this "regression" is worth tracking.

Comment 6 Jakub Jelinek 2019-05-03 09:18:09 UTC

GCC 9.1 has been released.

Comment 7 Jakub Jelinek 2019-08-12 08:57:45 UTC

GCC 9.2 has been released.

Comment 8 Jakub Jelinek 2020-03-12 11:58:49 UTC

GCC 9.3.0 has been released, adjusting target milestone.

Comment 9 Richard Biener 2021-06-01 08:09:21 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 10 Richard Biener 2022-05-27 09:37:36 UTC

GCC 9 branch is being closed

Comment 11 Jakub Jelinek 2022-06-28 10:33:42 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 12 Richard Biener 2023-07-07 10:32:32 UTC

GCC 10 branch is being closed.