r251713 brings reasonable improvement to alloca. However there is a side effect of this patch - 436.cactusADM performance became unstable when compiled with -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops The impact is more noticeable when compiled with auto-parallelization -ftree-parallelize-loops=N Comparing performance for particular 7-runs (relative to median performance of r251711): r251711: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2% r251713: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6% r251711 is prettty stable, while r251713 is +7% faster on some runs and -7% slower on other. There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in StaggeredLeapfrog2.fppized.f. When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated by alloca. Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size = (size + 15) & -16". In prior revisions it differs in just one byte: "size = (size + 22) & -16" Which actually may just waste extra 16 bytes for each array depending on initial "size" value. Actual r251713 code, built with gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s -O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops -ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f ------------ lea rax, [15+r13*8] ; size = <...> + 15 shr rax, 4 ; zero-out sal rax, 4 ; lower 4 bits sub rsp, rax mov QWORD PTR [rbp-4984], rsp ; Array 1 sub rsp, rax mov QWORD PTR [rbp-4448], rsp ; Array 2 sub rsp, rax mov QWORD PTR [rbp-4784], rsp ; Array 3 ... and so on ------------ Aligning rsp to cache line size (on each allocation or even once in the beginning) brings performance to stable high values: ------------ lea rax, [15+r13*8] shr rax, 4 sal rax, 4 shr rsp, 6 ; Align rsp to shl rsp, 6 ; 64-byte border sub rsp, rax mov QWORD PTR [rbp-4984], rsp sub rsp, rax mov QWORD PTR [rbp-4448], rsp sub rsp, rax mov QWORD PTR [rbp-4784], rsp ------------ 64-byte aligned version performance compared to the same median performance of r251711: 106,7% 107,0% 107,0% 107,1% 107,1% 107,2% 107,4% Maybe what is necessary here is some kind of option to force array aligning for gfortran (like "-align array64byte" for ifort) ?
Is it really a gfortran bug?
I think the Fortran FE could help with using alloca-with-align with larger alignment (maybe using BIGGEST_ALIGNMENT, I see really no good reason for using sth as large as 64 bytes by default, so yes, for this we'd need an option that amends -fstack-arrays). But that Fortran part isn't really a regression. > Comparing performance for particular 7-runs > (relative to median performance of r251711): > r251711: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2% > r251713: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6% I suppose you swapped the revs here as here r251713 looks more stable. The median looks unchanged here so the regression would be the instability.
The Fortran part is now PR82392.
(In reply to Richard Biener from comment #2) > I suppose you swapped the revs here Yep, sorry. It's supposed to be: r251711: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6% r251713: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2%
Deferred. Not sure if this "regression" is worth tracking.
GCC 9.1 has been released.
GCC 9.2 has been released.
GCC 9.3.0 has been released, adjusting target milestone.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.