[Bug fortran/78456] New: [6/7 Regression] 171.swim loops not interchanged, vectorized perf loss on aarch64
chris_s_jones at yahoo dot com
gcc-bugzilla@gcc.gnu.org
Mon Nov 21 22:21:00 GMT 2016
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78456
Bug ID: 78456
Summary: [6/7 Regression] 171.swim loops not interchanged,
vectorized perf loss on aarch64
Product: gcc
Version: 6.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: chris_s_jones at yahoo dot com
Target Milestone: ---
Created attachment 40102
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40102&action=edit
test case
When debugging a perf regression in 171.swim after moving to gfortran 6.2.0, I
notice that a nested loop in MAIN is not being interchanged, causing
sub-optimal vectorization in this case. A simplified test case is attached
with an excerpt shown here:
DO 3500 I = 1, MNMIN
DO 4500 J = 1, MNMIN
FOO = FOO + ABS(X0(I,J))
BAR = BAR + ABS(X1(I,J))
BAZ = BAZ + ABS(X2(I,J))
4500 CONTINUE
X1(I,I) = X1(I,I)
1 * ( MOD (I, 100) /100.)
3500 CONTINUE
In 4.8.2, the compiler generates the sequence:
230: 4cdf7e47 ld1 {v7.2d}, [x18], #16
234: 4cdf7fd0 ld1 {v16.2d}, [x30], #16
238: 4cdf7c34 ld1 {v20.2d}, [x1], #16
23c: 4ee0f8f5 fabs v21.2d, v7.2d
240: 4ee0fa16 fabs v22.2d, v16.2d
244: 4ee0fa97 fabs v23.2d, v20.2d
248: 4e75d400 fadd v0.2d, v0.2d, v21.2d
24c: 4e76d421 fadd v1.2d, v1.2d, v22.2d
250: 4e77d442 fadd v2.2d, v2.2d, v23.2d
In 6.2.0 and on the trunk, I'm seeing the values assembled from multiple
locations since the missing loop interchange means it doesn't use adjacent
values:
2c8: fc606834 ldr d20, [x1,x0]
2cc: 52800050 mov w16, #0x2 // #2
2d0: d294dc0e mov x14, #0xa6e0 // #42720
2d4: 6b14021f cmp w16, w20
2d8: fc606bd6 ldr d22, [x30,x0]
2dc: fc6068f7 ldr d23, [x7,x0]
2e0: 8b0d0000 add x0, x0, x13
2e4: fd69b835 ldr d21, [x1,#21360]
2e4: fd69b835 ldr d21, [x1,#21360]
2e8: 6e0806b0 mov v16.d[0], v21.d[0]
2ec: 6e180690 mov v16.d[1], v20.d[0]
2f0: 4ee0fa19 fabs v25.2d, v16.2d
2f4: fd69bbd8 ldr d24, [x30,#21360]
2f8: 6e080706 mov v6.d[0], v24.d[0]
2fc: 6e1806c6 mov v6.d[1], v22.d[0]
300: 4ee0f8db fabs v27.2d, v6.2d
304: fd69b8fd ldr d29, [x7,#21360]
308: 6e0807a7 mov v7.d[0], v29.d[0]
30c: 6e1806e7 mov v7.d[1], v23.d[0]
310: 4ee0f8fe fabs v30.2d, v7.2d
314: 4e79d75a fadd v26.2d, v26.2d, v25.2d
318: 4e7bd79c fadd v28.2d, v28.2d, v27.2d
31c: 4e7ed7ff fadd v31.2d, v31.2d, v30.2d
Flags used: -O3 -march=armv8-a+crypto -mcpu=cortex-a57+crypto -ffast-math
-funroll-loops -fvect-cost-model=unlimited -floop-interchange -g -c -o sink.o
sink.f
I understand -floop-interchange is now an alias for -floop-nest-optimize but am
wondering why this case wasn't interchanged. The perf difference seems
significant for this case. Manually swapping the loop indices in the source
causes the better code sequence to be generated.
Behaves similarly for gfortran 6.2.0 and trunk, built using:
configure 'CFLAGS_FOR_TARGET=-march=armv8-a -mcpu=cortex-a57 -O3'
'CXXFLAGS_FOR_TARGET=-march=armv8-a -mcpu=cortex-a57 -O3'
--prefix=/home/gcc-aarch64/6.2.0-linux-gnu --target=aarch64-linux-gnu
--with-sysroot=/home/gcc-aarch64/6.2.0-linux-gnu/sysroot --enable-__cxa_atexit
--with-gnu-as --with-gnu-ld --enable-shared --disable-libssp
--disable-libmudflap --enable-languages=c,c++,fortran --disable-libsanitizer
--disable-nls
More information about the Gcc-bugs
mailing list