g:bc484e250990393e887f7239157cc85ce6fadcce, r11-205 make -k check-gcc RUNTESTFLAGS=vect.exp=gcc.dg/vect/bb-slp-pr68892.c FAIL: gcc.dg/vect/bb-slp-pr68892.c scan-tree-dump-times slp2 "Basic block will be vectorized" 1 FAIL: gcc.dg/vect/bb-slp-pr68892.c -flto -ffat-lto-objects scan-tree-dump-times slp2 "Basic block will be vectorized" 1 # of expected passes 4 # of unexpected failures 2 Seeing this on powerpc64 both BE and LE and on all power versions.
The testcase probably needs to move to costmodel/*/ because it's outcome now depends on the actual costing. On x86_64: 0x54db890 _1 2 times vector_store costs 24 in body 0x54db890 <unknown> 1 times vec_construct costs 8 in prologue 0x54db890 <unknown> 1 times vec_construct costs 8 in prologue 0x54dcba0 _1 1 times scalar_store costs 12 in body 0x54dcba0 _2 1 times scalar_store costs 12 in body 0x54dcba0 _3 1 times scalar_store costs 12 in body 0x54dcba0 _4 1 times scalar_store costs 12 in body while ppc64le has 0x42edf00 _1 2 times vector_store costs 2 in body 0x42edf00 <unknown> 1 times vec_construct costs 2 in prologue 0x42edf00 <unknown> 1 times vec_construct costs 2 in prologue 0x42ef850 _1 1 times scalar_store costs 1 in body 0x42ef850 _2 1 times scalar_store costs 1 in body 0x42ef850 _3 1 times scalar_store costs 1 in body 0x42ef850 _4 1 times scalar_store costs 1 in body so for ppc64le it's 6 vector vs. 4 scalar while on x86_64 it's 36 vector vs. 48 scalar. As the comment in the testcase explains the vectorization is considered a "bug" (well, I'd say if write-combining is profitable we should of course do it): /* ??? Due to the gaps we fall back to scalar loads which makes the vectorization profitable. */ /* { dg-final { scan-tree-dump "not profitable" "slp2" { xfail *-*-* } } } */ /* { dg-final { scan-tree-dump-times "BB vectorization with gaps at the end of a load is not supported" 1 "slp2" } } */ /* { dg-final { scan-tree-dump-times "Basic block will be vectorized" 1 "slp2" } } */ on x86_64 we get movsd a+2048(%rip), %xmm0 movsd a(%rip), %xmm1 movhpd a+3072(%rip), %xmm0 movhpd a+1024(%rip), %xmm1 movaps %xmm1, b(%rip) movaps %xmm0, b+16(%rip) vs. movsd a(%rip), %xmm0 movsd %xmm0, b(%rip) movsd a+1024(%rip), %xmm0 movsd %xmm0, b+8(%rip) movsd a+2048(%rip), %xmm0 movsd %xmm0, b+16(%rip) movsd a+3072(%rip), %xmm0 movsd %xmm0, b+24(%rip) where it looks profitable (larger stores are also always good for STLF) while on ppc64le we have 0: addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l .localentry foo,.-foo addis 9,2,.LANCHOR0+1024@toc@ha lfd 10,.LANCHOR0+1024@toc@l(9) addis 9,2,.LANCHOR0+2048@toc@ha lfd 11,.LANCHOR0+2048@toc@l(9) addis 9,2,.LANCHOR0+3072@toc@ha lfd 12,.LANCHOR0+3072@toc@l(9) addis 9,2,.LANCHOR0+4096@toc@ha lfd 0,.LANCHOR0+4096@toc@l(9) addis 9,2,.LANCHOR0@toc@ha stfd 10,.LANCHOR0@toc@l(9) addis 9,2,.LANCHOR0+8@toc@ha stfd 11,.LANCHOR0+8@toc@l(9) addis 9,2,.LANCHOR0+16@toc@ha stfd 12,.LANCHOR0+16@toc@l(9) addis 9,2,.LANCHOR0+24@toc@ha stfd 0,.LANCHOR0+24@toc@l(9) blr vs (cost model disabled): 0: addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l .localentry foo,.-foo addis 9,2,.LANCHOR0+2048@toc@ha addis 8,2,.LANCHOR0@toc@ha li 10,16 lfd 10,.LANCHOR0+2048@toc@l(9) lfd 11,.LANCHOR0@toc@l(8) addis 9,2,.LANCHOR0+3072@toc@ha addis 8,2,.LANCHOR0+1024@toc@ha lfd 12,.LANCHOR0+3072@toc@l(9) lfd 0,.LANCHOR0+1024@toc@l(8) addis 9,2,.LANCHOR0+131072@toc@ha addi 9,9,.LANCHOR0+131072@toc@l xxpermdi 12,10,12,0 xxpermdi 0,11,0,0 stxvd2x 12,9,10 stxvd2x 0,0,9 blr both look comparatively ugly due to the loads of .LANCHOR uses. I'd have expected a lea of &a[0][0] and then offsetted addressing of that. At least it would avoid a ton of relocations. Looks like 131072 wouldn't fit in the 16 bits offset though. Anyway - offtopic. Whether the xxpermdi makes it unprofitable to vectorize is not known to me.
The testcase morphed in a way no longer testing what it was originally supposed to do and slightly altering it shows the original issue isn't fixed (anymore). The limit as set as result of PR91403 (and dups) prevents the issue for larger arrays but the testcase has double a[128][128]; which results in a group size of "just" 512 (the limit is 4096). Avoiding the 'BB vectorization with gaps at the end of a load is not supported' by altering it to do void foo(void) { b[0] = a[0][0]; b[1] = a[1][0]; b[2] = a[2][0]; b[3] = a[3][127]; } shows that costing has improved further to not account the dead loads making the previous test inefficient. In fact the underlying issue isn't fixed (we do code-generate dead loads). In fact the vector permute load is even profitable, just the excessive code-generation issue exists (and is "fixed" by capping it a constant boundary, just too high for this particular testcase). The testcase now has "dups", so I'll simply remove it.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:cb60334b7162ec5ae560be482cd7a33402470bb4 commit r11-6710-gcb60334b7162ec5ae560be482cd7a33402470bb4 Author: Richard Biener <rguenther@suse.de> Date: Fri Jan 15 13:31:28 2021 +0100 testsuite/96098 - remove redundant testcase The testcase morphed in a way no longer testing what it was originally supposed to do and slightly altering it shows the original issue isn't fixed (anymore). The limit as set as result of PR91403 (and dups) prevents the issue for larger arrays but the testcase has double a[128][128]; which results in a group size of "just" 512 (the limit is 4096). Avoiding the 'BB vectorization with gaps at the end of a load is not supported' by altering it to do void foo(void) { b[0] = a[0][0]; b[1] = a[1][0]; b[2] = a[2][0]; b[3] = a[3][127]; } shows that costing has improved further to not account the dead loads making the previous test inefficient. In fact the underlying issue isn't fixed (we do code-generate dead loads). In fact the vector permute load is even profitable, just the excessive code-generation issue exists (and is "fixed" by capping it a constant boundary, just too high for this particular testcase). The testcase now has "dups", so I'll simply remove it. 2021-01-15 Richard Biener <rguenther@suse.de> PR testsuite/96098 * gcc.dg/vect/bb-slp-pr68892.c: Remove.
Fixed.