[Bug tree-optimization/51179] poor vectorization on interlagos.
ubizjak at gmail dot com
gcc-bugzilla@gcc.gnu.org
Tue Nov 22 12:31:00 GMT 2011
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
Uros Bizjak <ubizjak at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011-11-22
CC| |irar at il dot ibm.com
Component|target |tree-optimization
Ever Confirmed|0 |1
--- Comment #2 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 11:33:24 UTC ---
We can start here with something that hopefully resembles your original fortran
code:
--cut here--
double C[10][4], B[10][10], A[10][4];
void test (void)
{
int i = 0, j = 0, l = 0;
//for (; j < 10; j += 2)
// for (; l < 10; l++)
for (; i < 4; i++)
{
C[j+0][i] = C[j+0][i] + A[l][i] * B[j+0][l];
C[j+1][i] = C[j+1][i] + A[l][i] * B[j+1][l];
}
}
--cut here--
gcc -O3 -ffast-math -mfma4 -mavx:
test:
vmovapd A(%rip), %ymm0
vbroadcastsd B(%rip), %ymm1
vfmaddpd C(%rip), %ymm1, %ymm0, %ymm1
vmovapd %ymm1, C(%rip)
vbroadcastsd B+80(%rip), %ymm1
vfmaddpd C+32(%rip), %ymm1, %ymm0, %ymm0
vmovapd %ymm0, C+32(%rip)
vzeroupper
ret
Nice.
Now uncomment the second loop ("l" index) and this kernel will break:
< ... lots of code deleted ... >
.L3:
vmovupd (%r8,%rax), %xmm1
addl $1, %esi
vinsertf128 $0x1, 16(%r8,%rax), %ymm1, %ymm1
vfmaddpd %ymm0, %ymm5, %ymm1, %ymm0
vmovapd %ymm0, (%rbx,%rax)
vmovupd (%rcx,%rax), %xmm0
vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0
vfmaddpd %ymm0, %ymm4, %ymm1, %ymm0
vmovupd %xmm0, (%rcx,%rax)
vextractf128 $0x1, %ymm0, 16(%rcx,%rax)
addq $32, %rax
cmpl %r10d, %esi
jb .L3
< ... lots of code deleted ... >
This already happens in the tree optimizers (vectorizer), RTL is just following
this trail.
Confirmed as a vectorizer problem.
More information about the Gcc-bugs
mailing list