This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/69848] New: poor vectorization of a loop from SPEC2006 464.h264ref
- From: "wilson at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 17 Feb 2016 01:34:41 +0000
- Subject: [Bug tree-optimization/69848] New: poor vectorization of a loop from SPEC2006 464.h264ref
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69848
Bug ID: 69848
Summary: poor vectorization of a loop from SPEC2006 464.h264ref
Product: gcc
Version: 6.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: wilson at gcc dot gnu.org
Target Milestone: ---
This is a continuation of bug 69282, which reported an ICE on the same loop,
which has since been fixed. There is still the problem that the code is poorly
optimized. These problems can be seen on both armhf and aarch64. There are
multiple problems here.
The testcase is
#include <stdlib.h>
int fn1 (int) __attribute__ ((noinline));
int a[32];
int fn1(int d) {
int c = 1;
for (int b = 0; b < 32; b++)
if (a[b])
c = 0;
return c;
}
int
main (void)
{
int i;
for (i = 0; i < 32; i++)
a[i] = 0;
if (fn1(10) != 1)
abort ();
a[3] = 2;
a[24] = 1;
if (fn1(10) != 0)
abort ();
return 0;
}
Compiled with -O2 -ftree-vectorize, the inner loop of fn1 is
.L2:
ldr q0, [x0, x1]
add x0, x0, 16
cmp x0, 128
cmeq v0.4s, v0.4s, #0
not v0.16b, v0.16b
cmlt v0.4s, v0.4s, #0
bit v1.16b, v2.16b, v0.16b
bic v3.16b, v3.16b, v0.16b
add v2.4s, v2.4s, v4.4s
bne .L2
The cmlt instruction serves no useful purpose, as the output is the same as the
input. This can be fixed by adding the missing vcond_mask* patterns to the
armhf and aarch64 ports.
The not instruction is unnecessary. It can be eliminated by changing the
bit/bic instructions into bif/and. This might be possible via combine, and
might require rewriting some aarch64/armhf patterns to use vector rtl instead
of unspecs.
The v2 iterator is computing the index in the array as a vector, which is info
we don't need. We only need the info in v3. We can eliminate the instructions
setting v1 and v2, plus the instructions before the loop setting v1, v2, and
v4, and the instructions after the loop using v1.
Also, related to that, after the loop, we have two reductions.
umaxv s0, v1.4s
dup v0.4s, v0.s[0]
cmeq v1.4s, v1.4s, v0.4s
and v1.16b, v3.16b, v1.16b
umaxv s1, v1.4s
umov w0, v1.s[0]
We only need one reduction here, and we only need the info in v3. This can be
simplified to
uminv s1, v3.4s
umov w0, v1.s[0]
I don't know offhand what vectorizer changes are necessary to make these last
two transformations.
I verified that these transformations work on aarch64. Before the
transformations, we have 8 instructions before the loop, 10 instructions inside
the loop, and 6 instructions after the loop. After the transformations, we
have 4 instructions before the loop, 6 instructions inside the loop, and 2
instructions after the loop. So it is half the size statically, and roughly
60% of the original size dynamically.