[Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop
yyc1992 at gmail dot com
gcc-bugzilla@gcc.gnu.org
Mon Jul 18 13:01:58 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340
Bug ID: 106340
Summary: flag set from SVE svwhilelt intrinsic not reused in
loop
Product: gcc
Version: 12.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: yyc1992 at gmail dot com
Target Milestone: ---
I'm experimenting with manually writing VLA loops and trying to match the
assembly code I expect/from autovectorizer. One of the main area I can't get it
to work is when setting the loop predicate using the svwhilelt intrinsics. The
instruction it corresponds to set the flags and can be directly used to
terminate the loop. Indeed, when using the autovectorizer, this is exactly what
happens.
```
void set1(uint32_t *__restrict__ out, size_t m)
{
for (size_t i = 0; i < m; i++) {
out[i] = 1;
}
}
```
compiles to
```
cbz x1, .L1
mov x2, 0
cntw x3
whilelo p0.s, xzr, x1
mov z0.s, #1
.p2align 3,,7
.L3:
st1w z0.s, p0, [x0, x2, lsl 2]
add x2, x2, x3
whilelo p0.s, x2, x1
b.any .L3
.L1:
ret
```
(Here I believe the flag set from the loop header whilelo could also be used
for the jump but that doesn't same much in this case.)
However, no matter how I trie to replicate this using manually written code
using the sve intrinsics, there is always an additional cmp instruction
generated. The closest I can get is by replicating the structure of the
auto-vectorized loop as much as possible with,
```
void set2(uint32_t *__restrict__ out, size_t m)
{
auto svelen = svcntw();
auto v = svdup_u32(1);
if (m != 0) {
auto pg = svwhilelt_b32(0ul, m);
for (size_t i = 0; i < m; i += svelen, pg = svwhilelt_b32(i, m)) {
svst1(pg, &out[i], v);
}
}
}
```
which is compiled to
```
cbz x1, .L9
mov x2, 0
cntw x3
whilelo p0.s, xzr, x1
mov z0.s, #1
.p2align 3,,7
.L11:
st1w z0.s, p0, [x0, x2, lsl 2]
add x2, x2, x3
whilelo p0.s, x2, x1
cmp x1, x2
bhi .L11
.L9:
ret
```
which is literally the same code down to register allocation except that the
branch following the `whilelo` instruction is replaced with another comparison
and branch.
More information about the Gcc-bugs
mailing list