[Bug rtl-optimization/82237] New: [AArch64] Destructive operations result in poor register allocation after scheduling
jgreenhalgh at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Sep 18 13:25:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82237
Bug ID: 82237
Summary: [AArch64] Destructive operations result in poor
register allocation after scheduling
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: jgreenhalgh at gcc dot gnu.org
Target Milestone: ---
A destructive operation is one in which an input operand is both read and
written. For example, in the vector FMLA instruction in AArch64:
FMLA v0.4s, v1.4s, v2.4s
The first operand is used for the accumulator value (the operation is v0 = v0 +
v1 * v2) and is both read and written by the instruction.
In RTL terms, this is:
(define_insn "fma<mode>4"
[(set (match_operand:VHSDF 0 "register_operand" "=w")
(fma:VHSDF (match_operand:VHSDF 1 "register_operand" "w")
(match_operand:VHSDF 2 "register_operand" "w")
(match_operand:VHSDF 3 "register_operand" "0")))]
"TARGET_SIMD"
"fmla\\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
[(set_attr "type" "neon_fp_mla_<stype><q>")]
)
from config/aarch64/aarch64-simd.md .
We can get suboptimal code where a read/write operand is used both by a
destructive operation, and a non-destructive operation, and the destructive
operation is scheduled before the non-destructive operation. For example, with
this auto-vectorizable code (with trunk, -O3 -mcpu=cortex-a57):
void
foo (float* __restrict__ in1, float* __restrict__ in2,
float* __restrict__ out1, float* __restrict__ out2)
{
for (int i = 0; i < 1024; i++)
{
float t = out1[i];
out1[i] = t + in1[i] * in2[i];
out2[i] = t + in1[i];
}
}
ldr q1, [x2, x4]
ldr q0, [x0, x4]
ldr q2, [x1, x4]
mov v3.16b, v1.16b // <<<<<< 1)
fmla v3.4s, v2.4s, v0.4s // <<<<<< 2)
fadd v0.4s, v0.4s, v1.4s // <<<<<< 3)
str q3, [x2, x4]
str q0, [x3, x4]
The scheduling of 2) before 3) forces a reload from v1 in to v3 at 1). With an
improved schedule, this could be:
ldr q1, [x2, x4]
ldr q0, [x0, x4]
ldr q2, [x1, x4]
fadd v4.4s, v0.4s, v1.4s // <<<<<< 3)
fmla v3.4s, v2.4s, v0.4s // <<<<<< 2)
str q3, [x2, x4]
str q4, [x3, x4]
In larger loops, we can end up in this situation more frequently than we would
like - the cost of the extra move instructions can be high.
More information about the Gcc-bugs
mailing list