[Bug rtl-optimization/82237] New: [AArch64] Destructive operations result in poor register allocation after scheduling

jgreenhalgh at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Mon Sep 18 13:25:00 GMT 2017


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82237

            Bug ID: 82237
           Summary: [AArch64] Destructive operations result in poor
                    register allocation after scheduling
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jgreenhalgh at gcc dot gnu.org
  Target Milestone: ---

A destructive operation is one in which an input operand is both read and
written. For example, in the vector FMLA instruction in AArch64:

  FMLA v0.4s, v1.4s, v2.4s

The first operand is used for the accumulator value (the operation is v0 = v0 +
v1 * v2) and is both read and written by the instruction.

In RTL terms, this is:

  (define_insn "fma<mode>4"
    [(set (match_operand:VHSDF 0 "register_operand" "=w")
         (fma:VHSDF (match_operand:VHSDF 1 "register_operand" "w")
                  (match_operand:VHSDF 2 "register_operand" "w")
                    (match_operand:VHSDF 3 "register_operand" "0")))]
    "TARGET_SIMD"
   "fmla\\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>"
    [(set_attr "type" "neon_fp_mla_<stype><q>")]
  )

from config/aarch64/aarch64-simd.md .

We can get suboptimal code where a read/write operand is used both by a
destructive operation, and a non-destructive operation, and the destructive
operation is scheduled before the non-destructive operation. For example, with
this auto-vectorizable code (with trunk, -O3 -mcpu=cortex-a57):

  void
  foo (float* __restrict__ in1, float* __restrict__ in2,
       float* __restrict__ out1, float* __restrict__ out2)
  {
    for (int i = 0; i < 1024; i++)
      {
        float t = out1[i];
        out1[i] = t + in1[i] * in2[i];
        out2[i] = t + in1[i];
      }
  }

        ldr     q1, [x2, x4]
        ldr     q0, [x0, x4]
        ldr     q2, [x1, x4]
        mov     v3.16b, v1.16b          // <<<<<< 1)
        fmla    v3.4s, v2.4s, v0.4s     // <<<<<< 2)
        fadd    v0.4s, v0.4s, v1.4s     // <<<<<< 3)
        str     q3, [x2, x4]
        str     q0, [x3, x4]


The scheduling of 2) before 3) forces a reload from v1 in to v3 at 1). With an
improved schedule, this could be:

        ldr     q1, [x2, x4]
        ldr     q0, [x0, x4]
        ldr     q2, [x1, x4]
        fadd    v4.4s, v0.4s, v1.4s     // <<<<<< 3)
        fmla    v3.4s, v2.4s, v0.4s     // <<<<<< 2)
        str     q3, [x2, x4]
        str     q4, [x3, x4]

In larger loops, we can end up in this situation more frequently than we would
like - the cost of the extra move instructions can be high.


More information about the Gcc-bugs mailing list