[PATCH/AARCH64] Add scheduler for vulcan.

Mon Jul 25 10:34:00 GMT 2016

On Wed, Jul 20, 2016 at 03:07:45PM +0530, Virendra Pathak wrote:
> Hi gcc-patches group,
> 
> Please find the patch for adding the basic scheduler for vulcan
> in the aarch64 port.
> 
> Tested the patch with compiling cross aarch64-linux-gcc,
> bootstrapped native aarch64-unknown-linux-gnu and
> run gcc regression.
> 
> Kindly review and merge the patch to trunk, if the patch is okay.
> 
> There are few TODO in this patch which we have planned to
> submit in the next submission e.g. crc and crypto
> instructions, further improving integer & fp load/store
> based on addressing mode of the address.

Hi Virendra,

Thanks for the patch, I have some concerns about the size of the
automata that this description generates. As you can see
(use (automata_option "stats") in the description to enable statistics)
this scheduler description generates a 10x larger automata for Vulcan than
the second largest description we have for AArch64 (cortex_a53_advsimd):

  Automaton `cortex_a53_advsimd'
     9072 NDFA states,          49572 NDFA arcs
     9072 DFA states,           49572 DFA arcs
     4050 minimal DFA states,   23679 minimal DFA arcs
      368 all insns         11 insn equivalence classes
    0 locked states
  28759 transition comb vector els, 44550 trans table els: use simple vect
  44550 min delay table els, compression factor 2

  Automaton `vulcan'
    103223 NDFA states,          651918 NDFA arcs
    103223 DFA states,           651918 DFA arcs
    45857 minimal DFA states,   352255 minimal DFA arcs
      368 all insns         28 insn equivalence classes
    0 locked states
  429671 transition comb vector els, 1283996 trans table els: use comb vect
  1283996 min delay table els, compression factor 2

Such a large automaton increases compiler build time and memory consumption,
often for little scheduling benefit.

Normally such a large automaton comes from using a large repeat
expression (*). For example in your modeling of divisions:

> +(define_insn_reservation "vulcan_div" 13
> +  (and (eq_attr "tune" "vulcan")
> +       (eq_attr "type" "sdiv,udiv"))
> +  "vulcan_i1*13")
> +
> +(define_insn_reservation "vulcan_fp_divsqrt_s" 16
> +  (and (eq_attr "tune" "vulcan")
> +       (eq_attr "type" "fdivs,fsqrts"))
> +  "vulcan_f0*8|vulcan_f1*8")
> +
> +(define_insn_reservation "vulcan_fp_divsqrt_d" 23
> +  (and (eq_attr "tune" "vulcan")
> +       (eq_attr "type" "fdivd,fsqrtd"))
> +  "vulcan_f0*12|vulcan_f1*12")

In other pipeline models, we try to keep these repeat numbers low to avoid
the large state-space growth they cause. For example, the Cortex-A57
pipeline model describes them as:

  (define_insn_reservation "cortex_a57_fp_divd" 16
    (and (eq_attr "tune" "cortexa57")
         (eq_attr "type" "fdivd, fsqrtd, neon_fp_div_d, neon_fp_sqrt_d"))
    "ca57_cx2_block*3")

The lower accuracy is acceptable because of the nature of the scheduling
model. For a machine with an issue rate of "4" like Vulcan, each cycle the
compiler models it tries to find four instructions to schedule, before it
progresses the state of the automaton. If an instruction is modelled as
blocking the "vulcan_i1" unit for 13 cycles, that means up to 52
instructions that the scheduler would have to find before issuing the next
instruction which would use vulcan_i1. Because scheduling works within
basic-blocks, the chance of finding so many independent instructions is
extremely low, and so you'd never see the benefit of the 13-cycle block.

I tried lowering the repeat expressions as so:

> +(define_insn_reservation "vulcan_div" 13
> +  (and (eq_attr "tune" "vulcan")
> +       (eq_attr "type" "sdiv,udiv"))
> +  "vulcan_i1*3")
> +
> +(define_insn_reservation "vulcan_fp_divsqrt_s" 16
> +  (and (eq_attr "tune" "vulcan")
> +       (eq_attr "type" "fdivs,fsqrts"))
> +  "vulcan_f0*3|vulcan_f1*3")
> +
> +(define_insn_reservation "vulcan_fp_divsqrt_d" 23
> +  (and (eq_attr "tune" "vulcan")
> +       (eq_attr "type" "fdivd,fsqrtd"))
> +  "vulcan_f0*5|vulcan_f1*5")

Which more than halves the size of the generated automaton:

  Automaton `vulcan'
    45370 NDFA states,          319261 NDFA arcs
    45370 DFA states,           319261 DFA arcs
    20150 minimal DFA states,   170824 minimal DFA arcs
      368 all insns         28 insn equivalence classes
    0 locked states
  215565 transition comb vector els, 564200 trans table els: use comb vect
  564200 min delay table els, compression factor 2

The other technique we use to reduce the size of the generated automaton is
to split off the AdvSIMD/FP model from the main pipeline description
(the thunderx _main, thunderx_mult, thunderx_divide, and thunderx_simd
 models take this approach even further and acheieve very small automaton
 as a result)

A change like wiring the vulcan_f0 and vulcan_f1 reservations to be
cpu_units of a new define_automaton "vulcan_advsimd" would cut the size
of the automaton by half again:

  Automaton `vulcan'
     8520 NDFA states,          52754 NDFA arcs
     8520 DFA states,           52754 DFA arcs
     2414 minimal DFA states,   19882 minimal DFA arcs
      368 all insns         19 insn equivalence classes
    0 locked states
  21062 transition comb vector els, 45866 trans table els: use simple vect
  45866 min delay table els, compression factor 2

  Automaton `vulcan_simd'
    12231 NDFA states,          85833 NDFA arcs
    12231 DFA states,           85833 DFA arcs
     9246 minimal DFA states,   66554 minimal DFA arcs
      368 all insns         11 insn equivalence classes
    0 locked states
  84074 transition comb vector els, 101706 trans table els: use simple vect
  101706 min delay table els, compression factor 2

Finally, simplifying some of the remaining large expressions
(vulcan_asimd_load*_mult, vulcan_asimd_load*_elts) can bring the size down
by half again, making it much more in line with the size of the other
AArch64 automaton.

Would you mind taking a look at some of these techniques to see if you can
reduce the size of the generated automata without causing too much
trouble for code generation for Vulcan?

Ideally we want to keep the size of all models to a reasonable level to
avoid bugs like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70473 .

Thanks,
James

> Virendra Pathak  <virendra.pathak@broadcom.com>
> Julian Brown  <julian@codesourcery.com>
> 
>         * config/aarch64/aarch64-cores.def: Change the scheduler
>         to vulcan.
>         * config/aarch64/aarch64.md: Include vulcan.md.
>         * config/aarch64/vulcan.md: New file.
> 
>