[PATCH, GCC, AArch64] Branch Dilution Pass

Fri Nov 23 17:57:00 GMT 2018

Hi all,

Thank you for the comments. For the same reasons as stated by Kyrill and Richard E., I do believe it has to be the compiler that implements this. I will look to split the patch and address the points made by Andrew.

________________________________
From: Richard Earnshaw (lists) <Richard.Earnshaw@arm.com>
Sent: 12 November 2018 15:55:58
To: Kyrill Tkachov; Richard Biener; Sudakshina Das
Cc: GCC Patches; nd; Carey Williams
Subject: Re: [PATCH, GCC, AArch64] Branch Dilution Pass

On 12/11/2018 15:13, Kyrill Tkachov wrote:
> Hi Richard,
>
> On 12/11/18 14:13, Richard Biener wrote:
>> On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das <Sudi.Das@arm.com> wrote:
>> >
>> > Hi
>> >
>> > I am posting this patch on behalf of Carey (cc'ed). I also have some
>> > review comments that I will make as a reply to this later.
>> >
>> >
>> > This implements a new AArch64 specific back-end pass that helps
>> optimize
>> > branch-dense code, which can be a bottleneck for performance on some
>> Arm
>> > cores. This is achieved by padding out the branch-dense sections of the
>> > instruction stream with nops.
>>
>> Wouldn't this be more suitable for implementing inside the assembler?
>>
>
> The number of NOPs to insert to get the performance benefits varies from
> core to core,
> I don't think we want to add such CPU-specific optimisation logic to the
> assembler.

Additionally, the compiler has to keep track of branch ranges.  It can't
do this properly if the assembler is emitting more instructions than the
compiler thinks it is.

R.

>
> Thanks,
> Kyrill
>
>> > This has proven to show up to a 2.61%~ improvement on the Cortex A-72
>> > (SPEC CPU 2006: sjeng).
>> >
>> > The implementation includes the addition of a new RTX instruction class
>> > FILLER_INSN, which has been white listed to allow placement of NOPs
>> > outside of a basic block. This is to allow padding after unconditional
>> > branches. This is favorable so that any performance gained from
>> > diluting branches is not paid straight back via excessive eating of
>> nops.
>> >
>> > It was deemed that a new RTX class was less invasive than modifying
>> > behavior in regards to standard UNSPEC nops.
>> >
>> > ## Command Line Options
>> >
>> > Three new target-specific options are provided:
>> > - mbranch-dilution
>> > - mbranch-dilution-granularity={num}
>> > - mbranch-dilution-max-branches={num}
>> >
>> > A number of cores known to be able to benefit from this pass have been
>> > given default tuning values for their granularity and max-branches.
>> > Each affected core has a very specific granule size and associated
>> > max-branch limit. This is a microarchitecture specific optimization.
>> > Typical usage should be -mdilute-branches with a specificed -mcpu.
>> Cores
>> > with a granularity tuned to 0 will be ignored. Options are provided for
>> > experimentation.
>> >
>> > ## Algorithm and Heuristic
>> >
>> > The pass takes a very simple 'sliding window' approach to the problem.
>> > We crawl through each instruction (starting at the first branch) and
>> > keep track of the number of branches within the current "granule" (or
>> > window). When this exceeds the max-branch value, the pass will dilute
>> > the current granule, inserting nops to push out some of the branches.
>> > The heuristic will favour unconditonal branches (for performance
>> > reasons), or branches that are between two other branches (in order to
>> > decrease the likelihood of another dilution call being needed).
>> >
>> > Each branch type required a different method for nop insertion due to
>> > RTL/basic_block restrictions:
>> >
>> > - Returning calls do not end a basic block so can be handled by
>> emitting
>> > a generic nop.
>> > - Unconditional branches must be the end of a basic block, and nops
>> > cannot be outside of a basic block.
>> >    Thus the need for FILLER_INSN, which allows placement outside of a
>> > basic block - and translates to a nop.
>> > - For most conditional branches we've taken a simple approach and only
>> > handle the fallthru edge for simplicity,
>> >    which we do by inserting a "nop block" of nops on the fallthru edge,
>> > mapping that back to the original destination block.
>> > - asm gotos and pcsets are going to be tricky to analyse from a
>> dilution
>> > perspective so are ignored at present.
>> >
>> >
>> > ## Changelog
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 2018-11-09  Carey Williams <Carey.Williams@arm.com>
>> >
>> >         * gcc.target/aarch64/branch-dilution-off.c: New test.
>> >         * gcc.target/aarch64/branch-dilution-on.c: New test.
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> > 2018-11-09  Carey Williams <Carey.Williams@arm.com>
>> >
>> >         * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
>> >         * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN
>> outside
>> >         basic blocks.
>> >         * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>> >         * config/aarch64/aarch64-branch-dilution.c: New file.
>> >         * config/aarch64/aarch64-passes.def (branch-dilution): Register
>> >         pass.
>> >         * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>> >         tuning parameters bdilution_gsize and bdilution_maxb.
>> >         (make_pass_branch_dilution): New declaration.
>> >         * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
>> > cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
>> > cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
>> >         thunderx_tunings,tsv110_tunings,xgene1_tunings,
>> > qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
>> >         Provide default tunings for bdilution_gsize and bdilution_maxb.
>> >         * config/aarch64/aarch64.md (filler_insn): Define new insn.
>> >         * config/aarch64/aarch64.opt (mbranch-dilution,
>> >         mbranch-dilution-granularity,
>> >         mbranch-dilution-max-branches): Define new branch dilution
>> >         options.
>> >         * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New
>> rule
>> >         for aarch64-branch-dilution.c.
>> >         * coretypes.h (rtx_filler_insn): New rtx class.
>> >         * doc/invoke.texi (mbranch-dilution,
>> >         mbranch-dilution-granularity,
>> >         mbranch-dilution-max-branches): Document branch dilution
>> >         options.
>> >         * emit-rtl.c (emit_filler_after): New emit function.
>> >         * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
>> >         * rtl.h (class GTY): New class for rtx_filler_insn.
>> >         (is_a_helper ::test): New test helper for rtx_filler_insn.
>> >         (macro FILLER_INSN_P(X)): New predicate.
>> >         * target-insns.def (filler_insn): Add target insn def.
>> >
>> > ### Testing
>> > - Successful compilation of 3 stage bootstrap with the pass forced on
>> > (for stage 2, 3)
>> > - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
>> > - No 'make check' regressions
>> >
>> > Is this ok for trunk?
>> >
>> > Thanks
>> > Sudi
>