This is the mail archive of the
mailing list for the GCC project.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: "Bin.Cheng" <amker dot cheng at gmail dot com>
- Cc: Oleg Endo <oleg dot endo at t-online dot de>, "bin.cheng" <bin dot cheng at arm dot com>, "<gcc-patches at gcc dot gnu dot org>" <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 16 May 2014 12:51:12 +0200
- Subject: Re: [GCC RFC]A new and simple pass merging paired load store instructions
- Authentication-results: sourceware.org; auth=none
- References: <004d01cf700e$ef1e30e0$cd5a92a0$ at arm dot com> <9446DE1C-BBEC-407F-8F14-3E7D9B781905 at t-online dot de> <CAHFci2_oP587otaOCUa5oQGDV+U5fLpv6Jn-mT03sHE70Gdm8g at mail dot gmail dot com>
On Fri, May 16, 2014 at 12:10 PM, Bin.Cheng <email@example.com> wrote:
> On Thu, May 15, 2014 at 6:31 PM, Oleg Endo <firstname.lastname@example.org> wrote:
>> On 15 May 2014, at 09:26, "bin.cheng" <email@example.com> wrote:
>>> Targets like ARM and AARCH64 support double-word load store instructions,
>>> and these instructions are generally faster than the corresponding two
>>> load/stores. GCC currently uses peephole2 to merge paired load/store into
>>> one single instruction which has a disadvantage. It can only handle simple
>>> cases like the two instructions actually appear sequentially in instruction
>>> stream, and is too weak to handle cases in which the two load/store are
>>> intervened by other irrelevant instructions.
>>> Here comes up with a new GCC pass looking through each basic block and
>>> merging paired load store even they are not adjacent to each other. The
>>> algorithm is pretty simple:
>>> 1) In initialization pass iterating over instruction stream it collects
>>> relevant memory access information for each instruction.
>>> 2) It iterates over each basic block, tries to find possible paired
>>> instruction for each memory access instruction. During this work, it checks
>>> dependencies between the two possible instructions and also records the
>>> information indicating how to pair the two instructions. To avoid quadratic
>>> behavior of the algorithm, It introduces new parameter
>>> max-merge-paired-loadstore-distance and set the default value to 4, which is
>>> large enough to catch major part of opportunities on ARM/cortex-a15.
>>> 3) For each candidate pair, it calls back-end's hook to do target dependent
>>> check and merge the two instructions if possible.
>>> Though the parameter is set to 4, for miscellaneous benchmarks, this pass
>>> can merge numerous opportunities except ones already merged by peephole2
>>> (same level numbers of opportunities comparing to peepholed ones). GCC
>>> bootstrap can also confirm this finding.
>> This is interesting. E.g. on SH there are insns to load/store SFmode pairs. However, these insns require a mode switch and have some constraints on register usage. So in the SH case the load/store pairing would need to be done before reg alloc and before mode switching.
>>> Yet there is an open issue about when we should run this new pass. Though
>>> register renaming is disabled by default now, I put this pass after it,
>>> because renaming can resolve some false dependencies thus benefit this pass.
>>> Another finding is, it can capture a lot more opportunities if it's after
>>> sched2, but I am not sure whether it will mess up with scheduling results in
>>> this way.
>> How about the following.
>> Instead of adding new hooks and inserting the pass to the general pass list, make the new
>> pass class take the necessary callback functions directly. Then targets can just instantiate
>> the pass, passing their impl of the callbacks, and insert the pass object into the pass list at
>> a place that fits best for the target.
> Oh, I don't know we can do this in GCC. But yes, a target may want to
> run it at some place that fits best for the target.
Btw, the bswap pass enhancements that are currently in review may
also be an opportunity to catch these. They can merge adjacent
loads that are used "composed" (but not yet composed by storing
into adjacent memory). The basic-block vectorizer should also
handle this (if the composition happens to be by storing into
adjacent memory) - of course it needs vector modes available and
it has to be enabled.
>>> So, any comments about this?
>>> 2014-05-15 Bin Cheng <firstname.lastname@example.org>
>>> * common.opt (flag_merge_paired_loadstore): New option.
>>> * merge-paired-loadstore.c: New file.
>>> * Makefile.in: Support new file.
>>> * config/arm/arm.c (TARGET_MERGE_PAIRED_LOADSTORE): New macro.
>>> (load_latency_expanded_p, arm_merge_paired_loadstore): New function.
>>> * params.def (PARAM_MAX_MERGE_PAIRED_LOADSTORE_DISTANCE): New param.
>>> * doc/invoke.texi (-fmerge-paired-loadstore): New.
>>> (max-merge-paired-loadstore-distance): New.
>>> * doc/tm.texi.in (TARGET_MERGE_PAIRED_LOADSTORE): New.
>>> * doc/tm.texi: Regenerated.
>>> * target.def (merge_paired_loadstore): New.
>>> * tree-pass.h (make_pass_merge_paired_loadstore): New decl.
>>> * passes.def (pass_merge_paired_loadstore): New pass.
>>> * timevar.def (TV_MERGE_PAIRED_LOADSTORE): New time var.
>>> 2014-05-15 Bin Cheng <email@example.com>
>>> * gcc.target/arm/merge-paired-loadstore.c: New test.
> Best Regards.