This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
- From: "Bin.Cheng" <amker dot cheng at gmail dot com>
- To: Jeff Law <law at redhat dot com>
- Cc: Bin Cheng <bin dot cheng at arm dot com>, gcc-patches List <gcc-patches at gcc dot gnu dot org>, Mike Stump <mikestump at comcast dot net>
- Date: Tue, 21 Oct 2014 13:43:58 +0800
- Subject: Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
- Authentication-results: sourceware.org; auth=none
- References: <000001cfdc90$1d95c670$58c15350$ at arm dot com> <54384C12 dot 6060401 at redhat dot com>
On Sat, Oct 11, 2014 at 5:13 AM, Jeff Law <email@example.com> wrote:
> On 09/30/14 03:22, Bin Cheng wrote:
>> many load/store pairs as my old patch. Then I decided to take one step
>> forward to introduce a generic instruction fusion infrastructure in GCC,
>> because in essence, load/store pair is nothing different with other
>> instruction fusion, all these optimizations want is to push instructions
>> together in instruction flow.
> Great generalization. And yes, you're absolutely right, what you're doing
> is building a fairly generic mechanism to mark insns that might fuse
> So, some questions. Let's assume I've got 3 kinds of insns. A B & C.
> I can fuse AB or AC, but not BC. In fact, moving B & C together may
> significantly harm performance.
> So my question is can a given insn have different fusion priorities
> depending on its scheduling context?
> So perhaps an example. Let's say I have an insn stream with the following
> kinds of instructions, all ready at the same time.
> Can I create 8 distinct fusion priorities such that I ultimately schedule
> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
> I guess another way to ask the question, are fusion priorities static based
> on the insn/alternative, or can they vary? And if they can vary, can they
> vary each tick of the scheduler?
> Now the next issue is I really don't want all those to fire
> back-to-back-to-back. I'd like some other insns to be inserted between each
> fusion pair if they're in the ready list. I guess the easiest way to get
> that is to assign the same fusion priority to other insns in the ready
> queue, even though they don't participate in fusion. So
> ABX(1) ABY(2).....
> Where X & Y are some other arbitrary insns that don't participate in the AB
> fusion, but will issue in the same cycle as the AB fused insn.
> Though I guess if we run fusion + peep2 between sched1 and sched2, that
> problem would just resolve itself as we'd have fused AB together into a new
> insn and we'd schedule normally with the fused insns and X, Y.
>> So here comes this patch. It adds a new sched_fusion pass just before
>> peephole2. The methodology is like:
>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>> pair, with fusion_priority as the major key and priority as the minor key.
>> 2) The back-end assigns priorities pair to each instruction, instructions
>> want to be fused together get same fusion_priority assigned.
> I think the bulk of my questions above are targetted at this part of the
> change. When are these assignments made and how much freedom does the
> backend have to make/change those assignments.
> So another question, given a fused pair, is there any way to guarantee
> ordering within the fused pair. This is useful to cut down on the number of
> peep2 patterns. I guess we could twiddle the priority in those cases to
> force a particular ordering of the fused pair, right?
> I wonder if we could use this to zap all the hair I added to caller-save
> back in the early 90s to try and widen the save/restore modes. So instead
> of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win
> for floating point on the sparc processors of that time. I don't expect you
> to do that investigation. Just thinking out loud.
>> I collected performance data for both cortex-a15 and cortex-a57 (with a
>> local peephole ldp/stp patch), the benchmarks can be obviously improved on
>> arm/aarch64. I also collected instrument data about how many load/store
>> pairs are found. For the four versions of load/store pair patches:
>> 0) The original Mike's patch.
>> 1) My original prototype patch.
>> 2) Cleaned up pass of Mike (with implementation bugs resolved).
>> 3) This new prototype fusion pass.
>> The numbers of paired opportunities satisfy below relations:
>> 3 * N0 ~ N1 ~ N2 < N3
>> For example, for one benchmark suite, we have:
>> N0 ~= 1300
>> N1/N2 ~= 5000
>> N3 ~= 7500
> Nice. Very nice.
> Overall it's a fairly simple change. I'll look deeper into it next week.
Any new comments?