This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
- From: Mike Stump <mikestump at comcast dot net>
- To: Jeff Law <law at redhat dot com>
- Cc: Bin Cheng <bin dot cheng at arm dot com>, gcc-patches at gcc dot gnu dot org
- Date: Fri, 10 Oct 2014 16:10:11 -0700
- Subject: Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
- Authentication-results: sourceware.org; auth=none
- References: <000001cfdc90$1d95c670$58c15350$ at arm dot com> <54384C12 dot 6060401 at redhat dot com>
[ I’ll give the state of the code that I finished with, Bin’s answers should be similar to mine, but, if he improved things, they could be better ]
On Oct 10, 2014, at 2:13 PM, Jeff Law <email@example.com> wrote:
> So, some questions. Let's assume I've got 3 kinds of insns. A B & C.
> I can fuse AB or AC, but not BC. In fact, moving B & C together may significantly harm performance.
We only can choose 1 ordering for fusion consideration, so you have to pick between AB or AC as the ordering. Once you decide, then the other is hidden from consideration.
The priorities merely are used to select an instruction ordering to consider pair wise (adjacent) opportunities.
They can chain and stack, so, if you want A B C to fuse in the new ABC instruction, it will.
> So my question is can a given insn have different fusion priorities depending on its scheduling context?
Area for future improvement. In theory you can feed any other state into the creation of the priority but only the single insn feeds into it today.
> So perhaps an example. Let's say I have an insn stream with the following kinds of instructions, all ready at the same time.
> Can I create 8 distinct fusion priorities such that I ultimately schedule
> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary? And if they can vary, can they vary each tick of the scheduler?
So, it wasn’t created to solve that problem. If the scheduler would tend to put them next to each other because structurally, they fit that way, then peephole already should have a chance to see them that way.
> Now the next issue is I really don't want all those to fire back-to-back-to-back. I'd like some other insns to be inserted between each fusion pair if they're in the ready list.
Scheduling figures out the way to layout instructions and will be done. Fusion run before scheduling and is separate, it doesn’t replace or change scheduling.
> I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion. So
> ABX(1) ABY(2).....
> Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn.
You’re conflating fusion with scheduling, they are separate. Fusion doesn’t deal with time or cycles. It exists only to fuse to separate instructions together. Those products will be scheduled later and post scheduling, all the holes will be filled to the extent the scheduler is able to find work to do.
For example, given:
and we want to fuse ld <even> with ld <even>+1, we then get:
then, post scheduling, we get:
by the scheduler to fill the holes.
> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y.
Yes, in my version, I ran it really early, before sched. I needed to run before ira and run before other people changed the code so much, that they obscured the instructions I wanted to fuse. One thing that happened that I saw was once we got to far along, the offsets used were loaded into register and instead of being r+n, it was r1+r2 and this blew the sorting.
>> So here comes this patch. It adds a new sched_fusion pass just before
Hum… In my version, that was just way too late. I needed to generate pseudos and count on ira and others to clean up register the register moves for me. I’ve not tried his patch to see if it works as well as my version on my problem (combining adjacent loads and stores).
>> The methodology is like:
>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>> pair, with fusion_priority as the major key and priority as the minor key.
>> 2) The back-end assigns priorities pair to each instruction, instructions
>> want to be fused together get same fusion_priority assigned.
> I think the bulk of my questions above are targetted at this part of the change. When are these assignments made and how much freedom does the backend have to make/change those assignments.
It is free to do anything with the numbers it wants. But, the port writer has carefully chosen the priorities to generate the best code, it is unlikely it would ever be wise to change the numbers.
> So another question, given a fused pair, is there any way to guarantee ordering within the fused pair.
Sure, ensure one instruction has a lower number than the other. That will order them reliably.
> This is useful to cut down on the number of peep2 patterns. I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right?
Right, assuming that there is an ordering for the instruction that makes sense. Given:
and a machine that could do a V2 mul, given a pair:
Here, we can see all the extra moves to ensure a register pair (n, n+1 of the mulv2 instruction) and the temporary created to hold the tuple (r8,r9). In my peephole patterns I do this to ensure registers have the right register number. I do this before register allocation, as it can choose the registers the best and ensure all the other instructions around it use those registers, and leave the moves in, if it can’t do better. The moves tend to be able to fill scheduling holes, so, even if they remain, usually the cost for them shouldn’t be too bad.
On the other hand, if you have
There is no way to sort them to get:
and then fuse:
This would be impossible.
> I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes. So instead of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win for floating point on the sparc processors of that time. I don't expect you to do that investigation. Just thinking out loud.
Well, curiously enough, that sort of thing is exactly why I wrote the code. I was thinking of cases where the user load lots of registers from memory and then plays a bit with them and then put stye data back into memory. I wanted to put all the loads together and then try and find the loads that are next to each other and combine them into a single, larger load. On a machine that can load n registers as once, you can then save n-1 instructions. Likewise for stores. Conceptually, there is nothing port specific about the optimization I wanted to do.