This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass

From: "Bin.Cheng" <amker dot cheng at gmail dot com>
To: Mike Stump <mikestump at comcast dot net>
Cc: Jeff Law <law at redhat dot com>, Bin Cheng <bin dot cheng at arm dot com>, gcc-patches List <gcc-patches at gcc dot gnu dot org>
Date: Sat, 11 Oct 2014 11:32:53 +0800
Subject: Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
Authentication-results: sourceware.org; auth=none
References: <000001cfdc90$1d95c670$58c15350$ at arm dot com> <54384C12 dot 6060401 at redhat dot com> <80EFD85E-49B5-4F71-9401-40F8FA85BD65 at comcast dot net>

Mike already gave great answers, here are just some of my thoughts on
the specific questions.  See embedded below.

On Sat, Oct 11, 2014 at 7:10 AM, Mike Stump <mikestump@comcast.net> wrote:
> [ I'll give the state of the code that I finished with, Bin's answers should be similar to mine, but, if he improved things, they could be better ]
>
> On Oct 10, 2014, at 2:13 PM, Jeff Law <law@redhat.com> wrote:
>> So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.
>>
>> I can fuse AB or AC, but not BC.  In fact, moving B & C together may significantly harm performance.
>
> We only can choose 1 ordering for fusion consideration, so you have to pick between AB or AC as the ordering.  Once you decide, then the other is hidden from consideration.
>
> The priorities merely are used to select an instruction ordering to consider pair wise (adjacent) opportunities.
>
> They can chain and stack, so, if you want A B C to fuse in the new ABC instruction, it will.
>
>> So my question is can a given insn have different fusion priorities depending on its scheduling context?
>
> Area for future improvement.  In theory you can feed any other state into the creation of the priority but only the single insn feeds into it today.
>
>> So perhaps an example.  Let's say I have an insn stream with the following kinds of instructions, all ready at the same time.
>>
>> AAAAAAAABBBBCCCC
>>
>> Can I create 8 distinct fusion priorities such that I ultimately schedule
>> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
>>
>> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary?  And if they can vary, can they vary each tick of the scheduler?

Though this pass works on predefined fusion types and priorities now,
there might be two possible fixes for this specific problem.
1) Introduce another exclusive_pri, now it's like "fusion_pri,
priority, exclusive_pri".  The first one is assigned to mark
instructions belonging to same fusion type.  The second is assigned to
fusion each pair/consecutive instructions together.  The last one is
assigned to prevent specific pair of instructions from being fused,
just like "BC" mentioned.
2) Extend the idea by using hook function
TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2.  Now we can assign
fusion_pri at the first place, making sure instructions in same fusion
type will be adjacent to each other, then we can change priority (thus
reorder the ready list) at back-end's wish even per each tick of the
scheduler.


>
> So, it wasn't created to solve that problem.  If the scheduler would tend to put them next to each other because structurally, they fit that way, then peephole already should have a chance to see them that way.
>
>> Now the next issue is I really don't want all those to fire back-to-back-to-back.  I'd like some other insns to be inserted between each fusion pair if they're in the ready list.
>
> Scheduling figures out the way to layout instructions and will be done.  Fusion run before scheduling and is separate, it doesn't replace or change scheduling.
>
>> I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion.  So
>>
>> ABX(1) ABY(2).....
>>
>> Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn.
>
> You're conflating fusion with scheduling, they are separate.  Fusion doesn't deal with time or cycles.  It exists only to fuse to separate instructions together.  Those products will be scheduled later and post scheduling, all the holes will be filled to the extent the scheduler is able to find work to do.
>
> For example, given:
>
> ld 0
> inc
> ld 1
> inc
> ld 2
> inc
> ld 3
> inc
>
> and we want to fuse ld <even> with ld <even>+1, we then get:
>
> ld 01
> ld 23
> inc
> inc
>
> then, post scheduling, we get:
>
> ld 01
> inc
> ld 23
> inc
>
> by the scheduler to fill the holes.
>
>> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y.
>
> Yes, in my version, I ran it really early, before sched.  I needed to run before ira and run before other people

Two reasons why I run it late in compilation process.
1) IRA is the pass I tend not to disturb since code is changed
dramatically.  With it after IRA, I can get certain improvement from
fusion, there is less noise here.
2) The spilling generates many load/store pair opportunities on ARM,
which I don't want to miss.

>changed the code so much, that they obscured the instructions I wanted to fuse.  One thing that happened that I
>saw was once we got to far along, the offsets used were loaded into register and instead of being r+n, it was r1+r2
>and this blew the sorting.

I saw this problem too, it's the fwprop pass that should be blamed.
Now it just forwards computation into address expression even there is
no benefit.  For example, if we have:

add rx, ry, rz
ldr   r1, [rx]
ldr   r2, [rx+4]
ldr   r3, [rx+8]

It will be transformed into:

add rx, ry, rz
ldr   r1, [ry+rz]
ldr   r2, [rx+4]
ldr   r3, [rx+8]

It corrupts consecutive load pairs, gives no computation reduction,
and increases register pressure.  Root cause is the pass works reg_use
by reg_use, doesn't take global information into consideration, like
register live range, whether there is real benefit, or if it hurts
other optimizations.

>
>>> So here comes this patch.  It adds a new sched_fusion pass just before
>>> peephole2.
>
> Hum...  In my version, that was just way too late.  I needed to generate pseudos and count on ira and others to clean up register the register moves for me.  I've not tried his patch to see if it works as well as my version on my problem (combining adjacent loads and stores).

Putting it after IRA has its own disadvantage.  For example, fake
dependency is introduced like below example:

ldr  r1, [rx + 4]
ldr  r2, [rx + 12]
ldr  rx, [rx + 8]
...
use r1/r2/rx

Here we can't fuse the first/third ldr instructions because rx is
modified.  This is why I tried to even push this pass (along with
peephole2) after register renaming and enable register renaming pass
in my experiments.  It does help a lot at lease on ARM.


>
>>>  The methodology is like:
>>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>>> pair, with fusion_priority as the major key and priority as the minor key.
>>> 2) The back-end assigns priorities pair to each instruction, instructions
>>> want to be fused together get same fusion_priority assigned.
>> I think the bulk of my questions above are targetted at this part of the change.  When are these assignments made and how much freedom does the backend have to make/change those assignments.
>
> It is free to do anything with the numbers it wants.  But, the port writer has carefully chosen the priorities to generate the best code, it is unlikely it would ever be wise to change the numbers.
>
>> So another question, given a fused pair, is there any way to guarantee ordering within the fused pair.
>
> Sure, ensure one instruction has a lower number than the other.  That will order them reliably.
>
>> This is useful to cut down on the number of peep2 patterns.  I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right?
>
> Right, assuming that there is an ordering for the instruction that makes sense.  Given:
>
> mul r1,#13
> inc r3
> mul r5,#13
> inc r4
>
> and a machine that could do a V2 mul, given a pair:
>
> mov r8,r1
> mov r9,r2
> mulv2 r8,#13
> inc r3
> inc r4
> mov r1,r8
> mov r5,r9
>
> Here, we can see all the extra moves to ensure a register pair (n, n+1 of the mulv2 instruction) and the temporary created to hold the tuple (r8,r9).  In my peephole patterns I do this to ensure registers have the right register number.  I do this before register allocation, as it can choose the registers the best and ensure all the other instructions around it use those registers, and leave the moves in, if it can't do better.  The moves tend to be able to fill scheduling holes, so, even if they remain, usually the cost for them shouldn't be too bad.
>
> On the other hand, if you have
>
> left
> left
> right
> right
>
> There is no way to sort them to get:
>
> left
> right
> left
> right
>
> and then fuse:
>
> left_right
> left_right
>
> This would be impossible.

I can't understand this very well, this exactly is one case we want to
fuse on ARM for movw/movt.  Given
moww  r1, const_1
movw   r2, const_2
movw   r3, const_3
movt     r1, const_1
movt     r2, const_2
movt     r3, const_3

I would expect:
moww  r1, const_1
movt     r1, const_1
movw   r2, const_2
movt     r2, const_2
movw   r3, const_3
movt     r3, const_3

Thanks,
bin

>
>> I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes.  So instead of st; st; call; ld; ld, we'd generate std; call; ldd.  It was a huge win for floating point on the sparc processors of that time.  I don't expect you to do that investigation.  Just thinking out loud.
>
> Well, curiously enough, that sort of thing is exactly why I wrote the code.  I was thinking of cases where the user load lots of registers from memory and then plays a bit with them and then put stye data back into memory.  I wanted to put all the loads together and then try and find the loads that are next to each other and combine them into a single, larger load.  On a machine that can load n registers as once, you can then save n-1 instructions.  Likewise for stores.  Conceptually, there is nothing port specific about the optimization I wanted to do.

Follow-Ups:
- Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
  - From: Mike Stump
- Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
  - From: Jeff Law

References:
- Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
  - From: Jeff Law
- Re: [PATCH RFC]Pair load store instructions using a generic scheduling fusion pass
  - From: Mike Stump

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]