[GCC RFC]A new and simple pass merging paired load store instructions

Thu May 15 16:52:00 GMT 2014

On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
> Here comes up with a new GCC pass looking through each basic block and
> merging paired load store even they are not adjacent to each other.

So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else.  Within a group of loads or stores the secondary key is the base register, the next key is the offset.  The net result, all loads off the same register are sorted in increasing order.  We then can use some define_insns and some peephole to patterns to take the seemingly unrelated instructions, which are now adjacent to knock them down into single instructions, instead of the mass of instructions they were before.  And then a peephole pass that runs early to allow the patterns to do the heavy lifting.  This scheme can handle unlimited complexity on the addressing forms just by ensuring the cost function for the new scheduling pass looks at all relevant bits (target dependent, if the simple machine independent form reg + n is not enough).  The sched0 and the peephole pass run early:

+      NEXT_PASS (pass_sched0);
+      NEXT_PASS (pass_peephole1);
       NEXT_PASS (pass_web);
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_cse2);

(before register allocation) so, it can arrange to have things in adjacent registers for the load and store multiple instructions.  The register allocator can then arrange all the code to use those registers directly.

So, having done all this work, I think it would be nicer if there were a pass that managed it (so that one doesn’t have to write any of the peephole or the define_insns (you need like 3*n patterns, and the patterns of O(n), so, you need around n*4/2 lines of code, which is annoying for large n.  A pass could use the existing load store multiple patterns directly, so, no additional port work.  In my work, since I extend life times of values into registers, pretty much without limit, this could be a bad thing.  The code is naturally limited to only extending the lives of things where load and store multiple are used, as if they aren’t used, the regular scheduling pass would undo all the sched0 motion.  I choose a light copy of sched, as I don’t care about compile time, and I wanted a very small patch that was easy to maintain.  If this pass when into trunk, we’d run the new passes _only_ if a port asked for them.  99% of the ports likely don’t want either, though, peephole before register allocation might be interesting for others to solve other issues.

I wanted this to run before register allocation as my load and store multiple instructions only take consecutive register ranges (n-m), and I need the register allocator to manage to make it true.  I do reg to reg motion to move things around as needed, but almost all I expect the allocator to get rid of.  Very complex cases might wind up with a few extra moves, but I have nice bubbles that I can fit these extra moves into.

In my scheme, no new options, no documentation for new options, no new param options, no silly constants, no hard to write/maintain pass, no new weird targets interface, no limit on just pairs, works on stores as well, runs earlier, 430 lines instead of 1086 lines, conceptually much simpler, added benefit of peephole before register allocation that can be used for other things by the port…

So, my question is, does my scheme on your target find more or fewer things?  Would your scheme pair pairs (so that 4 registers would go into 1 instruction)?

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ldm.diffs.txt
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20140515/aa66b904/attachment.txt>