This is the mail archive of the
mailing list for the GCC project.
Re: [GCC RFC]A new and simple pass merging paired load store instructions
- From: "Bin.Cheng" <amker dot cheng at gmail dot com>
- To: Mike Stump <mikestump at comcast dot net>
- Cc: "bin.cheng" <bin dot cheng at arm dot com>, gcc-patches List <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 16 May 2014 18:07:14 +0800
- Subject: Re: [GCC RFC]A new and simple pass merging paired load store instructions
- Authentication-results: sourceware.org; auth=none
- References: <004d01cf700e$ef1e30e0$cd5a92a0$ at arm dot com> <32B4330F-1D0F-4D4E-BF7A-2E5B2148B893 at comcast dot net>
On Fri, May 16, 2014 at 12:51 AM, Mike Stump <firstname.lastname@example.org> wrote:
> On May 15, 2014, at 12:26 AM, bin.cheng <email@example.com> wrote:
>> Here comes up with a new GCC pass looking through each basic block and
>> merging paired load store even they are not adjacent to each other.
> So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else. Within a group of loads or stores the secondary key is the base register, the next key is the offset. The net result, all loads off the same register are sorted in increasing order. We then can use some define_insns and some peephole to patterns to take the seemingly unrelated instructions, which are now adjacent to knock them down into single instructions, instead of the mass of instructions they were before. And then a peephole pass that runs early to allow the patterns to do the heavy lifting. This scheme can handle unlimited complexity on the addressing forms just by ensuring the cost function for the new scheduling pass looks at all relevant bits (target dependent, if the simple machine independent form reg + n is not enough). The sched0 and the peephole pass run early:
> + NEXT_PASS (pass_sched0);
> + NEXT_PASS (pass_peephole1);
> NEXT_PASS (pass_web);
> NEXT_PASS (pass_rtl_cprop);
> NEXT_PASS (pass_cse2);
> (before register allocation) so, it can arrange to have things in adjacent registers for the load and store multiple instructions. The register allocator can then arrange all the code to use those registers directly.
> So, having done all this work, I think it would be nicer if there were a pass that managed it (so that one doesn't have to write any of the peephole or the define_insns (you need like 3*n patterns, and the patterns of O(n), so, you need around n*4/2 lines of code, which is annoying for large n. A pass could use the existing load store multiple patterns directly, so, no additional port work. In my work, since I extend life times of values into registers, pretty much without limit, this could be a bad thing. The code is naturally limited to only extending the lives of things where load and store multiple are used, as if they aren't used, the regular scheduling pass would undo all the sched0 motion. I choose a light copy of sched, as I don't care about compile time, and I wanted a very small patch that was easy to maintain. If this pass when into trunk, we'd run the new passes _only_ if a port asked for them. 99% of the ports likely don't want either, though, peephole before register allocation might be interesting for others to solve other issues.
> I wanted this to run before register allocation as my load and store multiple instructions only take consecutive register ranges (n-m), and I need the register allocator to manage to make it true. I do reg to reg motion to move things around as needed, but almost all I expect the allocator to get rid of. Very complex cases might wind up with a few extra moves, but I have nice bubbles that I can fit these extra moves into.
> In my scheme, no new options, no documentation for new options, no new param options, no silly constants, no hard to write/maintain pass, no new weird targets interface, no limit on just pairs, works on stores as well, runs earlier, 430 lines instead of 1086 lines, conceptually much simpler, added benefit of peephole before register allocation that can be used for other things by the port...
> So, my question is, does my scheme on your target find more or fewer things? Would your scheme pair pairs (so that 4 registers would go into 1 instruction)?
Thanks for bringing up this new method. I had to admit that I was not
very much into this method at the first glance especially it requires
another pass in cooperation with scheduler.
For the first question, unfortunately, I can't apply the patch to svn
trunk, could you please update it thus I can do some experiment with
it? From my experience, lots of opportunities on ARM are generated by
RA, putting it before RA would miss many cases. Another possible
issue is about interaction with other optimizations. Putting it at
early stage of RTL, we may disturb other optimizations like
/gcse/-fgcse-lm/-fgcse-sm/dse. Of course, it has advantages too, for
example, fwprop_addr sometimes corrupt load/store pair opportunities
on ARM, it won't be a problem for your patch.
For the second question, the answer is no for current implementation.
For ARM the most important opportunity is the paired one, so I just
started with pair of two instructions, which is much simpler. I do
have a draft idea how to merge more than two instructions, but haven't
worked on it yet.