This is the mail archive of the
mailing list for the GCC project.
Re: [RFC] Tree loop unroller pass
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Kugan Vivekanandarajah <kugan dot vivekanandarajah at linaro dot org>
- Cc: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>
- Date: Fri, 16 Feb 2018 12:56:13 +0100
- Subject: Re: [RFC] Tree loop unroller pass
- Authentication-results: sourceware.org; auth=none
- References: <DB6PR0801MB205363C0CDF8D756E2C2F0D383F60@DB6PR0801MB2053.eurprd08.prod.outlook.com> <CAELXzTPTYH-QMYijxoGD_T=CeqK0p3H5X5FLiqzr9+Hvm76P8g@mail.gmail.com>
On Thu, Feb 15, 2018 at 11:30 PM, Kugan Vivekanandarajah
> Hi Wilko,
> Thanks for your comments.
> On 14 February 2018 at 00:05, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> Hi Kugan,
>>> Based on the previous discussions, I tried to implement a tree loop
>>> unroller for partial unrolling. I would like to queue this RFC patches
>>> for next stage1 review.
>> This is a great plan - GCC urgently requires a good unroller!
>>> * Cost-model for selecting the loop uses the same params used
>>> elsewhere in related optimizations. I was told that keeping this same
>>> would allow better tuning for all the optimizations.
>> I'd advise against using the existing params as is. Unrolling by 8x by default is
>> way too aggressive and counterproductive. It was perhaps OK for in-order cores
>> 20 years ago, but not today. The goal of unrolling is to create more ILP in small
>> loops, not to generate huge blocks of repeated code which definitely won't fit in
>> micro-op caches and loop buffers...
> OK, I will create separate params. It is possible that I misunderstood
> it in the first place.
To generate more ILP for modern out-of-order processors you need to be
able to do followup transforms that remove dependences. So rather than
inventing magic params we should look at those transforms and key
unrolling on them. Like we do in predictive commoning or other passes
that end up performing unrolling as part of their transform.
Our measurements on x86 concluded that unrolling isn't worth it, in fact
it very often hurts. That was of course with saner params than the defaults
of the RTL unroller.
Often you even have to fight with followup passes doing stuff that ends up
inreasing register pressure too much so we end up spilling.
>> Also we need to enable this by default, at least with -O3, maybe even for small
>> (or rather tiny) loops in -O2 like LLVM does.
> It is enabled for -O3 and above now.
So _please_ first get testcases we know unrolling will be beneficial on
and _also_ have a thorough description _why_.
>>> * I have also implemented an option to limit loops based on memory
>>> streams. i.e., some micro-architectures where limiting the resulting
>>> memory streams is preferred and used to limit unrolling factor.
>> I'm not convinced this is needed once you tune the parameters for unrolling.
>> If you have say 4 read streams you must have > 10 instructions already so
>> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the streams
>> issue as a problem caused by too aggressive unroll settings. I think if you
>> address that first, you're unlikely going to have an issue with too many streams.
> I will experiment with some microbenchmarks. I still think that it
> will be useful for some micro-architectures. Thats why, it its not
> enabled by default. If a back-end thinks that it is useful, they can
> enable limiting unroll factor based on memory streams.
Note that without doing scheduling at the same time (basically interleaving
iterations rather than pasting them after each other) I have a hard time
believing that maxing memory streams is any good on any microarchitecture.
So transform-wise you'd end up with "vectorizing" without "vectorizing" and you
can share dependence analysis.
>>> * I expect that there will be some cost-model changes might be needed
>>> to handle (or provide ability to handle) various loop preferences of
>>> the micro-architectures. I am sending this patch for review early to
>>> get feedbacks on this.
>> Yes it should be feasible to have settings based on backend preference
>> and optimization level (so O3/Ofast will unroll more than O2).
>>> * Position of the pass in passes.def can also be changed. Example,
>>> unrolling before SLP.
>> As long as it runs before IVOpt so we get base+immediate addressing modes.
> Thats what I am doing now.
Note I believe that IVOPTs should be moved a bit later than it is
placed right now.