[PATCH 2/4][AArch64] Increase the loop peeling limit

Wed Feb 3 19:46:00 GMT 2016

On 01/08/16 16:55, Evandro Menezes wrote:
> On 12/16/2015 02:11 PM, Evandro Menezes wrote:
>> On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
>>> On 15/12/15 23:34, Evandro Menezes wrote:
>>>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>>>
>>>>>>>>>     gcc/
>>>>>>>>>
>>>>>>>>>         * config/aarch64/aarch64.c
>>>>>>>>> (aarch64_override_options_internal):
>>>>>>>>>         Increase loop peeling limit.
>>>>>>>>>
>>>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>>>> With this change, I noticed no major regression in either
>>>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>>>> ones, improved significantly.
>>>>>>>>>
>>>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to
>>>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>>>> >from other stakeholders.
>>>>>>>>
>>>>>>>> Ping.
>>>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>>>> see why
>>>>>>> this leads to more opportunities for vectorization, but I'm
>>>>>>> concerned about
>>>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>>>> be our
>>>>>>> default at -O2 and below.
>>>>>>>
>>>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>>>> (there are
>>>>>>> presumably good reasons why the default for this parameter across
>>>>>>> GCC has
>>>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I 
>>>>>>> say, I'd
>>>>>>> like Marcus or Richard to make the call as to whether or not we 
>>>>>>> take
>>>>>>> this
>>>>>>> patch.
>>>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>>>> not a concern, for this heuristic is only active when unrolling
>>>>>> loops, when code size is already of secondary importance.
>>>>> My understanding was that loop peeling is enabled from -O2 
>>>>> upwards, and
>>>>> is also used to partially peel unaligned loops for vectorization
>>>>> (allowing
>>>>> the vector code to be well aligned), or to completely peel inner 
>>>>> loops
>>>>> which
>>>>> may then become amenable to SLP vectorization.
>>>>>
>>>>> If I'm wrong then I take back these objections. But I was sure this
>>>>> parameter was used in a number of situations outside of just
>>>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>>>> performance
>>>>> sensitivities to this parameter at -O3 in some internal workloads 
>>>>> I was
>>>>> analysing.
>>>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>>>> seems to me that peeling is only used by optimizations which already
>>>> lead to potential increase in code size.
>>>>
>>>> For instance, with "-Ofast -funroll-all-loops", the total text size 
>>>> for
>>>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>>>> without it; with just "-O2", it is the same at 23.1MB regardless of 
>>>> this
>>>> setting.
>>>>
>>>> So it seems to me that this proposal should be neutral for up to -O2.
>>>>
>>>> Thank you,
>>>>
>>> My preference would be to not diverge from the global parameter
>>> settings.  I haven't looked in detail at this parameter but it seems to
>>> me there are two possible paths:
>>>
>>> 1) We could get agreement globally that the parameter should be 
>>> increased.
>>> 2) We could agree that this specific use of the parameter is distinct
>>> from some other uses and deserves a new param in its own right with a
>>> higher value.
>>>
>>
>> Here's what I have observed, not only in AArch64: architectures 
>> benefit differently from certain loop optimizations, especially those 
>> dealing with vectorization.  Be it because some have plenty of 
>> registers of more aggressive loop unrolling, or because some have 
>> lower costs to vectorize.  With this, I'm trying to imply that there 
>> may be the case to wiggle this parameter to suit loop optimizations 
>> better to specific targets.  While it is not the only parameter 
>> related to loop optimizations, it seems to be the one with the 
>> desired effects, as exemplified by PPC, S390 and x86 (AOSP).  Though 
>> there is the possibility that they are actually side-effects, as 
>> Richard Biener perhaps implied in another reply.
>>
>
>
> Gents,
>
> Any new thoughts on this proposal?
>

Ping?

-- 
Evandro Menezes