[PATCH][AArch64] Increase code alignment

Thu Jun 30 15:14:00 GMT 2016

Evandro Menezes wrote:
On 06/29/16 07:59, James Greenhalgh wrote:
> On Tue, Jun 21, 2016 at 02:39:23PM +0100, Wilco Dijkstra wrote:
>> ping
>>
>>
>> From: Wilco Dijkstra
>> Sent: 03 June 2016 11:51
>> To: GCC Patches
>> Cc: nd; philipp.tomsich@theobroma-systems.com; pinskia@gmail.com; jim.wilson@linaro.org; benedikt.huber@theobroma-systems.com; Evandro Menezes
>> Subject: [PATCH][AArch64] Increase code alignment
>>      
>> Increase loop alignment on Cortex cores to 8 and set function alignment to
>> 16.  This makes things consistent across big.LITTLE cores, improves
>> performance of benchmarks with tight loops and reduces performance variations
>> due to small  changes in code layout. It looks almost all AArch64 cores agree
>> on alignment of 16 for function, and 8 for loops and branches, so we should
>> change -mcpu=generic as well if there is no disagreement - feedback welcome.
>>
>> OK for commit?
> Hi Wilco,
>
> Sorry for the delay.
>
> This patch is OK for trunk.
>
> I hope we can continue the discussion as to whether there is a set of
> values for -mcpu=generic that better suits the range of cores we now
> support.
>
> After Wilco's patch, and using the values in the proposed vulcan and
> qdf24xx structures, and with the comments from this thread, the state of
> the tuning structures will be:
>
> -mcpu=    : function-jump-loop alignment
>
> cortex-a35: 16-8-8
> cortex-a53: 16-8-8
> cortex-a57: 16-8-8
> cortex-a72: 16-8-8
> cortex-a73: 16-8-4 (Though I'm guessing this is just an ordering issue on
>                      when Kyrill copied the cost table. Kyrill/Wilco do you
>                      want to spin an update for the Cortex-A73 alignment?
>                      Consider it preapproved if you do)
> exynos-m1 : 4-4-4  (2: 16-4-16, 3: 8-4-4)
> thunderx  : 8-8-8  (But 16-8-8 acceptable/maybe better)
> xgene1    :16-8-16
> qdf24xx   :16-8-16
> vulcan    :16-8-16
>
> Generic is currently set to 8-8-4, which doesn't look representative of
> these individual cores at all.
>
> Running down that list, I see a very compelling case for 16 for function
> alignment (from comments in this thread, it is second choice, but not too
> bad for exynos-m1, thunderx is maybe better at 16, but should not be
> worse). I also see a close to unanimous case for 8 for jump alignment,
> though I note that in Evandro's experiments 8 never came out as "best"
> for exynos-m1. Did you get a feel in your experiments for what the
> performance penalty of aligning jumps to 8 would be? That generic is
> currently set to 8 seems like a good tie-breaker if the performance impact
> for exynos-m1 would be minimal, as it gives no change from the current
> behaviour.
>
> For loop alignment I see a split straight down the middle between 8
> and 16 (exynos-m1/cortex-a73 are the outliers at 4, but second place for
> exynos-m1 was 16-4-16, and cortex-a73 might just be an artifact of when the
> table was copied).
>
>  From that, and as a starting point for discussion, I'd suggest that
> 16-8-8 or 16-8-16 are most representative from the core support that has
> been added to GCC.
>
> Wilco, the Cortex-A cores tip the balance in favour of 16-8-8. As you've
> most recently looked at the cost of setting loop alignment to 16, do you
> have any comments on why you chose to standardize on 16-8-8 rather than
> 16-8-16, or what the cost would be of 16-8-16?
>
> I've widened the CC list, and I'd appreciate any comments on what we all
> think will be the right alignment parameters for -mcpu=generic.

This is a very good summary.  However, I think that it should also 
consider the effects on code size.

Evidently, an alignment of 16 has the greatest probability of increasing 
code size, whereas an alignment of 4 doesn't increase it.  Likewise, 
what is aligned also matters, for each of them have a different 
frequency of instances.  Arguably, the function alignment has the least 
effect in code size, followed by jumps and then loops, based on typical 
frequencies.  In specific cases, the weights may move somewhat, of course.

As I stated in my previous reply, I also tracked the code size when 
experimenting with different alignments.  It was clear form my data that 
the jump alignment was the most critical to code size (~3% on average), 
followed by loop alignment (<1%) and function alignment (<1%).  
Therefore, I'd argue against setting the alignment for jumps to 16.  The 
case can be made for an alignment of 8 for them, when the cost to code 
size is more modest.

 From a performance perspective, generic alignments at 16-8-8 would be 
the 4th rank on Exynos M1, with a negligible code size penalty (<1%) 
over the current 8-4-4.

------------------

Hi Evandro,

Agreed, the codesize impact is important as well, so there is a balance to
be made. Looking at the codesize cost, branch alignment has the largest
cost but also the least benefit. It isn't helped by GCC's very simplistic way
of aligning - our goal is to improve fetching of hot code, however GCC
appears to align cold code as well. Also I haven't found a way that would align
a loop/branch only if a fetch would return 1 or 2 instructions. Improving this
would reduce the cost of aligning loops and branches without losing the benefits.

Picking a few benchmarks from SPEC, I get the following codesize costs
for various function, branch and loop alignments:

4_4_4	100% (current setting for exynos-m1)
8_4_4	+0.3%
16_4_4	+0.8%
16_4_8	+1.1%
16_4_16	+1.6%
16_8_8	+2.4%  (current setting for Cortex-A*)
16_8_16	+3.0%  (current setting for xgene1, qdf24xx, vulcan)

Based on this we should consider setting the branch alignment for back to 4 as
this reduces codesize by 1.4%! Also 16_4_16 might be a better option than
the current 16_8_8 if the performance is similar or better.

Wilco