[PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

Fri Aug 12 15:20:00 GMT 2016

On 08/12/2016 04:28 PM, Michael Matz wrote:
> Hi,
>
> On Fri, 12 Aug 2016, Denys Vlasenko wrote:
>
>>> Have you tested the performance impact of your patch?  Note that the
>>> macro you changed is used for function and code label alignment.  So,
>>> unless I misunderstand something that means that if the large
>>> alignment can't be achieved for e.g. a loop start label, you won't
>>> align it at all anymore. This should be fairly catastrophic for any
>>> loopy benchmark, so anything like this would have to be checked on a
>>> couple benchmarks from cpu2000 (possibly cpu2006), which has some that
>>> are extremely alignment sensitive.
>>>
>>> Even for function labels I'd find no alignment at all strange, and I
>>> don't see why you'd want this.
>>
>> For many generations now, x86 CPUs have at least 32, and usually 64 byte
>> cachelines. Decoders fetch instructions in blocks of 32 or 64 bytes. Not
>> less. Instructions which are "misaligned" (for example, starting at byte
>> 5) within a cacheline but still fitting into one cacheline are fetched
>> in one go, with no penalty.
>
> Yes, I know all that.  Fetching is one thing.  Loop cache is for instance
> another (more important) thing.  Not aligning the loop head increases
> chance of the whole loop being split over more cache lines than necessary.
> Jump predictors also don't necessarily decode/remember the whole
> instruction address.  And so on.
>
>> Aligning to 8 bytes within a cacheline does not speed things up. It
>> simply wastes bytes without speeding up anything.
>
> It's not that easy, which is why I have asked if you have _measured_ the
> correctness of your theory of it not mattering?  All the alignment
> adjustments in GCC were included after measurements.  In particular the
> align-by-8-always (for loop heads) was included after some large
> regressions on cpu2000, in 2007 (core2 duo at that time).
>
> So, I'm never much thrilled about listing reasons for why performance
> can't possibly be affected, especially when we know that it once _was_
> affected, when there's an easy way to show that it's not affected.

z.S:

#compile with: gcc -nostartfiles -nostdlib
_start:         .globl _start
                 .p2align 8
                 mov     $4000*1000*1000, %eax # 5-byte insn
                 nop     # 6
                 nop     # 7
                 nop     # 8
loop:           dec     %eax
                 lea     (%ebx), %ebx
                 jnz     loop
                 push    $0
                 ret     # SEGV

This program loops 4 billion times, then exits (by crashing).

I build two executables from it, z8 as shown above, which has its loop 8-byte aligned:

$ objdump -dr z8
z8:     file format elf64-x86-64
Disassembly of section .text:
0000000000400100 <_start>:
   400100:	b8 00 28 6b ee       	mov    $0xee6b2800,%eax
   400105:	90                   	nop
   400106:	90                   	nop
   400107:	90                   	nop
0000000000400108 <loop>:
   400108:	ff c8                	dec    %eax
   40010a:	67 8d 1b             	lea    (%ebx),%ebx
   40010d:	75 f9                	jne    400108 <loop>
   40010f:	6a 00                	pushq  $0x0
   400111:	c3                   	retq

and z7, which has one NOP removed and therefore its loop starts
at 0000000000400107.

$ perf stat -r20 ./z7
  Performance counter stats for './z7' (20 runs):
        1204.217409      task-clock (msec)         #    0.972 CPUs utilized            ( +-  0.19% )
                 10      context-switches          #    0.009 K/sec                    ( +- 15.69% )
                  0      cpu-migrations            #    0.000 K/sec                    ( +- 77.80% )
                  3      page-faults               #    0.003 K/sec                    ( +-  2.87% )
      4,220,236,037      cycles                    #    3.505 GHz                      ( +-  0.20% )
     12,030,574,486      instructions              #    2.85  insn per cycle           ( +-  0.00% )
      4,005,827,208      branches                  # 3326.498 M/sec                    ( +-  0.00% )
             22,338      branch-misses             #    0.00% of all branches          ( +-  4.10% )

        1.238638386 seconds time elapsed                                          ( +-  0.19% )

$ perf stat -r20 ./z8
  Performance counter stats for './z8' (20 runs):
        1203.453938      task-clock (msec)         #    0.973 CPUs utilized            ( +-  0.27% )
                  8      context-switches          #    0.007 K/sec                    ( +- 14.46% )
                  0      cpu-migrations            #    0.000 K/sec                    ( +- 54.61% )
                  3      page-faults               #    0.003 K/sec                    ( +-  2.60% )
      4,233,994,227      cycles                    #    3.518 GHz                      ( +-  0.27% )
     12,030,085,275      instructions              #    2.84  insn per cycle           ( +-  0.00% )
      4,005,715,106      branches                  # 3328.516 M/sec                    ( +-  0.00% )
             21,486      branch-misses             #    0.00% of all branches          ( +-  4.42% )

        1.236360951 seconds time elapsed                                          ( +-  0.26% )

z8 is 0.2% faster. Lets try another run?

  Performance counter stats for './z7' (20 runs):

        1217.476778      task-clock (msec)         #    0.972 CPUs utilized            ( +-  0.30% )
                  8      context-switches          #    0.006 K/sec                    ( +- 10.98% )
                  0      cpu-migrations            #    0.000 K/sec                    ( +- 27.14% )
                  3      page-faults               #    0.003 K/sec                    ( +-  3.06% )
      4,252,346,035      cycles                    #    3.493 GHz                      ( +-  0.17% )
     12,030,474,923      instructions              #    2.83  insn per cycle           ( +-  0.00% )
      4,005,793,752      branches                  # 3290.242 M/sec                    ( +-  0.00% )
             22,640      branch-misses             #    0.00% of all branches          ( +-  6.52% )

        1.252268537 seconds time elapsed                                          ( +-  0.32% )

  Performance counter stats for './z8' (20 runs):

        1220.024012      task-clock (msec)         #    0.973 CPUs utilized            ( +-  0.35% )
                  8      context-switches          #    0.006 K/sec                    ( +- 12.55% )
                  0      cpu-migrations            #    0.000 K/sec                    ( +- 39.74% )
                  3      page-faults               #    0.003 K/sec                    ( +-  2.87% )
      4,247,690,562      cycles                    #    3.482 GHz                      ( +-  0.27% )
     12,032,460,554      instructions              #    2.83  insn per cycle           ( +-  0.01% )
      4,006,219,524      branches                  # 3283.722 M/sec                    ( +-  0.01% )
             26,651      branch-misses             #    0.00% of all branches          ( +-  7.73% )

        1.253366584 seconds time elapsed                                          ( +-  0.36% )

Now z7 is 0.1% faster.

Looks like loop alignment to 8 bytes does not matter (in this particular example).