[PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding
Denys Vlasenko
dvlasenk@redhat.com
Fri Aug 12 15:20:00 GMT 2016
On 08/12/2016 04:28 PM, Michael Matz wrote:
> Hi,
>
> On Fri, 12 Aug 2016, Denys Vlasenko wrote:
>
>>> Have you tested the performance impact of your patch? Note that the
>>> macro you changed is used for function and code label alignment. So,
>>> unless I misunderstand something that means that if the large
>>> alignment can't be achieved for e.g. a loop start label, you won't
>>> align it at all anymore. This should be fairly catastrophic for any
>>> loopy benchmark, so anything like this would have to be checked on a
>>> couple benchmarks from cpu2000 (possibly cpu2006), which has some that
>>> are extremely alignment sensitive.
>>>
>>> Even for function labels I'd find no alignment at all strange, and I
>>> don't see why you'd want this.
>>
>> For many generations now, x86 CPUs have at least 32, and usually 64 byte
>> cachelines. Decoders fetch instructions in blocks of 32 or 64 bytes. Not
>> less. Instructions which are "misaligned" (for example, starting at byte
>> 5) within a cacheline but still fitting into one cacheline are fetched
>> in one go, with no penalty.
>
> Yes, I know all that. Fetching is one thing. Loop cache is for instance
> another (more important) thing. Not aligning the loop head increases
> chance of the whole loop being split over more cache lines than necessary.
> Jump predictors also don't necessarily decode/remember the whole
> instruction address. And so on.
>
>> Aligning to 8 bytes within a cacheline does not speed things up. It
>> simply wastes bytes without speeding up anything.
>
> It's not that easy, which is why I have asked if you have _measured_ the
> correctness of your theory of it not mattering? All the alignment
> adjustments in GCC were included after measurements. In particular the
> align-by-8-always (for loop heads) was included after some large
> regressions on cpu2000, in 2007 (core2 duo at that time).
>
> So, I'm never much thrilled about listing reasons for why performance
> can't possibly be affected, especially when we know that it once _was_
> affected, when there's an easy way to show that it's not affected.
z.S:
#compile with: gcc -nostartfiles -nostdlib
_start: .globl _start
.p2align 8
mov $4000*1000*1000, %eax # 5-byte insn
nop # 6
nop # 7
nop # 8
loop: dec %eax
lea (%ebx), %ebx
jnz loop
push $0
ret # SEGV
This program loops 4 billion times, then exits (by crashing).
I build two executables from it, z8 as shown above, which has its loop 8-byte aligned:
$ objdump -dr z8
z8: file format elf64-x86-64
Disassembly of section .text:
0000000000400100 <_start>:
400100: b8 00 28 6b ee mov $0xee6b2800,%eax
400105: 90 nop
400106: 90 nop
400107: 90 nop
0000000000400108 <loop>:
400108: ff c8 dec %eax
40010a: 67 8d 1b lea (%ebx),%ebx
40010d: 75 f9 jne 400108 <loop>
40010f: 6a 00 pushq $0x0
400111: c3 retq
and z7, which has one NOP removed and therefore its loop starts
at 0000000000400107.
$ perf stat -r20 ./z7
Performance counter stats for './z7' (20 runs):
1204.217409 task-clock (msec) # 0.972 CPUs utilized ( +- 0.19% )
10 context-switches # 0.009 K/sec ( +- 15.69% )
0 cpu-migrations # 0.000 K/sec ( +- 77.80% )
3 page-faults # 0.003 K/sec ( +- 2.87% )
4,220,236,037 cycles # 3.505 GHz ( +- 0.20% )
12,030,574,486 instructions # 2.85 insn per cycle ( +- 0.00% )
4,005,827,208 branches # 3326.498 M/sec ( +- 0.00% )
22,338 branch-misses # 0.00% of all branches ( +- 4.10% )
1.238638386 seconds time elapsed ( +- 0.19% )
$ perf stat -r20 ./z8
Performance counter stats for './z8' (20 runs):
1203.453938 task-clock (msec) # 0.973 CPUs utilized ( +- 0.27% )
8 context-switches # 0.007 K/sec ( +- 14.46% )
0 cpu-migrations # 0.000 K/sec ( +- 54.61% )
3 page-faults # 0.003 K/sec ( +- 2.60% )
4,233,994,227 cycles # 3.518 GHz ( +- 0.27% )
12,030,085,275 instructions # 2.84 insn per cycle ( +- 0.00% )
4,005,715,106 branches # 3328.516 M/sec ( +- 0.00% )
21,486 branch-misses # 0.00% of all branches ( +- 4.42% )
1.236360951 seconds time elapsed ( +- 0.26% )
z8 is 0.2% faster. Lets try another run?
Performance counter stats for './z7' (20 runs):
1217.476778 task-clock (msec) # 0.972 CPUs utilized ( +- 0.30% )
8 context-switches # 0.006 K/sec ( +- 10.98% )
0 cpu-migrations # 0.000 K/sec ( +- 27.14% )
3 page-faults # 0.003 K/sec ( +- 3.06% )
4,252,346,035 cycles # 3.493 GHz ( +- 0.17% )
12,030,474,923 instructions # 2.83 insn per cycle ( +- 0.00% )
4,005,793,752 branches # 3290.242 M/sec ( +- 0.00% )
22,640 branch-misses # 0.00% of all branches ( +- 6.52% )
1.252268537 seconds time elapsed ( +- 0.32% )
Performance counter stats for './z8' (20 runs):
1220.024012 task-clock (msec) # 0.973 CPUs utilized ( +- 0.35% )
8 context-switches # 0.006 K/sec ( +- 12.55% )
0 cpu-migrations # 0.000 K/sec ( +- 39.74% )
3 page-faults # 0.003 K/sec ( +- 2.87% )
4,247,690,562 cycles # 3.482 GHz ( +- 0.27% )
12,032,460,554 instructions # 2.83 insn per cycle ( +- 0.01% )
4,006,219,524 branches # 3283.722 M/sec ( +- 0.01% )
26,651 branch-misses # 0.00% of all branches ( +- 7.73% )
1.253366584 seconds time elapsed ( +- 0.36% )
Now z7 is 0.1% faster.
Looks like loop alignment to 8 bytes does not matter (in this particular example).
More information about the Gcc-patches
mailing list