This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding
On Fri, Aug 12, 2016 at 9:00 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/12/2016 05:20 PM, Denys Vlasenko wrote:
>>>
>>> Yes, I know all that. Fetching is one thing. Loop cache is for instance
>>> another (more important) thing. Not aligning the loop head increases
>>> chance of the whole loop being split over more cache lines than
>>> necessary.
>>> Jump predictors also don't necessarily decode/remember the whole
>>> instruction address. And so on.
>>>
>>>> Aligning to 8 bytes within a cacheline does not speed things up. It
>>>> simply wastes bytes without speeding up anything.
>>>
>>>
>>> It's not that easy, which is why I have asked if you have _measured_ the
>>> correctness of your theory of it not mattering? All the alignment
>>> adjustments in GCC were included after measurements. In particular the
>>> align-by-8-always (for loop heads) was included after some large
>>> regressions on cpu2000, in 2007 (core2 duo at that time).
>>>
>>> So, I'm never much thrilled about listing reasons for why performance
>>> can't possibly be affected, especially when we know that it once _was_
>>> affected, when there's an easy way to show that it's not affected.
>>
>>
>> z.S:
>>
>> #compile with: gcc -nostartfiles -nostdlib
>> _start: .globl _start
>> .p2align 8
>> mov $4000*1000*1000, %eax # 5-byte insn
>> nop # 6
>> nop # 7
>> nop # 8
>> loop: dec %eax
>> lea (%ebx), %ebx
>> jnz loop
>> push $0
>> ret # SEGV
>>
>> This program loops 4 billion times, then exits (by crashing).
>
> ...
>>
>> Looks like loop alignment to 8 bytes does not matter (in this particular
>> example).
>
>
>
> I looked into it more. I read Agner's Fog
> http://www.agner.org/optimize/microarchitecture.pdf
>
> Since Nehalem, Intel CPUs have loopback buffer,
> differently implemented in different CPUs.
>
> I use the following code with 4-billion iteration loop
> with various numbers of padding NOPs:
>
> 0000000000400100 <_start>:
> 400100: b8 00 28 6b ee mov $0xee6b2800,%eax
> 400105: 90 nop
> 400106: 90 nop
> 0000000000400107 <loop>:
> 400107: ff c8 dec %eax
> 400109: 8d 88 d2 04 00 00 lea 0x4d2(%rax),%ecx
> 40010f: 75 f6 jne 400107 <loop>
>
> 400111: b8 e7 00 00 00 mov $0xe7,%eax
> 400116: 0f 05 syscall
>
> On Skylake, the loop slows down if its body crosses 16 bytes
> (as shown above - last JNE insn doesn't fit).
>
> With loop starting at 0000000000400106 and fitting into an aligned 16-byte
> block:
>
> Performance counter stats for './z6' (10 runs):
> 1209.051244 task-clock (msec) # 0.999 CPUs utilized
> ( +- 0.99% )
> 5 context-switches # 0.004 K/sec
> ( +- 11.11% )
> 2 page-faults # 0.002 K/sec
> ( +- 4.76% )
> 4,101,694,215 cycles # 3.392 GHz
> ( +- 0.51% )
> 12,027,931,896 instructions # 2.93 insn per cycle
> ( +- 0.00% )
> 4,005,295,446 branches # 3312.759 M/sec
> ( +- 0.00% )
> 15,828 branch-misses # 0.00% of all branches
> ( +- 4.49% )
> 1.209910890 seconds time elapsed
> ( +- 0.99% )
>
> With loop starting at 0000000000400107:
>
> Performance counter stats for './z7' (10 runs):
> 1408.362422 task-clock (msec) # 0.999 CPUs utilized
> ( +- 1.23% )
> 5 context-switches # 0.004 K/sec
> ( +- 15.59% )
> 2 page-faults # 0.001 K/sec
> ( +- 4.76% )
> 4,749,031,319 cycles # 3.372 GHz
> ( +- 0.34% )
> 12,032,488,082 instructions # 2.53 insn per cycle
> ( +- 0.00% )
> 4,006,159,536 branches # 2844.552 M/sec
> ( +- 0.00% )
> 6,946 branch-misses # 0.00% of all branches
> ( +- 3.88% )
> 1.409459099 seconds time elapsed
> ( +- 1.23% )
>
> With loop starting at 0000000000400108:
>
> Performance counter stats for './z8' (10 runs):
> 1407.127953 task-clock (msec) # 0.999 CPUs utilized
> ( +- 1.09% )
> 6 context-switches # 0.004 K/sec
> ( +- 15.70% )
> 2 page-faults # 0.002 K/sec
> ( +- 6.64% )
> 4,747,410,967 cycles # 3.374 GHz
> ( +- 0.39% )
> 12,032,462,223 instructions # 2.53 insn per cycle
> ( +- 0.00% )
> 4,006,154,637 branches # 2847.044 M/sec
> ( +- 0.00% )
> 7,324 branch-misses # 0.00% of all branches
> ( +- 3.40% )
> 1.408205377 seconds time elapsed
> ( +- 1.08% )
>
> The difference is significant and reproducible.
>
> Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if it
> happens
> to align a loop to 16 bytes, but it may in fact hurt performance if it
> happens to align
> a loop to 16+8 bytes and this pushes loop's body end over the next 16-byte
> boundary,
> as it happens in the above example.
>
> I suspect something similar was seen sometime ago on a different, earlier
> CPU,
> and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it likes
> 8 byte alignment.
>
> It's not true that such alignment is always a win.
It looks to me that all you want is to drop the 8-byte alignment on
entities that
are smaller than a cacheline. So you should implement that, rather than
dropping the 8-byte alignment on every entity, even those larger than
a cacheline.
In fact, a new '.align-to-8-or-to-make-N-bytes-fit-into-the-current-cacheline'
directive
may help here. Of course the compiler needs to compute N or compute it via
labels.
Richard.