[PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

Mon Aug 15 13:30:00 GMT 2016

On Mon, Aug 15, 2016 at 1:53 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/15/2016 11:45 AM, Richard Biener wrote:
>>>
>>> Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if
>>> it
>>> happens
>>> to align a loop to 16 bytes, but it may in fact hurt performance if it
>>> happens to align
>>> a loop to 16+8 bytes and this pushes loop's body end over the next
>>> 16-byte
>>> boundary,
>>> as it happens in the above example.
>>>
>>> I suspect something similar was seen sometime ago on a different, earlier
>>> CPU,
>>> and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it
>>> likes
>>> 8 byte alignment.
>>>
>>> It's not true that such alignment is always a win.
>>
>>
>> It looks to me that all you want is to drop the 8-byte alignment on
>> entities that are smaller than a cacheline.
>
>
> I don't think it can be simplified to this.
>
> An example. A loop 122 bytes long fits into either two or three 64-byte
> cachelines,
> depending on where it starts. If it starts in bytes 0..5 in a cacheline, it
> fits
> into two cachelines. If it starts at 6 bytes or more into cacheline, it
> doesn't fit.
>
> 8-byte alignment is worse for such a loop than not doing it.
>
> It's even worse for the use case which prompted me to create these patches:
> -falign-functions. Linux kernel people want to align all functions
> to 64 bytes, but only if the necessary padding is, say, 9 bytes or less.
> The rationale is that function calls are often "cold", i.e. function body
> is not in L1, and it would be even slower if first insn(s) would require
> two L1 loads, not one, to be decoded.
>
> Hence -falign-functions=64,10. This would be a very efficient packing:
> only ~15% of all functions would need any padding (the remaining 85%
> would start 10 or more bytes before end of cacheline and thus need
> no padding), and among those 15% the average padding length would be
> only 5 bytes. With very small code size increase, we'd gain a lot
> in speed.
>
> This nice optimistic picture is currently destroyed by unnecessary
> and not-asked-for "subalignment" to 8 bytes, which now adds 4.5 bytes
> of padding on average *to every function*, as a "bonus" making
> it *less* efficient versus instruction fetch, not more efficient!
>
>
> IOW: I am proposing to remove this code because it seems arbitrary: it
> helped
> on one particular CPU model, and maybe only on some particular benchmarks.
> On other CPUs, or in other scenarios, it's harmful.
> It should not be now done for all CPUs and all programs.
>
> If there is a value in the ability to do a "subalignment" within a larger
> alignment,
> maybe we can make it a separate option, and let user specify it if he wants?

Controlling this separately makes sense IMHO.  Changing the default for
generic tuning has to be backed up with measurements and old CPUs not
benchmarked should retain the old value when tuned for them.

Let me rephrase the desire again.  The desire is to maximize the number
of instructions fetched with the first cacheline for any label that is branched
(forward) to.  A side-effect may be avoiding penalties for CPUs that have
an instruction started at only N-byte aligned space (not sure that exists
for an ISA with 1-byte opcodes).  For labels that are branched backward to
(thus loops) the desire is to minimize the number of cachelines that need
to be fetched to get the whole loop covered - ISTR CPUs have limits on that
number when it comes to handling loops with loop caches.  Branch target
buffers may also not like too many targest per cache-line -- I expect
8 2-byte functions in a cache-line to be very bad here.

If the situation cannot be improved on any of the above any additional
"aritificial" alignment makes things only worse (by enlarging code).

Richard.