This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding


On 08/15/2016 11:45 AM, Richard Biener wrote:
Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if it
happens
to align a loop to 16 bytes, but it may in fact hurt performance if it
happens to align
a loop to 16+8 bytes and this pushes loop's body end over the next 16-byte
boundary,
as it happens in the above example.

I suspect something similar was seen sometime ago on a different, earlier
CPU,
and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it likes
8 byte alignment.

It's not true that such alignment is always a win.

It looks to me that all you want is to drop the 8-byte alignment on
entities that are smaller than a cacheline.

I don't think it can be simplified to this.

An example. A loop 122 bytes long fits into either two or three 64-byte cachelines,
depending on where it starts. If it starts in bytes 0..5 in a cacheline, it fits
into two cachelines. If it starts at 6 bytes or more into cacheline, it doesn't fit.

8-byte alignment is worse for such a loop than not doing it.

It's even worse for the use case which prompted me to create these patches:
-falign-functions. Linux kernel people want to align all functions
to 64 bytes, but only if the necessary padding is, say, 9 bytes or less.
The rationale is that function calls are often "cold", i.e. function body
is not in L1, and it would be even slower if first insn(s) would require
two L1 loads, not one, to be decoded.

Hence -falign-functions=64,10. This would be a very efficient packing:
only ~15% of all functions would need any padding (the remaining 85%
would start 10 or more bytes before end of cacheline and thus need
no padding), and among those 15% the average padding length would be
only 5 bytes. With very small code size increase, we'd gain a lot
in speed.

This nice optimistic picture is currently destroyed by unnecessary
and not-asked-for "subalignment" to 8 bytes, which now adds 4.5 bytes
of padding on average *to every function*, as a "bonus" making
it *less* efficient versus instruction fetch, not more efficient!


IOW: I am proposing to remove this code because it seems arbitrary: it helped
on one particular CPU model, and maybe only on some particular benchmarks.
On other CPUs, or in other scenarios, it's harmful.
It should not be now done for all CPUs and all programs.

If there is a value in the ability to do a "subalignment" within a larger alignment,
maybe we can make it a separate option, and let user specify it if he wants?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]