This is the mail archive of the
mailing list for the GCC project.
Re: Builtin expansion versus headers optimization: Reductions
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Andi Kleen <andi at firstfloor dot org>
- Cc: gcc at gcc dot gnu dot org, law at redhat dot org, libc-alpha at sourceware dot org
- Date: Fri, 5 Jun 2015 11:02:03 +0200
- Subject: Re: Builtin expansion versus headers optimization: Reductions
- Authentication-results: sourceware.org; auth=none
- References: <20150604105929 dot GA19141 at domone> <87fv67nonj dot fsf at tassilo dot jf dot intel dot com>
On Thu, Jun 04, 2015 at 02:34:40PM -0700, Andi Kleen wrote:
> The compiler has much more information than the headers.
> - It can do alias analysis, so to avoid needing to handle overlap
> and similar.
Could but it could also export that information which would benefit
> - It can (sometimes) determine alignment, which is important
> information for tuning.
In general case yes, but here its useless. As most functions are aligned
to 16 bytes in less than 10% of calls you shouldn't add cold branch to
handle aligned data.
Also as I mentioned bugs before gcc now doesn't handle alignment well so
it doesn't optimize following to zero for aligned code.
align = ((uintptr_t) x) % 16;
If it done so then you don't need go to gcc, just check alignment with
__builtin_constant_p(((uintptr_t) x) % 16) && ((uintptr_t) x) % 16 == 0
> - With profile feedback it can use value histograms to determine the
> best code.
Problem is that histograms are not enough as I mentioned before. For
profiling you need to measure useful data which differs per function and
should be done in userspace.
For best code you need to know things like percentage of cache lines in L1,
L2 and L3 cache cache to select correct memset.
On ivy bridge I got that Using rep stosq for memset(x,0,4096) is 20%
slower than libcall for L1 cache resident data while 50% faster for data
outside cache. How do you teach compiler that?
Switch to 16 byte blocks here to see graphs.
Likewise on memcpy I got that rte_memcpy is faster on copies of L1 cache data.
That isn't very useful as you cannot have many 8kb input and output
buffers both in L1 cache. Reason is it uses 256-byte loopp That becomes nil for L2 cache and problem for L3 cache where it is slower.
Likewise for strcmp+co you need to know probabilities in which bytes
match occurs and depending on than first add 0-4 bytewise checks
followed by maybe 8byte checks and libcall.
> It may not use all of this today, but it could.