This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Core 2 and Core i7 tuning
- From: Andi Kleen <andi at firstfloor dot org>
- To: Bernd Schmidt <bernds at codesourcery dot com>
- Cc: Andi Kleen <andi at firstfloor dot org>, GCC Patches <gcc-patches at gcc dot gnu dot org>, "H.J. Lu" <hjl dot tools at gmail dot com>, Maxim Kuvyrkov <maxim at codesourcery dot com>, Paul Brook <paul at codesourcery dot com>
- Date: Mon, 23 Aug 2010 15:55:29 +0200
- Subject: Re: Core 2 and Core i7 tuning
- References: <4C6EE072.4070802@codesourcery.com> <87eidpwjrh.fsf@basil.nowhere.org> <4C7278A7.8080407@codesourcery.com>
On Mon, Aug 23, 2010 at 03:33:27PM +0200, Bernd Schmidt wrote:
> Not sure it's the same one, but I have an Intel optimization manual
> which only seems to have general information about which instructions go
> to which ports; the Agner Fog document has tables which at least try to
> provide full information. In the end, it may not be relevant since I
> doubt there's much to be gained from trying to get this 100% accurate.
Maybe.
>
> > As a general comment Core i7 is not a good name to use here because
> > it's a marketing name used for different micro architectures
> > (already the case). I made this mistake in another project
> > and still suffering from it :-)
>
> Most of these points also apply to Core 2, which has two different
> variants and a couple of Xeons with the same basic core.
Yes, but that doesn't mean that the mistake has to be repeated.
>
> > Comparing costs with my own model:
>
> The i7 table is just copied from the Core 2 table for the moment. I've
> only adjusted the L2 cache size.
Well as a minimum change you should at least fix the vector alignment,
that's a big win (just need to make sure AVX is still using it)
But some of the other parameters can also be tweaked.
I believe especially the string tuning ops help quite a lot.
> > 1 now. Inter unit moves got a lot cheaper.
>
> As far as I know there are still stalls?
I thought it was pretty cheap. The manual even recommends to do
XMM spilling, because it's far faster than L1.
>
> >> + 32, /* size of l1 cache. */
> >> + 256, /* size of l2 cache. */
> >
> > I used the L3 here. Makes more sense?
>
> No idea.
I think it does, ignoring the L3 completely for cache blocking
of loops would be a poor decision.
That is there is still the problem of resource sharing with
multi threading, but afaik that's ignored everywhere in gcc currently.
> >> + COSTS_N_INSNS (58), /* cost of FSQRT
> >> instruction. */
> >
> > I suspect some of these costs are also outdated, but needs measurements.
>
> FADD and FMUL are correct, I think, but Maxim pointed me at an earlier
> patch from Vlad which got better results by changing them.
>
> >> /* X86_TUNE_PAD_RETURNS */
> >> - m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
> >> + m_AMD_MULTIPLE | m_GENERIC,
> >
> > Not sure why?
>
> Everything I looked at seemed to say this is an AMD-only thing.
The jump to ret is AMD only, but it still can help the Intel
branch predictor indirectly to avoid exceeding the maximum limit
per 16 byte window.
I thought that is why it was originally added for Core 2 too.
Better would be probably to use a special pass for this. iirc
there's already some code for it, but likely not fully correct.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.