This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Core 2 and Core i7 tuning


On Mon, Aug 23, 2010 at 03:33:27PM +0200, Bernd Schmidt wrote:
> Not sure it's the same one, but I have an Intel optimization manual
> which only seems to have general information about which instructions go
> to which ports; the Agner Fog document has tables which at least try to
> provide full information.  In the end, it may not be relevant since I
> doubt there's much to be gained from trying to get this 100% accurate.

Maybe.

> 
> > As a general comment Core i7 is not a good name to use here because
> > it's a marketing name used for different micro architectures
> > (already the case). I made this mistake in another project
> > and still suffering from it :-)
> 
> Most of these points also apply to Core 2, which has two different
> variants and a couple of Xeons with the same basic core.

Yes, but that doesn't mean that the mistake has to be repeated.


> 
> > Comparing costs with my own model: 
> 
> The i7 table is just copied from the Core 2 table for the moment.  I've
> only adjusted the L2 cache size.

Well as a minimum change you should at least fix the vector alignment,
that's a big win (just need to make sure AVX is still using it)

But some of the other parameters can also be tweaked.
I believe especially the string tuning ops help quite a lot.

> > 1 now. Inter unit moves got a lot cheaper.
> 
> As far as I know there are still stalls?

I thought it was pretty cheap. The manual even recommends to do 
XMM spilling, because it's far faster than L1.

> 
> >> +  32,					/* size of l1 cache.  */
> >> +  256,					/* size of l2 cache.  */
> > 
> > I used the L3 here. Makes more sense?
> 
> No idea.

I think it does, ignoring the L3 completely for cache blocking
of loops would be a poor decision.

That is there is still the problem of resource sharing with
multi threading, but afaik that's ignored everywhere in gcc currently.

> >> +  COSTS_N_INSNS (58),			/* cost of FSQRT
> >> instruction.  */
> > 
> > I suspect some of these costs are also outdated, but needs measurements.
> 
> FADD and FMUL are correct, I think, but Maxim pointed me at an earlier
> patch from Vlad which got better results by changing them.
> 
> >>    /* X86_TUNE_PAD_RETURNS */
> >> -  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
> >> +  m_AMD_MULTIPLE | m_GENERIC,
> > 
> > Not sure why?
> 
> Everything I looked at seemed to say this is an AMD-only thing.

The jump to ret is AMD only, but it still can help the Intel
branch predictor indirectly to avoid exceeding the maximum limit
per 16 byte window.

I thought that is why it was originally added for Core 2 too.

Better would be probably to use a special pass for this. iirc
there's already some code for it, but likely not fully correct.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]