This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Core 2 and Core i7 tuning

From: Andi Kleen <andi at firstfloor dot org>
To: Bernd Schmidt <bernds at codesourcery dot com>
Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, "H.J. Lu" <hjl dot tools at gmail dot com>, Maxim Kuvyrkov <maxim at codesourcery dot com>, Paul Brook <paul at codesourcery dot com>
Date: Mon, 23 Aug 2010 15:17:38 +0200
Subject: Re: Core 2 and Core i7 tuning
References: <4C6EE072.4070802@codesourcery.com>

Bernd Schmidt <bernds@codesourcery.com> writes:

Hi Bernd,

FWIW I have an own private core i7 target, but it wasn't as fancy
as yours.

First I'm surprised that you wrote that the pipeline description
in the optimization manual wasn't good enough. Did you use
2.1 in http://www.intel.com/assets/pdf/manual/248966.pdf 
as a reference?

Also I think you forgot to update driver-i386.c

> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi	(revision 162821)
> +++ doc/invoke.texi	(working copy)
> @@ -11937,6 +11937,9 @@ SSE2 and SSE3 instruction set support.
>  @item core2
>  Intel Core2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
>  instruction set support.
> +@item corei7
> +Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3,
> SSSE3, SSE4.1

As a general comment Core i7 is not a good name to use here because
it's a marketing name used for different micro architectures
(already the case). I made this mistake in another project
and still suffering from it :-)

The Intel manual uses "Enhanced Core in 45nm" 

Also there are CPUs like xeon5500 or 7500 that use a similar core, but
have a different name. I also don't think anyone still cares about "with
64bit-extensions". In my version I added aliases for xeon5500 etc.

Also there are already 32nm variants, which are mostly the same,
but have different cache sizes and a few extensions.

> Index: config/i386/i386-c.c
> ===================================================================
> --- config/i386/i386-c.c	(revision 162821)
> +++ config/i386/i386-c.c	(working copy)
> @@ -122,6 +122,10 @@ ix86_target_macros_internal (int isa_fla
>        def_or_undef (parse_in, "__core2");
>        def_or_undef (parse_in, "__core2__");
>        break;
> +    case PROCESSOR_COREI7:
> +      def_or_undef (parse_in, "__corei7");
> +      def_or_undef (parse_in, "__corei7__");

Again the name is not good.
>  
>  static const

Comparing costs with my own model: 

> +  0,					/* cost of multiply per each bit set */
> +  {COSTS_N_INSNS (22),			/* cost of a divide/mod for QI */
> +   COSTS_N_INSNS (22),			/*
> HI

AFAIK these costs are not accurate anymore for the new divider since
Penryn. The cost is variable based on bits, so fully expressing it would
need a few changes in the high level check.

 */

> +					   in SFmode, DFmode and XFmode */
> +  2,					/* cost of moving MMX register */
> +  {6, 6},				/* cost of loading MMX registers
> +					   in SImode and DImode */
> +  {4, 4},				/* cost of storing MMX registers
> +					   in SImode and DImode */
> +  2,					/* cost of moving SSE register
> */

Too high?

> +  {6, 6, 6},				/* cost of loading SSE registers
> +					   in SImode, DImode and TImode
> */

And I suspect that's also too high.

> +  {4, 4, 4},				/* cost of storing SSE registers
> +					   in SImode, DImode and TImode */
> +  2,					/* MMX or SSE register to
> integer */

1 now. Inter unit moves got a lot cheaper.

> +  32,					/* size of l1 cache.  */
> +  256,					/* size of l2 cache.  */

I used the L3 here. Makes more sense?

BTW I was always wondering if there should be a flag for multithreading,
then the values should be half.

> +  128,					/* size of prefetch
> block */

I don't think that's true.

> +  8,					/* number of parallel prefetches
> */

I believe this number is too low.

> +  3,					/* Branch cost */
> +  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
> +  COSTS_N_INSNS (32),			/* cost of FDIV instruction.  */
> +  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
> +  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
> +  COSTS_N_INSNS (58),			/* cost of FSQRT
> instruction.  */

I suspect some of these costs are also outdated, but needs measurements.

> +  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
> +	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +  {{libcall, {{8, loop}, {15, unrolled_loop},
> +	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{24, loop}, {32, unrolled_loop},
> +	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},

This is certainly not correct for Nehalem, see 2.2.6 in the optimization
manual

> +  1,					/* scalar_stmt_cost.  */
> +  1,					/* scalar load_cost.  */
> +  1,					/* scalar_store_cost.  */
> +  1,					/* vec_stmt_cost.  */
> +  1,					/* vec_to_scalar_cost.  */
> +  1,					/* scalar_to_vec_cost.  */
> +  1,					/* vec_align_load_cost.  */
> +  2,					/* vec_unalign_load_cost.  */

Should be actually the same as aligned. This gives a big improvement
because the vectorizer does not generate all the explicit alignment code.

The only problem I ran into is that it has to be redone for AVX again :/

>    /* X86_TUNE_PAD_RETURNS */
> -  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
> +  m_AMD_MULTIPLE | m_GENERIC,

Not sure why?

The return padding can still help to not exceed the max density
of the branch predictor. However it would be probably better to have 
a different pass for that.

-andi
-- 
ak@linux.intel.com -- Speaking for myself only.

Follow-Ups:
- Re: Core 2 and Core i7 tuning
  - From: Bernd Schmidt

References:
- Core 2 and Core i7 tuning
  - From: Bernd Schmidt

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]