Core 2 and Core i7 tuning

Sat Aug 21 10:28:00 GMT 2010

> 	* doc/invoke.texi (i386 and x86-64 Options): Document corei7 cpu type.
> 	* config/i386/i386.h (TARGET_COREI7): New macro.
> 	(enum ix86_tune_indices): Add X86_TUNE_PROMOTE_HI_CONSTANTS.
> 	(enum target_cpu_default): Add TARGET_CPU_DEFAULT_corei7.
> 	(enum processor_type): Add PROCESSOR_COREI7.
> 	* config/i386/i386.md: Include "core2.md".
> 	(attr "cpu"): Add "corei7".
> 	(mul_operands): New attribute.
> 	(mul<mode>3_1, mulsi3_1_zext, mulhi3_1, mulqi3_1, <u>mul<mode><dwi>3_1,
> 	<u>mulqihi3_1, <s>muldi3_highpart_1, <s>mulsi3_highpart_1,
> 	<s>mulsi3_highpart_zext): Set it.
> 	* config/i386/core2.md: New file.
> 	* config/i386/i386-c.c (ix86_target-macros_internal): Handle
> 	PROCESSOR_COREI7.
> 	* config/i386/i386.c (corei7_cost): New static variable.
> 	(m_COREI7, m_CORE2I7): New macros.
> 	(initial_ix86_tune_features): Use them.  Disable X86_TUNE_USE_LEAVE,
> 	X86_TUNE_PAD_RETURNS and X86_TUNE_USE_INCDEC, and enable
> 	X86_TUNE_PROMOTE_HI_REGS and X86_TUNE_PROMOTE_HI_CONSTANTS for Core 2
> 	and Core i7.
> 	(x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387): Use
> 	m_CORE2I7 instead of m_CORE2.
> 	(processor_target_table): Add entry for corei7_cost.
> 	(cpu_names): Add "corei7" entr.
> 	(override_options): Add entry for Core i7.
> 	(ix86_fixup_binary_operands, ix86_binary_operator_ok): Handle
> 	TARGET_PROMOTE_HI_CONSTANTS.
> 	(ix86_issue_rate): 4 for Core i7.
> 	(ix86_adjust_cost): Try to do something sensible about domains for
> 	PROCESSOR_COREI7.
> 
> Index: config/i386/core2.md
> ===================================================================
> --- config/i386/core2.md	(revision 0)
> +++ config/i386/core2.md	(revision 0)

What is effect on cc1 binary size with your pipeline model?
I am asking because core has a lot of parallelizm that tends to blow up the automata
size a lot.
> @@ -2173,6 +2251,7 @@ static const struct ptt processor_target
>    {&k8_cost, 16, 7, 16, 7, 16},
>    {&nocona_cost, 0, 0, 0, 0, 0},
>    {&core2_cost, 16, 10, 16, 10, 16},
> +  {&corei7_cost, 16, 10, 16, 10, 16},

You was mentioning reducing alignments, but they seem same in the patch?
> @@ -14291,6 +14374,12 @@ ix86_fixup_binary_operands (enum rtx_cod
>    if (MEM_P (src1) && !rtx_equal_p (dst, src1))
>      src1 = force_reg (mode, src1);
>  
> +  if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2)
> +      && (INTVAL (src2) < -128 || INTVAL (src2) > 127)
> +      && (code != AND
> +	  || (INTVAL (src2) != 255 && INTVAL (src2) != -65281)))
> +    src2 = gen_lowpart (HImode, force_reg (SImode, src2));
> +

I am concerned about this especially on 32bit, since we force another register
to hold the constant.  
Option would be to do postreload peep2 to offload constants to registers, but then
we would miss PRE on those.  Perhaps we can break up the patch so we have
chance to see how it works.

The pipeline model seems resonable as does the tunning flags change, so perhaps it
should go in first.
>    operands[1] = src1;
>    operands[2] = src2;
>    return dst;
> @@ -14377,6 +14466,12 @@ ix86_binary_operator_ok (enum rtx_code c
>    if (MEM_P (src1) && !rtx_equal_p (dst, src1))
>      return 0;
>  
> +  if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2)
> +      && (INTVAL (src2) < -128 || INTVAL (src2) > 127)
> +      && (code != AND
> +	  || (INTVAL (src2) != 255 && INTVAL (src2) != -65281)))
> +    return 0;
> +
>    return 1;
>  }
>  
> @@ -20569,6 +20665,7 @@ ix86_adjust_cost (rtx insn, rtx link, rt
>  {
>    enum attr_type insn_type, dep_insn_type;
>    enum attr_memory memory;
> +  enum attr_i7_domain domain1, domain2;
>    rtx set, set2;
>    int dep_insn_code_number;
>  
> @@ -20711,6 +20808,19 @@ ix86_adjust_cost (rtx insn, rtx link, rt
>  	  else
>  	    cost = 0;
>  	}
> +      break;
> +
> +    case PROCESSOR_COREI7:
> +      memory = get_attr_memory (insn);
> +
> +      domain1 = get_attr_i7_domain (insn);
> +      domain2 = get_attr_i7_domain (dep_insn);
> +      if (domain1 != domain2
> +	  && !ix86_agi_dependent (dep_insn, insn))
> +	cost += ((domain1 == I7_DOMAIN_SIMD && domain2 == I7_DOMAIN_INT)
> +		 || (domain1 == I7_DOMAIN_INT && domain2 == I7_DOMAIN_SIMD)
> +		 ? 1 : 2);

This number is supposed to be load latency, is it still 1/2 at Core when reading from cache?

Honza
> +      break;
>  
>      default:
>        break;