[PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Wed Dec 12 18:30:00 GMT 2012

Concerning 1push per cycle, I think it is same as K7 hardware did, so move
prologue should be a win.
> > Index: config/i386/i386.c
> > ===================================================================
> > --- config/i386/i386.c  (revision 194452)
> > +++ config/i386/i386.c  (working copy)
> > @@ -1620,14 +1620,14 @@ struct processor_costs core_cost = {
> >    COSTS_N_INSNS (8),                   /* cost of FABS instruction.  */
> >    COSTS_N_INSNS (8),                   /* cost of FCHS instruction.  */
> >    COSTS_N_INSNS (40),                  /* cost of FSQRT instruction.  */
> > -  {{libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {128, rep_prefix_8_byte, true},
> > +  {{libcall, {{8192, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},
> >                {-1, libcall, false}}}},
> >    {{libcall, {{6, loop_1_byte, true},
> >                {24, loop, true},
> >                {8192, rep_prefix_4_byte, true},
> >                {-1, libcall, false}}},
> > -   {libcall, {{24, loop, true}, {512, rep_prefix_8_byte, true},
> > +   {libcall, {{24, loop, true}, {8192, rep_prefix_8_byte, true},

libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
cache than fully blowin function call.
> > @@ -1806,7 +1806,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_PPRO,
> >
> >    /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
> > -  m_CORE2I7 | m_GENERIC,
> > +  m_GENERIC | m_CORE2,

This disable shifts that store just some flags. Acroding to Agner's manual I7 handle
this well.

Partial flags stall
The Sandy Bridge uses the method of an extra Ă‚Âľop to join partial registers not only for
general purpose registers but also for the flags register, unlike previous processors which
used this method only for general purpose registers. This occurs when a write to a part of
the flags register is followed by a read from a larger part of the flags register. The partial
flags stall of previous processors (See page 75) is therefore replaced by an extra Ă‚Âľop. The
Sandy Bridge also generates an extra Ă‚Âľop when reading the flags after a rotate instruction.

This is cheaper than the 7 cycle delay on Core this flags is trying to avoid.
> >
> >    /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
> >     * on 16-bit immediate moves into memory on Core2 and Corei7.  */
> > @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_K6,
> >
> >    /* X86_TUNE_USE_CLTD */
> > -  ~(m_PENT | m_ATOM | m_K6),
> > +  ~(m_PENT | m_ATOM | m_K6 | m_GENERIC),

None of CPUs that generic care about are !USE_CLTD now after your change.
> > @@ -1910,10 +1910,10 @@ static unsigned int initial_ix86_tune_fe
> >    m_ATHLON_K8,
> >
> >    /* X86_TUNE_SSE_TYPELESS_STORES */
> > -  m_AMD_MULTIPLE,
> > +  m_AMD_MULTIPLE | m_CORE2I7, /*????*/

Hmm, I can not seem to find this in manual now, but I believe that stores also do not type,
so movaps is preferred over movapd store because it is shorter.  If not, this change should
produce a lot of slowdowns.
> >
> >    /* X86_TUNE_SSE_LOAD0_BY_PXOR */
> > -  m_PPRO | m_P4_NOCONA,
> > +  m_PPRO | m_P4_NOCONA | m_CORE2I7, /*????*/

Agner:
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The
Core2 and Nehalem processors recognize that certain instructions are independent of the
prior value of the register if the source and destination registers are the same.

This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, and all
variants of PSUBxxx and PCMPxxx except PCMPEQQ.
> >
> >    /* X86_TUNE_MEMORY_MISMATCH_STALL */
> >    m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1938,7 +1938,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
> >       than 4 branch instructions in the 16 byte window.  */
> > -  m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,

This is special passs to handle limitations of AMD's K7/K8/K10 branch prediction.
Intel never had similar design, so this flag is pointless.

We apparently ought to disable it for K10, at least per Agner's manual.
> >
> >    /* X86_TUNE_SCHEDULE */
> >    m_PENT | m_PPRO | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC,
> > @@ -1947,10 +1947,10 @@ static unsigned int initial_ix86_tune_fe
> >    m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC,
> >
> >    /* X86_TUNE_USE_INCDEC */
> > -  ~(m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_GENERIC),
> > +  ~(m_P4_NOCONA | m_ATOM | m_GENERIC),

Skipping inc/dec is to avoid partial flag stall happening on P4 only.
> >
> >    /* X86_TUNE_PAD_RETURNS */
> > -  m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC,
> > +  m_AMD_MULTIPLE | m_GENERIC,

Again this deals specifically with AMD K7/K8/K10 branch prediction.  I am not even
sure this should be enabled for K10.
> >
> >    /* X86_TUNE_PAD_SHORT_FUNCTION: Pad short funtion.  */
> >    m_ATOM,
> > @@ -1959,7 +1959,7 @@ static unsigned int initial_ix86_tune_fe
> >    m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC,
> >
> >    /* X86_TUNE_AVOID_VECTOR_DECODE */
> > -  m_CORE2I7 | m_K8 | m_GENERIC64,
> > +  m_K8 | m_GENERIC64,

This avoid AMD vector decoded instructions, again if it helped it did so by accident.
> >
> >    /* X86_TUNE_PROMOTE_HIMODE_IMUL: Modern CPUs have same latency for HImode
> >       and SImode multiply, but 386 and 486 do HImode multiply faster.  */
> > @@ -1967,11 +1967,11 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
> >       vector path on AMD machines.  */
> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,
> >
> >    /* X86_TUNE_SLOW_IMUL_IMM8: Imul of 8-bit constant is vector path on AMD
> >       machines.  */
> > -  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC64,
> > +  m_CORE2I7 | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER,

This is similarly targetd for AMD hardware only. I did not find ismilar limiation
in the optimization manual.
> >
> >    /* X86_TUNE_MOVE_M1_VIA_OR: On pentiums, it is faster to load -1 via OR
> >       than a MOV.  */
> > @@ -1988,7 +1988,7 @@ static unsigned int initial_ix86_tune_fe
> >
> >    /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
> >       from FP to FP. */
> > -  m_CORE2I7 | m_AMDFAM10 | m_GENERIC,
> > +  m_AMDFAM10 | m_GENERIC,

This is quite specific feature of AMD chips to preffer packed converts over
scalar. Nothing like this is documented for cores
> >
> >    /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
> >       from integer to FP. */
> > @@ -1997,7 +1997,7 @@ static unsigned int initial_ix86_tune_fe
> >    /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
> >       with a subsequent conditional jump instruction into a single
> >       compare-and-branch uop.  */
> > -  m_BDVER,
> > +  m_BDVER | m_CORE2I7,

Core iplements fusion similar to what AMD does so I think this just applies here.
> >
> >    /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
> >       will impact LEA instruction selection. */
> > @@ -2052,7 +2052,7 @@ static unsigned int initial_ix86_arch_fe
> >  };
> >
> >  static const unsigned int x86_accumulate_outgoing_args
> > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_CORE2I7 | m_AMD_MULTIPLE | m_GENERIC;
> > +  = m_PPRO | m_P4_NOCONA | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;

Stack engine should make this cheap, just like prologues in moves.
This definitely needs some validation, the accumulate-outgoing-args
codegen differs quite a lot. Also this leads to unwind tables bloat.
> >
> >  static const unsigned int x86_arch_always_fancy_math_387
> >    = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE2I7 | m_ATOM | m_AMD_MULTIPLE | m_GENERIC;

Honza