[Bug target/81614] Should -mtune-ctrl=partial_reg_stall be turned by default?

Mon Jul 31 10:16:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
There are two flags (which I believe was introduced by me)
 - partial-reg-stall which models behavior of PentiumPro where partial writes
   to register were cheap as long as the partially written register was never
   used in wider mode as the stall.
 - partial-reg-dependency which models later CPUs where partial writes are
   handled as read-mody-write instructions and thus slow even if the result
   is used only in same width as write.

   This was design of Athlons.

The first flag avoids random optimizations which replace full sized instruction
by part size (for example xol $1, eax is not changed to xorb to save size).
Still we could generate partial register stalls out of combine.

The second is trying to make sure we always read full register (by movzx). 

We set those as: 

DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO)
DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
          m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL
          | m_KNL | m_AMD_MULTIPLE | m_GENERIC)

I would say that it makes no sense to have both X86_TUNE_PARTIAL_REG_STALL and
X86_TUNE_PARTIAL_REG_DEPENDENCY set on one chip.
According to Fog's manual indeed Core and later chips can rename partial
registers again so they should be moved to X86_TUNE_PARTIAL_REG_STALL category
and we should try to fix possible regressions.

In the testcase given, for X86_TUNE_PARTIAL_REG_DEPENDENCY we ought to emit the
dependency breaking instruction to clear full register before partial write 
when optimizing for speed.

All AMD chips since Athlon are however X86_TUNE_PARTIAL_REG_DEPENDENCY design
so for generic we will need to check what are the tradeoffs.  I would say that
X86_TUNE_PARTIAL_REG_DEPENDENCY is in general more conservative and works well
for (X86_TUNE_PARTIAL_REG_STALL chips as the cases we produce partial write
(sete) are relatively rare.