[RFC] Cse reducing performance of register allocation with -O2

Mon Nov 23 09:39:00 GMT 2015

On Tue, Oct 13, 2015 at 11:06:48AM -0600, Jeff Law wrote:
> On 10/13/2015 07:12 AM, Dominik Vogt wrote:
> >In some cases, the work of the cse1 pass is counterproductive, as
> >we noticed on s390x.  The effect described below is present since
> >at least 4.8.0.  Note that this may not become manifest in a
> >performance issue problem on all platforms.  Also note that -O1
> >does not show this behaviour because the responsible code is only
> >executed with -O2 or higher.
> >
> >The core of the problem is the was cse1 sometimes handles function
> >parameters.  Roughly, the observed situation is
> >
> >Before cse1
> >
> >   start of function
> >   set pseudoreg Rp to the first argument from hardreg R2
> >   (some code that uses Rp)
> >   set R2 to Rp
> >
> >After cse1:
> >
> >   start of function
> >   set pseudoreg Rp to the first argument from hardreg R2
> >   (some code that uses Rp)  <--- The use of Rp is still present
> >   set R2 to R2              <--- cse1 has replaced Rp with R2
> >
> >After that, the set pattern is removed completely, and now we have
> >both, Rp and R2 live in the drafted code snippet.  Because R2 ist
> >still supposed to be live later on, the ira pass chooses a
> >different hard register (R1) for Rp, and code to copy R1 back to
> >R2 is added later.  (See further down for Rtl and assembly code.)
...
> >So, I've made an experimental hack (see attachment) and treid
> >that.  In a larger test suite, register copies could be saved in
> >quite some places (including the test program below), but in other
> >places new register copies were introduced, resulting in about
> >twice as much "issues" as without the patch.
> >
> >Maybe the patch is just too coarse.  In general I'd assume that
> >the register allocator does a better job of assigning hard
> >registers to pseudo registers.  Is it possible to better describe
> >when cse1 should keep its hands off pseudo registers?
> We don't really have a way to describe this.
> 
> I know Vlad looked at problems in this space -- essentially knowing
> when two registers had the same value in the allocators/reload and
> exploiting that information.
> 
> My recollection was it didn't help in any measurable way -- I think
> he discussed it during one of the old GCC summit conferences.  That
> was also in the reload era.
> 
> Ultimately this feels like all the issues around coalescing and
> copy-propagation. With that in mind, if we had lifetime & conflict
> information, then we'd be able to query that and perhaps be able to
> make different choices.

I've spent some more time to try out the naive approach of
detecting this situation in cse_insn().

1. In cse_insn()

  IF current "set" is "set Hardreg H := Pseudoreg P"
  AND  P is generated as a copy of C further up in the extended BB
  AND  P and H still contain the same value
  AND  Cse considers to replace the set with "set H := H"
  AND  P is still live at the end of the EBB
       (In the test program this prevents that *all uses of P are
       replaced by H.)
  THEN do not replace

  => Testing this with the Spec 2006 suite on S390 results in a
  small gain in some cases, a small loss im lots of cases, and a
  substantial win in two cases and a substantial loss in one.  On
  average there is a small win.  I've not tested that on x86, but
  assuming that x86 does not suffer from the original problem I
  expect to see mostly losses.

  This patch requires that a per-register bitmap is created for
  each EBB to record which pseude registers have been generated
  inside the EBB.

2. 

  IF current "set" is "set Hardreg H := Pseudoreg P"
  AND  P is generated as a copy of C further up in the extended BB
  AND  P and H still contain the same value
  AND  Cse considers to replace the set with "set H := H"
  AND  P is still live at the end of the EBB
  AND  P is used between generation and the current instruction.
  THEN do not replace

  => Has fewer win and fewer loss situations and is only slightly
     better on average than (1).  No real improvement.

  This patch requires scanning every insn in cse_insn() for all
  uses of all pseudo registers.  At the moment there is no
  function in rtlanal.c to do this in one call, so I've just
  scanned for each one individually, causing a dramatic increase
  of compile time (* 2 or even more).

So, my conclusion is that the attempt to fix this by patching
cse_insn() is more or less futile.  Replacing the pseudo register
with thte hard register early is actually often a *good* thing,
and to determine whether it's good or bad the code in cse_insn()
would have to correctly guess what later passes do.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany