This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Cse reducing performance of register allocation with -O2

On 10/13/2015 01:06 PM, Jeff Law wrote:
On 10/13/2015 07:12 AM, Dominik Vogt wrote:
In some cases, the work of the cse1 pass is counterproductive, as
we noticed on s390x.  The effect described below is present since
at least 4.8.0.  Note that this may not become manifest in a
performance issue problem on all platforms.  Also note that -O1
does not show this behaviour because the responsible code is only
executed with -O2 or higher.

The core of the problem is the was cse1 sometimes handles function
parameters.  Roughly, the observed situation is

Before cse1

   start of function
   set pseudoreg Rp to the first argument from hardreg R2
   (some code that uses Rp)
   set R2 to Rp

After cse1:

   start of function
   set pseudoreg Rp to the first argument from hardreg R2
   (some code that uses Rp)  <--- The use of Rp is still present
   set R2 to R2              <--- cse1 has replaced Rp with R2

After that, the set pattern is removed completely, and now we have
both, Rp and R2 live in the drafted code snippet.  Because R2 ist
still supposed to be live later on, the ira pass chooses a
different hard register (R1) for Rp, and code to copy R1 back to
R2 is added later.  (See further down for Rtl and assembly code.)


There seems to be code to prevent this in cse.c:hash_rtx_cb() as a
comment from that function suggests:

     /* On some machines, we can't record any non-fixed hard register,
        because extending its life will cause reload problems. We
        consider ap, fp, sp, gp to be fixed for this purpose.
This is referring to the inability to reload those objects. It's a correctness not a performance concern with those registers.


Unfortunately this is not caused by hashing but by the code
dealing with src_related in cse_insn().  When cse_insn() handles
the "copy Rp to R2" instruction, it does nothing up to line 5020
and sets src_related there:

   /* This is the same as the destination of the insns, we want
      to prefer it.  Copy it to src_related.  The code below will
      then give it a negative cost.  */
   if (GET_CODE (dest) == code && rtx_equal_p (p->exp, dest))
     src_related = dest;

Eventually, the term src_related is used to replace the source
expression of the set pattern.  So, while the above comment may be
applicable to hashed expressions that are considered for
replacement, there's no such "safety net" for the expressions
src_related, src_folded etc.  I guess if there was, that would fix
the issue.


So, I've made an experimental hack (see attachment) and treid
that.  In a larger test suite, register copies could be saved in
quite some places (including the test program below), but in other
places new register copies were introduced, resulting in about
twice as much "issues" as without the patch.

Maybe the patch is just too coarse.  In general I'd assume that
the register allocator does a better job of assigning hard
registers to pseudo registers.  Is it possible to better describe
when cse1 should keep its hands off pseudo registers?
We don't really have a way to describe this.

I know Vlad looked at problems in this space -- essentially knowing when two registers had the same value in the allocators/reload and exploiting that information.

Yes, I've tried a simple GVN and use it to improve the conflict graph.
My recollection was it didn't help in any measurable way -- I think he discussed it during one of the old GCC summit conferences. That was also in the reload era.

I checked my article

and GVN gave mostly 0.2% on eon only. The current environment is quite different (IRA, LRA) so the results might be different too.

Also as I remember I implemented GVN only for pseudos.

LRA also checks values too but again only for reload and original pseudos.

It is a known problem. I saw many times when optimizations propagate hard registers and it is truly hurts RA. I guess such practice should be discouraged. RA can perfectly remove copy between hard reg and the pseudo itself as in the example above by assigning the same hard reg to the pseudo.

Although we could implement GVN for conflicts taking hard register into account. But it will be compiler time intensive especially in LRA which changes RTL considerably during its work and rebuilds live info several times.

Ultimately this feels like all the issues around coalescing and copy-propagation. With that in mind, if we had lifetime & conflict information, then we'd be able to query that and perhaps be able to make different choices.

I wonder if the web-izer pass could help here or something based on it. Essentially what you want to do is a range split.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]