64317 – [5 Regression] Ineffective allocation of PIC base register

Bug 64317 - [5 Regression] Ineffective allocation of PIC base register

Summary: [5 Regression] Ineffective allocation of PIC base register

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	5.0

Importance:	P2 normal
Target Milestone:	5.0
Assignee:	Jeffrey A. Law

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2014-12-15 15:02 UTC by Igor Zamyatin
Modified:	2018-10-01 23:53 UTC (History)
CC List:	5 users (show)

See Also:	87479
Host:
Target:	i686
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-01-13 00:00:00

Attachments
Reproducer (129 bytes, text/plain) 2014-12-15 15:02 UTC, Igor Zamyatin	Details
Another reproducer (645 bytes, text/plain) 2015-02-05 12:22 UTC, Ilya Enkovich	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Igor Zamyatin 2014-12-15 15:02:59 UTC

Created attachment 34285 [details]
Reproducer

For the attached test compiled with -O2  -m32 -fPIE -pie  after r218059 we generate 

	call	__x86.get_pc_thunk.ax
	addl	$_GLOBAL_OFFSET_TABLE_, %eax
	subl	$28, %esp
	.cfi_def_cfa_offset 48
	movl	48(%esp), %edi
	movl	%eax, 12(%esp)            <--- PIC reg spill
	testl	%edi, %edi
	je	.L8
	movl	12(%esp), %eax            <--- PIC reg fill
	xorl	%esi, %esi
	movl	c@GOT(%eax), %ebp
	.p2align 4,,10
	.p2align 3
.L4:
	movl	12(%esp), %ebx            <--- PIC reg fill
	addl	$1, %esi
	call	bar@PLT

while for r218058 there is no spill and only reg-reg fill:

	call	__x86.get_pc_thunk.di
	addl	$_GLOBAL_OFFSET_TABLE_, %edi
	subl	$12, %esp
	.cfi_def_cfa_offset 32
	movl	32(%esp), %eax
	testl	%eax, %eax
	je	.L8
	movl	c@GOT(%edi), %ebp
	xorl	%esi, %esi
	.p2align 4,,10
	.p2align 3
.L4:
	movl	%edi, %ebx
	addl	$1, %esi
	call	bar@PLT

Comment 1 Richard Biener 2015-01-13 10:26:06 UTC

Confirmed.

Comment 2 Vladimir Makarov 2015-01-23 20:16:48 UTC

Author: vmakarov
Date: Fri Jan 23 20:15:56 2015
New Revision: 220060

URL: https://gcc.gnu.org/viewcvs?rev=220060&root=gcc&view=rev
Log:
2015-01-23  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/64317
	* lra-lives.c (make_hard_regno_born): Add parameter.  Don't make
	REAL_PIC_OFFSET_TABLE_REGNUM conflicting with pic offset pseudo.
	(mark_regno_live, process_bb_lives): Pass new paramater value to
	make_hard_regno_born.

2015-01-23  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/64317
	* gcc.target/i386/pr64317.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr64317.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lra-lives.c
    trunk/gcc/testsuite/ChangeLog

Comment 3 Ilya Enkovich 2015-02-05 12:22:20 UTC

Created attachment 34675 [details]
Another reproducer

I found one more reproducer for the problem.  Generated code for multiple 'output' calls inlined into test':

..
.L5:
        movl    28(%esp), %edx               <-- load PIC reg
        movl    %edi, (%ecx)
        leal    (%ebx,%eax), %ecx
        cmpl    %ecx, %ebp
        movl    %ebx, out_pos@GOTOFF(%edx)
        movl    val3@GOTOFF(%edx), %edi
        jnb     .L6
        movl    28(%esp), %ebx              <-- have value in EDX
        movl    out@GOTOFF(%ebx), %ebx
        leal    (%ebx,%eax), %ecx
.L6:
        movl    28(%esp), %edx              <-- NOP
        movl    %edi, (%ebx)
        leal    (%ecx,%eax), %ebx
        cmpl    %ebx, %ebp
        movl    %ecx, out_pos@GOTOFF(%edx)
        movl    val4@GOTOFF(%edx), %edi
        jnb     .L7
        movl    28(%esp), %ecx             <-- have value in EDX
        movl    out@GOTOFF(%ecx), %ecx
        leal    (%ecx,%eax), %ebx
.L7:
        movl    28(%esp), %edx             <-- NOP
        movl    %edi, (%ecx)
...

BTW if I put __attribute__((noinline)) for crc32 function then mentioned code becomes better and we don't have these two useless instructions in each function instance.

Compilation string:

gcc -Ofast -funroll-loops -m32 -march=slm -fPIE test.i -S

Used compiler:

gcc version 5.0.0 20150203 (experimental) (GCC)

Comment 4 Jakub Jelinek 2015-02-06 09:52:45 UTC

Does #c2 fix this, or is #c3 an unrelated bugreport that still needs fixing?

Comment 5 Ilya Enkovich 2015-02-06 10:06:12 UTC

(In reply to Jakub Jelinek from comment #4)
> Does #c2 fix this, or is #c3 an unrelated bugreport that still needs fixing?

Problem is still seen after the fix.  I put test here because of the same symptom.  Should I open a new one?

Comment 6 Vladimir Makarov 2015-02-10 22:17:48 UTC

  p107 is a pic pseudo. IRA spills it (most probably it improves overall
allocation).  Every usage of p107 needs a reload.  Reloads of p107 are
decreased in BB scope through inheritance.

  LRA can do inheritance in EBB scope too (reload pass was never able
to do this).  LRA does not form EBB containing more one BB as
probability of each branch in the code in question is exactly 50%.
LRA adds BB to EBB only if fall through probability is more 50%.
Therefore each EBB in the code contains one block.  If we permits BB
with exact 50% fall through, we would get the following code.

  103: L103:                     -> .L5
  104: NOTE_INSN_BASIC_BLOCK 9
  105: [cx:SI]=di:SI
  850: bp:SI=[sp:SI+0x1c]
  585: di:SI=bp:SI
  106: [di:SI+const(unspec[`out_pos'] 1)]=bx:SI
  586: di:SI=bp:SI
  107: di:SI=[di:SI+const(unspec[`val3'] 1)]
  108: {cx:SI=bx:SI+ax:SI;clobber flags:CC;}
  109: flags:CC=cmp(dx:SI,cx:SI)
  110: pc={(geu(flags:CC,0))?L114:pc}
      REG_BR_PROB 5000
  111: NOTE_INSN_BASIC_BLOCK 10
  587: bx:SI=bp:SI
  112: bx:SI=[bx:SI+const(unspec[`out'] 1)]
  113: {cx:SI=bx:SI+ax:SI;clobber flags:CC;}
  114: L114:                     -> .L6
  115: NOTE_INSN_BASIC_BLOCK 11
  116: [bx:SI]=di:SI
  852: bp:SI=[sp:SI+0x1c]
  588: bx:SI=bp:SI
  117: [bx:SI+const(unspec[`out_pos'] 1)]=cx:SI
  589: bx:SI=bp:SI
  118: di:SI=[bx:SI+const(unspec[`val4'] 1)]
  119: {bx:SI=cx:SI+ax:SI;clobber flags:CC;}
  120: flags:CC=cmp(dx:SI,bx:SI)
  121: pc={(geu(flags:CC,0))?L125:pc}
      REG_BR_PROB 5000
  122: NOTE_INSN_BASIC_BLOCK 12
  590: bx:SI=bp:SI
  123: cx:SI=[bx:SI+const(unspec[`out'] 1)]
  124: {bx:SI=cx:SI+ax:SI;clobber flags:CC;}

  As you can see, pic pseudo is loaded once for each EBB (BBs 9 10)
and EBB (BBs 11 12).

  For some reason GCSE after RA lift the pic loads up and the final
code looks like:

        leal    (%ecx,%eax), %ebx
.L5:
        movl    %edi, (%ecx)
        leal    (%ebx,%eax), %ecx
        movl    %ebx, out_pos@GOTOFF(%ebp)
        cmpl    %ecx, %edx
        movl    val3@GOTOFF(%ebp), %edi
        jnb     .L6
        movl    out@GOTOFF(%ebp), %ebx
        movl    28(%esp), %ebp               ! lifted load
        leal    (%ebx,%eax), %ecx
.L6:
        movl    %edi, (%ebx)
        leal    (%ecx,%eax), %ebx
        movl    %ecx, out_pos@GOTOFF(%ebp)
        cmpl    %ebx, %edx
        movl    val4@GOTOFF(%ebp), %edi
        jnb     .L7
	movl    out@GOTOFF(%ebp), %ecx
        movl    28(%esp), %ebp               ! lifted load
        leal    (%ecx,%eax), %ebx

  It looks better but we could still remove one redundant load.  It
needs inheritance beyond EBB.  This is very difficult to implement.
Please remember that inheritance in EBB scope is already step forward
in comparison with old reload pass one.

  The question is should we change fall through probability to get the
code above?  I don't know.  It needs at least a good benchmarking as
changing heuristic for improving one case might result in performance
degradation in more cases.

  RA is all about heuristics, it will be always some cases where RA
can be improved more but fixing them by adding new heuristics might
worsen other cases and the PR cycle will continue.

  So I don't think, this case will be solved for GCC-5.0 or even for
next releases.  Sorry.

Comment 7 Jeffrey A. Law 2015-02-12 21:04:49 UTC

Vlad,

What's the rationale behind the 50% probability cutoff for forming an EBB?  For the purposes of inheritance, ISTM you want the biggest EBBs possible to give you a maximal window in which to find an insn to inherit from?  Or is it the case that EBB formation impacts both spilling and inheritance in IRA/LRA?

Comment 8 Jeffrey A. Law 2015-02-12 21:21:28 UTC

And for GCC 5, ISTM the question that hasn't been answered, particularly with regard to the second reproducer is whether or this is a regression for the overall performance of that code.

It's certainly possible that IRA determined that %ebx was better used to hold a different value and that the PIC register might end up in a call-clobbered register.  If the object in %ebx is heavily used, the benefits of keeping that value in %ebx may outweigh the cost of having the PIC value in a different register (say perhaps one that is call-clobbered).

Comment 9 Vladimir Makarov 2015-02-13 03:35:29 UTC

(In reply to Jeffrey A. Law from comment #7)
> Vlad,
> 
> What's the rationale behind the 50% probability cutoff for forming an EBB? 
> For the purposes of inheritance, ISTM you want the biggest EBBs possible to
> give you a maximal window in which to find an insn to inherit from?  Or is
> it the case that EBB formation impacts both spilling and inheritance in
> IRA/LRA?

I remember I tried different cut-off probabilities.  But LRA is changing quickly, may be it is time to check this again.

I'd agree with you that for inheritance the bigger EBB, the better.  But there are also optional reloads.  So if we had EBB consisting of BB1 and very low probability BB2

BB1
    op with pseudo spilled
BB2
    the spilled pseudo use

optional reload and inheritance would transform it into

BB1
    load hr, pseudo spilled
    op with hr
BB2
    hr use

With 50% cut-off we would have

BB1
    op with pseudo spilled
BB2
    the spilled pseudo use

which is better code if probability BB1 >> BB2.

Comment 10 Vladimir Makarov 2015-02-13 03:37:32 UTC

(In reply to Jeffrey A. Law from comment #8)
> And for GCC 5, ISTM the question that hasn't been answered, particularly
> with regard to the second reproducer is whether or this is a regression for
> the overall performance of that code.
> 
> It's certainly possible that IRA determined that %ebx was better used to
> hold a different value and that the PIC register might end up in a
> call-clobbered register.  If the object in %ebx is heavily used, the
> benefits of keeping that value in %ebx may outweigh the cost of having the
> PIC value in a different register (say perhaps one that is call-clobbered).

I guess it is easy to check by preventing pic pseudo generation.

Comment 11 Ilya Enkovich 2015-02-13 08:06:51 UTC

(In reply to Vladimir Makarov from comment #10)
> I guess it is easy to check by preventing pic pseudo generation.

i386 back-end doesn't support fixed PIC register any more.  This test case demonstrates performance regression in some EEMBC 1.1 tests caused by pseudo PIC register introduction.

It is unclear why RA decided to spill PIC register.  If we look at loop's code then we see PIC register is used in each line of code and seems to be the most used register.

It also seems weird to me that code for the first loop becomes much better (with no PIC reg fills) if we restrict inlining for the other one.  How does the second loop affect allocation in the first one?

Comment 12 Jeffrey A. Law 2015-02-19 13:18:23 UTC

I'm very aware that the x86 backend doesn't support a fixed PIC register anymore.  

RA was going to have to spill something.  THe PIC register is needed in three different loops, L0, L1 and L2.  L0 needs 9 general purpose registers, L8 needs 8 general purpose registers and L2 needs 9 general purpose registers.  ie, there's going to be spills, there's simply no way around it.

r107 (PIC pseudo) gets split into 3 allocnos.  A1, A11 and A237, covering Loops 0, 2, 1 respectively.  It's live throughout most of the resultant function by way of explicit references and the need to have %ebx set up prior to external calls.

For Loop 1 & Loop 2, the respective allocnos (A237, A11) are not used/set within the loop at all, ie, they are transparent within their respective loops.  IRA does exactly what we want here by keeping the PIC register in memory which frees up a register within those loops for other objects that are used within the loop.  Of course, to do that we have to reload the value for the uses outside the boundary of those loops.

Loop 0 (bbs 0, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19) is the most interesting and is the one where we have those annoying reloads.

One of the things I notice is that LRA is generating sequences like:
(insn 581 89 90 6 (set (reg:SI 3 bx [107])
        (mem/c:SI (plus:SI (reg/f:SI 7 sp)
                (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) j.c:19 90 {*movsi_internal}
     (nil))
(insn 90 581 91 6 (set (reg/f:SI 3 bx [orig:142 D.2145 ] [142])
        (mem/f/c:SI (plus:SI (reg:SI 3 bx [107])
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("out") [flags 0x2] <var_decl 0x7ffff670bc60 out>)
                        ] UNSPEC_GOTOFF))) [1 out+0 S4 A32])) j.c:19 90 {*movsi_internal}
     (nil))

Note how we load %ebx from memory, then use/clobber it in the next insn.  That makes it impossible for the post-reload optimizers to help clean this up.  How hard would it be to generate code in LRA where those two insns set different registers in those local snippets of code?

In this particular case, %ebp is locally available and there's no strong reason why we have to use %ebx.   By using different destinations for insns 581 and 90, the source MEM of insn 581 would be available after insn 90.  And by making the value available postreload-gcse would be able to commonize those loads.

Comment 13 Vladimir Makarov 2015-02-19 16:21:45 UTC

(In reply to Jeffrey A. Law from comment #12)
> 
> One of the things I notice is that LRA is generating sequences like:
> (insn 581 89 90 6 (set (reg:SI 3 bx [107])
>         (mem/c:SI (plus:SI (reg/f:SI 7 sp)
>                 (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) j.c:19 90
> {*movsi_internal}
>      (nil))
> (insn 90 581 91 6 (set (reg/f:SI 3 bx [orig:142 D.2145 ] [142])
>         (mem/f/c:SI (plus:SI (reg:SI 3 bx [107])
>                 (const:SI (unspec:SI [
>                             (symbol_ref:SI ("out") [flags 0x2] <var_decl
> 0x7ffff670bc60 out>)
>                         ] UNSPEC_GOTOFF))) [1 out+0 S4 A32])) j.c:19 90
> {*movsi_internal}
>      (nil))
> 
> Note how we load %ebx from memory, then use/clobber it in the next insn. 
> That makes it impossible for the post-reload optimizers to help clean this
> up.  How hard would it be to generate code in LRA where those two insns set
> different registers in those local snippets of code?
> 
> In this particular case, %ebp is locally available and there's no strong
> reason why we have to use %ebx.   By using different destinations for insns
> 581 and 90, the source MEM of insn 581 would be available after insn 90. 
> And by making the value available postreload-gcse would be able to commonize
> those loads.

Jeff, thanks for the analysis.  PIC pseudo really can be in different regs.  There are a lot of places in RA in what reg the pic pseudo will be placed for particular live range.  Probably there is a place where the decision is done for ebx without considering other alternatives.  I'll investigate this problem.

Comment 14 Jeffrey A. Law 2015-02-20 23:10:39 UTC

Just trying to help out where I can.   It's similar to the round robin use of reload regs we've had in reload for a while.  THe idea was to hopefully have reloaded values lying around in a useful place more often to improve inheritance.

Comment 15 Vladimir Makarov 2015-02-21 19:11:13 UTC

(In reply to Jeffrey A. Law from comment #14)
> Just trying to help out where I can.   It's similar to the round robin use
> of reload regs we've had in reload for a while.  THe idea was to hopefully
> have reloaded values lying around in a useful place more often to improve
> inheritance.

Yes, I remember this.  LRA has analogous technique called hard registers levelling. The idea is the same to use all hard registers with equal probability.  Although I think there is some possibility to improve this making it more CFG sensitive although it is harder to do it in LRA as hard register allocation in LRA is done in more global scope than in reload.

Comment 16 Vladimir Makarov 2015-02-27 22:02:37 UTC

Author: vmakarov
Date: Fri Feb 27 22:02:05 2015
New Revision: 221070

URL: https://gcc.gnu.org/viewcvs?rev=221070&root=gcc&view=rev
Log:
2015-02-27  Vladimir Makarov  <vmakarov@redhat.com>

	PR target/64317
	* params.def (PARAM_LRA_INHERITANCE_EBB_PROBABILITY_CUTOFF): New.
	* params.h (LRA_INHERITANCE_EBB_PROBABILITY_CUTOFF): New.
	* lra-constraints.c: Include "params.h".
	(EBB_PROBABILITY_CUTOFF): Use
	LRA_INHERITANCE_EBB_PROBABILITY_CUTOFF.
	(lra_inheritance): Use '<' instead of '<=' for
	EBB_PROBABILITY_CUTOFF.
	* doc/invoke.texi (lra-inheritance-ebb-probability-cutoff):
	Document change.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/lra-constraints.c
    trunk/gcc/params.def
    trunk/gcc/params.h

Comment 17 Jeffrey A. Law 2015-03-03 12:30:40 UTC

Thanks Vlad, that patch helped.  Prior to your patch we had 15 reloads of the PIC register from memory, after your patch we have just 9.   However, several of the remaining 9 seem to be redundant.

LRA never considers a block starting with a label as participating in an EBB.  That's overly conservative.  A block can participate in an EBB if all of its predecessors are part of the same EBB.

That's particularly useful in CFGs like
 
    A
   /|
  / |
 B  |
  \ |
   \|
    C

[ Flow downward of course. ]


If we assume that B is the fallthru path, then LRA will try to make AB into an EBB.  But it will not consider C because C will start with a label.  That ultimately causes missed inheritances in this example.

As an example, we have this in the .reload dump:

[ BB 7 ]

(insn 848 94 582 7 (set (reg:SI 6 bp [107])
        (mem/c:SI (plus:SI (reg/f:SI 7 sp)
                (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) k.c:22 90 {*movsi_internal}
     (nil))

[ ... ]

(jump_insn 99 98 100 7 (set (pc)
        (if_then_else (geu (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref 103)
            (pc))) k.c:18 613 {*jcc_1}
     (int_list:REG_BR_PROB 5000 (nil))

[ BB 8 ]

(insn 584 100 101 8 (set (reg:SI 3 bx [107])
        (reg:SI 6 bp [107])) k.c:19 90 {*movsi_internal}
     (nil)) 

[ Note we inherited the value in %ebp, this is good. ]

[ ... ]

[ BB 9 ]

(code_label 103 102 104 9 5 "" [1 uses])

[ ... ]

(insn 850 105 585 9 (set (reg:SI 6 bp [107])
        (mem/c:SI (plus:SI (reg/f:SI 7 sp)
                (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) k.c:22 90 {*movsi_internal}
     (nil))

Note insn 850 which reloads the value from memory again.  We correctly formed an EBB with BB7 and BB8, but we really should have extended that to BB9.

It appears that fixing this would improve things further.  It's also the case that the post-reload passes aren't doing as good of a job at cleaning things up as they should, so  I'll look at that in parallel if you can investigate the LRA side of things.

Comment 18 Jeffrey A. Law 2015-03-03 19:08:42 UTC

Sigh...

reload_cse_regs doesn't help us because of the CODE_LABEL at the start of block C in the mini-cfg referenced in c#17.  In essence it's using the same extended block definition as LRA.

postreload-gcse doesn't help us because it computes local properties, but never propagates them through the CFG.  So a memory load that occurs in block A and is transparent through B is not seen as available at the end of B.  Thus the load is only considered partially redundant at the start of C.

SIGH.

Comment 19 Jeffrey A. Law 2015-03-03 22:38:55 UTC

Making reload_cse_regs less pessimistic at CODE_LABELs really isn't feasible.  cselib isn't easily turned into a scoped hash table and the multiple assignment nature of RTL means that a simple scoped hash table isn't sufficient anyway.

I'm currently poking at postreload-gcse.  The basic idea being to detect when an occurrence is transparent through a block and an occurrence exists in the block's one and only predecessor.  In that case pretend the occurrence also exists at the end of the current block.  I don't expect this to occur often, thus I'm not computing transparency for each expression in all blocks and doing a traditional dataflow propagation.

With that special case in place, I get just 5 reloads of the PIC register, which appears minimal given the other register allocation decisions.  I need to ponder a couple implementation details, but this may be the final resolution to this BZ :-)

Comment 20 Jeffrey A. Law 2015-03-04 17:38:55 UTC

Still poking, but downgrading to P2.

Comment 21 Vladimir Makarov 2015-03-05 15:37:41 UTC

(In reply to Jeffrey A. Law from comment #17)
> Thanks Vlad, that patch helped.  Prior to your patch we had 15 reloads of
> the PIC register from memory, after your patch we have just 9.   However,
> several of the remaining 9 seem to be redundant.
> 
> LRA never considers a block starting with a label as participating in an
> EBB.  That's overly conservative.  A block can participate in an EBB if all
> of its predecessors are part of the same EBB.
> 
> That's particularly useful in CFGs like
>  
>     A
>    /|
>   / |
>  B  |
>   \ |
>    \|
>     C
> 
> [ Flow downward of course. ]
> 
> 
> If we assume that B is the fallthru path, then LRA will try to make AB into
> an EBB.  But it will not consider C because C will start with a label.  That
> ultimately causes missed inheritances in this example.

I am agree the bigger scope would help but it is hard to implement.  If we consider the graph hammock structure, inherited value of pseudo P in A to C might be changed in B.  It works if we don't change it in B but then the structure can be dependent on considered pseudos for inheritance.  It is harder to update live info too and to undo inheritance transformations if they fail.

The biggest problem with PIC pseudo is in that it can not treated as the rest of pseudos with live range splitting perspective because of possible creation of memory with pic pseudo value.  It is hard to decide what pseudo (or hard register) to use for such memory in a particular place as it can be in different pseudos (hard registers) if we permit usual pseudo live range splitting for pic pseudo.

Comment 22 Jeffrey A. Law 2015-03-05 20:00:57 UTC

Yes, absolutely, we have to compensate for the case where B clobbers something useful, undoing transformations, etc.  It may not be reasonably feasible within the current structure of LRA.

I'm getting pretty close on getting postreload-gcse.c to handle this.  Most of the complexity has been in untangling the bits from gcse.c that I want to re-use.

There's obviously cleanup still to do and some instrumentation/analysis of the generated code on a wider codebase, but it looks promising.

Comment 23 Jeffrey A. Law 2015-03-09 17:12:22 UTC

So with a functional prototype in place, I can get good code for this test.

However, deeper testing of that prototype is proving difficult simply because the postreload-gcse.c code very rarely does anything useful, at least on x86 or x86-64.

I put in some dumping in gcse_after_reload_main to always dump its statistics if its expression table has > 0 elements.  Believe it or not, this *never* triggers during a bootstrap on x86_64.  I was so amazed by this I fired up a VM and bootstraped i686 as well, nope, never fires there either.  Concerned that maybe my other changes were somehow getting in the way, I put in just the dumping code, still doesn't fire.  Yet it fires semi regularly in the testsuite.

The original author of the code in postreload-gcse.c was doing their development on PPC boxes, so I'm going to grab one and see if I can get some useful testing from on it.

Comment 24 Jeffrey A. Law 2015-03-23 05:21:36 UTC

Author: law
Date: Mon Mar 23 05:21:04 2015
New Revision: 221585

URL: https://gcc.gnu.org/viewcvs?rev=221585&root=gcc&view=rev
Log:
       PR rtl-optimization/64317
        * Makefile.in (OBJS): Add gcse-common.c
        * gcse.c: Include gcse-common.h
        (struct modify_pair_s): Move structure definition to gcse-common.h
        (compute_transp): Move function to gcse-common.c.
        (canon_list_insert): Similarly.
        (record_last_mem_set_info): Break out some code and put it into
        gcse-common.c.  Call into the new common code.
        (compute_local_properties): Pass additional arguments to compute_transp.
        * postreload-gcse.c: Include gcse-common.h and df.h
        (modify_mem_list_set, blocks_with_calls): New variables.
        (modify_mem_list, canon_modify_mem_list, transp): Likewise.
        (get_bb_avail_insn): Pass in the expression index too.
        (alloc_mem): Allocate memory for the new bitmaps and lists.
        (free_mem): Free memory for the new bitmaps and lists.
        (insert_expr_in_table): Record a bitmap index for each entry we
        add to the table.
        (record_last_mem_set_info): Call into common code in gcse-common.c.
        (get_bb_avail_insn): If no available insn was found in the requested
        BB.  If BB has a single predecessor, see if the expression is
        transparent in BB and available in that single predecessor.
        (compute_expr_transp): New wrapper for compute_transp.
        (eliminate_partially_redundant_load): Pass expression's bitmap_index
        to get_bb_avail_insn.  Compute next_pred_bb_end a bit later.
        (gcse_after_reload_main): If there are elements in the hash table,
        then compute transparency for all the elements in the hash table.
        * gcse-common.h: New file.
        * gcse-common.c: New file.

Added:
    trunk/gcc/gcse-common.c
    trunk/gcc/gcse-common.h
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/gcse.c
    trunk/gcc/postreload-gcse.c

Comment 25 Jeffrey A. Law 2015-03-23 05:23:12 UTC

Unwanted loads have been eliminated by the postreload-gcse improvements.

Comment 26 Jeffrey A. Law 2015-03-26 12:40:54 UTC

Forgot to change BZ state after committing fix.