Created attachment 34285 [details] Reproducer For the attached test compiled with -O2 -m32 -fPIE -pie after r218059 we generate call __x86.get_pc_thunk.ax addl $_GLOBAL_OFFSET_TABLE_, %eax subl $28, %esp .cfi_def_cfa_offset 48 movl 48(%esp), %edi movl %eax, 12(%esp) <--- PIC reg spill testl %edi, %edi je .L8 movl 12(%esp), %eax <--- PIC reg fill xorl %esi, %esi movl c@GOT(%eax), %ebp .p2align 4,,10 .p2align 3 .L4: movl 12(%esp), %ebx <--- PIC reg fill addl $1, %esi call bar@PLT while for r218058 there is no spill and only reg-reg fill: call __x86.get_pc_thunk.di addl $_GLOBAL_OFFSET_TABLE_, %edi subl $12, %esp .cfi_def_cfa_offset 32 movl 32(%esp), %eax testl %eax, %eax je .L8 movl c@GOT(%edi), %ebp xorl %esi, %esi .p2align 4,,10 .p2align 3 .L4: movl %edi, %ebx addl $1, %esi call bar@PLT
Confirmed.
Author: vmakarov Date: Fri Jan 23 20:15:56 2015 New Revision: 220060 URL: https://gcc.gnu.org/viewcvs?rev=220060&root=gcc&view=rev Log: 2015-01-23 Vladimir Makarov <vmakarov@redhat.com> PR target/64317 * lra-lives.c (make_hard_regno_born): Add parameter. Don't make REAL_PIC_OFFSET_TABLE_REGNUM conflicting with pic offset pseudo. (mark_regno_live, process_bb_lives): Pass new paramater value to make_hard_regno_born. 2015-01-23 Vladimir Makarov <vmakarov@redhat.com> PR target/64317 * gcc.target/i386/pr64317.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr64317.c Modified: trunk/gcc/ChangeLog trunk/gcc/lra-lives.c trunk/gcc/testsuite/ChangeLog
Created attachment 34675 [details] Another reproducer I found one more reproducer for the problem. Generated code for multiple 'output' calls inlined into test': .. .L5: movl 28(%esp), %edx <-- load PIC reg movl %edi, (%ecx) leal (%ebx,%eax), %ecx cmpl %ecx, %ebp movl %ebx, out_pos@GOTOFF(%edx) movl val3@GOTOFF(%edx), %edi jnb .L6 movl 28(%esp), %ebx <-- have value in EDX movl out@GOTOFF(%ebx), %ebx leal (%ebx,%eax), %ecx .L6: movl 28(%esp), %edx <-- NOP movl %edi, (%ebx) leal (%ecx,%eax), %ebx cmpl %ebx, %ebp movl %ecx, out_pos@GOTOFF(%edx) movl val4@GOTOFF(%edx), %edi jnb .L7 movl 28(%esp), %ecx <-- have value in EDX movl out@GOTOFF(%ecx), %ecx leal (%ecx,%eax), %ebx .L7: movl 28(%esp), %edx <-- NOP movl %edi, (%ecx) ... BTW if I put __attribute__((noinline)) for crc32 function then mentioned code becomes better and we don't have these two useless instructions in each function instance. Compilation string: gcc -Ofast -funroll-loops -m32 -march=slm -fPIE test.i -S Used compiler: gcc version 5.0.0 20150203 (experimental) (GCC)
Does #c2 fix this, or is #c3 an unrelated bugreport that still needs fixing?
(In reply to Jakub Jelinek from comment #4) > Does #c2 fix this, or is #c3 an unrelated bugreport that still needs fixing? Problem is still seen after the fix. I put test here because of the same symptom. Should I open a new one?
p107 is a pic pseudo. IRA spills it (most probably it improves overall allocation). Every usage of p107 needs a reload. Reloads of p107 are decreased in BB scope through inheritance. LRA can do inheritance in EBB scope too (reload pass was never able to do this). LRA does not form EBB containing more one BB as probability of each branch in the code in question is exactly 50%. LRA adds BB to EBB only if fall through probability is more 50%. Therefore each EBB in the code contains one block. If we permits BB with exact 50% fall through, we would get the following code. 103: L103: -> .L5 104: NOTE_INSN_BASIC_BLOCK 9 105: [cx:SI]=di:SI 850: bp:SI=[sp:SI+0x1c] 585: di:SI=bp:SI 106: [di:SI+const(unspec[`out_pos'] 1)]=bx:SI 586: di:SI=bp:SI 107: di:SI=[di:SI+const(unspec[`val3'] 1)] 108: {cx:SI=bx:SI+ax:SI;clobber flags:CC;} 109: flags:CC=cmp(dx:SI,cx:SI) 110: pc={(geu(flags:CC,0))?L114:pc} REG_BR_PROB 5000 111: NOTE_INSN_BASIC_BLOCK 10 587: bx:SI=bp:SI 112: bx:SI=[bx:SI+const(unspec[`out'] 1)] 113: {cx:SI=bx:SI+ax:SI;clobber flags:CC;} 114: L114: -> .L6 115: NOTE_INSN_BASIC_BLOCK 11 116: [bx:SI]=di:SI 852: bp:SI=[sp:SI+0x1c] 588: bx:SI=bp:SI 117: [bx:SI+const(unspec[`out_pos'] 1)]=cx:SI 589: bx:SI=bp:SI 118: di:SI=[bx:SI+const(unspec[`val4'] 1)] 119: {bx:SI=cx:SI+ax:SI;clobber flags:CC;} 120: flags:CC=cmp(dx:SI,bx:SI) 121: pc={(geu(flags:CC,0))?L125:pc} REG_BR_PROB 5000 122: NOTE_INSN_BASIC_BLOCK 12 590: bx:SI=bp:SI 123: cx:SI=[bx:SI+const(unspec[`out'] 1)] 124: {bx:SI=cx:SI+ax:SI;clobber flags:CC;} As you can see, pic pseudo is loaded once for each EBB (BBs 9 10) and EBB (BBs 11 12). For some reason GCSE after RA lift the pic loads up and the final code looks like: leal (%ecx,%eax), %ebx .L5: movl %edi, (%ecx) leal (%ebx,%eax), %ecx movl %ebx, out_pos@GOTOFF(%ebp) cmpl %ecx, %edx movl val3@GOTOFF(%ebp), %edi jnb .L6 movl out@GOTOFF(%ebp), %ebx movl 28(%esp), %ebp ! lifted load leal (%ebx,%eax), %ecx .L6: movl %edi, (%ebx) leal (%ecx,%eax), %ebx movl %ecx, out_pos@GOTOFF(%ebp) cmpl %ebx, %edx movl val4@GOTOFF(%ebp), %edi jnb .L7 movl out@GOTOFF(%ebp), %ecx movl 28(%esp), %ebp ! lifted load leal (%ecx,%eax), %ebx It looks better but we could still remove one redundant load. It needs inheritance beyond EBB. This is very difficult to implement. Please remember that inheritance in EBB scope is already step forward in comparison with old reload pass one. The question is should we change fall through probability to get the code above? I don't know. It needs at least a good benchmarking as changing heuristic for improving one case might result in performance degradation in more cases. RA is all about heuristics, it will be always some cases where RA can be improved more but fixing them by adding new heuristics might worsen other cases and the PR cycle will continue. So I don't think, this case will be solved for GCC-5.0 or even for next releases. Sorry.
Vlad, What's the rationale behind the 50% probability cutoff for forming an EBB? For the purposes of inheritance, ISTM you want the biggest EBBs possible to give you a maximal window in which to find an insn to inherit from? Or is it the case that EBB formation impacts both spilling and inheritance in IRA/LRA?
And for GCC 5, ISTM the question that hasn't been answered, particularly with regard to the second reproducer is whether or this is a regression for the overall performance of that code. It's certainly possible that IRA determined that %ebx was better used to hold a different value and that the PIC register might end up in a call-clobbered register. If the object in %ebx is heavily used, the benefits of keeping that value in %ebx may outweigh the cost of having the PIC value in a different register (say perhaps one that is call-clobbered).
(In reply to Jeffrey A. Law from comment #7) > Vlad, > > What's the rationale behind the 50% probability cutoff for forming an EBB? > For the purposes of inheritance, ISTM you want the biggest EBBs possible to > give you a maximal window in which to find an insn to inherit from? Or is > it the case that EBB formation impacts both spilling and inheritance in > IRA/LRA? I remember I tried different cut-off probabilities. But LRA is changing quickly, may be it is time to check this again. I'd agree with you that for inheritance the bigger EBB, the better. But there are also optional reloads. So if we had EBB consisting of BB1 and very low probability BB2 BB1 op with pseudo spilled BB2 the spilled pseudo use optional reload and inheritance would transform it into BB1 load hr, pseudo spilled op with hr BB2 hr use With 50% cut-off we would have BB1 op with pseudo spilled BB2 the spilled pseudo use which is better code if probability BB1 >> BB2.
(In reply to Jeffrey A. Law from comment #8) > And for GCC 5, ISTM the question that hasn't been answered, particularly > with regard to the second reproducer is whether or this is a regression for > the overall performance of that code. > > It's certainly possible that IRA determined that %ebx was better used to > hold a different value and that the PIC register might end up in a > call-clobbered register. If the object in %ebx is heavily used, the > benefits of keeping that value in %ebx may outweigh the cost of having the > PIC value in a different register (say perhaps one that is call-clobbered). I guess it is easy to check by preventing pic pseudo generation.
(In reply to Vladimir Makarov from comment #10) > I guess it is easy to check by preventing pic pseudo generation. i386 back-end doesn't support fixed PIC register any more. This test case demonstrates performance regression in some EEMBC 1.1 tests caused by pseudo PIC register introduction. It is unclear why RA decided to spill PIC register. If we look at loop's code then we see PIC register is used in each line of code and seems to be the most used register. It also seems weird to me that code for the first loop becomes much better (with no PIC reg fills) if we restrict inlining for the other one. How does the second loop affect allocation in the first one?
I'm very aware that the x86 backend doesn't support a fixed PIC register anymore. RA was going to have to spill something. THe PIC register is needed in three different loops, L0, L1 and L2. L0 needs 9 general purpose registers, L8 needs 8 general purpose registers and L2 needs 9 general purpose registers. ie, there's going to be spills, there's simply no way around it. r107 (PIC pseudo) gets split into 3 allocnos. A1, A11 and A237, covering Loops 0, 2, 1 respectively. It's live throughout most of the resultant function by way of explicit references and the need to have %ebx set up prior to external calls. For Loop 1 & Loop 2, the respective allocnos (A237, A11) are not used/set within the loop at all, ie, they are transparent within their respective loops. IRA does exactly what we want here by keeping the PIC register in memory which frees up a register within those loops for other objects that are used within the loop. Of course, to do that we have to reload the value for the uses outside the boundary of those loops. Loop 0 (bbs 0, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19) is the most interesting and is the one where we have those annoying reloads. One of the things I notice is that LRA is generating sequences like: (insn 581 89 90 6 (set (reg:SI 3 bx [107]) (mem/c:SI (plus:SI (reg/f:SI 7 sp) (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) j.c:19 90 {*movsi_internal} (nil)) (insn 90 581 91 6 (set (reg/f:SI 3 bx [orig:142 D.2145 ] [142]) (mem/f/c:SI (plus:SI (reg:SI 3 bx [107]) (const:SI (unspec:SI [ (symbol_ref:SI ("out") [flags 0x2] <var_decl 0x7ffff670bc60 out>) ] UNSPEC_GOTOFF))) [1 out+0 S4 A32])) j.c:19 90 {*movsi_internal} (nil)) Note how we load %ebx from memory, then use/clobber it in the next insn. That makes it impossible for the post-reload optimizers to help clean this up. How hard would it be to generate code in LRA where those two insns set different registers in those local snippets of code? In this particular case, %ebp is locally available and there's no strong reason why we have to use %ebx. By using different destinations for insns 581 and 90, the source MEM of insn 581 would be available after insn 90. And by making the value available postreload-gcse would be able to commonize those loads.
(In reply to Jeffrey A. Law from comment #12) > > One of the things I notice is that LRA is generating sequences like: > (insn 581 89 90 6 (set (reg:SI 3 bx [107]) > (mem/c:SI (plus:SI (reg/f:SI 7 sp) > (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) j.c:19 90 > {*movsi_internal} > (nil)) > (insn 90 581 91 6 (set (reg/f:SI 3 bx [orig:142 D.2145 ] [142]) > (mem/f/c:SI (plus:SI (reg:SI 3 bx [107]) > (const:SI (unspec:SI [ > (symbol_ref:SI ("out") [flags 0x2] <var_decl > 0x7ffff670bc60 out>) > ] UNSPEC_GOTOFF))) [1 out+0 S4 A32])) j.c:19 90 > {*movsi_internal} > (nil)) > > Note how we load %ebx from memory, then use/clobber it in the next insn. > That makes it impossible for the post-reload optimizers to help clean this > up. How hard would it be to generate code in LRA where those two insns set > different registers in those local snippets of code? > > In this particular case, %ebp is locally available and there's no strong > reason why we have to use %ebx. By using different destinations for insns > 581 and 90, the source MEM of insn 581 would be available after insn 90. > And by making the value available postreload-gcse would be able to commonize > those loads. Jeff, thanks for the analysis. PIC pseudo really can be in different regs. There are a lot of places in RA in what reg the pic pseudo will be placed for particular live range. Probably there is a place where the decision is done for ebx without considering other alternatives. I'll investigate this problem.
Just trying to help out where I can. It's similar to the round robin use of reload regs we've had in reload for a while. THe idea was to hopefully have reloaded values lying around in a useful place more often to improve inheritance.
(In reply to Jeffrey A. Law from comment #14) > Just trying to help out where I can. It's similar to the round robin use > of reload regs we've had in reload for a while. THe idea was to hopefully > have reloaded values lying around in a useful place more often to improve > inheritance. Yes, I remember this. LRA has analogous technique called hard registers levelling. The idea is the same to use all hard registers with equal probability. Although I think there is some possibility to improve this making it more CFG sensitive although it is harder to do it in LRA as hard register allocation in LRA is done in more global scope than in reload.
Author: vmakarov Date: Fri Feb 27 22:02:05 2015 New Revision: 221070 URL: https://gcc.gnu.org/viewcvs?rev=221070&root=gcc&view=rev Log: 2015-02-27 Vladimir Makarov <vmakarov@redhat.com> PR target/64317 * params.def (PARAM_LRA_INHERITANCE_EBB_PROBABILITY_CUTOFF): New. * params.h (LRA_INHERITANCE_EBB_PROBABILITY_CUTOFF): New. * lra-constraints.c: Include "params.h". (EBB_PROBABILITY_CUTOFF): Use LRA_INHERITANCE_EBB_PROBABILITY_CUTOFF. (lra_inheritance): Use '<' instead of '<=' for EBB_PROBABILITY_CUTOFF. * doc/invoke.texi (lra-inheritance-ebb-probability-cutoff): Document change. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/invoke.texi trunk/gcc/lra-constraints.c trunk/gcc/params.def trunk/gcc/params.h
Thanks Vlad, that patch helped. Prior to your patch we had 15 reloads of the PIC register from memory, after your patch we have just 9. However, several of the remaining 9 seem to be redundant. LRA never considers a block starting with a label as participating in an EBB. That's overly conservative. A block can participate in an EBB if all of its predecessors are part of the same EBB. That's particularly useful in CFGs like A /| / | B | \ | \| C [ Flow downward of course. ] If we assume that B is the fallthru path, then LRA will try to make AB into an EBB. But it will not consider C because C will start with a label. That ultimately causes missed inheritances in this example. As an example, we have this in the .reload dump: [ BB 7 ] (insn 848 94 582 7 (set (reg:SI 6 bp [107]) (mem/c:SI (plus:SI (reg/f:SI 7 sp) (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) k.c:22 90 {*movsi_internal} (nil)) [ ... ] (jump_insn 99 98 100 7 (set (pc) (if_then_else (geu (reg:CC 17 flags) (const_int 0 [0])) (label_ref 103) (pc))) k.c:18 613 {*jcc_1} (int_list:REG_BR_PROB 5000 (nil)) [ BB 8 ] (insn 584 100 101 8 (set (reg:SI 3 bx [107]) (reg:SI 6 bp [107])) k.c:19 90 {*movsi_internal} (nil)) [ Note we inherited the value in %ebp, this is good. ] [ ... ] [ BB 9 ] (code_label 103 102 104 9 5 "" [1 uses]) [ ... ] (insn 850 105 585 9 (set (reg:SI 6 bp [107]) (mem/c:SI (plus:SI (reg/f:SI 7 sp) (const_int 28 [0x1c])) [4 %sfp+-4 S4 A32])) k.c:22 90 {*movsi_internal} (nil)) Note insn 850 which reloads the value from memory again. We correctly formed an EBB with BB7 and BB8, but we really should have extended that to BB9. It appears that fixing this would improve things further. It's also the case that the post-reload passes aren't doing as good of a job at cleaning things up as they should, so I'll look at that in parallel if you can investigate the LRA side of things.
Sigh... reload_cse_regs doesn't help us because of the CODE_LABEL at the start of block C in the mini-cfg referenced in c#17. In essence it's using the same extended block definition as LRA. postreload-gcse doesn't help us because it computes local properties, but never propagates them through the CFG. So a memory load that occurs in block A and is transparent through B is not seen as available at the end of B. Thus the load is only considered partially redundant at the start of C. SIGH.
Making reload_cse_regs less pessimistic at CODE_LABELs really isn't feasible. cselib isn't easily turned into a scoped hash table and the multiple assignment nature of RTL means that a simple scoped hash table isn't sufficient anyway. I'm currently poking at postreload-gcse. The basic idea being to detect when an occurrence is transparent through a block and an occurrence exists in the block's one and only predecessor. In that case pretend the occurrence also exists at the end of the current block. I don't expect this to occur often, thus I'm not computing transparency for each expression in all blocks and doing a traditional dataflow propagation. With that special case in place, I get just 5 reloads of the PIC register, which appears minimal given the other register allocation decisions. I need to ponder a couple implementation details, but this may be the final resolution to this BZ :-)
Still poking, but downgrading to P2.
(In reply to Jeffrey A. Law from comment #17) > Thanks Vlad, that patch helped. Prior to your patch we had 15 reloads of > the PIC register from memory, after your patch we have just 9. However, > several of the remaining 9 seem to be redundant. > > LRA never considers a block starting with a label as participating in an > EBB. That's overly conservative. A block can participate in an EBB if all > of its predecessors are part of the same EBB. > > That's particularly useful in CFGs like > > A > /| > / | > B | > \ | > \| > C > > [ Flow downward of course. ] > > > If we assume that B is the fallthru path, then LRA will try to make AB into > an EBB. But it will not consider C because C will start with a label. That > ultimately causes missed inheritances in this example. I am agree the bigger scope would help but it is hard to implement. If we consider the graph hammock structure, inherited value of pseudo P in A to C might be changed in B. It works if we don't change it in B but then the structure can be dependent on considered pseudos for inheritance. It is harder to update live info too and to undo inheritance transformations if they fail. The biggest problem with PIC pseudo is in that it can not treated as the rest of pseudos with live range splitting perspective because of possible creation of memory with pic pseudo value. It is hard to decide what pseudo (or hard register) to use for such memory in a particular place as it can be in different pseudos (hard registers) if we permit usual pseudo live range splitting for pic pseudo.
Yes, absolutely, we have to compensate for the case where B clobbers something useful, undoing transformations, etc. It may not be reasonably feasible within the current structure of LRA. I'm getting pretty close on getting postreload-gcse.c to handle this. Most of the complexity has been in untangling the bits from gcse.c that I want to re-use. There's obviously cleanup still to do and some instrumentation/analysis of the generated code on a wider codebase, but it looks promising.
So with a functional prototype in place, I can get good code for this test. However, deeper testing of that prototype is proving difficult simply because the postreload-gcse.c code very rarely does anything useful, at least on x86 or x86-64. I put in some dumping in gcse_after_reload_main to always dump its statistics if its expression table has > 0 elements. Believe it or not, this *never* triggers during a bootstrap on x86_64. I was so amazed by this I fired up a VM and bootstraped i686 as well, nope, never fires there either. Concerned that maybe my other changes were somehow getting in the way, I put in just the dumping code, still doesn't fire. Yet it fires semi regularly in the testsuite. The original author of the code in postreload-gcse.c was doing their development on PPC boxes, so I'm going to grab one and see if I can get some useful testing from on it.
Author: law Date: Mon Mar 23 05:21:04 2015 New Revision: 221585 URL: https://gcc.gnu.org/viewcvs?rev=221585&root=gcc&view=rev Log: PR rtl-optimization/64317 * Makefile.in (OBJS): Add gcse-common.c * gcse.c: Include gcse-common.h (struct modify_pair_s): Move structure definition to gcse-common.h (compute_transp): Move function to gcse-common.c. (canon_list_insert): Similarly. (record_last_mem_set_info): Break out some code and put it into gcse-common.c. Call into the new common code. (compute_local_properties): Pass additional arguments to compute_transp. * postreload-gcse.c: Include gcse-common.h and df.h (modify_mem_list_set, blocks_with_calls): New variables. (modify_mem_list, canon_modify_mem_list, transp): Likewise. (get_bb_avail_insn): Pass in the expression index too. (alloc_mem): Allocate memory for the new bitmaps and lists. (free_mem): Free memory for the new bitmaps and lists. (insert_expr_in_table): Record a bitmap index for each entry we add to the table. (record_last_mem_set_info): Call into common code in gcse-common.c. (get_bb_avail_insn): If no available insn was found in the requested BB. If BB has a single predecessor, see if the expression is transparent in BB and available in that single predecessor. (compute_expr_transp): New wrapper for compute_transp. (eliminate_partially_redundant_load): Pass expression's bitmap_index to get_bb_avail_insn. Compute next_pred_bb_end a bit later. (gcse_after_reload_main): If there are elements in the hash table, then compute transparency for all the elements in the hash table. * gcse-common.h: New file. * gcse-common.c: New file. Added: trunk/gcc/gcse-common.c trunk/gcc/gcse-common.h Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/gcse.c trunk/gcc/postreload-gcse.c
Unwanted loads have been eliminated by the postreload-gcse improvements.
Forgot to change BZ state after committing fix.