#define wipememory2(_ptr,_set,_len) do { \ volatile char *_vptr=(volatile char *)(_ptr); \ unsigned long _vlen=(_len); \ while(_vlen) { *_vptr=(_set); _vptr++; _vlen--; } \ } while(0) #define wipememory(_ptr,_len) wipememory2(_ptr,0,_len) __attribute__((noinline, noclone)) void _gcry_burn_stack (int bytes) { char buf[64]; wipememory (buf, sizeof buf); bytes -= sizeof buf; if (bytes > 0) _gcry_burn_stack (bytes); } is one of the hot parts of gcrypt camellia256 ECB benchmark which apparently slowed down in 4.7 compared to 4.6. The routine is called many times, usually with bytes equal to 372. The above first slowed down the whole benchmark from ~ 2040ms to ~ 2400ms (-O2 -march=corei7-avx -fPIC) in http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=178104 and then (somewhat surprisingly) became tiny bit faster again with http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=181172 , which no longer tail recursion optimizes the function (in this case suprisingly a win, but generally IMHO a mistake).
While it will slow down this exact testcase even more, I think tailr/tailc passes should ignore CLOBBER stmts.
This is also a libgcrypt bug, because clearly if it wants to burn some portion of the stack (assuming for security reasons), then 1) if it doesn't prevent tail recursion or tail call to itself, it doesn't do what it is trying to do, in particular it keeps overwriting with zeros the same portion of memory over and over 2) even if it isn't tail recursion/call optimized, on most targets it will still leave gaps on the stack not overwritten So, all in all, quite questionable way of burning cycles in the crypto library.
Created attachment 26681 [details] gcc47-pr52285.patch Patch to ignore clobber stmts in tailc/tailr passes. This turns this testcase back into a loop, for which the bad IVOPT decision matters.
Author: jakub Date: Thu Feb 16 22:20:27 2012 New Revision: 184317 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=184317 Log: PR tree-optimization/52285 * tree-tailcall.c (find_tail_calls): Ignore gimple_clobber_p stmts when deciding if a call is a tail call or tail recursion. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-tailcall.c
void foo (int x) { char buf[64]; do { volatile char *p = buf; unsigned long l = sizeof buf; while (l) { *p = 0; p++; l--; } x -= sizeof buf; } while (x > 0); } is a testcase without tail call. With r184317 we keep the &buf[64] inside of the loop, but unfortunately &buf[64] is (reg:DI 20 frame) here (&buf[0] is frame-64), and therefore RTL lim doesn't do anything with it. And when reload reloads that (reg:DI frame) into (minus:DI (reg:DI sp) (const_int -8)), it doesn't place the lea before the loop, even when the register pressure is low. And no further pass does the loop invariant move either. I wonder if we shouldn't in RTL lim just load (frame) and other most likely eliminable registers before the loop into some new pseudo (with REG_EQUIV) and use it in the loop, I think if register pressure is high RA should rematerialize those inside of the loop, shouldn't it?
Confirmed.
GCC 4.7.0 is being released, adjusting target milestone.
GCC 4.7.1 is being released, adjusting target milestone.
GCC 4.7.2 has been released.
There are several reasons why RTL LIM cannot currently hoist the (frame) rtx. The first is that in general it stays away from any HARD_REGISTER_P reg with a 10-foot pole. For most hard registers this is probably a good strategy: Anything that's in a "real" hard register at this point is there for a reason (function return, global reg, whatever) and almost certainly not invariant in a loop. Second, RTL LIM only hoists expression that can be assigned to the original SET_DEST of a single set. In the case of this PR, the insn in case is: ;; UD chains for insn luid 2 uid 15 ;; reg 20 { } ;; reg 63 { d2(bb 4 insn 12) } (insn 15 12 16 4 (set (reg:CCZ 17 flags) (compare:CCZ (reg/v/f:DI 63 [ p ]) (reg/f:DI 20 frame))) PR52285.c:10 8 {*cmpdi_1} (nil)) This fails in may_assign_reg_p because (reg:CCZ 17) can't be assigned to (it is a hard register, and I suppose it has class NO_REGS), so the SET_SRC is not even considered by find_invariant_insn as a potential invariant. I think this condition can be relaxed with something like, Index: loop-invariant.c =================================================================== --- loop-invariant.c (revision 193454) +++ loop-invariant.c (working copy) @@ -874,11 +874,11 @@ dest = SET_DEST (set); if (!REG_P (dest) - || HARD_REGISTER_P (dest)) + || HARD_REGISTER_P (dest) + || !may_assign_reg_p (dest)) simple = false; - if (!may_assign_reg_p (SET_DEST (set)) - || !check_maybe_invariant (SET_SRC (set))) + if (!check_maybe_invariant (SET_SRC (set))) return; /* If the insn can throw exception, ... Finally, RTL LIM cannot hoist parts of expressions. It only hoists the SET_SRC as a whole, or nothing at all. I have patches for that, originally developed to hoist addresses out of MEMs. I'll dust them off and see if I can make it handle "(reg:frame + CONST_INT)" and other expressions that involve eliminable regs.
I think (plus (frame) (const_int)) is likely not an issue (at least in the common case, could be only problem if eliminated into something that needs much bigger offset that doesn't fit into the instruction anymore), the problem is an eliminable register alone used somewhere where (plus (eliminate_to) (const_int small_int)) wouldn't be handled and thus would need to be reloaded. If loops are still around at LRA time, perhaps LRA should consider putting it before loop if register pressure is low, or LIM could just have extra code for this (first handle normal IV motions and just record if there are any eliminable regs not used inside of plus with const_int, and at the end if register pressure still isn't too high consider just creating a new insn that sets a pseudo to (frame) or other eliminable register before loop and replacing all uses of (frame) in the loop with that. I'm not saying it must be LIM, I'm just looking for suggestions where to perform this.
Created attachment 28678 [details] Gross hack (In reply to comment #11) > If loops are still around at LRA time, perhaps LRA should consider putting > it before loop if register pressure is low, or LIM could just have extra > code for this Unfortunately, loop are destroyed _just_ before LRA, at the end of IRA. IRA has its own loop tree but that is destroyed before LRA, too. > I'm not saying it must be LIM, I'm > just looking for suggestions where to perform this. LIM may be too early. I've experimented with the attached patch (based off some other patch for invariant addresses that was bit-rotting on a shelf) and I had to resort to some crude hacks to make loop-invariant even just consider moving the bare frame_pointer_rtx, like manually setting the cost to something high because set_src_cost(frame_pointer_rtx)==0. The result is this code: foo: leaq -72(%rsp), %rcx leaq -8(%rsp), %rdx // A Pyrrhic victory... .p2align 4,,10 .p2align 3 .L5: movq %rcx, %rax .p2align 4,,10 .p2align 3 .L3: movb $0, (%rax) addq $1, %rax cmpq %rdx, %rax jne .L3 subl $64, %edi testl %edi, %edi jg .L5 rep ret Need to think about this a bit more, perhaps postreload-gcse can be used for this instead of LIM...
GCC 4.7.3 is being released, adjusting target milestone.
The 4.7 branch is being closed, moving target milestone to 4.8.4.
GCC 4.8.4 has been released.
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
GCC 4.9 branch is being closed
GCC 5 branch is being closed
Vlad, your thoughts on this? Can it be done in LRA or postreload-gcse or some other post-LRA pass (if they do have loops)?
(In reply to Jakub Jelinek from comment #20) > Vlad, your thoughts on this? Can it be done in LRA or postreload-gcse or > some other post-LRA pass (if they do have loops)? I don't think it can be done easily after RA. There are no optimizations working with loops after IRA. postreload-gcse is very simple pass working with loads/stores. It is hard to modify it (e.g. into PRE) to do invariant motion. Probably we could repeat loop invariant motion pass again after LRA but it needs investigation and running such heavy pass for rare cases may be doubtful. I don't think that this PR will be fixed for GCC-8 or even later. It will be hard to justify compilation slowdown without visible performance improvements for example on SPEC.
GCC 6 branch is being closed
Won't be working on this any time soon...
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.