Bug 52285 - [9/10/11/12/13 Regression] libgcrypt _gcry_burn_stack slowdown
Summary: [9/10/11/12/13 Regression] libgcrypt _gcry_burn_stack slowdown
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.7.0
: P2 normal
Target Milestone: 9.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2012-02-16 18:22 UTC by Jakub Jelinek
Modified: 2022-04-28 14:48 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-02-20 00:00:00


Attachments
gcc47-pr52285.patch (444 bytes, patch)
2012-02-16 18:50 UTC, Jakub Jelinek
Details | Diff
Gross hack (3.08 KB, patch)
2012-11-13 23:37 UTC, Steven Bosscher
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Jelinek 2012-02-16 18:22:40 UTC
#define wipememory2(_ptr,_set,_len) do { \
              volatile char *_vptr=(volatile char *)(_ptr); \
              unsigned long _vlen=(_len); \
              while(_vlen) { *_vptr=(_set); _vptr++; _vlen--; } \
                  } while(0)
#define wipememory(_ptr,_len) wipememory2(_ptr,0,_len)

__attribute__((noinline, noclone)) void
_gcry_burn_stack (int bytes)
{
  char buf[64];
    
    wipememory (buf, sizeof buf);
    bytes -= sizeof buf;
    if (bytes > 0)
        _gcry_burn_stack (bytes);
}

is one of the hot parts of gcrypt camellia256 ECB benchmark which apparently slowed down in 4.7 compared to 4.6.  The routine is called many times, usually with bytes equal to 372.

The above first slowed down the whole benchmark from ~ 2040ms to ~ 2400ms (-O2 -march=corei7-avx -fPIC) in http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=178104
and then (somewhat surprisingly) became tiny bit faster again with
http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=181172 , which no longer tail recursion optimizes the function (in this case suprisingly a win, but generally IMHO a mistake).
Comment 1 Jakub Jelinek 2012-02-16 18:25:24 UTC
While it will slow down this exact testcase even more, I think tailr/tailc passes should ignore CLOBBER stmts.
Comment 2 Jakub Jelinek 2012-02-16 18:39:57 UTC
This is also a libgcrypt bug, because clearly if it wants to burn some portion of the stack (assuming for security reasons), then
1) if it doesn't prevent tail recursion or tail call to itself, it doesn't do what it is trying to do, in particular it keeps overwriting with zeros the same portion of memory over and over
2) even if it isn't tail recursion/call optimized, on most targets it will still leave gaps on the stack not overwritten
So, all in all, quite questionable way of burning cycles in the crypto library.
Comment 3 Jakub Jelinek 2012-02-16 18:50:27 UTC
Created attachment 26681 [details]
gcc47-pr52285.patch

Patch to ignore clobber stmts in tailc/tailr passes.  This turns this testcase back into a loop, for which the bad IVOPT decision matters.
Comment 4 Jakub Jelinek 2012-02-16 22:20:31 UTC
Author: jakub
Date: Thu Feb 16 22:20:27 2012
New Revision: 184317

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=184317
Log:
	PR tree-optimization/52285
	* tree-tailcall.c (find_tail_calls): Ignore gimple_clobber_p stmts
	when deciding if a call is a tail call or tail recursion.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-tailcall.c
Comment 5 Jakub Jelinek 2012-02-20 11:13:33 UTC
void
foo (int x)
{
  char buf[64];
  do
    {
      volatile char *p = buf;
      unsigned long l = sizeof buf;
      while (l)
        {
          *p = 0;
          p++;
          l--;
        }
      x -= sizeof buf;
    }
  while (x > 0);
}

is a testcase without tail call.  With r184317 we keep the &buf[64] inside of the loop, but unfortunately &buf[64] is (reg:DI 20 frame) here (&buf[0] is frame-64), and therefore RTL lim doesn't do anything with it.
And when reload reloads that (reg:DI frame) into (minus:DI (reg:DI sp) (const_int -8)), it doesn't place the lea before the loop, even when the register pressure is low.  And no further pass does the loop invariant move either.

I wonder if we shouldn't in RTL lim just load (frame) and other most likely
eliminable registers before the loop into some new pseudo (with REG_EQUIV) and use it in the loop, I think if register pressure is high RA should rematerialize those inside of the loop, shouldn't it?
Comment 6 Andrew Pinski 2012-02-20 21:45:53 UTC
Confirmed.
Comment 7 Richard Biener 2012-03-22 08:26:47 UTC
GCC 4.7.0 is being released, adjusting target milestone.
Comment 8 Richard Biener 2012-06-14 08:29:16 UTC
GCC 4.7.1 is being released, adjusting target milestone.
Comment 9 Jakub Jelinek 2012-09-20 10:18:54 UTC
GCC 4.7.2 has been released.
Comment 10 Steven Bosscher 2012-11-13 10:30:27 UTC
There are several reasons why RTL LIM cannot currently hoist the (frame) rtx.


The first is that in general it stays away from any HARD_REGISTER_P reg with
a 10-foot pole.  For most hard registers this is probably a good strategy:
Anything that's in a "real" hard register at this point is there for a reason
(function return, global reg, whatever) and almost certainly not invariant in
a loop.


Second, RTL LIM only hoists expression that can be assigned to the original
SET_DEST of a single set. In the case of this PR, the insn in case is:

;;   UD chains for insn luid 2 uid 15
;;      reg 20 { }
;;      reg 63 { d2(bb 4 insn 12) }
(insn 15 12 16 4 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg/v/f:DI 63 [ p ])
            (reg/f:DI 20 frame))) PR52285.c:10 8 {*cmpdi_1}
     (nil))

This fails in may_assign_reg_p because (reg:CCZ 17) can't be assigned to (it
is a hard register, and I suppose it has class NO_REGS), so the SET_SRC is
not even considered by find_invariant_insn as a potential invariant.  I think
this condition can be relaxed with something like,

Index: loop-invariant.c
===================================================================
--- loop-invariant.c    (revision 193454)
+++ loop-invariant.c    (working copy)
@@ -874,11 +874,11 @@
   dest = SET_DEST (set);

   if (!REG_P (dest)
-      || HARD_REGISTER_P (dest))
+      || HARD_REGISTER_P (dest)
+      || !may_assign_reg_p (dest))
     simple = false;

-  if (!may_assign_reg_p (SET_DEST (set))
-      || !check_maybe_invariant (SET_SRC (set)))
+  if (!check_maybe_invariant (SET_SRC (set)))
     return;

   /* If the insn can throw exception, ...


Finally, RTL LIM cannot hoist parts of expressions.  It only hoists the
SET_SRC as a whole, or nothing at all.  I have patches for that, originally
developed to hoist addresses out of MEMs.  I'll dust them off and see if
I can make it handle "(reg:frame + CONST_INT)" and other expressions that 
involve eliminable regs.
Comment 11 Jakub Jelinek 2012-11-13 10:43:58 UTC
I think (plus (frame) (const_int)) is likely not an issue (at least in the common case, could be only problem if eliminated into something that needs much bigger offset that doesn't fit into the instruction anymore), the problem is an eliminable register alone used somewhere where (plus (eliminate_to) (const_int small_int)) wouldn't be handled and thus would need to be reloaded.
If loops are still around at LRA time, perhaps LRA should consider putting it before loop if register pressure is low, or LIM could just have extra code for this (first handle normal IV motions and just record if there are any eliminable regs not used inside of plus with const_int, and at the end if register pressure still isn't too high consider just creating a new insn that sets a pseudo to (frame) or other eliminable register before loop and replacing all uses of (frame) in the loop with that.  I'm not saying it must be LIM, I'm just looking for suggestions where to perform this.
Comment 12 Steven Bosscher 2012-11-13 23:37:52 UTC
Created attachment 28678 [details]
Gross hack

(In reply to comment #11)
> If loops are still around at LRA time, perhaps LRA should consider putting
> it before loop if register pressure is low, or LIM could just have extra
> code for this

Unfortunately, loop are destroyed _just_ before LRA, at the end of IRA.
IRA has its own loop tree but that is destroyed before LRA, too.


> I'm not saying it must be LIM, I'm
> just looking for suggestions where to perform this.

LIM may be too early. I've experimented with the attached patch (based off
some other patch for invariant addresses that was bit-rotting on a shelf)
and I had to resort to some crude hacks to make loop-invariant even just
consider moving the bare frame_pointer_rtx, like manually setting the cost
to something high because set_src_cost(frame_pointer_rtx)==0.  The result
is this code:

foo:
        leaq    -72(%rsp), %rcx
        leaq    -8(%rsp), %rdx     // A Pyrrhic victory...
        .p2align 4,,10
        .p2align 3
.L5:
        movq    %rcx, %rax
        .p2align 4,,10
        .p2align 3
.L3:
        movb    $0, (%rax)
        addq    $1, %rax
        cmpq    %rdx, %rax
        jne     .L3
        subl    $64, %edi
        testl   %edi, %edi
        jg      .L5
        rep ret


Need to think about this a bit more, perhaps postreload-gcse can be used
for this instead of LIM...
Comment 13 Richard Biener 2013-04-11 07:59:23 UTC
GCC 4.7.3 is being released, adjusting target milestone.
Comment 14 Richard Biener 2014-06-12 13:46:25 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 15 Jakub Jelinek 2014-12-19 13:35:55 UTC
GCC 4.8.4 has been released.
Comment 16 Richard Biener 2015-06-23 08:18:39 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 17 Jakub Jelinek 2015-06-26 19:59:18 UTC
GCC 4.9.3 has been released.
Comment 18 Richard Biener 2016-08-03 10:55:16 UTC
GCC 4.9 branch is being closed
Comment 19 Jakub Jelinek 2017-10-10 13:28:03 UTC
GCC 5 branch is being closed
Comment 20 Jakub Jelinek 2018-02-28 17:33:17 UTC
Vlad, your thoughts on this?  Can it be done in LRA or postreload-gcse or some other post-LRA pass (if they do have loops)?
Comment 21 Vladimir Makarov 2018-02-28 21:29:38 UTC
(In reply to Jakub Jelinek from comment #20)
> Vlad, your thoughts on this?  Can it be done in LRA or postreload-gcse or
> some other post-LRA pass (if they do have loops)?

I don't think it can be done easily after RA.  There are no optimizations working with loops after IRA.  postreload-gcse is very simple pass working with loads/stores.  It is hard to modify it (e.g. into PRE) to do invariant motion. Probably we could repeat loop invariant motion pass again after LRA but it needs investigation and running such heavy pass for rare cases may be doubtful.

I don't think that this PR will be fixed for GCC-8 or even later.  It will be hard to justify compilation slowdown without visible performance improvements for example on SPEC.
Comment 22 Jakub Jelinek 2018-10-26 10:13:17 UTC
GCC 6 branch is being closed
Comment 23 Steven Bosscher 2019-03-05 08:16:56 UTC
Won't be working on this any time soon...
Comment 24 Richard Biener 2019-11-14 07:58:07 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 25 Jakub Jelinek 2020-03-04 09:39:23 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 26 Jakub Jelinek 2021-05-14 09:46:36 UTC
GCC 8 branch is being closed.
Comment 27 Richard Biener 2021-06-01 08:05:35 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.