63191 – [5/6 Regression] 32-bit gcc uses excessive memory during dead store elimination with -fPIC

Bug 63191 - [5/6 Regression] 32-bit gcc uses excessive memory during dead store elimination with -fPIC

Summary: [5/6 Regression] 32-bit gcc uses excessive memory during dead store eliminati...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.9.1

Importance:	P2 normal
Target Milestone:	7.0
Assignee:	Steven Bosscher

URL:
Keywords:	memory-hog

Depends on:
Blocks:	47344
	Show dependency tree / graph

Reported:	2014-09-05 20:12 UTC by woodfin
Modified:	2017-03-23 06:47 UTC (History)
CC List:	7 users (show)

See Also:
Host:
Target:	i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2014-09-08 00:00:00

Attachments
Source triggering memory usage (21.57 KB, text/plain) 2014-09-05 20:12 UTC, woodfin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description woodfin 2014-09-05 20:12:50 UTC

Created attachment 33453 [details]
Source triggering memory usage

Every version of gcc I try from 4.4 -> 4.9.1 on 32-bit Linux uses excessive memory (eventually exhausting all 3GB of virtual address space) when I compile the attached (generated, simplified) source with

g++ -c -O2 -fPIC badmatch.cpp

4.3 does not exhibit the issue.

64-bit Linux (gcc 4.4.4 in that case) does not exhibit the issue, and memory usage tops out at ~440MB. gcc 4.4.4 on 32-bit Linux does exhibit the issue.

Dropping -fPIC causes memory usage to top out at ~175MB and the compile to succeed.

-no-dse eliminates the problem as well.

A look at the mallocs in the debugger suggests that the explosion of allocations is happening under rest_of_handle_dse(), e.g.:

#0  0x00332e46 in malloc () from /lib/libc.so.6
#1  0x08a26148 in xmalloc ()
#2  0x0829285b in pool_alloc(alloc_pool_def*) () at ../.././gcc/alloc-pool.c:281
#3  0x082f076e in cselib_lookup(rtx_def*, machine_mode, int, machine_mode) () at ../.././gcc/cselib.c:1303
#4  0x089536e2 in canon_address(rtx_def*, int*, int*, long long*, cselib_val**) () at ../.././gcc/dse.c:1182
#5  0x08954770 in record_store(rtx_def*, bb_info*) () at ../.././gcc/dse.c:1443
#6  0x08955e6a in rest_of_handle_dse() () at ../.././gcc/dse.c:2616
#7  0x084bd923 in execute_one_pass(opt_pass*) () at ../.././gcc/passes.c:2233

The most recent version with which I have tested:

>g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/nethome/woodfin/opt/lnxrhx86/gcc-4.9.1/libexec/gcc/i686-pc-linux-gnu/4.9.1/lto-wrapper
Target: i686-pc-linux-gnu
Configured with: ./configure --prefix=/nethome/woodfin/opt/lnxrhx86/gcc-4.9.1 --enable-languages=c,c++ --with-mpfr=/nethome/woodfin/opt/lnxrhx86/mpfr-2.4.2 --with-gmp=/nethome/woodfin/opt/lnxrhx86/gmp-4.3.2 --with-mpc=/nethome/woodfin/opt/lnxrhx86/mpc-1.0.2
Thread model: posix
gcc version 4.9.1 (GCC)

Comment 1 Richard Biener 2014-09-08 08:58:28 UTC

Confirmed.  Possibly excessive value_rtx expansion from dse.c:canon_address.

The testcase is a function with a single basic-block and 30000 stores
(the static initializer function) with the pattern

  D.94947 = (struct Z *) &Zs;
  D.94947->x1_ = &Xs1[0];
  D.94947->x2_ = 1;
  D.94947->x3_ = 1;
  temp.20397 = D.94947 + 12;
  temp.20397->x1_ = &Xs90[0];
  temp.20397->x2_ = 2;
  temp.20397->x3_ = 1;
...
  temp.30587 = temp.30586 + 12;
  temp.30587->x1_ = &Xs611[0];
  temp.30587->x2_ = 2;
  temp.30587->x3_ = 1;

thus groups of three stores followed by an address adjustment.  The above
is from a GCC 4.3 IL dump.

The GCC 4.9 IL dump shows

  MEM[(struct Z *)&Zs].x1_ = &Xs1;
  MEM[(struct Z *)&Zs].x2_ = 1;
  MEM[(struct Z *)&Zs].x3_ = 1;
  MEM[(struct Z *)&Zs + 12B].x1_ = &Xs90;
  MEM[(struct Z *)&Zs + 12B].x2_ = 2;
  MEM[(struct Z *)&Zs + 12B].x3_ = 1;
  MEM[(struct Z *)&Zs + 24B].x1_ = &Xs91;
  MEM[(struct Z *)&Zs + 24B].x2_ = 2;
  MEM[(struct Z *)&Zs + 24B].x3_ = 1;
...
  MEM[(struct Z *)&Zs + 122292B].x1_ = &Xs611;
  MEM[(struct Z *)&Zs + 122292B].x2_ = 2;
  MEM[(struct Z *)&Zs + 122292B].x3_ = 1;

which causes each store to be expanded via st like

(insn 71298 71297 71299 2 (set (reg:SI 40822)
        (const:SI (unspec:SI [
                    (symbol_ref:SI ("_ZL2Zs") [flags 0x2]  <var_decl 0x7ffff5c4a098 Zs>)
                ] UNSPEC_GOTOFF))) t.C:5 -1
     (nil))
(insn 71299 71298 71300 2 (set (mem/c:SI (plus:SI (plus:SI (reg:SI 3 bx)
                    (reg:SI 40822))
                (const_int 122216 [0x1dd68])) [4 MEM[(struct Z *)&Zs + 122208B].x3_+0 S4 A64])
        (const_int 1 [0x1])) t.C:5 -1
     (nil))

I suppose "lowering" PIC addresses somewhere before RTL expansion (and
CSEing the addresses) would help here.  Lowering as in not treating
them as is_gimple_min_invariant.

With 4.3 we have a single address load for &Zs (but of course we retain
the individual stored addresses loads - thus still very many PIC addresses
in this function).

Why is CSE not able to CSE the UNSPEC_GOTOFF addresses?  Does it not do
it because of the (const:SI ...) wrapping (as in, not profitable)?  Or is
it confused about the other intermediate UNSPEC_GOTOFF uses?

That said, cse1 should be able to turn the RTL into sth equivalent to
what 4.3 produced.

Comment 2 Richard Biener 2014-09-08 09:01:45 UTC

With

int a, b, c, d;
struct X { int a; int b; void *p; } z[4];
void foo (void)
{
  z[0].a = 1;
  z[0].b = 2;
  z[0].p = &a;
  z[1].a = 1;
  z[1].b = 2;
  z[1].p = &b;
  z[2].a = 1;
  z[2].b = 2;
  z[2].p = &c;
  z[3].a = 1;
  z[3].b = 2;
  z[3].p = &d;
}

CSEing of the GOT load of z works.

Comment 3 Jakub Jelinek 2014-12-19 13:26:41 UTC

GCC 4.8.4 has been released.

Comment 4 Steven Bosscher 2015-04-07 20:23:32 UTC

How is one to reproduce this bug with GCC5? I've tried:

$ ./xg++ --version
xg++ (GCC) 5.0.0 20150407 (experimental) [trunk revision 221906]
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ./xg++ -B. -S -O2 -m32 -fPIC PR63191.cc -fdump-tree-optimized
$ cat PR63191.cc.190t.optimized

;; Function (static initializers for PR63191.cc) (_GLOBAL__sub_I_PR63191.cc, funcdef_no=4, decl_uid=14028, cgraph_uid=4, symbol_order=1500) (executed once)

(static initializers for PR63191.cc) ()
{
  <bb 2>:
  return;

}


$ 

So AFAICT GCC5 optimizes the test case of comment #0 to an empty file.
I'm sure there's a way to avoid optimizing this to empty, but I'm not
quite a C++ guru ;-)

Comment 5 woodfin 2015-04-07 20:42:38 UTC

You could try adding a non-static function that returns an address inside Zs.

const Z* getzs() {
  return &Zs[0];
}

I'd think that would force it to actually perform the initialization if the contents can be externally accessed.

Sorry, I don't have a gcc 5.0 environment yet. I'll set one up if you still can't reproduce this there.

Comment 6 Steven Bosscher 2015-04-07 21:30:09 UTC

(In reply to woodfin from comment #5)
> You could try adding a non-static function that returns an address inside Zs.
> 
> const Z* getzs() {
>   return &Zs[0];
> }

Yes, that does the trick:
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
25244 stevenb   20   0 5964m 5.8g  30m R   100  9.3  25:03.60 cc1plus
(and counting)

Now let's see if I can come up with a more reasonable test case...

Comment 7 Steven Bosscher 2015-04-07 21:39:03 UTC

(In reply to Steven Bosscher from comment #6)
> Now let's see if I can come up with a more reasonable test case...

Like so:

----- 8< -----
typedef int X;

struct Z {
    Z(const X* x1, X x2, X x3) :
      x1_(x1), x2_(x2), x3_(x3) {}
    const X* x1_;
    X x2_;
    X x3_;
};

#undef X____1
#undef X___10
#undef X__100
#undef X_1000
#undef X10000
#define X____1(N) \
  static const X Xs##N[] = {};
#define X___10(N) \
  X____1(N##0) X____1(N##1) X____1(N##2) X____1(N##3) X____1(N##4) \
  X____1(N##5) X____1(N##6) X____1(N##7) X____1(N##8) X____1(N##9)
#define X__100(N) \
  X___10(N##0) X___10(N##1) X___10(N##2) X___10(N##3) X___10(N##4) \
  X___10(N##5) X___10(N##6) X___10(N##7) X___10(N##8) X___10(N##9)
#define X_1000(N) \
  X__100(N##0) X__100(N##1) X__100(N##2) X__100(N##3) X__100(N##4) \
  X__100(N##5) X__100(N##6) X__100(N##7) X__100(N##8) X__100(N##9)
#define X10000(N) \
  X_1000(N##0) X_1000(N##1) X_1000(N##2) X_1000(N##3) X_1000(N##4) \
  X_1000(N##5) X_1000(N##6) X_1000(N##7) X_1000(N##8) X_1000(N##9)

X10000(0)

#undef Z____1
#undef Z___10
#undef Z__100
#undef Z_1000
#undef Z10000
#define Z____1(N,I,J) \
  Z(Xs##N,1,1),
#define Z___10(N) \
  Z____1(N##0,1,1) Z____1(N##0,1,1) \
  Z____1(N##0,1,1) Z____1(N##1,2,1) \
  Z____1(N##0,1,1) Z____1(N##2,1,2) \
  Z____1(N##0,1,1) Z____1(N##3,6,3) \
  Z____1(N##0,1,1) Z____1(N##4,7,2) \
  Z____1(N##0,1,1) Z____1(N##5,1,3) \
  Z____1(N##0,1,1) Z____1(N##6,5,9) \
  Z____1(N##0,1,1) Z____1(N##7,7,1) \
  Z____1(N##0,1,1) Z____1(N##8,3,3) \
  Z____1(N##0,1,1) Z____1(N##9,2,2)
#define Z__100(N) \
  Z___10(N##0) Z___10(N##1) Z___10(N##2) Z___10(N##3) Z___10(N##4) \
  Z___10(N##5) Z___10(N##6) Z___10(N##7) Z___10(N##8) Z___10(N##9)
#define Z_1000(N) \
  Z__100(N##0) Z__100(N##1) Z__100(N##2) Z__100(N##3) Z__100(N##4) \
  Z__100(N##5) Z__100(N##6) Z__100(N##7) Z__100(N##8) Z__100(N##9)
#define Z10000(N) \
  Z_1000(N##0) // Z_1000(N##1) Z_1000(N##2) Z_1000(N##3) Z_1000(N##4) \
  // Z_1000(N##5) Z_1000(N##6) Z_1000(N##7) Z_1000(N##8) Z_1000(N##9)

static const X XsLast[] = {};
static const Z Zs[] = { Z10000(0) Z(XsLast,1,1) };

const Z* getzs() {
    return &Zs[0];
}

----- 8< -----

exploding in DSE:
 dead store elim1        :  45.34 (15%) usr   0.19 (28%) sys  45.53 (15%) wall 1016985 kB (45%) ggc

Comment 8 Richard Biener 2015-06-23 08:16:05 UTC

The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.

Comment 9 Jakub Jelinek 2015-06-26 19:53:09 UTC

GCC 4.9.3 has been released.

Comment 10 Richard Biener 2016-08-03 11:43:19 UTC

GCC 4.9 branch is being closed

Comment 11 Jakub Jelinek 2017-03-10 11:57:20 UTC

Given say:
typedef int X;
struct Z {
  Z(const X* x1, X x2, X x3) : x1_(x1), x2_(x2), x3_(x3) {}
  const X* x1_;
  X x2_;
  X x3_;
};
static const X Xs0[] = {};
static const X Xs1[] = {};
static const X Xs2[] = {};
static const X Xs3[] = {};
static const X Xs4[] = {};
static const X Xs5[] = {};
static const X Xs6[] = {};
static const X Xs7[] = {};
static const X Xs8[] = {};
static const X Xs9[] = {};
static const Z Zs[] = {
Z(Xs1,1,1),
Z(Xs2,2,1),
Z(Xs3,2,1),
Z(Xs4,1,1),
Z(Xs5,1,1),Z(Xs6,8,1),
Z(Xs7,1,1),Z(Xs8,5,1),
Z(Xs9,1,1),Z(Xs0,7,1) };
const Z *p = &Zs[0];
(the last line is there so that everything is not optimized away), I'm seeing that on x86_64-linux with -m64 -O2 -fpic we actually CSE all those (symbol_ref:DI "_ZL2Zs"), because it has cost of 6 (rtx_cost on that (symbol_ref, DImode, SET, 1) yields 3), while with -m32 -O2 -fpic we don't, because it has cost of 0.

But it is more confusing that we actually return 3 on -m64 -fpic, x86_64_nonimmediate_operand has:
      /* For certain code models, the symbolic references are known to fit.
         in CM_SMALL_PIC model we know it fits if it is local to the shared
         library.  Don't count TLS SYMBOL_REFs here, since they should fit
         only if inside of UNSPEC handled below.  */
      return (ix86_cmodel == CM_SMALL || ix86_cmodel == CM_KERNEL
              || (ix86_cmodel == CM_MEDIUM && !SYMBOL_REF_FAR_ADDR_P (op)));
So it talks about ix86_cmodel of CM_SMALL_PIC, but then actually doesn't do anything for it.
Perhaps it is right that foo(%rip) is not considered x86_64_nonimmediate_operand, but perhaps it should still have zero cost.
That would of course make this PR likely worse even on x86_64 -m64.

Comment 12 Jakub Jelinek 2017-03-10 12:12:14 UTC

For the C++ FE, the question here is why we actually emit dynamic initialization at all.  If constexpr is added to the ctor, then we just emit the initializer, but even without the constexpr I'd think that if the ctor has empty body and trivial mem initializers and if all arguments of the ctors are constants, as an optimization we should handle it as if it was declared constexpr.  Jason?

Comment 13 Richard Biener 2017-03-10 12:21:44 UTC

(In reply to Jakub Jelinek from comment #12)
> For the C++ FE, the question here is why we actually emit dynamic
> initialization at all.  If constexpr is added to the ctor, then we just emit
> the initializer, but even without the constexpr I'd think that if the ctor
> has empty body and trivial mem initializers and if all arguments of the
> ctors are constants, as an optimization we should handle it as if it was
> declared constexpr.  Jason?

I think there is a PR somewhere suggesting C++ should (for all initializers)
try constexpr evaluation.  It's much harder to do this in the middle-end.

Comment 14 Jakub Jelinek 2017-03-10 12:24:21 UTC

(In reply to Richard Biener from comment #13)
> (In reply to Jakub Jelinek from comment #12)
> > For the C++ FE, the question here is why we actually emit dynamic
> > initialization at all.  If constexpr is added to the ctor, then we just emit
> > the initializer, but even without the constexpr I'd think that if the ctor
> > has empty body and trivial mem initializers and if all arguments of the
> > ctors are constants, as an optimization we should handle it as if it was
> > declared constexpr.  Jason?
> 
> I think there is a PR somewhere suggesting C++ should (for all initializers)
> try constexpr evaluation.  It's much harder to do this in the middle-end.

Note clang++ seems to implement that (i.e. constexpr isn't needed there in order to get a static initializer).

Comment 15 Jakub Jelinek 2017-03-10 13:25:53 UTC

Anyway, as far as memory consumption goes (compile time is still the same), the following patch helps a lot:

--- gcc/config/i386/i386.c.jj	2017-03-07 20:04:52.000000000 +0100
+++ gcc/config/i386/i386.c	2017-03-10 13:46:12.482704787 +0100
@@ -17257,8 +17257,9 @@ ix86_delegitimize_tls_address (rtx orig_
    necessary to remove references to the PIC label from RTL stored by
    the DWARF output code.  */
 
-static rtx
-ix86_delegitimize_address (rtx x)
+template <bool base_term>
+static inline rtx
+ix86_delegitimize_address_1 (rtx x)
 {
   rtx orig_x = delegitimize_mem_from_attrs (x);
   /* addend is NULL or some rtx if x is something+GOTOFF where
@@ -17361,7 +17362,7 @@ ix86_delegitimize_address (rtx x)
   if (! result)
     return ix86_delegitimize_tls_address (orig_x);
 
-  if (const_addend)
+  if (const_addend && !base_term)
     result = gen_rtx_CONST (Pmode, gen_rtx_PLUS (Pmode, result, const_addend));
   if (reg_addend)
     result = gen_rtx_PLUS (Pmode, reg_addend, result);
@@ -17399,6 +17400,12 @@ ix86_delegitimize_address (rtx x)
   return result;
 }
 
+static rtx
+ix86_delegitimize_address (rtx x)
+{
+  return ix86_delegitimize_address_1<false> (x);
+}
+
 /* If X is a machine specific address (i.e. a symbol or label being
    referenced as a displacement from the GOT implemented using an
    UNSPEC), then return the base term.  Otherwise return X.  */
@@ -17424,7 +17431,7 @@ ix86_find_base_term (rtx x)
       return XVECEXP (term, 0, 0);
     }
 
-  return ix86_delegitimize_address (x);
+  return ix86_delegitimize_address_1<true> (x);
 }
 

 static void

Without the patch (just the major time or memory consumers):
 tree DSE                :  40.53 ( 9%) usr   0.00 ( 0%) sys  40.51 ( 9%) wall       0 kB ( 0%) ggc
 dead store elim1        : 244.65 (55%) usr   1.10 (46%) sys 245.75 (55%) wall 5879136 kB (47%) ggc
 dead store elim2        :   3.12 ( 1%) usr   0.01 ( 0%) sys   3.12 ( 1%) wall  252045 kB ( 2%) ggc
 reload CSE regs         : 106.15 (24%) usr   0.01 ( 0%) sys 106.15 (24%) wall 4496830 kB (36%) ggc
 TOTAL                 : 444.45             2.38           447.46           12477770 kB
and with the patch:
 tree DSE                :  40.52 (10%) usr   0.00 ( 0%) sys  40.51 (10%) wall       0 kB ( 0%) ggc
 dead store elim1        : 223.84 (55%) usr   0.00 ( 0%) sys 223.84 (55%) wall    4653 kB ( 0%) ggc
 dead store elim2        :   2.92 ( 1%) usr   0.00 ( 0%) sys   2.92 ( 1%) wall  175766 kB ( 7%) ggc
 reload CSE regs         :  98.58 (24%) usr   0.46 (53%) sys  99.04 (24%) wall 2130309 kB (83%) ggc
 TOTAL                 : 407.95             0.86           409.33            2558609 kB
(both completely unoptimized compilers with checking etc.).

The thing is that ix86_find_base_term calls ix86_delegitimize_address that often creates some RTL that the caller then immediately throws away.
ix86_find_base_term is called a lot on expressions like:
(plus:SI (value:SI 1:1 @0x2c60f50/0x2c50f40)
    (const:SI (plus:SI (unspec:SI [
                    (symbol_ref:SI ("_ZL2Zs") [flags 0x2] <var_decl 0x7fffefc19900 Zs>)
                ] UNSPEC_GOTOFF)
            (const_int 8 [0x8]))))
on which it returns
(const:SI (plus:SI (symbol_ref:SI ("_ZL2Zs") [flags 0x2] <var_decl 0x7fffefc19900 Zs>)
        (const_int 8 [0x8])))
but in reality, the caller only cares about the SYMBOL_REF, CONST_INT operand on PLUS is ignored by find_base_term.
The other option is to duplicate and adjust ix86_delegitimize_address into ix86_find_base_term.
With the above template, we can share the code, just (for now in one spot, but likely in more spots later).

As for more spots later, e.g. both find_base_value and find_base_term (the only users of ix86_find_base_term)
only care about MEM with arg_pointer_rtx or plus arg_pointer_rtx something.  So, in other cases it doesn't
make sense to replace_equiv_address_nv.  Thus I think
      if (GET_CODE (x) == CONST
          && GET_CODE (XEXP (x, 0)) == PLUS
          && GET_MODE (XEXP (x, 0)) == Pmode
          && CONST_INT_P (XEXP (XEXP (x, 0), 1))
          && GET_CODE (XEXP (XEXP (x, 0), 0)) == UNSPEC
          && XINT (XEXP (XEXP (x, 0), 0), 1) == UNSPEC_PCREL)
        {
          rtx x2 = XVECEXP (XEXP (XEXP (x, 0), 0), 0, 0);
          x = gen_rtx_PLUS (Pmode, XEXP (XEXP (x, 0), 1), x2);
          if (MEM_P (orig_x))
            x = replace_equiv_address_nv (orig_x, x);
          return x;
        }
isn't really useful if base_term && MEM_P (orig_x).

Comment 16 Jakub Jelinek 2017-03-22 18:34:10 UTC

Author: jakub
Date: Wed Mar 22 18:33:37 2017
New Revision: 246398

URL: https://gcc.gnu.org/viewcvs?rev=246398&root=gcc&view=rev
Log:
	PR rtl-optimization/63191
	* config/i386/i386.c (ix86_delegitimize_address): Turn into small
	wrapper function, moved the whole old content into ...
	(ix86_delegitimize_address_1): ... this.  New inline function.
	(ix86_find_base_term): Use ix86_delegitimize_address_1 with
	true as last argument instead of ix86_delegitimize_address.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c

Comment 17 Jakub Jelinek 2017-03-22 18:58:03 UTC

Fixed.  Not planning to backport.

Comment 18 __vic 2017-03-23 06:05:02 UTC

(In reply to Jakub Jelinek from comment #17)
> Fixed.  Not planning to backport.

So Target Milestone is still 5.5?