Created attachment 33453 [details] Source triggering memory usage Every version of gcc I try from 4.4 -> 4.9.1 on 32-bit Linux uses excessive memory (eventually exhausting all 3GB of virtual address space) when I compile the attached (generated, simplified) source with g++ -c -O2 -fPIC badmatch.cpp 4.3 does not exhibit the issue. 64-bit Linux (gcc 4.4.4 in that case) does not exhibit the issue, and memory usage tops out at ~440MB. gcc 4.4.4 on 32-bit Linux does exhibit the issue. Dropping -fPIC causes memory usage to top out at ~175MB and the compile to succeed. -no-dse eliminates the problem as well. A look at the mallocs in the debugger suggests that the explosion of allocations is happening under rest_of_handle_dse(), e.g.: #0 0x00332e46 in malloc () from /lib/libc.so.6 #1 0x08a26148 in xmalloc () #2 0x0829285b in pool_alloc(alloc_pool_def*) () at ../.././gcc/alloc-pool.c:281 #3 0x082f076e in cselib_lookup(rtx_def*, machine_mode, int, machine_mode) () at ../.././gcc/cselib.c:1303 #4 0x089536e2 in canon_address(rtx_def*, int*, int*, long long*, cselib_val**) () at ../.././gcc/dse.c:1182 #5 0x08954770 in record_store(rtx_def*, bb_info*) () at ../.././gcc/dse.c:1443 #6 0x08955e6a in rest_of_handle_dse() () at ../.././gcc/dse.c:2616 #7 0x084bd923 in execute_one_pass(opt_pass*) () at ../.././gcc/passes.c:2233 The most recent version with which I have tested: >g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/nethome/woodfin/opt/lnxrhx86/gcc-4.9.1/libexec/gcc/i686-pc-linux-gnu/4.9.1/lto-wrapper Target: i686-pc-linux-gnu Configured with: ./configure --prefix=/nethome/woodfin/opt/lnxrhx86/gcc-4.9.1 --enable-languages=c,c++ --with-mpfr=/nethome/woodfin/opt/lnxrhx86/mpfr-2.4.2 --with-gmp=/nethome/woodfin/opt/lnxrhx86/gmp-4.3.2 --with-mpc=/nethome/woodfin/opt/lnxrhx86/mpc-1.0.2 Thread model: posix gcc version 4.9.1 (GCC)
Confirmed. Possibly excessive value_rtx expansion from dse.c:canon_address. The testcase is a function with a single basic-block and 30000 stores (the static initializer function) with the pattern D.94947 = (struct Z *) &Zs; D.94947->x1_ = &Xs1[0]; D.94947->x2_ = 1; D.94947->x3_ = 1; temp.20397 = D.94947 + 12; temp.20397->x1_ = &Xs90[0]; temp.20397->x2_ = 2; temp.20397->x3_ = 1; ... temp.30587 = temp.30586 + 12; temp.30587->x1_ = &Xs611[0]; temp.30587->x2_ = 2; temp.30587->x3_ = 1; thus groups of three stores followed by an address adjustment. The above is from a GCC 4.3 IL dump. The GCC 4.9 IL dump shows MEM[(struct Z *)&Zs].x1_ = &Xs1; MEM[(struct Z *)&Zs].x2_ = 1; MEM[(struct Z *)&Zs].x3_ = 1; MEM[(struct Z *)&Zs + 12B].x1_ = &Xs90; MEM[(struct Z *)&Zs + 12B].x2_ = 2; MEM[(struct Z *)&Zs + 12B].x3_ = 1; MEM[(struct Z *)&Zs + 24B].x1_ = &Xs91; MEM[(struct Z *)&Zs + 24B].x2_ = 2; MEM[(struct Z *)&Zs + 24B].x3_ = 1; ... MEM[(struct Z *)&Zs + 122292B].x1_ = &Xs611; MEM[(struct Z *)&Zs + 122292B].x2_ = 2; MEM[(struct Z *)&Zs + 122292B].x3_ = 1; which causes each store to be expanded via st like (insn 71298 71297 71299 2 (set (reg:SI 40822) (const:SI (unspec:SI [ (symbol_ref:SI ("_ZL2Zs") [flags 0x2] <var_decl 0x7ffff5c4a098 Zs>) ] UNSPEC_GOTOFF))) t.C:5 -1 (nil)) (insn 71299 71298 71300 2 (set (mem/c:SI (plus:SI (plus:SI (reg:SI 3 bx) (reg:SI 40822)) (const_int 122216 [0x1dd68])) [4 MEM[(struct Z *)&Zs + 122208B].x3_+0 S4 A64]) (const_int 1 [0x1])) t.C:5 -1 (nil)) I suppose "lowering" PIC addresses somewhere before RTL expansion (and CSEing the addresses) would help here. Lowering as in not treating them as is_gimple_min_invariant. With 4.3 we have a single address load for &Zs (but of course we retain the individual stored addresses loads - thus still very many PIC addresses in this function). Why is CSE not able to CSE the UNSPEC_GOTOFF addresses? Does it not do it because of the (const:SI ...) wrapping (as in, not profitable)? Or is it confused about the other intermediate UNSPEC_GOTOFF uses? That said, cse1 should be able to turn the RTL into sth equivalent to what 4.3 produced.
With int a, b, c, d; struct X { int a; int b; void *p; } z[4]; void foo (void) { z[0].a = 1; z[0].b = 2; z[0].p = &a; z[1].a = 1; z[1].b = 2; z[1].p = &b; z[2].a = 1; z[2].b = 2; z[2].p = &c; z[3].a = 1; z[3].b = 2; z[3].p = &d; } CSEing of the GOT load of z works.
GCC 4.8.4 has been released.
How is one to reproduce this bug with GCC5? I've tried: $ ./xg++ --version xg++ (GCC) 5.0.0 20150407 (experimental) [trunk revision 221906] Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ ./xg++ -B. -S -O2 -m32 -fPIC PR63191.cc -fdump-tree-optimized $ cat PR63191.cc.190t.optimized ;; Function (static initializers for PR63191.cc) (_GLOBAL__sub_I_PR63191.cc, funcdef_no=4, decl_uid=14028, cgraph_uid=4, symbol_order=1500) (executed once) (static initializers for PR63191.cc) () { <bb 2>: return; } $ So AFAICT GCC5 optimizes the test case of comment #0 to an empty file. I'm sure there's a way to avoid optimizing this to empty, but I'm not quite a C++ guru ;-)
You could try adding a non-static function that returns an address inside Zs. const Z* getzs() { return &Zs[0]; } I'd think that would force it to actually perform the initialization if the contents can be externally accessed. Sorry, I don't have a gcc 5.0 environment yet. I'll set one up if you still can't reproduce this there.
(In reply to woodfin from comment #5) > You could try adding a non-static function that returns an address inside Zs. > > const Z* getzs() { > return &Zs[0]; > } Yes, that does the trick: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25244 stevenb 20 0 5964m 5.8g 30m R 100 9.3 25:03.60 cc1plus (and counting) Now let's see if I can come up with a more reasonable test case...
(In reply to Steven Bosscher from comment #6) > Now let's see if I can come up with a more reasonable test case... Like so: ----- 8< ----- typedef int X; struct Z { Z(const X* x1, X x2, X x3) : x1_(x1), x2_(x2), x3_(x3) {} const X* x1_; X x2_; X x3_; }; #undef X____1 #undef X___10 #undef X__100 #undef X_1000 #undef X10000 #define X____1(N) \ static const X Xs##N[] = {}; #define X___10(N) \ X____1(N##0) X____1(N##1) X____1(N##2) X____1(N##3) X____1(N##4) \ X____1(N##5) X____1(N##6) X____1(N##7) X____1(N##8) X____1(N##9) #define X__100(N) \ X___10(N##0) X___10(N##1) X___10(N##2) X___10(N##3) X___10(N##4) \ X___10(N##5) X___10(N##6) X___10(N##7) X___10(N##8) X___10(N##9) #define X_1000(N) \ X__100(N##0) X__100(N##1) X__100(N##2) X__100(N##3) X__100(N##4) \ X__100(N##5) X__100(N##6) X__100(N##7) X__100(N##8) X__100(N##9) #define X10000(N) \ X_1000(N##0) X_1000(N##1) X_1000(N##2) X_1000(N##3) X_1000(N##4) \ X_1000(N##5) X_1000(N##6) X_1000(N##7) X_1000(N##8) X_1000(N##9) X10000(0) #undef Z____1 #undef Z___10 #undef Z__100 #undef Z_1000 #undef Z10000 #define Z____1(N,I,J) \ Z(Xs##N,1,1), #define Z___10(N) \ Z____1(N##0,1,1) Z____1(N##0,1,1) \ Z____1(N##0,1,1) Z____1(N##1,2,1) \ Z____1(N##0,1,1) Z____1(N##2,1,2) \ Z____1(N##0,1,1) Z____1(N##3,6,3) \ Z____1(N##0,1,1) Z____1(N##4,7,2) \ Z____1(N##0,1,1) Z____1(N##5,1,3) \ Z____1(N##0,1,1) Z____1(N##6,5,9) \ Z____1(N##0,1,1) Z____1(N##7,7,1) \ Z____1(N##0,1,1) Z____1(N##8,3,3) \ Z____1(N##0,1,1) Z____1(N##9,2,2) #define Z__100(N) \ Z___10(N##0) Z___10(N##1) Z___10(N##2) Z___10(N##3) Z___10(N##4) \ Z___10(N##5) Z___10(N##6) Z___10(N##7) Z___10(N##8) Z___10(N##9) #define Z_1000(N) \ Z__100(N##0) Z__100(N##1) Z__100(N##2) Z__100(N##3) Z__100(N##4) \ Z__100(N##5) Z__100(N##6) Z__100(N##7) Z__100(N##8) Z__100(N##9) #define Z10000(N) \ Z_1000(N##0) // Z_1000(N##1) Z_1000(N##2) Z_1000(N##3) Z_1000(N##4) \ // Z_1000(N##5) Z_1000(N##6) Z_1000(N##7) Z_1000(N##8) Z_1000(N##9) static const X XsLast[] = {}; static const Z Zs[] = { Z10000(0) Z(XsLast,1,1) }; const Z* getzs() { return &Zs[0]; } ----- 8< ----- exploding in DSE: dead store elim1 : 45.34 (15%) usr 0.19 (28%) sys 45.53 (15%) wall 1016985 kB (45%) ggc
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
GCC 4.9 branch is being closed
Given say: typedef int X; struct Z { Z(const X* x1, X x2, X x3) : x1_(x1), x2_(x2), x3_(x3) {} const X* x1_; X x2_; X x3_; }; static const X Xs0[] = {}; static const X Xs1[] = {}; static const X Xs2[] = {}; static const X Xs3[] = {}; static const X Xs4[] = {}; static const X Xs5[] = {}; static const X Xs6[] = {}; static const X Xs7[] = {}; static const X Xs8[] = {}; static const X Xs9[] = {}; static const Z Zs[] = { Z(Xs1,1,1), Z(Xs2,2,1), Z(Xs3,2,1), Z(Xs4,1,1), Z(Xs5,1,1),Z(Xs6,8,1), Z(Xs7,1,1),Z(Xs8,5,1), Z(Xs9,1,1),Z(Xs0,7,1) }; const Z *p = &Zs[0]; (the last line is there so that everything is not optimized away), I'm seeing that on x86_64-linux with -m64 -O2 -fpic we actually CSE all those (symbol_ref:DI "_ZL2Zs"), because it has cost of 6 (rtx_cost on that (symbol_ref, DImode, SET, 1) yields 3), while with -m32 -O2 -fpic we don't, because it has cost of 0. But it is more confusing that we actually return 3 on -m64 -fpic, x86_64_nonimmediate_operand has: /* For certain code models, the symbolic references are known to fit. in CM_SMALL_PIC model we know it fits if it is local to the shared library. Don't count TLS SYMBOL_REFs here, since they should fit only if inside of UNSPEC handled below. */ return (ix86_cmodel == CM_SMALL || ix86_cmodel == CM_KERNEL || (ix86_cmodel == CM_MEDIUM && !SYMBOL_REF_FAR_ADDR_P (op))); So it talks about ix86_cmodel of CM_SMALL_PIC, but then actually doesn't do anything for it. Perhaps it is right that foo(%rip) is not considered x86_64_nonimmediate_operand, but perhaps it should still have zero cost. That would of course make this PR likely worse even on x86_64 -m64.
For the C++ FE, the question here is why we actually emit dynamic initialization at all. If constexpr is added to the ctor, then we just emit the initializer, but even without the constexpr I'd think that if the ctor has empty body and trivial mem initializers and if all arguments of the ctors are constants, as an optimization we should handle it as if it was declared constexpr. Jason?
(In reply to Jakub Jelinek from comment #12) > For the C++ FE, the question here is why we actually emit dynamic > initialization at all. If constexpr is added to the ctor, then we just emit > the initializer, but even without the constexpr I'd think that if the ctor > has empty body and trivial mem initializers and if all arguments of the > ctors are constants, as an optimization we should handle it as if it was > declared constexpr. Jason? I think there is a PR somewhere suggesting C++ should (for all initializers) try constexpr evaluation. It's much harder to do this in the middle-end.
(In reply to Richard Biener from comment #13) > (In reply to Jakub Jelinek from comment #12) > > For the C++ FE, the question here is why we actually emit dynamic > > initialization at all. If constexpr is added to the ctor, then we just emit > > the initializer, but even without the constexpr I'd think that if the ctor > > has empty body and trivial mem initializers and if all arguments of the > > ctors are constants, as an optimization we should handle it as if it was > > declared constexpr. Jason? > > I think there is a PR somewhere suggesting C++ should (for all initializers) > try constexpr evaluation. It's much harder to do this in the middle-end. Note clang++ seems to implement that (i.e. constexpr isn't needed there in order to get a static initializer).
Anyway, as far as memory consumption goes (compile time is still the same), the following patch helps a lot: --- gcc/config/i386/i386.c.jj 2017-03-07 20:04:52.000000000 +0100 +++ gcc/config/i386/i386.c 2017-03-10 13:46:12.482704787 +0100 @@ -17257,8 +17257,9 @@ ix86_delegitimize_tls_address (rtx orig_ necessary to remove references to the PIC label from RTL stored by the DWARF output code. */ -static rtx -ix86_delegitimize_address (rtx x) +template <bool base_term> +static inline rtx +ix86_delegitimize_address_1 (rtx x) { rtx orig_x = delegitimize_mem_from_attrs (x); /* addend is NULL or some rtx if x is something+GOTOFF where @@ -17361,7 +17362,7 @@ ix86_delegitimize_address (rtx x) if (! result) return ix86_delegitimize_tls_address (orig_x); - if (const_addend) + if (const_addend && !base_term) result = gen_rtx_CONST (Pmode, gen_rtx_PLUS (Pmode, result, const_addend)); if (reg_addend) result = gen_rtx_PLUS (Pmode, reg_addend, result); @@ -17399,6 +17400,12 @@ ix86_delegitimize_address (rtx x) return result; } +static rtx +ix86_delegitimize_address (rtx x) +{ + return ix86_delegitimize_address_1<false> (x); +} + /* If X is a machine specific address (i.e. a symbol or label being referenced as a displacement from the GOT implemented using an UNSPEC), then return the base term. Otherwise return X. */ @@ -17424,7 +17431,7 @@ ix86_find_base_term (rtx x) return XVECEXP (term, 0, 0); } - return ix86_delegitimize_address (x); + return ix86_delegitimize_address_1<true> (x); } static void Without the patch (just the major time or memory consumers): tree DSE : 40.53 ( 9%) usr 0.00 ( 0%) sys 40.51 ( 9%) wall 0 kB ( 0%) ggc dead store elim1 : 244.65 (55%) usr 1.10 (46%) sys 245.75 (55%) wall 5879136 kB (47%) ggc dead store elim2 : 3.12 ( 1%) usr 0.01 ( 0%) sys 3.12 ( 1%) wall 252045 kB ( 2%) ggc reload CSE regs : 106.15 (24%) usr 0.01 ( 0%) sys 106.15 (24%) wall 4496830 kB (36%) ggc TOTAL : 444.45 2.38 447.46 12477770 kB and with the patch: tree DSE : 40.52 (10%) usr 0.00 ( 0%) sys 40.51 (10%) wall 0 kB ( 0%) ggc dead store elim1 : 223.84 (55%) usr 0.00 ( 0%) sys 223.84 (55%) wall 4653 kB ( 0%) ggc dead store elim2 : 2.92 ( 1%) usr 0.00 ( 0%) sys 2.92 ( 1%) wall 175766 kB ( 7%) ggc reload CSE regs : 98.58 (24%) usr 0.46 (53%) sys 99.04 (24%) wall 2130309 kB (83%) ggc TOTAL : 407.95 0.86 409.33 2558609 kB (both completely unoptimized compilers with checking etc.). The thing is that ix86_find_base_term calls ix86_delegitimize_address that often creates some RTL that the caller then immediately throws away. ix86_find_base_term is called a lot on expressions like: (plus:SI (value:SI 1:1 @0x2c60f50/0x2c50f40) (const:SI (plus:SI (unspec:SI [ (symbol_ref:SI ("_ZL2Zs") [flags 0x2] <var_decl 0x7fffefc19900 Zs>) ] UNSPEC_GOTOFF) (const_int 8 [0x8])))) on which it returns (const:SI (plus:SI (symbol_ref:SI ("_ZL2Zs") [flags 0x2] <var_decl 0x7fffefc19900 Zs>) (const_int 8 [0x8]))) but in reality, the caller only cares about the SYMBOL_REF, CONST_INT operand on PLUS is ignored by find_base_term. The other option is to duplicate and adjust ix86_delegitimize_address into ix86_find_base_term. With the above template, we can share the code, just (for now in one spot, but likely in more spots later). As for more spots later, e.g. both find_base_value and find_base_term (the only users of ix86_find_base_term) only care about MEM with arg_pointer_rtx or plus arg_pointer_rtx something. So, in other cases it doesn't make sense to replace_equiv_address_nv. Thus I think if (GET_CODE (x) == CONST && GET_CODE (XEXP (x, 0)) == PLUS && GET_MODE (XEXP (x, 0)) == Pmode && CONST_INT_P (XEXP (XEXP (x, 0), 1)) && GET_CODE (XEXP (XEXP (x, 0), 0)) == UNSPEC && XINT (XEXP (XEXP (x, 0), 0), 1) == UNSPEC_PCREL) { rtx x2 = XVECEXP (XEXP (XEXP (x, 0), 0), 0, 0); x = gen_rtx_PLUS (Pmode, XEXP (XEXP (x, 0), 1), x2); if (MEM_P (orig_x)) x = replace_equiv_address_nv (orig_x, x); return x; } isn't really useful if base_term && MEM_P (orig_x).
Author: jakub Date: Wed Mar 22 18:33:37 2017 New Revision: 246398 URL: https://gcc.gnu.org/viewcvs?rev=246398&root=gcc&view=rev Log: PR rtl-optimization/63191 * config/i386/i386.c (ix86_delegitimize_address): Turn into small wrapper function, moved the whole old content into ... (ix86_delegitimize_address_1): ... this. New inline function. (ix86_find_base_term): Use ix86_delegitimize_address_1 with true as last argument instead of ix86_delegitimize_address. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
Fixed. Not planning to backport.
(In reply to Jakub Jelinek from comment #17) > Fixed. Not planning to backport. So Target Milestone is still 5.5?