The following generator program #include <stdio.h> #include <stdlib.h> int main() { int i; setvbuf (stdout, malloc(32768), _IOFBF, 32768); for (i = 0; i < 10000000; ++i) printf ("#ifdef M%d\nchar c%d[] = M%d;\n#endif\n", i, i, i); return 0; } builds a 483MB file which contains nothing but #ifdefs against undefined symbols. The preprocessed file should contain nothing but cpp line notes. With gcc 3.2, cpp0 has peak memory usage of 930MB. With gcc 3.4 and 4.0, cc1 has peak memory usage of 2010MB. While this does strike me as a bit silly, apparently there are users trying this sort of thing. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=68634
I think this is GC related.
I was the original reporter. I had generated a file of this format in a programmatic search to determine all the gcc predefined macros. I was looping through all the possible macro names. Problem was my search kept failing due to a crashing compiler. Since it looked like a memory leak to me, I reported it.
A better way is to do "~/local/bin/gcc -x c /dev/null -dD -o - -E" and then look for "#define".
Postponed until GCC 3.4.4.
Confirmed.
The first thing is that read_file_guts mallocs the whole file which seems wrong. That accounts for 500M. The next problem is that keep every identifier we parsed even though we don't need it. 3014 calls for 12,273,008 bytes: thread_a000a1ec |0x0 | _dyld_start | _start | main | toplev_main | do_compile | compile_file | c_common_parse_file | c_parse_file | yyparse | yylex | _yylex | c_lex | c_lex_with_flags | cpp_get_token | _cpp_lex_token | _cpp_handle_directive | do_ifdef | lex_macro_node | _cpp_lex_token | _cpp_lex_direct | lex_identifier | ht_lookup_with_hash | _obstack_newchunk | xmalloc | malloc | malloc_zone_malloc And this is where the problem comes from. No there is no leak we keep a reference to all of thes identifiers but this seems like we should not.
Subject: Re: [3.4/4.0 Regression] high cpp memory usage with undefined symbols pinskia at gcc dot gnu dot org wrote:- > > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-14 05:53 ------- > The first thing is that read_file_guts mallocs the whole file which seems wrong. That accounts for > 500M. > The next problem is that keep every identifier we parsed even though we don't need it. > 3014 calls for 12,273,008 bytes: thread_a000a1ec |0x0 | _dyld_start | _start | main | toplev_main | > do_compile | compile_file | c_common_parse_file | c_parse_file | yyparse | yylex | _yylex | c_lex | > c_lex_with_flags | cpp_get_token | _cpp_lex_token | _cpp_handle_directive | do_ifdef | lex_macro_node > | _cpp_lex_token | _cpp_lex_direct | lex_identifier | ht_lookup_with_hash | _obstack_newchunk | > xmalloc | malloc | malloc_zone_malloc > > And this is where the problem comes from. > No there is no leak we keep a reference to all of thes identifiers but this seems like we should not. Not doing either of these involves a major rework of cpplib FWIW. I happen to think it would be beneficial, but I also think that the whole approach CPP takes needs rethinking. Neil.
This is nowhere near release-critical; it's an intentional extreme corner case. As for the facts noted in the audit trail (i.e., that we lex the whole file up front, and that we keep all identifiers around the entire time), those are very sound strategies for most programs. I've removed the target milestone, and closed as WONTFIX. If someone chooses to reopen this, please do not reset the target milestone.