With the same CFLAGS/CXXFLAGS, GCC 4.3.0 produces the code which is on average 1-5% bigger then the code produced by GCC 4.2.3. My usual c|c++ flags are: -march=pentium2 -O2 -pipe -ftree-vectorize I've tested wine 0.9.58 and KDE 3.5.9.
Does the code size difference goes away when you compare w/o -ftree-vectorize ? It might be the case we are no vectorizing more which means we have peeled more loops and such.
I'll check this out very soon - at least on Wine sources. Recompiling KDE isn't a task I'm very found of.
counting all .so and .a files in lib/wine directory: Without -ftree-vectorize: GCC 4.2.3: 46,884,566 bytes in 305 files GCC 4.3.0: 47,375,178 bytes in 305 files With -ftree-vectorize: GCC 4.2.3: 46,889,486 bytes in 305 files GCC 4.3.0: 47,397,130 bytes in 305 files It's not like too much to care about. I have to test using different sources. The truth is that Windows archivers work slower in Wine compiled using GCC 4.3. That was the real grief.
Test configuration: Software: Linux kernel 2.6.28.9 x86, GCC 4.2.4, GCC 4.4.0 RC, http://www.rarlab.com/rar/unrarsrc-3.8.5.tar.gz Hardware: AMD64 Dual Core CPU 5600, 1MB x 2 level 2 cache RAM: DDR2 800MHz 4GB unrarsrc-3.8.5.tar.gz compiled binary (compilation flags: -march=pentium2 -O2 -ftree-vectorize): GCC 4.2.4: 180492 bytes GCC 4.4.0: 196288 bytes Uncompressing 1GB archive (several hundreds WAV files): time rar-4.2.4 t -inul archive.rar real 1m18.413s user 1m17.758s sys 0m0.580s time rar-4.4.0 t -inul archive.rar real 1m28.021s user 1m27.344s sys 0m0.627s (average results for five runs - disk IO has zero effect, since file resides on RAM disk). Summary: 4.4.x binary is 9% larger and produces code which runs 14% slower.
(In reply to comment #4) > Test configuration: > > Software: Linux kernel 2.6.28.9 x86, GCC 4.2.4, GCC 4.4.0 RC, > http://www.rarlab.com/rar/unrarsrc-3.8.5.tar.gz > > Hardware: AMD64 Dual Core CPU 5600, 1MB x 2 level 2 cache > RAM: DDR2 800MHz 4GB > > unrarsrc-3.8.5.tar.gz compiled binary (compilation flags: -march=pentium2 -O2 > -ftree-vectorize): > Why do you use -march=pentium2 on AMD64 Dual Core CPU 5600? Remove -march=pentium2 and report what you get.
Many Linux distros compile binaries for a common lowest denominator so that you could run a distro on very old computers and CPUs - their developers in most cases choose -march=i686 or -march=i586. I compile binaries which I have to run on very old computers like Pentium 2, that's why I chose -march=pentium2 in the first place. Let's go back to the results: ________________________________________________________ -march=i386 -O2 -pipe -ftree-vectorize unrar-424 size: 169384 time: 1m34.372s unrar-440 size: 175836 time: 1m32.014s Without CPU optimizations a binary produced by GCC 4.4.0 is larger but a bit faster. ________________________________________________________ -march=native -O2 -pipe -ftree-vectorize unrar-424 size: 180488 time: 1m17.608s unrar-440 size: 188348 time: 1m27.211s With native CPU optimizations a binary produced by GCC 4.4.0 is again larger but noticeably slower. ________________________________________________________ Pentium2 results have been already posted. This is the second major release of GCC which produces subpar results ...
For better speed with -march=pentium2 you should add -mtune=generic which will use only pentium2 features but tunes the code to not pessimize newer processors. That said, without a testcase and maybe some analysis (like a profile comparison) there is nothing we can do. If you want to play with some flags a bit I would suggest to try -fno-tree-pre and/or -fno-ivopts and/or -funroll-loops. Using profile-feedback will also help reducing code size and increase performance.
If anyone cares to repeat my test results, here's a simple test case: 1) Obtain a large enough collection of WAV files (however I'm sure all other compressible material will also fit this test). If you have wine emulator installed you can get many large wav files by running this application (http://www.scene.org/file.php?file=/demos/groups/farb-rausch/fr-028.zip&fileinfo) - "record" all files to the same folder. 2) Compress your test folder with RAR archiver (http://rarlabs.com/rar/rarlinux-3.8.0.tar.gz) using these parameters: rar -r -m5 -mdG arhive_name.rar folder 3) Compile unrar (http://www.rarlab.com/rar/unrarsrc-3.8.5.tar.gz) with the already given parameters. See :) (In reply to comment #7) > For better speed with -march=pentium2 you should add -mtune=generic which > will use only pentium2 features but tunes the code to not pessimize newer > processors. > As you can see 1) GCC 4.2 "pessimizes" code less than GCC 4.4, and 2) I'm sure no new pentium2 optimizations have been introduced for the last two years - so I'm sure general -O2 code produced by GCC 4.4 (and 4.3) is less optimized than code produced by GCC 4.2. At least it's quite obvious than most binaries are larger - but performance benefit is not so clear.
Qt 4.5.2 /lib directory (without *.debug files) occupies GCC 4.2.4: 43,649,379 bytes in 107 files GCC 4.4.0: 46,544,895 bytes in 107 files I don't like it at all. Compilation flags are still the same: -march=pentium2 -O2 -pipe -ftree-vectorize I hope GCC 4.5.0 will become sane again.
abit more comprehensive gcc 4.2.4 vs 4.3.3 vs 4.4.0 vs 4.4.1 comparison using nbench: hardware: Intel celron 320 (prescott, SSE3, 256KB L2, socket 478) @ 2970 mhz kernel: specially optimized by intel compiler 10 (linuxdna) version 2.6.29.1 http://manoa.flnet.org/nbench-celron-results.txt
forgot to mention executable sizes: all tested gcc versions 4.2 4.3 and 4.4 were 100kb, intel executables were 68 and 72 kb respectively (version 10, 11). executable size in memory (both VIRT and RSS) did not change between versions (the program is very small)
one more note about executable size in memory: while there was no difference in sizes in memory for all gcc versions for icc versions there was a great difference: VERSION VIRT RSS gcc (all) ~2000kb 500kb icc10 (all) ~5500kb 800kb but that should be obvious due to the additional linking in icc to libimf.so, libintlc.so.5, libsvml.so, libgcc_s.so.1, libpthread.so.0 and libdl.so.2 that can possibly be countered to make a more fair memory usage comparison between icc and gcc by modification to my icc.cfg configuration files, which were when compiling this program: http://manoa.flnet.org/icc10.cfg http://manoa.flnet.org/icc11.cfg
one last thing: and try not to take the LU DECOMPOSITION test seriously between the various gcc testing runs, there was great difference even when using the same executable several times, except of corse for the huge gap between intel's and gcc's
one more thing to mention about gcc, is the configurations during their compilation: (although it may not have much sense as those things were never really having an effect to the anticipated extent) ../gcc-4.a.b/configure --prefix=/opt/gcc4ab --libexecdir=/opt/gcc4ab/lib --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-languages=c,c++ --disable-libstdcxx-pch --disable-bootstrap --disable-stage1-languages --disable-objc-gc --disable-libssp --disable-libada --with-gmp=/opt/gmp --with-mpfr=/opt/mpfr --with-ppl=/opt/ppl --with-cloog=/opt/cloog --with-mpc=/opt/mpc just in case there were any hope for cloog and ppl for this program I made sure it was properly included in the compiler, but unfortunately from the tests as you can see, those options had no effect on nbench
btw, these results also show something else of interest: pgo degrades performance
(In reply to comment #8) > If anyone cares to repeat my test results, here's a simple test case: This is not a simple testcase. A simple testcase is a sufficiently small self-contained compilable code that shows the problem in a way that can be reliably and consistently reproduced. The ideal testcase would be the smallest possible still showing the problem but anything below 100 lines of preprocessed code is probably small enough. There are currently 3615 bugs opened. Many of them have simple testcases and some of them even provide what is the difference in the generated code that leads to worse performance. This problem report provides neither. Guess which one is more likely to be worked on. Each of the following steps increases the chances of getting a problem fixed: 1. Provide a self-contained testcase. 2. Provide a very small self-contained testcase. 3. Describe the differences in the generated code that leads to worse performance. 4. Find the exact flags that trigger the loss in performance. 5. Find the exact revision that introduced the loss in performance. 6. Find the exact code in GCC that contains the bug. (In reply to comment #9) > I hope GCC 4.5.0 will become sane again. You mean that you hope that your current problems get fixed by chance. If you cannot wait for the surprise, just grab a recent snapshot or current SVN trunk and check it out. Good luck!
you can find a nicer version of results (and potentially future updates) here: http://anonym.to?http://manoa.flnet.org/linux/compilers.html
(In reply to comment #16) > > This is not a simple testcase. A simple testcase is a sufficiently small > self-contained compilable code that shows the problem in a way that can be > reliably and consistently reproduced. The ideal testcase would be the smallest > possible still showing the problem but anything below 100 lines of preprocessed > code is probably small enough. > OK, let's be blunt. 99% of applications and libraries (that I use regularly) compiled with GCC >= 4.3 have bigger (binary code) sizes and lower speed. You can _easily_ check it on your own. And I cannot come up with a really simple testcase because a new compilation infrastructure introduced in GCC 4.3 made everything not so brilliant. The last but not the least - is that I'm not a developer at all, I have no knowledge of assembler, thus I have no ability to analyze code produced by different GCC versions. All I see is the end result and it's far from being remarkable. It seems like GCC developers are busy implementing new features forgetting about the core mission of any compiler - creating the most efficient code for all supported architectures. I'm closing this bug since I feel no one will step up to even confirm it.
Note that after a GCC version is released fixes for runtime regressions are usually not considered because of their impact on stability (which is the most important point). Instead if you care for performance of a specific application (we _do_ monitor SPEC CPU 2000 and 2006 and some other benchmarks and try to work hard to improve there) you should monitor performance of your area of interest for the current development snapshots. Then there is sufficient time to address regressions. Note that usually a testcase and some analysis is still required - but at least the likeliness that somebody will look and analyze the regression for the development trunk is far more likely than for released versions. Remember GCC is a volunteer driven project and benchmark analysis is time-consuming. At least both nbench and scimark are simple enough, so I'll add them to our periodic monitoring of GCC trunk. (http://gcc.opensuse.org/)