35671 – GCC 4.4.x vs. 4.2.x performance regression

Bug 35671 - GCC 4.4.x vs. 4.2.x performance regression

Summary: GCC 4.4.x vs. 4.2.x performance regression

Status:	RESOLVED WONTFIX

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	regression (show other bugs)
Version:	4.4.0

Importance:	P3 major
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-03-22 23:31 UTC by Artem S. Tashkinov
Modified:	2009-08-08 16:21 UTC (History)
CC List:	2 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:
Build:
Known to work:	4.2.4
Known to fail:	4.4.0 4.3.3
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Artem S. Tashkinov 2008-03-22 23:31:32 UTC

With the same CFLAGS/CXXFLAGS, GCC 4.3.0 produces the code which is on average 1-5% bigger then the code produced by GCC 4.2.3.

My usual c|c++ flags are: -march=pentium2 -O2 -pipe -ftree-vectorize

I've tested wine 0.9.58 and KDE 3.5.9.

Comment 1 Andrew Pinski 2008-03-24 16:10:13 UTC

Does the code size difference goes away when you compare w/o -ftree-vectorize ?  It might be the case we are no vectorizing more which means we have peeled more loops and such.

Comment 2 Artem S. Tashkinov 2008-03-24 22:24:52 UTC

I'll check this out very soon - at least on Wine sources. Recompiling KDE isn't a task I'm very found of.

Comment 3 Artem S. Tashkinov 2008-03-24 23:01:47 UTC

counting all .so and .a files in lib/wine directory:

Without  -ftree-vectorize:

GCC 4.2.3: 46,884,566 bytes in 305 files
GCC 4.3.0: 47,375,178 bytes in 305 files

With -ftree-vectorize:

GCC 4.2.3: 46,889,486 bytes in 305 files
GCC 4.3.0: 47,397,130 bytes in 305 files

It's not like too much to care about. I have to test using different sources. The truth is that Windows archivers work slower in Wine compiled using GCC 4.3. That was the real grief.

Comment 4 Artem S. Tashkinov 2009-04-18 01:44:00 UTC

Test configuration:

Software: Linux kernel 2.6.28.9 x86, GCC 4.2.4, GCC 4.4.0 RC, http://www.rarlab.com/rar/unrarsrc-3.8.5.tar.gz

Hardware: AMD64 Dual Core CPU 5600, 1MB x 2 level 2 cache
RAM: DDR2 800MHz 4GB

unrarsrc-3.8.5.tar.gz compiled binary (compilation flags: -march=pentium2 -O2 -ftree-vectorize):

GCC 4.2.4: 180492 bytes
GCC 4.4.0: 196288 bytes

Uncompressing 1GB archive (several hundreds WAV files):

time rar-4.2.4 t -inul archive.rar
real    1m18.413s
user    1m17.758s
sys     0m0.580s

time rar-4.4.0 t -inul archive.rar
real    1m28.021s
user    1m27.344s
sys     0m0.627s

(average results for five runs - disk IO has zero effect, since file resides on RAM disk).

Summary: 4.4.x binary is 9% larger and produces code which runs 14% slower.

Comment 5 H.J. Lu 2009-04-18 02:24:42 UTC

(In reply to comment #4)
> Test configuration:
> 
> Software: Linux kernel 2.6.28.9 x86, GCC 4.2.4, GCC 4.4.0 RC,
> http://www.rarlab.com/rar/unrarsrc-3.8.5.tar.gz
> 
> Hardware: AMD64 Dual Core CPU 5600, 1MB x 2 level 2 cache
> RAM: DDR2 800MHz 4GB
> 
> unrarsrc-3.8.5.tar.gz compiled binary (compilation flags: -march=pentium2 -O2
> -ftree-vectorize):
> 

Why do you use -march=pentium2 on AMD64 Dual Core CPU 5600? Remove
-march=pentium2 and report what you get.

Comment 6 Artem S. Tashkinov 2009-04-18 08:18:02 UTC

Many Linux distros compile binaries for a common lowest denominator so that you could run a distro on very old computers and CPUs - their developers in most cases choose -march=i686 or -march=i586.

I compile binaries which I have to run on very old computers like Pentium 2, that's why I chose -march=pentium2 in the first place. Let's go back to the results:

________________________________________________________

-march=i386 -O2 -pipe -ftree-vectorize

unrar-424 size: 169384 time: 1m34.372s
unrar-440 size: 175836 time: 1m32.014s

Without CPU optimizations a binary produced by GCC 4.4.0 is larger but a bit faster.
________________________________________________________

-march=native -O2 -pipe -ftree-vectorize

unrar-424 size: 180488 time: 1m17.608s
unrar-440 size: 188348 time: 1m27.211s

With native CPU optimizations a binary produced by GCC 4.4.0 is again larger but noticeably slower.
________________________________________________________

Pentium2 results have been already posted.

This is the second major release of GCC which produces subpar results ...

Comment 7 Richard Biener 2009-04-18 10:01:30 UTC

For better speed with -march=pentium2 you should add -mtune=generic which
will use only pentium2 features but tunes the code to not pessimize newer
processors.

That said, without a testcase and maybe some analysis (like a profile
comparison) there is nothing we can do.

If you want to play with some flags a bit I would suggest to try
-fno-tree-pre and/or -fno-ivopts and/or -funroll-loops.

Using profile-feedback will also help reducing code size and increase
performance.

Comment 8 Artem S. Tashkinov 2009-04-19 13:51:24 UTC

If anyone cares to repeat my test results, here's a simple test case:

1) Obtain a large enough collection of WAV files (however I'm sure all other compressible material will also fit this test). If you have wine emulator installed you can get many large wav files by running this application (http://www.scene.org/file.php?file=/demos/groups/farb-rausch/fr-028.zip&fileinfo) - "record" all files to the same folder.

2) Compress your test folder with RAR archiver (http://rarlabs.com/rar/rarlinux-3.8.0.tar.gz) using these parameters:

rar -r -m5 -mdG arhive_name.rar folder

3) Compile unrar (http://www.rarlab.com/rar/unrarsrc-3.8.5.tar.gz) with the already given parameters.

See :)

(In reply to comment #7)
> For better speed with -march=pentium2 you should add -mtune=generic which
> will use only pentium2 features but tunes the code to not pessimize newer
> processors.
> 

As you can see 1) GCC 4.2 "pessimizes" code less than GCC 4.4, and 2) I'm sure no new pentium2 optimizations have been introduced for the last two years - so I'm sure general -O2 code produced by GCC 4.4 (and 4.3) is less optimized than code produced by GCC 4.2.

At least it's quite obvious than most binaries are larger - but performance benefit is not so clear.

Comment 9 Artem S. Tashkinov 2009-07-07 18:45:50 UTC

Qt 4.5.2 /lib directory (without *.debug files) occupies

GCC 4.2.4: 43,649,379 bytes in 107 files
GCC 4.4.0: 46,544,895 bytes in 107 files

I don't like it at all. Compilation flags are still the same: -march=pentium2 -O2 -pipe -ftree-vectorize

I hope GCC 4.5.0 will become sane again.

Comment 10 manoa 2009-07-30 04:16:13 UTC

abit more comprehensive gcc 4.2.4 vs 4.3.3 vs 4.4.0 vs 4.4.1 comparison using nbench:

hardware: Intel celron 320 (prescott, SSE3, 256KB L2, socket 478) @ 2970 mhz
kernel: specially optimized by intel compiler 10 (linuxdna) version 2.6.29.1

http://manoa.flnet.org/nbench-celron-results.txt

Comment 11 manoa 2009-07-30 04:24:21 UTC

forgot to mention executable sizes:
all tested gcc versions 4.2 4.3 and 4.4 were 100kb, intel executables were 68 and 72 kb respectively (version 10, 11).

executable size in memory (both VIRT and RSS) did not change between versions (the program is very small)

Comment 12 manoa 2009-07-30 04:34:34 UTC

one more note about executable size in memory:
while there was no difference in sizes in memory for all gcc versions
for icc versions there was a great difference:

VERSION     VIRT    RSS
gcc (all)   ~2000kb 500kb
icc10 (all) ~5500kb 800kb

but that should be obvious due to the additional linking in icc to libimf.so, libintlc.so.5, libsvml.so, libgcc_s.so.1, libpthread.so.0 and libdl.so.2

that can possibly be countered to make a more fair memory usage comparison between icc and gcc by modification to my icc.cfg configuration files, which were when compiling this program:

http://manoa.flnet.org/icc10.cfg
http://manoa.flnet.org/icc11.cfg

Comment 13 manoa 2009-07-30 04:39:00 UTC

one last thing: and try not to take the LU DECOMPOSITION test seriously between the various gcc testing runs, there was great difference even when using the same executable several times, except of corse for the huge gap between intel's and gcc's

Comment 14 manoa 2009-07-30 05:09:10 UTC

one more thing to mention about gcc, is the configurations during their compilation: (although it may not have much sense as those things were never really having an effect to the anticipated extent)

../gcc-4.a.b/configure --prefix=/opt/gcc4ab --libexecdir=/opt/gcc4ab/lib --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-languages=c,c++ --disable-libstdcxx-pch --disable-bootstrap --disable-stage1-languages --disable-objc-gc --disable-libssp --disable-libada --with-gmp=/opt/gmp --with-mpfr=/opt/mpfr --with-ppl=/opt/ppl --with-cloog=/opt/cloog --with-mpc=/opt/mpc

just in case there were any hope for cloog and ppl for this program I made sure it was properly included in the compiler, but unfortunately from the tests as you can see, those options had no effect on nbench

Comment 15 manoa 2009-07-30 06:37:44 UTC

btw, these results also show something else of interest: pgo degrades performance

Comment 16 Manuel López-Ibáñez 2009-07-30 12:02:12 UTC

(In reply to comment #8)
> If anyone cares to repeat my test results, here's a simple test case:

This is not a simple testcase. A simple testcase is a sufficiently small self-contained compilable code that shows the problem in a way that can be reliably and consistently reproduced. The ideal testcase would be the smallest possible still showing the problem but anything below 100 lines of preprocessed code is probably small enough.

There are currently 3615 bugs opened. Many of them have simple testcases and some of them even provide what is the difference in the generated code that leads to worse performance. This problem report provides neither. Guess which one is more likely to be worked on.

Each of the following steps increases the chances of getting a problem fixed:

1. Provide a self-contained testcase.
2. Provide a very small self-contained testcase.
3. Describe the differences in the generated code that leads to worse performance.
4. Find the exact flags that trigger the loss in performance.
5. Find the exact revision that introduced the loss in performance.
6. Find the exact code in GCC that contains the bug.

(In reply to comment #9)
> I hope GCC 4.5.0 will become sane again.

You mean that you hope that your current problems get fixed by chance. If you cannot wait for the surprise, just grab a recent snapshot or current SVN trunk and check it out. Good luck!

Comment 17 manoa 2009-07-30 23:58:35 UTC

you can find a nicer version of results (and potentially future updates) here:
http://anonym.to?http://manoa.flnet.org/linux/compilers.html

Comment 18 Artem S. Tashkinov 2009-08-08 14:14:32 UTC

(In reply to comment #16)
> 
> This is not a simple testcase. A simple testcase is a sufficiently small
> self-contained compilable code that shows the problem in a way that can be
> reliably and consistently reproduced. The ideal testcase would be the smallest
> possible still showing the problem but anything below 100 lines of preprocessed
> code is probably small enough.
> 

OK, let's be blunt.

99% of applications and libraries (that I use regularly) compiled with GCC >= 4.3 have bigger (binary code) sizes and lower speed. You can _easily_ check it on your own. And I cannot come up with a really simple testcase because a new compilation infrastructure introduced in GCC 4.3 made everything not so brilliant.

The last but not the least - is that I'm not a developer at all, I have no knowledge of assembler, thus I have no ability to analyze code produced by different GCC versions. All I see is the end result and it's far from being remarkable.

It seems like GCC developers are busy implementing new features forgetting about the core mission of any compiler - creating the most efficient code for all supported architectures. I'm closing this bug since I feel no one will step up to even confirm it.

Comment 19 Richard Biener 2009-08-08 16:21:21 UTC

Note that after a GCC version is released fixes for runtime regressions are
usually not considered because of their impact on stability (which is the
most important point).  Instead if you care for performance of a specific
application (we _do_ monitor SPEC CPU 2000 and 2006 and some other benchmarks
and try to work hard to improve there) you should monitor performance of
your area of interest for the current development snapshots.  Then there is
sufficient time to address regressions.

Note that usually a testcase and some analysis is still required - but at
least the likeliness that somebody will look and analyze the regression
for the development trunk is far more likely than for released versions.
Remember GCC is a volunteer driven project and benchmark analysis is
time-consuming.

At least both nbench and scimark are simple enough, so I'll add them to our
periodic monitoring of GCC trunk.  (http://gcc.opensuse.org/)