[Bug rtl-optimization/21395] New: Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0
asuraparaju at gmail dot com
gcc-bugzilla@gcc.gnu.org
Thu May 5 09:11:00 GMT 2005
gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0
Thread model: posix
gcc version 4.0.0
chandra.anuradha% gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0
Thread model: posix
gcc version 4.0.0
Compile line:
g++ -mmmx -g -O3 test_mmx_diff4.cpp
Background: Dirac video codec project uses MMX to speed up the encoding
process. When using gcc 3.3.x and gcc-3.4.x there is a performance gain between
20-30% depending on the platform Dirac is built on. However, there was a huge
perfomance dip when the Dirac project was built using gcc-4.0.0.In fact, on 32
bit systems the Dirac system performed worse with MMX optimisations enabled than
with them turned off.
I've incorporated a scaled down version of a Dirac class that uses MMX opts in
the attached test_mmx_diff4.cpp and compared the performance of gcc-4.0.0 with
gcc-3.4.3 / gcc-3.3.3 on different architectures. The performance comparison
results are as follows:
Compile line
g++ -mmmx -g -O3 test_mmx_diff4.cpp
1. AMD Dual Opteron Processor, Suse 9.2 (32 bit)
Results:
gcc-3.4.3 gcc-4.0.1 20050503 (prerelease)
real 1.25 real 2.87
user 1.24 user 2.87
sys 0.00 sys 0.00
2. Intel Dual Xeon 3.0 GHz, Suse 9.2 64 bit
Results:
gcc-3.4.3 gcc-4.0.0
real 1.09 real 1.58
user 1.09 user 1.54
sys 0.00 sys 0.00
3. Pentium 4 2.66GHz, Suse 9.2
Results:
gcc3.3 20030226 gcc-4.0.0
real 1.35 real 4.98
user 1.32 user 4.96
sys 0.00 sys 0.00
gcc-4.0.0 performed worse than gcc-3.3.3 or gcc3.4.3 even for this simple
program. The test results using Dirac were similar to this.
I posted a message on the gcc mailing list and here's an excerpt from one of the
replies.
---
I took a quick look at it. It appears to be a register allocation
issue. The gcc mainline compiled code I looked at uses 3 mmx registers,
and ends up putting one variable on the stack, thus needing two extra
loads and stores in the inner loop. The gcc-3.3.3 compiled code I
looked at put everything in registers, using 7 mmx registers, and no
unnecessary loads/stores in the inner loop.
----
--
Summary: Performance degradation when building code that uses MMX
intrinsics with gcc-4.0.0
Product: gcc
Version: 4.0.0
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: asuraparaju at gmail dot com
CC: asuraparaju at gmail dot com,gcc-bugs at gcc dot gnu dot
org
GCC build triplet: i686-pc-linux-gnu
GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu configured with: ../gcc-
4.0.0/configure --pref
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395
More information about the Gcc-bugs
mailing list