[Bug rtl-optimization/21395] New: Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

asuraparaju at gmail dot com gcc-bugzilla@gcc.gnu.org
Thu May 5 09:11:00 GMT 2005


gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0
Thread model: posix
gcc version 4.0.0
chandra.anuradha% gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0
Thread model: posix
gcc version 4.0.0

Compile line:
g++ -mmmx -g -O3  test_mmx_diff4.cpp

Background: Dirac video codec project uses MMX  to speed up the encoding
process. When using gcc 3.3.x and gcc-3.4.x there is a performance gain between
20-30% depending on the platform Dirac is built on. However, there was a huge
perfomance dip when the Dirac project was built using gcc-4.0.0.In fact, on 32
bit systems the Dirac system performed worse with MMX optimisations enabled than
with them turned off. 

I've incorporated a scaled down version of a Dirac class that uses MMX opts in
the attached test_mmx_diff4.cpp and compared the performance of gcc-4.0.0 with
gcc-3.4.3 / gcc-3.3.3 on different architectures. The performance comparison
results are as follows:


Compile line
g++ -mmmx -g -O3  test_mmx_diff4.cpp

1. AMD Dual Opteron Processor, Suse 9.2 (32 bit)

Results:

    gcc-3.4.3          gcc-4.0.1 20050503 (prerelease)

    real 1.25          real 2.87
    user 1.24          user 2.87
    sys 0.00           sys 0.00


2. Intel Dual Xeon 3.0 GHz, Suse 9.2 64 bit

Results:

    gcc-3.4.3          gcc-4.0.0

    real 1.09          real 1.58
    user 1.09          user 1.54
    sys 0.00           sys 0.00

3. Pentium 4 2.66GHz, Suse 9.2

Results:

    gcc3.3 20030226    gcc-4.0.0

    real 1.35          real 4.98
    user 1.32          user 4.96
    sys 0.00           sys 0.00


gcc-4.0.0 performed worse than gcc-3.3.3 or gcc3.4.3 even for this simple 
program. The test results using Dirac were similar to this.

I posted a message on the gcc mailing list and here's an excerpt from one of the
replies.

---
I took a quick look at it.  It appears to be a register allocation
issue.  The gcc mainline compiled code I looked at uses 3 mmx registers,
and ends up putting one variable on the stack, thus needing two extra
loads and stores in the inner loop.  The gcc-3.3.3 compiled code I
looked at put everything in registers, using 7 mmx registers, and no
unnecessary loads/stores in the inner loop.
----

-- 
           Summary: Performance degradation when building code that uses MMX
                    intrinsics with gcc-4.0.0
           Product: gcc
           Version: 4.0.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: asuraparaju at gmail dot com
                CC: asuraparaju at gmail dot com,gcc-bugs at gcc dot gnu dot
                    org
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu configured with: ../gcc-
                    4.0.0/configure --pref


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395



More information about the Gcc-bugs mailing list