21395 – Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

Bug 21395 - Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0

Summary: Performance degradation when building code that uses MMX intrinsics with gcc-...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	4.4.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-05-05 09:11 UTC by Anuradha Suraparaju
Modified:	2008-03-23 10:46 UTC (History)
CC List:	4 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:	i686-pc-linux-gnu
Build:	i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:

Attachments
A class that uses MMX intrinsics to compute block differences between two video frames (979 bytes, text/plain) 2005-05-05 09:14 UTC, Anuradha Suraparaju	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anuradha Suraparaju 2005-05-05 09:11:37 UTC

gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0
Thread model: posix
gcc version 4.0.0
chandra.anuradha% gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0
Thread model: posix
gcc version 4.0.0

Compile line:
g++ -mmmx -g -O3  test_mmx_diff4.cpp

Background: Dirac video codec project uses MMX  to speed up the encoding
process. When using gcc 3.3.x and gcc-3.4.x there is a performance gain between
20-30% depending on the platform Dirac is built on. However, there was a huge
perfomance dip when the Dirac project was built using gcc-4.0.0.In fact, on 32
bit systems the Dirac system performed worse with MMX optimisations enabled than
with them turned off. 

I've incorporated a scaled down version of a Dirac class that uses MMX opts in
the attached test_mmx_diff4.cpp and compared the performance of gcc-4.0.0 with
gcc-3.4.3 / gcc-3.3.3 on different architectures. The performance comparison
results are as follows:


Compile line
g++ -mmmx -g -O3  test_mmx_diff4.cpp

1. AMD Dual Opteron Processor, Suse 9.2 (32 bit)

Results:

    gcc-3.4.3          gcc-4.0.1 20050503 (prerelease)

    real 1.25          real 2.87
    user 1.24          user 2.87
    sys 0.00           sys 0.00


2. Intel Dual Xeon 3.0 GHz, Suse 9.2 64 bit

Results:

    gcc-3.4.3          gcc-4.0.0

    real 1.09          real 1.58
    user 1.09          user 1.54
    sys 0.00           sys 0.00

3. Pentium 4 2.66GHz, Suse 9.2

Results:

    gcc3.3 20030226    gcc-4.0.0

    real 1.35          real 4.98
    user 1.32          user 4.96
    sys 0.00           sys 0.00


gcc-4.0.0 performed worse than gcc-3.3.3 or gcc3.4.3 even for this simple 
program. The test results using Dirac were similar to this.

I posted a message on the gcc mailing list and here's an excerpt from one of the
replies.

---
I took a quick look at it.  It appears to be a register allocation
issue.  The gcc mainline compiled code I looked at uses 3 mmx registers,
and ends up putting one variable on the stack, thus needing two extra
loads and stores in the inner loop.  The gcc-3.3.3 compiled code I
looked at put everything in registers, using 7 mmx registers, and no
unnecessary loads/stores in the inner loop.
----

Comment 1 Anuradha Suraparaju 2005-05-05 09:14:30 UTC

Created attachment 8822 [details]
A class that uses MMX intrinsics to compute block differences between two video frames

Comment 2 Andrew Pinski 2005-05-05 16:53:49 UTC

Note C and C++ aliasing rules are being violated in the source.

Comment 3 Andrew Pinski 2005-05-05 17:00:21 UTC

Note with the following code, I get back to what it is without -mmx:
union b
{
  int i[2];
  __m64 j;
}a;
    __m64 sum = _mm_set_pi32(0, 0);
    
        for (int j=0 ; j < yl ; j++)
        {
                short *p = &pic_data[j][0];
                short *r = &ref_data[j][0];
                
                for (int i=0 ; i < xl ; i+=4, p +=4, r+=4 )
        {       
           __m64 pic = *(__m64 *)p;
           __m64 ref = *(__m64 *)r;
            // pic - ref
            pic = _mm_sub_pi16 (pic, ref);
            // abs (pic - ref)
            ref = _mm_srai_pi16(pic, 15);
            pic = _mm_xor_si64(pic, ref);
            pic = _mm_sub_pi16 (pic, ref);
            // sum += abs(pic -ref)
            ref = _mm_xor_si64(ref, ref);
            ref = _mm_unpackhi_pi16(pic, ref);
            pic = _mm_unpacklo_pi16(pic, pic);
            pic = _mm_srai_pi32 (pic, 16);
            //ref = _mm_srai_pi32 (ref, 16);
            pic = _mm_add_pi32 (pic, ref);
            sum = _mm_add_pi32 (sum, pic);
        }
    }
    a.j = sum;
   // int *result = (int *) &sum;
    _mm_empty();

   // return result[0] + result[1];
   return a.i[0] + a.i[1];

Comment 4 Richard Biener 2008-01-26 23:02:21 UTC

How is the situation with 4.1, 4.2 or 4.3?

Comment 5 Uroš Bizjak 2008-03-21 10:17:16 UTC

Inner loop, generated with -O2 -mmmx -fno-strict-aliasing,
gcc version 4.0.2 20051125 (Red Hat 4.0.2-8):

.L45:
	movq	(%ebx), %mm0
	psubw	(%ecx), %mm0
	movq	%mm0, %mm1
	psraw	$15, %mm1
	pxor	%mm1, %mm0
	psubw	%mm1, %mm0
	movq	%mm0, %mm1
	punpckhwd	%mm2, %mm1
	punpcklwd	%mm0, %mm0
	psrad	$16, %mm0
	paddd	%mm1, %mm0
	movl	%eax, -56(%ebp)
	movl	%edx, -52(%ebp)
	movq	-56(%ebp), %mm1
	paddd	%mm1, %mm0
	movq	%mm0, -56(%ebp)
	movl	-56(%ebp), %eax
	movl	-52(%ebp), %edx
	movl	%eax, -24(%ebp)
	movl	%edx, -20(%ebp)
	addl	$4, %esi
	addl	$8, %ebx
	addl	$8, %ecx
	cmpl	%esi, %edi
	jg	.L45

time ./a.out
144

real    0m4.587s
user    0m4.584s
sys     0m0.004s


Inner loop, generated with -O2 -mmmx -fno-strict-aliasing,
gcc version 4.4.0 20080318 (experimental) [trunk revision 133304] (GCC)
(this one has improved MMX move instructions):

.L23:
	movq	(%ecx,%eax,2), %mm0
	psubw	(%edx,%eax,2), %mm0
	addl	$4, %eax
	cmpl	%eax, %ebx
	movq	%mm0, %mm1
	psraw	$15, %mm1
	pxor	%mm1, %mm0
	psubw	%mm1, %mm0
	movq	%mm0, %mm1
	punpcklwd	%mm0, %mm0
	punpckhwd	%mm3, %mm1
	psrad	$16, %mm0
	paddd	%mm1, %mm0
	paddd	%mm0, %mm2
	movq	%mm2, -24(%ebp)
	jg	.L23


time ./a.out
144

real    0m0.755s
user    0m0.752s
sys     0m0.000s

Current mainline is _SIX_ times faster.

Unfortunatelly, there are no plans to backport this functionality to anything older than 4.4, so fixed for 4.4.

Comment 6 michaelni 2008-03-22 02:15:14 UTC

As Uros has "challenged me to beat performance of gcc-4.4 generated code by hand-crafted assembly using the example of PR 21395" heres my entry, sadly i only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code than 4.4 so i guess thats ok its inner loop is:
.L23:
        movq    (%ecx,%eax,2), %mm0
        psubw   (%edx,%eax,2), %mm0
        addl    $4, %eax
        cmpl    %eax, %ebx
        movq    %mm0, %mm1
        psraw   $15, %mm0
        pxor    %mm0, %mm1
        psubw   %mm0, %mm1
        movq    %mm1, %mm0
        punpcklwd       %mm1, %mm1
        punpckhwd       %mm3, %mm0
        psrad   $16, %mm1
        paddd   %mm0, %mm1
        paddd   %mm1, %mm2
        movq    %mm2, -24(%ebp)
        jg      .L23

Its better because the psraw doesnt depend on the previous movq result.

Now heres my code (this is naivly written and not unrolled or hand scheduled, it also uses hardcoded registers, so i suspect it can be improved further ...)
int SimpleBlockDiff::Diff ()
{
#ifdef __MMX__
    int sum;
    int x1b=-2*xl;
    int ylb= yl;

    asm volatile(
        "xorl %%edx, %%edx          \n\t"
        "pcmpeqw %%mm6, %%mm6       \n\t"
        "pxor %%mm7, %%mm7          \n\t"
        "psrlw $15, %%mm6           \n\t"
        "1:                         \n\t"
        "movl (%1, %%edx, 4), %%eax \n\t"
        "movl (%2, %%edx, 4), %%esi \n\t"
        "movl %3, %%ecx             \n\t"
        "subl %%ecx, %%eax          \n\t"
        "subl %%ecx, %%esi          \n\t"
        "2:                         \n\t"
        "pxor %%mm1, %%mm1          \n\t"
        "movq  (%%eax, %%ecx), %%mm0\n\t"
        "psubw (%%esi, %%ecx), %%mm0\n\t"
#if 0
        "psubw %%mm0, %%mm1         \n\t"
        "pmaxsw %%mm1, %%mm0        \n\t"
#else
        "pcmpgtw %%mm0, %%mm1       \n\t"
        "pxor %%mm1, %%mm0          \n\t"
        "psubw %%mm1, %%mm0         \n\t"
#endif
        "pmaddwd %%mm6, %%mm0       \n\t"
        "paddd %%mm0, %%mm7         \n\t"
        "addl $8, %%ecx             \n\t"
        " jnz 2b                    \n\t"
        "incl %%edx                 \n\t"
        "cmpl %%edx, %4             \n\t"
        " jnz 1b                    \n\t"
        "movq %%mm7, %%mm0          \n\t"
        "psrlq $32, %%mm7           \n\t"
        "paddd %%mm7, %%mm0         \n\t"
        "movd %%mm0, %0             \n\t"
        :"=g" (sum)
        :"r" (pic_data), "r" (ref_data), "m"(x1b), "m"(ylb)
        : "%eax", "%esi", "%ecx", "%edx"
    );
    return sum;
--------------
and benchmarks:

on a duron:
gcc-4.3:
real	0m2.034s
user	0m1.882s
sys	0m0.017s

asm:
real	0m1.312s
user	0m1.208s
sys	0m0.016s

on a 500mhz pentium3:
gcc-4.3
real	0m4.021s
user	0m3.767s
sys	0m0.009s

asm:
real	0m2.827s
user	0m2.565s
sys	0m0.055s

Comment 7 michaelni 2008-03-22 02:51:44 UTC

You can also replace the inner loop by:

        "2:                         \n\t"
        "pxor %%mm1, %%mm1          \n\t"
        "movq  (%%eax, %%ecx), %%mm0\n\t"
        "psubw (%%esi, %%ecx), %%mm0\n\t"
        "pcmpgtw %%mm0, %%mm1       \n\t"
        "por     %%mm6, %%mm1       \n\t"
        "pmaddwd %%mm1, %%mm0       \n\t"
        "paddd %%mm0, %%mm7         \n\t"
        "addl $8, %%ecx             \n\t"
        " jnz 2b                    \n\t"

Which has one instruction less, its a hair faster on my p3 but a little slower on my duron.
And of course the most obvious optimization is to unroll this and do a bunch of them at once.

Comment 8 Uroš Bizjak 2008-03-22 11:01:54 UTC

(In reply to comment #6)
> As Uros has "challenged me to beat performance of gcc-4.4 generated code by
> hand-crafted assembly using the example of PR 21395" heres my entry, sadly i
> only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code
> than 4.4 so i guess thats ok its inner loop is:

Not!

This is the comparison of runtimes for the original test, comparing 4.3.0 vs 4.4.0 compiled code on core2D EE:

$ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp
$ time ./a.out
144

real    0m0.619s
user    0m0.620s
sys     0m0.000s

$ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp
$ time ./a.out
144

real    0m0.398s
user    0m0.400s
sys     0m0.000s

gcc 4.4.0 with your modified computation kernel:

$ g++ -m32 -march=core2 -O2 mmx-1.cpp
$ time ./a.out
144

real    0m0.309s
user    0m0.308s
sys     0m0.000s

To be honest, I didn't expect you to completely rewrite the computation kernel, so we are comparing apples to oranges. However, you can rewrite your ASM code using intrinsic functions from __mmintrin.h, and you will get all optimizations (scheduling, unrolling, etc) for free, while you are still in control of code generation on a fairly low level. Using intrinsics, you leave to the compiler things that the compiler is good at (loop handling, register allocation, scheduling).

Are you interested in this experiment? The results of this experiment would perhaps be interesting to ffmpeg people to consider rewriting their asm blocks into intrinsics.

And really thanks for your detailed benchmark results! And since your computation kernel is already 30% faster than current implementation, I'm sure that Dirac people (in CC of this PR) will be very interested in your computational kernel.

Comment 9 michaelni 2008-03-23 02:49:50 UTC

Subject: Re:  Performance degradation when
	building code that uses MMX intrinsics with gcc-4.0.0

On Sat, Mar 22, 2008 at 11:01:55AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #8 from ubizjak at gmail dot com  2008-03-22 11:01 -------
> (In reply to comment #6)
> > As Uros has "challenged me to beat performance of gcc-4.4 generated code by
> > hand-crafted assembly using the example of PR 21395" heres my entry, sadly i
> > only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code
> > than 4.4 so i guess thats ok its inner loop is:
> 
> Not!
> 
> This is the comparison of runtimes for the original test, comparing 4.3.0 vs
> 4.4.0 compiled code on core2D EE:
> 
> $ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp
> $ time ./a.out
> 144
> 
> real    0m0.619s
> user    0m0.620s
> sys     0m0.000s
> 
> $ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp
> $ time ./a.out
> 144
> 
> real    0m0.398s
> user    0m0.400s
> sys     0m0.000s

On my duron with -O2 -mmmx i get
g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
144

real	0m2.077s
user	0m1.912s
sys	0m0.019s


g++-4.4 (GCC) 4.4.0 20080321 (experimental)
144

real	0m2.172s
user	0m2.004s
sys	0m0.021s


with -m32 -march=core2 (incorrect as doesnt match cpu!)
g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
144

real	0m3.644s
user	0m3.389s
sys	0m0.022s


g++-4.4 (GCC) 4.4.0 20080321 (experimental)
Illegal instruction         (yes yes i know i asked for it)

real	0m0.011s
user	0m0.003s
sys	0m0.007s


So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm.



> 
> gcc 4.4.0 with your modified computation kernel:
> 
> $ g++ -m32 -march=core2 -O2 mmx-1.cpp
> $ time ./a.out
> 144
> 
> real    0m0.309s
> user    0m0.308s
> sys     0m0.000s
> 
> To be honest, I didn't expect you to completely rewrite the computation kernel,
> so we are comparing apples to oranges. 

Well nothing stops gcc from rewriting the intrinsics either :)


> However, you can rewrite your ASM code
> using intrinsic functions from __mmintrin.h, and you will get all optimizations
> (scheduling, unrolling, etc) for free, while you are still in control of code
> generation on a fairly low level. Using intrinsics, you leave to the compiler
> things that the compiler is good at (loop handling, register allocation,
> scheduling).
> 
> Are you interested in this experiment? 

Iam surely interrested but iam a little busy with google summer of code
students currently. We have to choose wisely which applications and students
we select for ffmpeg this summer ... that means alot of code reviewing from
what the students submit as qualification tasks ...
So i wont rewrite this in intrinsics, at least not anytime soon.


> The results of this experiment would
> perhaps be interesting to ffmpeg people to consider rewriting their asm blocks
> into intrinsics.

well ...
Iam not a friend of intrinsics, but i think you guessed that already :)
The thing i like on asm() is that it produces the same performance and code
with every compiler. Its largely a write once and forget thing. A problem
with asm() is almost always of the compile time error sort like 
"cant find register in class blah" these things are vissible and can be dealt
with ...
With intrinsics its all a gamble, just look at this PR, how hugely performance
differs between gcc versions. If ffmpeg where using intrinsics instead of
asm we would have to spend considerable time dealing with such variations
somehow.


> 
> And really thanks for your detailed benchmark results! And since your
> computation kernel is already 30% faster than current implementation, I'm sure
> that Dirac people (in CC of this PR) will be very interested in your
> computational kernel.

yes, iam also fine with them using it under whichever FOSS license they want.

[...]

Comment 10 Uroš Bizjak 2008-03-23 10:46:41 UTC

(In reply to comment #9)

> So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm.

Can you tell from code dumps of 4.4 vs 4.3, where you think that 4.4 code is worse than 4.3 for Duron? For Core2, 4.4 avoids store forwarding stall, but I'm not sure why Duron prefers moves via memory instead of keeping values in %mm registers.

Comment 11 michaelni 2008-03-24 00:08:27 UTC

Subject: Re:  Performance degradation when
	building code that uses MMX intrinsics with gcc-4.0.0

On Sun, Mar 23, 2008 at 10:46:41AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #10 from ubizjak at gmail dot com  2008-03-23 10:46 -------
> (In reply to comment #9)
> 
> > So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm.
> 
> Can you tell from code dumps of 4.4 vs 4.3, where you think that 4.4 code is
> worse than 4.3 for Duron? For Core2, 4.4 avoids store forwarding stall, but I'm
> not sure why Duron prefers moves via memory instead of keeping values in %mm
> registers.

--- freaky_mmx_code-4.3.s       2008-03-24 00:48:11.000000000 +0100
+++ freaky_mmx_code-4.4.s       2008-03-24 00:48:03.000000000 +0100
...
 .L24:
        movl    -36(%ebp), %eax
        testl   %ebx, %ebx
        movl    (%edi,%esi,4), %edx
        movl    (%eax,%esi,4), %ecx
@@ -182,113 +183,102 @@
        xorl    %eax, %eax
        movq    -24(%ebp), %mm2
        .p2align 4,,7
        .p2align 3

 .L23:
        movq    (%ecx,%eax,2), %mm0
        psubw   (%edx,%eax,2), %mm0
        addl    $4, %eax
        cmpl    %eax, %ebx
        movq    %mm0, %mm1
-       psraw   $15, %mm0
-       pxor    %mm0, %mm1
-       psubw   %mm0, %mm1
-       movq    %mm1, %mm0
-       punpcklwd       %mm1, %mm1
-       punpckhwd       %mm3, %mm0
-       psrad   $16, %mm1
-       paddd   %mm0, %mm1
-       paddd   %mm1, %mm2
+       psraw   $15, %mm1
+       pxor    %mm1, %mm0
+       psubw   %mm1, %mm0
+       movq    %mm0, %mm1
+       punpcklwd       %mm0, %mm0
+       punpckhwd       %mm3, %mm1
+       psrad   $16, %mm0
+       paddd   %mm1, %mm0
+       paddd   %mm0, %mm2
        movq    %mm2, -24(%ebp)
        jg      .L23
 .L22:
        addl    $1, %esi
-       cmpl    %esi, -40(%ebp)
-       jg      .L24
+       cmpl    -40(%ebp), %esi
+       jl      .L24

...
-       .ident  "GCC: (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)"
+       .ident  "GCC: (GNU) 4.4.0 20080321 (experimental)"
------------------
What i _think_ makes 4.4 slower on duron is that
psraw   $15, %mm1
reads a register which has been written in the previous instruction
while 4.3 choose the other register which contains the same value.
4.4 simply has a longer dependancy chain than 4.3.

PS: both compiled with -mmmx -O2 -S
PS2: 4.3 is from debian, 4.4 is from gcc svn

[...]