Bug 11327 - Non-optimal code when using MMX/SSE builtins
Summary: Non-optimal code when using MMX/SSE builtins
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 3.3
: P2 normal
Target Milestone: 4.0.0
Assignee: Richard Henderson
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2003-06-26 12:44 UTC by Kevin Atkinson
Modified: 2005-01-07 14:15 UTC (History)
1 user (show)

See Also:
Host:
Target: i686-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2003-12-18 09:05:44


Attachments
Test Code (365 bytes, text/plain)
2003-06-26 12:45 UTC, Kevin Atkinson
Details
Generated Code (479 bytes, text/plain)
2003-06-26 12:46 UTC, Kevin Atkinson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Atkinson 2003-06-26 12:44:06 UTC
Gcc generates non optimal code when the builtin MMX/SSE functions are used.  In
particular it has a tendency to insert unnecessary movq and sometimes
unnecessary memory reads.

For example in the following code:

#include <stddef.h>

typedef int v8qi __attribute__ ((mode(V8QI)));
typedef long long unsigned int ullint;

#define peq  __builtin_ia32_pcmpeqb
#define pmin __builtin_ia32_pminub
#define por(a,b) (v8qi)__builtin_ia32_por((ullint)a, (ullint)b)
#define psubs __builtin_ia32_psubusb

void foo(v8qi * a, v8qi * b, v8qi * c, size_t s)
{
  size_t i;
  v8qi thres;
  memset(&thres, 10, 8);
  for (i = 0; i != s; ++i)
  {
    c[i] = peq( pmin( por(psubs(a[i],b[i]), psubs(b[i],a[i])),
                      thres),
                thres);
  }
}

When compiled with "-O2 -march=pentium3" Gcc generates:

        # %mm2 is the constant thres
        ...
	movq	(%ecx,%eax,8), %mm0
	psubusb	(%edx,%eax,8), %mm0
	movq	%mm0, %mm1
	movq	(%edx,%eax,8), %mm0
	psubusb	(%ecx,%eax,8), %mm0
	por	%mm0, %mm1
	movq	%mm1, %mm0
        pminub	%mm2, %mm0
	pcmpeqb	%mm2, %mm0
	movq	%mm0, (%esi,%eax,8)
        ...

which involves 2 unnecessary memory reads and 1 unnecessary movq.  An optimal
version of the above code:

	movq	(%ecx,%eax,8), %mm0
        movq    (%edx,%eax,8), %mm1
        movq    %mm0, %mm3
        psubusb %mm1, %mm0
        psubusb %mm3, %mm1
        por     %mm1, %mm0
        pminub	%mm2, %mm0
	pcmpeqb	%mm2, %mm0
	movq	%mm0, (%esi,%eax,8)

This is just a simple example.  In more complex code there are more unnecessary
movq.  Spelling out exactly what to do for the inner loop:

    m1 = a[i];
    m2 = b[i];
    m3 = m1;
    m1 = psubs(m1, m2);
    m2 = psubs(m2, m3);
    m1 = por(m1,m2);
    m1 = pmin(m1, thres);
    m1 = peq(m1,thres);
    c[i] = m1;

Does not help.  It avoids the unnecessary memory reads but adds several
unnecessary movq.

The attached files include the example code and the generated code gcc produces
with a "diff" to my optimal version.  The ineffect code is marked with a '-'
while my code is marked with a '+'.
Comment 1 Kevin Atkinson 2003-06-26 12:45:04 UTC
Created attachment 4287 [details]
Test Code
Comment 2 Kevin Atkinson 2003-06-26 12:46:06 UTC
Created attachment 4288 [details]
Generated Code
Comment 3 Dara Hazeghi 2003-06-27 17:50:34 UTC
Checking your simpler testcase with gcc mainline (20030620), I get:

.L5:
        movq    (%ecx,%eax,8), %mm0
        movq    (%edx,%eax,8), %mm1
        psubusb (%edx,%eax,8), %mm0
        psubusb (%ecx,%eax,8), %mm1
        por     %mm1, %mm0
        pminub  %mm2, %mm0
        pcmpeqb %mm2, %mm0
        movq    %mm0, (%esi,%eax,8)
        incl    %eax
        cmpl    %ebx, %eax
        jne     .L5


This looks a lot like the optimal code you suggested, correct? Would you mind sending an example 
of the better code you'd like to see generated for foo2, and/or trying gcc cvs to see if the problem 
is fixed there? Thanks,

Dara
Comment 4 Falk Hueffner 2003-06-27 17:54:32 UTC
Be sure to check out -fnew-ra, too, since this seems to be related to
register allocation, and -fnew-ra might become default in the future.
Comment 5 Kevin Atkinson 2003-06-27 23:22:43 UTC
The generated code in gcc mainline (20030620) is better but it still involves
two unnecessary memory reads.

The second test case should produce the exact same code as the first as both
functions are doing the exact same thing.

I tried -fnew-ra with gcc 3.3 and it didn't seam to help.
Comment 6 Dara Hazeghi 2003-06-28 00:13:44 UTC
Just checked mainline, and foo2 has one more unnecessary movq than foo1. -fnew-ra doesn't help 
(actually makes foo1 worse). Confirmed.
Comment 7 GCC Commits 2005-01-06 06:22:44 UTC
Subject: Bug 11327

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rth@gcc.gnu.org	2005-01-06 06:22:36

Modified files:
	gcc            : ChangeLog 
	gcc/config/i386: i386.c 

Log message:
	PR target/11327
	* config/i386/i386.c (BUILTIN_DESC_SWAP_OPERANDS): New.
	(bdesc_2arg): Use it.
	(ix86_expand_binop_builtin): Force operands into registers
	when optimizing.
	(ix86_expand_unop_builtin, ix86_expand_unop1_builtin,
	ix86_expand_sse_compare, ix86_expand_sse_comi,
	ix86_expand_builtin): Likewise.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7044&r2=2.7045
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&r1=1.769&r2=1.770

Comment 8 Richard Henderson 2005-01-06 06:26:20 UTC
http://gcc.gnu.org/ml/gcc-patches/2005-01/msg00331.html
Should be fixed.