User account creation filtered due to spam.

Bug 14552 - compiled trivial vector intrinsic code is inefficient
Summary: compiled trivial vector intrinsic code is inefficient
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 3.4.0
: P2 minor
Target Milestone: ---
Assignee: Uroš Bizjak
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization, patch, ssemmx
: 25277 32301 (view as bug list)
Depends on: 19161 19391
Blocks: 22152 25277
  Show dependency treegraph
 
Reported: 2004-03-12 13:15 UTC by michaelni
Modified: 2008-04-21 08:21 UTC (History)
6 users (show)

See Also:
Host: pentium3-debian-linux
Target: pentium3-debian-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-07-19 16:37:59


Attachments
source to generate the well optimized code (121 bytes, text/x-csrc)
2004-03-12 13:20 UTC, michaelni
Details

Note You need to log in before you can comment on or make changes to this bug.
Description michaelni 2004-03-12 13:15:36 UTC
See attached source, gcc  -O3 -mtune=pentium3 -march=pentium3 -S 
generates: 
test: 
        movq    w, %mm1 
        pushl   %ebp 
        movl    %esp, %ebp 
        popl    %ebp 
        psllw   $1, %mm1 
        movq    %mm1, w 
        movq    w, %mm0 
        movq    %mm0, dw 
        ret 
 
human generates: 
movq w, %mm1 
paddw %mm1,%mm1 
movq %mm1, w 
movq %mm1,dw 
ret
Comment 1 michaelni 2004-03-12 13:20:30 UTC
Created attachment 5906 [details]
source to generate the well optimized code
Comment 2 Andrew Pinski 2004-03-12 15:46:59 UTC
The poblem is that you also need -fomit-frame-pointer  to get the same code as the human generated 
code:
test:
        movq    w, %mm1
        psllw   $1, %mm1
        movq    %mm1, w
        movq    w, %mm0
        movq    %mm0, dw
        ret
Comment 3 michaelni 2004-03-12 16:26:04 UTC
sorry, no thats not the same code, it has 1 instruction more, uses a shift 
instead of a addition and writes the value to memory and reads it 
immedeatly afterwards, anyway iam not surprised that the bugreport got 
closed immedeatly 
 
gcc: 
movq    w, %mm1 
psllw   $1, %mm1     <------- 
movq    %mm1, w 
movq    w, %mm0     <------ 
movq    %mm0, dw 
 
human: 
movq w, %mm1  
paddw %mm1,%mm1  
movq %mm1, w  
movq %mm1,dw  
Comment 4 Andrew Pinski 2004-03-12 16:30:10 UTC
Okay so reopening it.
Comment 5 Andrew Pinski 2004-03-12 16:38:01 UTC
Using a tempary variable I can get it down to 5 instructions (including the return):
        movq    w, %mm0
        psllw   $1, %mm0
        movq    %mm0, dw
        movq    %mm0, w
        ret

The problem is that global variables create the pessimize code so this is a dup of bug 12395.

*** This bug has been marked as a duplicate of 12395 ***
Comment 6 michaelni 2004-03-12 17:11:38 UTC
and the addition vs. shift issue? on the p3, mmx additions can be executed 
in port 0 or 1 while mmx shifts can only execute in port 1 
Comment 7 Andrew Pinski 2004-03-12 17:15:28 UTC
That is a tunning issue.  The problem is that CSE selects the shift.
Comment 8 Andrew Pinski 2004-04-07 03:00:14 UTC
I already confirmed this.
Comment 9 Andrew Pinski 2005-01-12 06:26:54 UTC
I will have to file a new bug for this as we produce so much worse code now on the mainline but that is 
because we expand the + to do it all four times instead of using the sse/mmx unit which is just plainly 
wrong.
Comment 10 Richard Henderson 2005-01-18 11:34:10 UTC
No, Andrew, mainline is not plainly wrong.  We are correctly not using the 
MMX unit when <mmintrin.h> is not in use.  The instruction selection thing
can still be seen with the SSE unit though, if you widen the vectors to 16
bytes.

The problem is that ix86_rtx_costs has no idea about the cost of vector
operations.  For what little it's worth, K8 thinks paddw and psllw are
equivalent -- both can be issued to fadd or fmul pipelines.
Comment 11 Uroš Bizjak 2005-06-22 10:14:30 UTC
Just for fun, I have compiled the testcase with MMX/x87 mode switching patch 
included, to check MMX vector extensions. This little patch is needed to enable 
MMX vector extensions (only MMX vector add expander is shown):

diff -upr /export/home/uros/gcc-back/gcc/config/i386/i386.h i386/i386.h
--- /export/home/uros/gcc-back/gcc/config/i386/i386.h	2005-06-08 
07:05:22.000000000 +0200
+++ i386/i386.h	2005-06-22 10:41:31.000000000 +0200
@@ -843,7 +845,8 @@ do {							
		\
 
 /* ??? No autovectorization into MMX or 3DNOW until we can reliably
    place emms and femms instructions.  */
-#define UNITS_PER_SIMD_WORD (TARGET_SSE ? 16 : UNITS_PER_WORD)
+#define UNITS_PER_SIMD_WORD						\
+    (TARGET_SSE ? 16 : TARGET_MMX ? 8 : UNITS_PER_WORD)
 
 #define VALID_FP_MODE_P(MODE)						\
     ((MODE) == SFmode || (MODE) == DFmode || (MODE) == XFmode		\
diff -upr /export/home/uros/gcc-back/gcc/config/i386/mmx.md i386/mmx.md
--- /export/home/uros/gcc-back/gcc/config/i386/mmx.md	2005-04-20 
21:56:15.000000000 +0200
+++ i386/mmx.md	2005-06-22 11:00:35.000000000 +0200
@@ -553,6 +553,13 @@
 ;;
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
+(define_expand "add<mode>3"
+  [(set (match_operand:MMXMODEI 0 "register_operand" "")
+	(plus:MMXMODEI (match_operand:MMXMODEI 1 "nonimmediate_operand" "")
+		       (match_operand:MMXMODEI 2 "nonimmediate_operand" "")))]
+  "TARGET_MMX"
+  "ix86_fixup_binary_operands_no_copy (PLUS, <MODE>mode, operands);")
+
 (define_insn "mmx_add<mode>3"
   [(set (match_operand:MMXMODEI 0 "register_operand" "=y")
         (plus:MMXMODEI

After that, the testcase from description is compiled to (with -fomit-frame-
pointer):

test:
	movq	w, %mm0
	paddw	%mm0, %mm0
	movq	%mm0, w
	movq	%mm0, dw
	emms
	ret

Comment 12 Uroš Bizjak 2005-07-21 08:42:14 UTC
You can patch the mainline 4.1 compiler with the patch at 
http://gcc.gnu.org/ml/gcc-patches/2005-07/msg01128.html. Patch (which is 
currently awaiting a review) will make gcc to produce optimal code:

'gcc -O2 -mmmx -fomit-frame-pointer'

test:
	movq	w, %mm0
	paddw	%mm0, %mm0
	movq	%mm0, w
	movq	%mm0, dw
	emms
	ret


Comment 13 Fariborz Jahanian 2005-09-13 21:09:50 UTC
Hello,

What is the status of Uros's patches in:

http://gcc.gnu.org/ml/gcc-patches/2005-07/msg01128.html

Looks like they did not make it to FSF mainline? Are there remaining issues with them?

Comment 14 Andrew Pinski 2005-09-13 21:13:21 UTC
(In reply to comment #13)
> Are there remaining issues with them?

Yes, it does not work when configuring gcc with --with-cpu=pentium4 see PR 19161.
Comment 15 Uroš Bizjak 2005-09-15 11:39:37 UTC
(In reply to comment #14)

> Yes, it does not work when configuring gcc with --with-cpu=pentium4 see PR 
19161.

No, the patch works OK for pentium4. The remaining problem is in 
optimize_mode_switching() function. For a certain loop layout, o_m_s could 
insert emms and efpu insn in such way, that both register sets are blocked.

Because emms/efpu insertion depends heavily on o_m_s functionality, this 
infrastructure should be upgraded as explained in PR 19161.

(BTW: One of the design goals was to ICE, instead of generating wrong code. It 
loks that this goal was achieved :)
Comment 16 Pawel Sikora 2005-11-21 11:29:51 UTC
without Uros' mmx-patch the gcc-4.1.0-20051113 generates amazing code:
(gcc -O3 -march=pentium3 -S -fomit-frame-pointer pr14552.c)

test:   subl    $20, %esp
        movl    w, %eax
        movl    w+4, %edx
        movl    %ebx, 8(%esp)
        movl    %esi, 12(%esp)
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        movswl  (%esp),%esi
        movl    %edi, 16(%esp)
        movswl  4(%esp),%ecx
        movswl  2(%esp),%edi
        movswl  6(%esp),%ebx
        addl    %esi, %esi
        addl    %ecx, %ecx
        movzwl  %si, %esi
        sall    $17, %edi
        movzwl  %cx, %ecx
        sall    $17, %ebx
        movl    %edi, %eax
        movl    16(%esp), %edi
        movl    %ebx, %edx
        orl     %esi, %eax
        movl    8(%esp), %ebx
        orl     %ecx, %edx
        movl    12(%esp), %esi
        movl    %eax, w
        movl    %edx, w+4
        movl    w, %eax
        movl    w+4, %edx
        movl    %eax, dw
        movl    %edx, dw+4
        addl    $20, %esp
        ret
        .size   test, .-test
        .comm   dw,8,8
        .comm   w,8,8
        .ident  "GCC: (GNU) 4.1.0 20051113 (experimental)"
        .section        .note.GNU-stack,"",@progbits
Comment 17 Paolo Carlini 2005-11-21 11:34:15 UTC
Sorry.
Comment 18 Pawel Sikora 2005-11-21 15:05:09 UTC
gcc-3.3.6 produces better code:

test:   movq    w, %mm1
        psllw   $1, %mm1
        movq    %mm1, w
        movq    w, %mm1
        movq    %mm1, dw
        ret

        .comm   dw,8,8
        .comm   w,8,8


can we classify this as a code size regression?
Comment 19 Andrew Pinski 2005-11-21 15:09:11 UTC
(In reply to comment #18)
> can we classify this as a code size regression?

No because 3.3.x was also wrong in the sense it did not emit an emms.
Comment 20 Pawel Sikora 2005-11-21 18:38:30 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > can we classify this as a code size regression?
> 
> No because 3.3.x was also wrong in the sense it did not emit an emms.

ok.

gcc 4.1.0/20051113 with x87/mmx mode switch patch produces:

test:   movq    w, %mm0
        paddw   %mm0, %mm0
        movq    %mm0, w
        movl    w, %eax
        movl    w+4, %edx
        movl    %eax, dw
        movl    %edx, dw+4
        emms
        ret

        .comm   dw,8,8
        .comm   w,8,8

it isn't optimal but correct (emms opcode) and smaller than pure 4.1 output.
Comment 21 Pawel Sikora 2005-12-01 00:52:37 UTC
I'm wondering is it possible to implement tranformations
of vector arithmetics into vector builtins?

e.g.

#include <mmintrin.h>
__v8qi foo(const __v8qi x, const __v8qi y) { return x + y; }
__v8qi bar(const __v8qi x, const __v8qi y) { return _mm_add_pi8(x, y); }

I except from compiler the same code for both functions
but it produces insane code for foo() :/

foo (x, y)
{
  unsigned int D.2377;
  unsigned int D.2376;
  unsigned int D.2369;
  unsigned int D.2368;
<bb 0>:
  D.2368 = BIT_FIELD_REF <x, 32, 0>;
  D.2369 = BIT_FIELD_REF <y, 32, 0>;
  D.2376 = BIT_FIELD_REF <x, 32, 32>;
  D.2377 = BIT_FIELD_REF <y, 32, 32>;
  return VIEW_CONVERT_EXPR<__v8qi>(
    {(D.2368 ^ D.2369) & 080808080 ^ (D.2369 & 2139062143) +
     (D.2368 & 2139062143),
     (D.2376 ^ D.2377) & 080808080 ^ (D.2377 & 2139062143) +
     (D.2376 & 2139062143)});
}

bar (x, y)
{
  vector signed char D.2448;
<bb 0>:
  D.2448 = __builtin_ia32_paddb (
    VIEW_CONVERT_EXPR<vector signed char>(VIEW_CONVERT_EXPR<__m64>(x)),
    VIEW_CONVERT_EXPR<vector signed char>(VIEW_CONVERT_EXPR<__m64>(y)));
  return VIEW_CONVERT_EXPR<__v8qi>(VIEW_CONVERT_EXPR<vector int>(D.2448));
}

# gcc -O2 -march=pentium3 -fomit-frame-pointer -mregparm=3

foo:
        subl    $44, %esp
        movq    %mm0, 24(%esp)
        movl    %ebx, 32(%esp)
        movl    24(%esp), %ebx
        movl    %esi, 36(%esp)
        movl    28(%esp), %esi
        movq    %mm1, 24(%esp)
        movl    24(%esp), %eax
        movl    28(%esp), %edx
        movl    %edi, 40(%esp)
        movl    %ebx, %edi
        andl    $2139062143, %edi
        movl    %eax, %ecx
        xorl    %eax, %ebx
        andl    $2139062143, %ecx
        movl    %esi, %eax
        addl    %edi, %ecx
        xorl    %edx, %eax
        movl    40(%esp), %edi
        andl    $2139062143, %esi
        andl    $-2139062144, %ebx
        andl    $2139062143, %edx
        xorl    %ecx, %ebx
        addl    %esi, %edx
        andl    $-2139062144, %eax
        movl    36(%esp), %esi
        movl    %ebx, 20(%esp)
        xorl    %edx, %eax
        movl    32(%esp), %ebx
        movss   20(%esp), %xmm0
        movl    %eax, 20(%esp)
        movss   20(%esp), %xmm1
        unpcklps        %xmm1, %xmm0
        movlps  %xmm0, 8(%esp)
        movl    8(%esp), %eax
        movl    12(%esp), %edx
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        movq    (%esp), %mm1
        addl    $44, %esp
        movq    %mm1, %mm0
        ret

bar:
        paddb   %mm1, %mm0
        ret

Comment 22 Uroš Bizjak 2008-03-08 07:29:54 UTC
*** Bug 25277 has been marked as a duplicate of this bug. ***
Comment 23 Uroš Bizjak 2008-03-19 10:45:24 UTC
As said in PR 19161:

The LCM infrastructure doesn't support mode switching in the way that would be usable for emms. Additionally, there are MANY problems expected when sharing x87 and MMX registers (i.e. handling of uninitialized x87 registers at the beginning of the function - this is the reason we don't implement x87 register passing ABI).

Automatic MMX vectorization is not exactly a much usable feature nowadays (we
have SSE that works quite well here). Due to recent changes in MMX register
allocation area, excellent code is produced using MMX intrinsics, I'm closing
this bug as WONTFIX.

Also, auto-vectorization would produce either MMX or SSE code, but not both of them:

#define UNITS_PER_SIMD_WORD (TARGET_SSE ? 16 : UNITS_PER_WORD)
Comment 24 Alexander Strange 2008-03-19 19:21:36 UTC
For
typedef short mmxw  __attribute__ ((mode(V4HI)));
typedef int   mmxdw __attribute__ ((mode(V2SI)));

mmxdw dw;
mmxw w;

void test(){
    w+=w;
    dw= (mmxdw)w;
}

void test2(){
	w= __builtin_ia32_paddw(w,w);
	dw= (mmxdw)w;
}

gcc SVN generates the expected code for test2(), but not test(). I don't think using += on an MMX variable should count as autovectorization - if you're doing either you should know where to put emms yourself.

For test() we get:
        subl    $28, %esp
        movq    _w, %mm0
        movq    %mm0, 8(%esp)
        movzwl  8(%esp), %eax
        movzwl  10(%esp), %edx
        movzwl  12(%esp), %ecx
        addl    %eax, %eax
        addl    %edx, %edx
        movw    %ax, _w
        movw    %dx, _w+2
        movzwl  14(%esp), %eax
        addl    %ecx, %ecx
        addl    %eax, %eax
        movw    %cx, _w+4
        movw    %ax, _w+6
        movq    _w, %mm0
        movq    %mm0, _dw
        addl    $28, %esp
        ret

which touches mm0 (requiring emms, I think) but not using paddw (so being slow and silly-looking).
LLVM generates expected code for both of them.
Comment 25 Alexander Strange 2008-03-19 19:39:06 UTC
Actually the first generates-
        subl    $12, %esp
        movq    _w, %mm0
        paddw   %mm0, %mm0
        movq    %mm0, _w
        movq    _w, %mm0
        movq    %mm0, _dw
        addl    $12, %esp
        ret

which is better than the code in the original report but still has a useless store/reload.
Comment 26 uros 2008-03-19 23:39:17 UTC
Subject: Bug 14552

Author: uros
Date: Wed Mar 19 23:38:35 2008
New Revision: 133354

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133354
Log:
        PR target/14552
        * config/i386/mmx.md (*mov<mode>_internal_rex64"): Adjust register
        allocator preferences for "y" and "r" class registers.
        ("*mov<mode>_internal"): Ditto.
        ("*movv2sf_internal_rex64"): Ditto.
        ("*movv2sf_internal"): Ditto.

testsuite/ChangeLog:

        PR target/14552
        * gcc.target/i386/pr14552.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr14552.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/mmx.md
    trunk/gcc/testsuite/ChangeLog

Comment 27 Uroš Bizjak 2008-03-19 23:46:57 UTC
(In reply to comment #25)
> Actually the first generates-
>         subl    $12, %esp
>         movq    _w, %mm0
>         paddw   %mm0, %mm0
>         movq    %mm0, _w
>         movq    _w, %mm0
>         movq    %mm0, _dw
>         addl    $12, %esp
>         ret
> 
> which is better than the code in the original report but still has a useless
> store/reload.

The store is not useless. Reload from "_w" is how gcc handles double stores nowadays and is not mmx specific. It looks that some pass forgot to check where the value came from.
Comment 28 Andrew Pinski 2008-03-19 23:49:22 UTC
(In reply to comment #27)
> The store is not useless. Reload from "_w" is how gcc handles double stores
> nowadays and is not mmx specific. It looks that some pass forgot to check where
> the value came from.

Do you happen to know if there are two different modes at work here?  If so there are patches which fix this up in DSE and post-reload CSE.
Comment 29 Uroš Bizjak 2008-03-20 00:01:26 UTC
Now we generate:

-m32 -mmmx -msse2:

test:
        subl    $20, %esp
        movl    w, %eax
        movl    w+4, %edx
        movl    %ebx, 12(%esp)
        movl    %esi, 16(%esp)
        movl    %eax, (%esp)
        movzwl  (%esp), %ecx
        movl    %edx, 4(%esp)
        movzwl  2(%esp), %ebx
        movzwl  4(%esp), %esi
        movzwl  6(%esp), %eax
        addl    %ecx, %ecx
        addl    %ebx, %ebx
        addl    %esi, %esi
        addl    %eax, %eax
        movw    %bx, w+2
        movl    12(%esp), %ebx
        movw    %si, w+4
        movl    16(%esp), %esi
        movw    %ax, w+6
        movl    w+4, %edx
        movw    %cx, w
        movl    w, %eax
        movl    %edx, dw+4
        movl    %eax, dw
        addl    $20, %esp
        ret

-m64 -mmmx -msse2:

test:
        movabsq $9223231297218904063, %rax
        andq    w(%rip), %rax
        addq    %rax, %rax
        movq    %rax, w(%rip)
        movq    w(%rip), %rax
        movq    %rax, dw(%rip)
        ret

The issue with useless reload is PR 12395, as mentioned in Comment #5.
Comment 30 Uroš Bizjak 2008-03-20 00:04:05 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > The store is not useless. Reload from "_w" is how gcc handles double stores
> > nowadays and is not mmx specific. It looks that some pass forgot to check where
> > the value came from.
> 
> Do you happen to know if there are two different modes at work here?  If so
> there are patches which fix this up in DSE and post-reload CSE.

Yes, from comment #24 (slightly changed):

typedef short mmxw  __attribute__ ((vector_size (8)));
typedef int   mmxdw __attribute__ ((vector_size (8)));

mmxdw dw;
mmxw w;

so, we have V4HI and V2SI.
Comment 31 pinskia@gmail.com 2008-03-20 00:23:30 UTC
Subject: Re:  compiled trivial vector intrinsic code is inefficient

See pr 33790.

Sent from my iPhone

On Mar 19, 2008, at 17:04, "ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org 
 > wrote:

>
>
> ------- Comment #30 from ubizjak at gmail dot com  2008-03-20 00:04  
> -------
> (In reply to comment #28)
>> (In reply to comment #27)
>>> The store is not useless. Reload from "_w" is how gcc handles  
>>> double stores
>>> nowadays and is not mmx specific. It looks that some pass forgot  
>>> to check where
>>> the value came from.
>>
>> Do you happen to know if there are two different modes at work  
>> here?  If so
>> there are patches which fix this up in DSE and post-reload CSE.
>
> Yes, from comment #24 (slightly changed):
>
> typedef short mmxw  __attribute__ ((vector_size (8)));
> typedef int   mmxdw __attribute__ ((vector_size (8)));
>
> mmxdw dw;
> mmxw w;
>
> so, we have V4HI and V2SI.
>
>
> -- 
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
>
Comment 32 Alexander Strange 2008-03-20 00:39:29 UTC
This is missed on trees:
mmxdw dw;
mmxw w;

void test2(){
	w= __builtin_ia32_paddw(w,w); w= (mmxdw)w;
}

void test3(){
	mmxw w2= __builtin_ia32_paddw(w,w); dw= (mmxdw)w2;
}

test2 ()
{
  vector short int w.4;
  vector short int w.3;

<bb 2>:
  w.3 = w;
  w.4 = __builtin_ia32_paddw (w.3, w.3);
  w = w.4;
  dw = VIEW_CONVERT_EXPR<vector int>(w);
  return;
}

test3 ()
{
  mmxw w2;
  vector short int w.6;

<bb 2>:
  w.6 = w;
  w2 = __builtin_ia32_paddw (w.6, w.6);
  dw = VIEW_CONVERT_EXPR<vector int>(w2);
  return;
}
Comment 33 michaelni 2008-03-20 01:37:07 UTC
Subject: Re:  compiled trivial vector intrinsic code is
	inefficient

On Wed, Mar 19, 2008 at 11:39:18PM -0000, uros at gcc dot gnu dot org wrote:
> 
> 
> ------- Comment #26 from uros at gcc dot gnu dot org  2008-03-19 23:39 -------
> Subject: Bug 14552
[...]
>         * gcc.target/i386/pr14552.c: New test.
> 
> 
> Added:
>     trunk/gcc/testsuite/gcc.target/i386/pr14552.c

Thanks, i was already scared that the inverse proportional relation between
version number and performance which was so nicely followed since 2.95
would stop.
Adding a test to the testsuit to ensure that mmx intrinsics dont use
mmx registers is well, just brilliant.
Iam already eagerly awaiting the testcase which will check that floating
point code doesnt use the FPU, i assume that will happen in gcc 5.0?

Anyway iam glad ffmpeg compiles fine under icc.

[...]
Comment 34 Uroš Bizjak 2008-03-20 09:49:22 UTC
(In reply to comment #33)

> Anyway iam glad ffmpeg compiles fine under icc.

Me to. Now you will troll in their support lists.
Comment 35 michaelni 2008-03-20 17:18:07 UTC
Subject: Re:  compiled trivial vector intrinsic code is
	inefficient

On Thu, Mar 20, 2008 at 09:49:22AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #34 from ubizjak at gmail dot com  2008-03-20 09:49 -------
> (In reply to comment #33)
> 
> > Anyway iam glad ffmpeg compiles fine under icc.
> 
> Me to. Now you will troll in their support lists.

No, truth be, i dont plan to switch to icc yet. Somehow i do prefer to use
free tools. Of course if the gap becomes too big i as well as most others
will switch to icc ...
Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
not so much a problem for ffmpeg than it is for others who followed the
recommandition of "intrinsics are better than asm".

About trolling, well i made no attempt to reply politely and diplomatic, no.
But "solving" a "problem" in some use case by droping support for that use
case is kinda extreem.

The way i see it is that
* Its non trivial to place emms optimally and automatically
* there needs to be a emms between mmx code and fpu code

The solutions to this would be any one of
A. let the programmer place emms like it has been in the past
B. dont support mmx at all
C. dont support x87 fpu at all
D. place emms after every bunch of mmx instructions
E. solve a quite non trivial problem and place emms optimally

The solution which has been selected apparently is B., why was that choosen?
Instead of lets say A.?

If i do write SIMD code then i do know that i need an emms on x86. Its
trivial for the programmer to place it optimally.

[...]

Comment 36 Uroš Bizjak 2008-03-21 10:33:59 UTC
(In reply to comment #35)

> Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
> not so much a problem for ffmpeg than it is for others who followed the
> recommandition of "intrinsics are better than asm".
> 
> About trolling, well i made no attempt to reply politely and diplomatic, no.
> But "solving" a "problem" in some use case by droping support for that use
> case is kinda extreem.
> 
> The way i see it is that
> * Its non trivial to place emms optimally and automatically
> * there needs to be a emms between mmx code and fpu code
> 
> The solutions to this would be any one of
> A. let the programmer place emms like it has been in the past
> B. dont support mmx at all
> C. dont support x87 fpu at all
> D. place emms after every bunch of mmx instructions
> E. solve a quite non trivial problem and place emms optimally
> 
> The solution which has been selected apparently is B., why was that choosen?
> Instead of lets say A.?
> 
> If i do write SIMD code then i do know that i need an emms on x86. Its
> trivial for the programmer to place it optimally.

I don't know where you get the idea that MMX support was dropped in any way. I won't engage in a discussion about autovectorisation, intrinsics, builtins, generic vectorisation, etc, etc with you, but please look at PR 21395 how performance PR should be filled. The MMX code in that PR is _far_ from trivial, but since it is well written using intrinsic instructions, it enables jaw-dropping performance increase that is simply not possible when ASM blocks are used.

Now, I'm sure that you have your numbers ready to back up your claims from Comment #33 about performance of generated code, and I challenge you to beat performance of gcc-4.4 generated code by hand-crafted assembly using the example of PR 21395.
Comment 37 michaelni 2008-03-22 02:39:13 UTC
Subject: Re:  compiled trivial vector intrinsic code is
	inefficient

On Fri, Mar 21, 2008 at 10:34:00AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #36 from ubizjak at gmail dot com  2008-03-21 10:33 -------
> (In reply to comment #35)
> 
> > Also ffmpeg uses almost entirely asm() instead of intrinsics so this alone is
> > not so much a problem for ffmpeg than it is for others who followed the
> > recommandition of "intrinsics are better than asm".
> > 
> > About trolling, well i made no attempt to reply politely and diplomatic, no.
> > But "solving" a "problem" in some use case by droping support for that use
> > case is kinda extreem.
> > 
> > The way i see it is that
> > * Its non trivial to place emms optimally and automatically
> > * there needs to be a emms between mmx code and fpu code
> > 
> > The solutions to this would be any one of
> > A. let the programmer place emms like it has been in the past
> > B. dont support mmx at all
> > C. dont support x87 fpu at all
> > D. place emms after every bunch of mmx instructions
> > E. solve a quite non trivial problem and place emms optimally
> > 
> > The solution which has been selected apparently is B., why was that choosen?
> > Instead of lets say A.?
> > 
> > If i do write SIMD code then i do know that i need an emms on x86. Its
> > trivial for the programmer to place it optimally.
> 
> I don't know where you get the idea that MMX support was dropped in any way. I

Maybe because the SIMD code in this PR compiled with -mmmx does not use mmx
but very significantly less efficient integer instructions. And you added a
test to gcc which ensures that this case does not use mmx instructions.

This is pretty much the definion of droping mmx support (for this specific
case).


> won't engage in a discussion about autovectorisation, intrinsics, builtins,
> generic vectorisation, etc, etc with you,

And somehow iam glad about that.


> but please look at PR 21395 how
> performance PR should be filled. 

> The MMX code in that PR is _far_ from trivial,

Well that is something i would disagree about.


> but since it is well written using intrinsic instructions, it enables
> jaw-dropping performance increase that is simply not possible when ASM blocks
> are used.
> 
> Now, I'm sure that you have your numbers ready to back up your claims from
> Comment #33 about performance of generated code, and I challenge you to beat
> performance of gcc-4.4 generated code by hand-crafted assembly using the
> example of PR 21395.

done, 
jaw-dropping intrinsics need 
2.034s 

stinky hand written asm needs 
1.312s

But you can read the details in PR 21395.

[...]
Comment 38 Uroš Bizjak 2008-04-21 08:21:07 UTC
*** Bug 32301 has been marked as a duplicate of this bug. ***