User account creation filtered due to spam.

Bug 22152 - Poor loop optimization when using mmx builtins
Summary: Poor loop optimization when using mmx builtins
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.1.0
: P2 normal
Target Milestone: 4.4.0
Assignee: Not yet assigned to anyone
Keywords: missed-optimization, ssemmx
Depends on: 14552 19161 22076
  Show dependency treegraph
Reported: 2005-06-22 20:05 UTC by Fariborz Jahanian
Modified: 2010-09-06 18:13 UTC (History)
1 user (show)

See Also:
Target: i?86-*-*, x86_64-*-*
Known to work:
Known to fail:
Last reconfirmed: 2005-12-16 01:53:41


Note You need to log in before you can comment on or make changes to this bug.
Description Fariborz Jahanian 2005-06-22 20:05:53 UTC
In the following trivial test case, gcc-4.1 produces very ineffecient code for the loop. gcc-3.3 produces 
much better code.

typedef int __m64 __attribute__ ((__vector_size__ (8)));

__m64 unsigned_add3( const __m64 *a, const __m64 *b, unsigned long count )
                __m64 sum;
                unsigned int i;

                for( i = 1; i < count; i++ )
                        sum = (__m64) __builtin_ia32_paddq ((long long)a[i], (long long)b[i]);
                return sum;

1) Loop when compiled with gcc-4.1 -O2 -msse2 (note in particular the extra movq to memory):

        movl    12(%ebp), %esi
        movq    (%eax,%edx,8), %mm0
        paddq   (%esi,%edx,8), %mm0
        incl    %edx
        cmpl    %edx, %ecx
        movq    %mm0, -16(%ebp)
        movl    -16(%ebp), %esi
        movl    -12(%ebp), %edi
        jne     L4

2) Loop using gcc-3.3 compiled with -O2 -msse2:

        movq    (%esi,%edx,8), %mm0
        paddq   (%eax,%edx,8), %mm0
        addl    $1, %edx
        cmpl    %ecx, %edx
        jb      L6

AFAICT, culprit is reload which generates extra load and store of %mm0:

(insn 62 30 63 2 (set (mem:V2SI (plus:SI (reg/f:SI 6 bp)
                (const_int -16 [0xfffffffffffffff0])) [0 S8 A8])
        (reg:V2SI 29 mm0)) 736 {*movv2si_internal} (nil)

(insn 63 62 32 2 (set (reg/v:V2SI 4 si [orig:61 sum ] [61])
        (mem:V2SI (plus:SI (reg/f:SI 6 bp)
                (const_int -16 [0xfffffffffffffff0])) [0 S8 A8])) 736 {*movv2si_internal} (nil)

Here is the larger test case from which above test was extracted:

#include <xmmintrin.h>

__m64 unsigned_add3( const __m64 *a, const __m64 *b, __m64 *result, unsigned long count )
        __m64 carry, temp, sum, one, onesCarry, _a, _b;
        unsigned int i;

        if( count > 0 )
                _a = a[0];
                _b = b[0];

                one = _mm_cmpeq_pi8( _a, _a );  //-1
                one = _mm_sub_si64( _mm_xor_si64( one, one ), one );    //1
                sum = _mm_add_si64( _a, _b );

                onesCarry = _mm_and_si64( _a, _b );             //the 1's bit is set only if the 1's bit add 
generates a carry
                onesCarry = _mm_and_si64( onesCarry, one );             //onesCarry &= 1

                //Trim off the one's bit on both vA and vB to make room for a carry bit at the top after the 
                _a = _mm_srli_si64( _a, 1 );                                            //vA >>= 1
                _b = _mm_srli_si64( _b, 1 );                                            //vB >>= 1

                //Add vA to vB and add the carry bit
                carry = _mm_add_si64( _a, _b );
                carry = _mm_add_si64( carry, onesCarry );

                //right shift by 63 bits to get the carry bit for the high 64 bit quantity
                carry = _mm_srli_si64( carry, 63 );

                for( i = 1; i < count; i++ )
                        result[i-1] = sum;
                        _a = a[i];
                        _b = b[i];
                        onesCarry = _mm_and_si64( _a, _b );
                        onesCarry = _mm_and_si64( onesCarry, one );
                        sum = _mm_add_si64( _a, _b );
                        _a = _mm_add_si64( _a, onesCarry );
                        onesCarry = _mm_and_si64( carry, _a );  //find low bit carry
                        sum = _mm_add_si64( sum, carry );               //add in carry bit to low word sum 
                        carry = _mm_add_si64( _a, onesCarry );  //add in low bit carry to high result

                result[i-1] = sum;

        return carry;

Again, gcc-3.3 produces much better code for this loop.
Comment 1 Andrew Pinski 2005-06-22 20:16:00 UTC
This is a MMX builtin and not a SSE builtin.
Comment 2 Andrew Pinski 2005-06-22 20:26:22 UTC
This comes down to PR 14552 and PR 19161.  Since this is MMX code, it is very hard to get correct as 
GCC does not currently output emms and vector code without you doing it which is what PR 19161 is 
about.  This is a target problem.  Also see PR 22076.
Comment 3 Fariborz Jahanian 2005-09-13 00:52:01 UTC
Has there been any progress toward fixing the problems addressed by these PRs?
- thanks.
Comment 4 Uroš Bizjak 2007-03-01 13:47:36 UTC
Current mainline produces really horrible code:

        movl    (%edx), %ebx
        addl    $1, %eax
        movl    4(%edx), %esi
        addl    $8, %edx
        movl    %ebx, 8(%esp)
        movl    %esi, 12(%esp)
        movq    8(%esp), %mm0
        paddq   (%ecx), %mm0
        addl    $8, %ecx
        cmpl    %edi, %eax
        movq    %mm0, 8(%esp)
        movl    8(%esp), %ebx
        movl    12(%esp), %esi
        jne     .L4

This is due to two problems:

1) For some reason, ivopts doesn't use fancy i386 addressing modes. -fno-ivopts produces slightly better code:

        movl    (%edi,%eax,8), %edx
        movl    4(%edi,%eax,8), %ecx
        movl    %edx, 8(%esp)
        movl    %ecx, 12(%esp)
        movq    8(%esp), %mm0
        paddq   (%esi,%eax,8), %mm0
        addl    $1, %eax
        cmpl    %eax, %ebx
        movq    %mm0, 8(%esp)
        movl    8(%esp), %edx
        movl    12(%esp), %ecx
        ja      .L4

2) A DImode register is used in the middle of RTL stream, following to this reload sequence:

(insn:HI 21 20 53 4 (set (reg:DI 1 dx)
        (mem:DI (plus:SI (mult:SI (reg/v:SI 0 ax [orig:59 i ] [59])
                    (const_int 8 [0x8]))
                (reg/v/f:SI 5 di [orig:64 a ] [64])) [2 S8 A64])) 56 {*movdi_2} (nil)

(insn 53 21 54 4 (set (mem/c:DI (plus:SI (reg/f:SI 7 sp)
                (const_int 8 [0x8])) [5 S8 A8])
        (reg:DI 1 dx)) 56 {*movdi_2} (nil)

(insn 54 53 22 4 (set (reg:DI 29 mm0)
        (mem/c:DI (plus:SI (reg/f:SI 7 sp)
                (const_int 8 [0x8])) [5 S8 A8])) 56 {*movdi_2} (nil)

(insn:HI 22 54 55 4 (set (reg:DI 29 mm0)
        (unspec:DI [
                (plus:DI (reg:DI 29 mm0)
                    (mem:DI (plus:SI (mult:SI (reg/v:SI 0 ax [orig:59 i ] [59])
                                (const_int 8 [0x8]))
                            (reg/v/f:SI 4 si [orig:65 b ] [65])) [2 S8 A64]))
            ] 38)) 612 {mmx_adddi3} (insn_list:REG_DEP_TRUE 21 (nil))

DImode register in insn 21 gets allocated to dx/cx DImode pair, but insn 22 wants mmx register. Reload then inserts insn 53 and 54 to satisfy input and output constraints. The same story repeats at the end of the loop, but this time dx/cx gets allocated to V2SImode pseudo (?!):

(insn:HI 24 55 26 4 (set (reg/v:V2SI 1 dx [orig:60 sum ] [60])
        (mem/c:V2SI (plus:SI (reg/f:SI 7 sp)
                (const_int 8 [0x8])) [5 S8 A8])) 581 {*movv2si_internal} (insn_list:REG_DEP_TRUE 22 (nil))

In above case, mm0 register (in DImode) gets reloaded from mm0 (V2SImode) via memory. It looks that mmx DImode _really_ upsets register allocator as it can be allocated to either si/si register pair or to mmx register. Perhaps we need V1DI mode to separate pure DImodes (either 2*32bit for i686 or 64bit for x86_64) from mmx DImodes.

It is possible to change delicate allocation balance by changing register preferences in movdi_2 and mov<mode>_internal MMX move patterns, but we really need more robust solution for this problem.
Comment 5 uros 2008-03-08 07:00:14 UTC
Subject: Bug 22152

Author: uros
Date: Sat Mar  8 06:59:33 2008
New Revision: 133023

2008-03-08  Uros Bizjak  <>

        PR target/22152
        * config/i386/i386-modes.def (V1DI): New vector mode.
        * config/i386/i386.h (VALID_MMX_REG_MODE): Add V1DImode.
        * config/i386/ (MMXMODEI8): New mode iterator.
        (MMXMODE248): Ditto.
        (MMXMODE): Add V1DI mode.
        (mmxvecsize): Change DI mode to V1DI mode.
        ("mov<mode>): Use MMXMODEI8 mode iterator.
        ("*mov<mode>_internal_rex64"): Ditto.
        ("*mov<mode>_internal"): Ditto.
        ("mmx_add<mode>3"): Ditto.  Handle V1DImode for TARGET_SSE2.
        ("mmx_sub<mode>3"): Ditto.
        ("mmx_adddi3"): Remove insn pattern.
        ("mmx_subdi3"): Ditto.
        ("mmx_ashr<mode>3"): Use SImode and "yN" constraint for operand 2.
        ("mmx_lshr<mode>3"): Ditto. Use MMXMODE248 mode iterator.
        ("mmx_ashl<mode>3"): Ditto.
        ("mmx_lshrdi3"): Remove insn pattern.
        ("mmx_ashldi3"): Ditto.
        * config/i386/i386.c (classify_argument): Handle V1DImode.
        (function_arg_advance_32): Ditto.
        (function_arg_32): Ditto.
        (struct builtin_description) [IX86_BUILTIN_PADDQ]: Use
        mmx_addv1di3 insn pattern.
        [IX86_BUILTIN_PSUBQ]: Use mmx_subv1di3 insn pattern.
        Remove definitions of built-in functions.
        (V1DI_type_node): New node.
        (v1di_ftype_v1di_int): Ditto.
        (v1di_ftype_v1di_v1di): Ditto.
        (v2si_ftype_v2si_si): Ditto.
        (v4hi_ftype_v4hi_di): Remove node.
        (v2si_ftype_v2si_di): Ditto.
        (ix86_init_mmx_sse_builtins): Handle V1DImode.
        (__builtin_ia32_psll?, __builtin_ia32_psrl?, __builtin_ia32_psra?):
        Redefine builtins using def_builtin_const with *_ftype_*_int node.
        (__builtin_ia32_psll?i, __builtin_ia32_psrl?i, __builtin_ia32_psra?i):
        Add new builtins using def_builtin_const.
        (ix86_expand_builtin) [IX86_BUILTIN_PSLL?, IX86_BUILTIN_PSRL?,
        IX86_BUILTIN_PSRA?I]: Handle builtin definitions.
        * config/i386/mmintrin.h (__v1di): New typedef.
        (_mm_add_si64): Cast arguments to __v1di type.
        (_mm_sub_si64): Ditto.
        (_mm_sll_pi16): Cast __count to __v4hi type.
        (_mm_sll_pi32): Cast __count to __v2si type.
        (_mm_sll_si64): Cast arguments to __v1di type.
        (_mm_srl_pi16): Cast __count to __v4hi type.
        (_mm_srl_pi32): Cast __count to __v2si type.
        (_mm_srl_si64): Cast arguments to __v1di type.
        (_mm_sra_pi16): Cast __count to __v4hi type.
        (_mm_sra_pi32): Cast __count to __v2si type.
        (_mm_slli_pi16): Use __builtin_ia32_psllwi.
        (_mm_slli_pi32): Use __builtin_ia32_pslldi.
        (_mm_slli_si64): Use __builtin_ia32_psllqi. Cast __m to __v1di type.
        (_mm_srli_pi16): Use __builtin_ia32_psrlwi.
        (_mm_srli_pi32): Use __builtin_ia32_psrldi.
        (_mm_srli_si64): Use __builtin_ia32_psrlqi. Cast __m to __v1di type.
        (_mm_srai_pi16): Use __builtin_ia32_psrawi.
        (_mm_srai_pi32): Use __builtin_ia32_psradi.
        * config/i386/ (UNSPEC_NOP): Remove unspec definition.
        * doc/extend.texi (X86 Built-in Functions) [__builtin_ia32_psll?,
        __builtin_ia32_psrl?, __builtin_ia32_psra?, __builtin_ia32_psll?i,
        __builtin_ia32_psrl?i, __builtin_ia32_psra?i]: Add new builtins.


Comment 6 Uroš Bizjak 2008-03-08 07:09:07 UTC
Patch at

New asm for the original testcase is now much better [see patch post].
Comment 7 Uroš Bizjak 2008-03-08 07:21:50 UTC
For updated testcase:

typedef int __m64 __attribute__ ((__vector_size__ (8)));
typedef long long __v1di __attribute__ ((__vector_size__ (8)));

unsigned_add3 (const __m64 * a, const __m64 * b, unsigned long count)
  __m64 sum;
  unsigned int i;

  for (i = 1; i < count; i++)
    sum = (__m64) __builtin_ia32_paddq ((__v1di) a[i], (__v1di) b[i]);

  return sum;

we now generate:

        addl    $1, %eax
        movq    (%edx), %mm0
        addl    $8, %edx
        paddq   (%ecx), %mm0
        addl    $8, %ecx
        cmpl    %eax, %ebx
        ja      .L3

and adding -mno-ivopts we are back to optimal code:

        movq    (%ebx,%eax,8), %mm0
        paddq   (%ecx,%eax,8), %mm0
        addl    $1, %eax
        cmpl    %eax, %edx
        ja      .L3

So, fixed.

Comment 8 Uroš Bizjak 2008-03-08 07:22:33 UTC
Fixed for real.
Comment 9 uros 2008-03-08 12:44:06 UTC
Subject: Bug 22152

Author: uros
Date: Sat Mar  8 12:43:13 2008
New Revision: 133034

        PR target/22152
        * New test.
        * Ditto.


Comment 10 Uroš Bizjak 2008-03-08 12:51:51 UTC
BTW: The larger testcase from the Comment 0 should add two numbers together, but the carry propagation logic in the loop is fatally flawed. The testcase that was added to the testsuite [1] fixes this problem.

Comment 11 Uroš Bizjak 2010-09-06 12:05:14 UTC
Reopened due to 4.6 regression, see [1].

Comment 12 uros 2010-09-06 17:51:30 UTC
Subject: Bug 22152

Author: uros
Date: Mon Sep  6 17:51:12 2010
New Revision: 163926

	PR target/22152
	* config/i386/ (*mov<mode>_internal_rex64,
	*mov<mode>_internal_avx, *mov<mode>_internal,
	*movv2sf_internal_rex64_avx, *movv2sf_internal_rex64,
	*movv2sf_internal_avx, *movv2sf_internal): Split out !y-!y alternative.


Comment 13 uros 2010-09-06 17:55:08 UTC
Subject: Bug 22152

Author: uros
Date: Mon Sep  6 17:54:46 2010
New Revision: 163927

	PR target/22152
	* (add3): Change "count" to unsigned int.


Comment 14 Uroš Bizjak 2010-09-06 18:13:57 UTC
Fixed again.