38671 – [4.4/4.5/4.6/4.7 Regression] selecting one IV instead of three

Bug 38671 - [4.4/4.5/4.6/4.7 Regression] selecting one IV instead of three

Summary: [4.4/4.5/4.6/4.7 Regression] selecting one IV instead of three

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.4.0

Importance:	P2 normal
Target Milestone:	4.7.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2008-12-30 12:57 UTC by tim blechmann
Modified:	2012-01-16 12:54 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64-linux-gnu
Build:
Known to work:	4.2.4, 4.7.0
Known to fail:	4.3.2, 4.5.0
Last reconfirmed:	2010-07-24 23:52:38

Attachments
preprocessed source (gcc-4.4) (85.14 KB, application/x-bzip) 2008-12-30 12:58 UTC, tim blechmann	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description tim blechmann 2008-12-30 12:57:11 UTC

i experience some speed regressions with gcc-4.4, with sse intrinsics on a core2 (x86_64). the code is:

namespace detail
{
/** compute x1 * (1 + x2 * amount)  */
__m128 inline amp_mod4_loop(__m128 x1, __m128 x2, __m128 amount, __m128 one)
{
    return _mm_mul_ps(x1,
                      _mm_add_ps(one,
                                 _mm_mul_ps(x2, amount)));
}
} /* namespace detail */

template <>
inline void amp_mod4(float * out, const float * in1, const float * in2,
                     const float amount, unsigned int n)
{
    n = n >> 2;
    const __m128 one = detail::gen_one();
    const __m128 amnt = _mm_set_ps1(amount);

    do
    {
        const __m128 x1 = _mm_load_ps(in1);
        in1 += 4;
        const __m128 x2 = _mm_load_ps(in2);
        in2 += 4;

        const __m128 result = detail::amp_mod4_loop(x1, x2, amnt, one);

        _mm_store_ps(out, result);
        out += 4;
    }
    while (--n);
}

the results for different compilers (using hardware performance counters) are:
gcc-4.4:
cycles: 1416276094
branch misses: 425897

gcc-4.4 -march=core2:
cycles: 1520034636
branch misses: 3263912

gcc-4.3:
cycles: 1548838336
branch misses: 5990424

gcc-4.3 -march=core2:
cycles: 1386605444
branch misses: 5609

gcc-4.2:
cycles: 1321697674
branch misses: 3682

it seems that gcc-4.3 with -march core2 and gcc-4.2 generate code, which is more friendly to the branch predictor. tuning for core2 on gcc-4.4 actually seems to generate worse code.

the best code (gcc-4.2) is:
0000000000400de0 <bench_1_simd(unsigned int)>:
  400de0:       66 0f ef c0             pxor   %xmm0,%xmm0
  400de4:       c1 ef 02                shr    $0x2,%edi
  400de7:       0f 28 15 32 0f 00 00    movaps 0xf32(%rip),%xmm2        # 401d20 <_IO_stdin_used+0xb0>
  400dee:       31 c0                   xor    %eax,%eax
  400df0:       66 0f 76 c0             pcmpeqd %xmm0,%xmm0
  400df4:       66 0f 72 d0 19          psrld  $0x19,%xmm0
  400df9:       66 0f 72 f0 17          pslld  $0x17,%xmm0
  400dfe:       0f 28 c8                movaps %xmm0,%xmm1
  400e01:       0f 28 80 e0 26 60 00    movaps 0x6026e0(%rax),%xmm0
  400e08:       0f 59 c2                mulps  %xmm2,%xmm0
  400e0b:       0f 58 c1                addps  %xmm1,%xmm0
  400e0e:       0f 59 80 e0 25 60 00    mulps  0x6025e0(%rax),%xmm0
  400e15:       0f 29 80 e0 24 60 00    movaps %xmm0,0x6024e0(%rax)
  400e1c:       48 83 c0 10             add    $0x10,%rax
  400e20:       83 ef 01                sub    $0x1,%edi
  400e23:       75 dc                   jne    400e01 <bench_1_simd(unsigned int)+0x21>
  400e25:       f3 c3                   repz retq 
  400e27:       90                      nop    
  400e28:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)

the worst code (gcc-4.4, -march=core2) is 15% slower:
0000000000400e70 <bench_1_simd(unsigned int)>:
  400e70:       66 0f ef d2             pxor   %xmm2,%xmm2
  400e74:       89 fa                   mov    %edi,%edx
  400e76:       66 0f 76 d2             pcmpeqd %xmm2,%xmm2
  400e7a:       c1 ea 02                shr    $0x2,%edx
  400e7d:       66 0f 72 d2 19          psrld  $0x19,%xmm2
  400e82:       ff ca                   dec    %edx
  400e84:       66 0f 72 f2 17          pslld  $0x17,%xmm2
  400e89:       48 ff c2                inc    %rdx
  400e8c:       0f 28 0d 7d 17 00 00    movaps 0x177d(%rip),%xmm1        # 402610 <_IO_stdin_used+0xb0>
  400e93:       48 c1 e2 04             shl    $0x4,%rdx
  400e97:       31 c0                   xor    %eax,%eax
  400e99:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400ea0:       0f 28 c1                movaps %xmm1,%xmm0
  400ea3:       0f 59 80 e0 36 60 00    mulps  0x6036e0(%rax),%xmm0
  400eaa:       0f 58 c2                addps  %xmm2,%xmm0
  400ead:       0f 59 80 e0 35 60 00    mulps  0x6035e0(%rax),%xmm0
  400eb4:       0f 29 80 e0 34 60 00    movaps %xmm0,0x6034e0(%rax)
  400ebb:       48 83 c0 10             add    $0x10,%rax
  400ebf:       48 39 d0                cmp    %rdx,%rax
  400ec2:       75 dc                   jne    400ea0 <bench_1_simd(unsigned int)+0x30>
  400ec4:       f3 c3                   repz retq 
  400ec6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400ecd:       00 00 00

Comment 1 tim blechmann 2008-12-30 12:58:18 UTC

Created attachment 17013 [details]
preprocessed source (gcc-4.4)

Comment 2 Andrew Pinski 2008-12-31 07:47:39 UTC

sys_perf_counter_open always returns less than zero for me. 
This is with:
Linux gcc13 2.6.18-6-vserver-amd64 #1 SMP Sun Feb 10 17:55:04 UTC 2008 x86_64 GNU/Linux

What system call is it trying to do and why?

Comment 3 Andrew Pinski 2008-12-31 07:56:31 UTC

t.cc: In function ‘float __vector__ nova::detail::gen_one()’:
t.cc:34160: warning: ‘x’ is used uninitialized in this function

inline __m128 gen_one(void)
{
    __m128i x;
    __m128i ones = _mm_cmpeq_epi32(x, x);
    return (__m128)_mm_slli_epi32 (_mm_srli_epi32(ones, 25), 23);
}

Is undefined code I think.

Comment 4 Andrew Pinski 2008-12-31 08:10:02 UTC

  D.45587 = VIEW_CONVERT_EXPR<__v4si>(x);
  D.45589 = __builtin_ia32_pcmpeqd128 (D.45587, D.45587);
  D.45591 = __builtin_ia32_psrldi128 (D.45589, 25);
  D.45594 = __builtin_ia32_pslldi128 (D.45591, 23);
  one = VIEW_CONVERT_EXPR<__m128>(VIEW_CONVERT_EXPR<__m128i>(D.45594));
  D.45644 = (long unsigned int) ((n >> 2) + 4294967295) + 1 * 16;
  ivtmp.516 = 0;


So the inner loop is not the issue, only the setup code.

The extra subtract/add comes from D.45644.

Comment 5 Andrew Pinski 2008-12-31 08:12:50 UTC

Confirmed, though I don't have a fully reduced testcase yet.  Basically it comes down to using unsigned int rather than size_t.  If you had used size_t as the index, the code would have worked correctly.

Comment 6 tim blechmann 2008-12-31 09:20:07 UTC

> sys_perf_counter_open always returns less than zero for me. 
> This is with:
> Linux gcc13 2.6.18-6-vserver-amd64 #1 SMP Sun Feb 10 17:55:04 UTC 2008 x86_64
> GNU/Linux
> 
> What system call is it trying to do and why?
> 

it is trying to open the performance counters (http://lwn.net/Articles/310176/). it requires a patched kernel, though ...


(In reply to comment #3)
> t.cc: In function �float __vector__ nova::detail::gen_one()�:
> t.cc:34160: warning: �x� is used uninitialized in this function
> 
> inline __m128 gen_one(void)
> {
>     __m128i x;
>     __m128i ones = _mm_cmpeq_epi32(x, x);
>     return (__m128)_mm_slli_epi32 (_mm_srli_epi32(ones, 25), 23);
> }
> 
> Is undefined code I think.

this code is valid. the uninitialized xmm register x is compared with itself in order to set the register ones to ffffffffffffffffffffffffffffffff.

Comment 7 Andrew Pinski 2010-03-01 23:32:28 UTC

Still getting:
  D.41749_31 = n_13 + 4294967295;
  D.41750_32 = (long unsigned int) D.41749_31;
  D.41751_49 = D.41750_32 + 1;

Reduced testcase:
int f(int *a, int n, int *b)
{
  n = n >> 2;
  do {
   *b = *a;
    a += 4;
   b += 4;
  } while (--n);
}

--- CUT ---
I want to say this was introduced by POINTER plus work :(.

Comment 8 Andrew Pinski 2010-03-01 23:34:55 UTC

For 4.2, we use three IVs; while from 4.3 and above, we use one IV.

Comment 9 Richard Biener 2012-01-16 12:54:21 UTC

We're back to exactly the code from 4.2 on trunk:

.L2:
        movaps  in2(%rax), %xmm0
        mulps   %xmm1, %xmm0
        addps   %xmm2, %xmm0
        mulps   in1(%rax), %xmm0
        movaps  %xmm0, out(%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L2

but I can't reproduce the originally reported assembly with 4.4.0 either
(the report lacks information on flags used besides -march=core2, so I used
both -O2 and -O3 with the same result).

I can confirm that for the testcase in comment #7 we, since 4.3.x and
up to 4.6.x generate sth like

f:
.LFB0:
        .cfi_startproc
        sarl    $2, %esi
        xorl    %eax, %eax
        subl    $1, %esi
        addq    $1, %rsi
        salq    $4, %rsi
        .p2align 4,,10
        .p2align 3
.L2:
        movl    (%rdi,%rax), %ecx
        movl    %ecx, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L2

instead of what we generated with 4.2:

f:
.LFB2:
        sarl    $2, %esi
        .p2align 4,,7
.L2:
        movl    (%rdi), %eax
        addq    $16, %rdi
        movl    %eax, (%rdx)
        addq    $16, %rdx
        subl    $1, %esi
        jne     .L2
        rep ; ret

But that, even while not using decrement-and-branch looks superior to me.
For trunk we now create

f:
.LFB0:
        .cfi_startproc
        sarl    $2, %esi
        .p2align 4,,10
        .p2align 3
.L2:
        movl    (%rdi), %eax
        addq    $16, %rdi
        movl    %eax, (%rdx)
        addq    $16, %rdx
        subl    $1, %esi
        jne     .L2

again.

So, closing as fixed for trunk (or WORKSFORME for the original report).