Bug 38671 - [4.4/4.5/4.6/4.7 Regression] selecting one IV instead of three
Summary: [4.4/4.5/4.6/4.7 Regression] selecting one IV instead of three
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P2 normal
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2008-12-30 12:57 UTC by tim blechmann
Modified: 2012-01-16 12:54 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-linux-gnu
Build:
Known to work: 4.2.4, 4.7.0
Known to fail: 4.3.2, 4.5.0
Last reconfirmed: 2010-07-24 23:52:38


Attachments
preprocessed source (gcc-4.4) (85.14 KB, application/x-bzip)
2008-12-30 12:58 UTC, tim blechmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description tim blechmann 2008-12-30 12:57:11 UTC
i experience some speed regressions with gcc-4.4, with sse intrinsics on a core2 (x86_64). the code is:

namespace detail
{
/** compute x1 * (1 + x2 * amount)  */
__m128 inline amp_mod4_loop(__m128 x1, __m128 x2, __m128 amount, __m128 one)
{
    return _mm_mul_ps(x1,
                      _mm_add_ps(one,
                                 _mm_mul_ps(x2, amount)));
}
} /* namespace detail */

template <>
inline void amp_mod4(float * out, const float * in1, const float * in2,
                     const float amount, unsigned int n)
{
    n = n >> 2;
    const __m128 one = detail::gen_one();
    const __m128 amnt = _mm_set_ps1(amount);

    do
    {
        const __m128 x1 = _mm_load_ps(in1);
        in1 += 4;
        const __m128 x2 = _mm_load_ps(in2);
        in2 += 4;

        const __m128 result = detail::amp_mod4_loop(x1, x2, amnt, one);

        _mm_store_ps(out, result);
        out += 4;
    }
    while (--n);
}

the results for different compilers (using hardware performance counters) are:
gcc-4.4:
cycles: 1416276094
branch misses: 425897

gcc-4.4 -march=core2:
cycles: 1520034636
branch misses: 3263912

gcc-4.3:
cycles: 1548838336
branch misses: 5990424

gcc-4.3 -march=core2:
cycles: 1386605444
branch misses: 5609

gcc-4.2:
cycles: 1321697674
branch misses: 3682

it seems that gcc-4.3 with -march core2 and gcc-4.2 generate code, which is more friendly to the branch predictor. tuning for core2 on gcc-4.4 actually seems to generate worse code.

the best code (gcc-4.2) is:
0000000000400de0 <bench_1_simd(unsigned int)>:
  400de0:       66 0f ef c0             pxor   %xmm0,%xmm0
  400de4:       c1 ef 02                shr    $0x2,%edi
  400de7:       0f 28 15 32 0f 00 00    movaps 0xf32(%rip),%xmm2        # 401d20 <_IO_stdin_used+0xb0>
  400dee:       31 c0                   xor    %eax,%eax
  400df0:       66 0f 76 c0             pcmpeqd %xmm0,%xmm0
  400df4:       66 0f 72 d0 19          psrld  $0x19,%xmm0
  400df9:       66 0f 72 f0 17          pslld  $0x17,%xmm0
  400dfe:       0f 28 c8                movaps %xmm0,%xmm1
  400e01:       0f 28 80 e0 26 60 00    movaps 0x6026e0(%rax),%xmm0
  400e08:       0f 59 c2                mulps  %xmm2,%xmm0
  400e0b:       0f 58 c1                addps  %xmm1,%xmm0
  400e0e:       0f 59 80 e0 25 60 00    mulps  0x6025e0(%rax),%xmm0
  400e15:       0f 29 80 e0 24 60 00    movaps %xmm0,0x6024e0(%rax)
  400e1c:       48 83 c0 10             add    $0x10,%rax
  400e20:       83 ef 01                sub    $0x1,%edi
  400e23:       75 dc                   jne    400e01 <bench_1_simd(unsigned int)+0x21>
  400e25:       f3 c3                   repz retq 
  400e27:       90                      nop    
  400e28:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)

the worst code (gcc-4.4, -march=core2) is 15% slower:
0000000000400e70 <bench_1_simd(unsigned int)>:
  400e70:       66 0f ef d2             pxor   %xmm2,%xmm2
  400e74:       89 fa                   mov    %edi,%edx
  400e76:       66 0f 76 d2             pcmpeqd %xmm2,%xmm2
  400e7a:       c1 ea 02                shr    $0x2,%edx
  400e7d:       66 0f 72 d2 19          psrld  $0x19,%xmm2
  400e82:       ff ca                   dec    %edx
  400e84:       66 0f 72 f2 17          pslld  $0x17,%xmm2
  400e89:       48 ff c2                inc    %rdx
  400e8c:       0f 28 0d 7d 17 00 00    movaps 0x177d(%rip),%xmm1        # 402610 <_IO_stdin_used+0xb0>
  400e93:       48 c1 e2 04             shl    $0x4,%rdx
  400e97:       31 c0                   xor    %eax,%eax
  400e99:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400ea0:       0f 28 c1                movaps %xmm1,%xmm0
  400ea3:       0f 59 80 e0 36 60 00    mulps  0x6036e0(%rax),%xmm0
  400eaa:       0f 58 c2                addps  %xmm2,%xmm0
  400ead:       0f 59 80 e0 35 60 00    mulps  0x6035e0(%rax),%xmm0
  400eb4:       0f 29 80 e0 34 60 00    movaps %xmm0,0x6034e0(%rax)
  400ebb:       48 83 c0 10             add    $0x10,%rax
  400ebf:       48 39 d0                cmp    %rdx,%rax
  400ec2:       75 dc                   jne    400ea0 <bench_1_simd(unsigned int)+0x30>
  400ec4:       f3 c3                   repz retq 
  400ec6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  400ecd:       00 00 00
Comment 1 tim blechmann 2008-12-30 12:58:18 UTC
Created attachment 17013 [details]
preprocessed source (gcc-4.4)
Comment 2 Andrew Pinski 2008-12-31 07:47:39 UTC
sys_perf_counter_open always returns less than zero for me. 
This is with:
Linux gcc13 2.6.18-6-vserver-amd64 #1 SMP Sun Feb 10 17:55:04 UTC 2008 x86_64 GNU/Linux

What system call is it trying to do and why?
Comment 3 Andrew Pinski 2008-12-31 07:56:31 UTC
t.cc: In function ‘float __vector__ nova::detail::gen_one()’:
t.cc:34160: warning: ‘x’ is used uninitialized in this function

inline __m128 gen_one(void)
{
    __m128i x;
    __m128i ones = _mm_cmpeq_epi32(x, x);
    return (__m128)_mm_slli_epi32 (_mm_srli_epi32(ones, 25), 23);
}

Is undefined code I think.
Comment 4 Andrew Pinski 2008-12-31 08:10:02 UTC
  D.45587 = VIEW_CONVERT_EXPR<__v4si>(x);
  D.45589 = __builtin_ia32_pcmpeqd128 (D.45587, D.45587);
  D.45591 = __builtin_ia32_psrldi128 (D.45589, 25);
  D.45594 = __builtin_ia32_pslldi128 (D.45591, 23);
  one = VIEW_CONVERT_EXPR<__m128>(VIEW_CONVERT_EXPR<__m128i>(D.45594));
  D.45644 = (long unsigned int) ((n >> 2) + 4294967295) + 1 * 16;
  ivtmp.516 = 0;


So the inner loop is not the issue, only the setup code.

The extra subtract/add comes from D.45644.
Comment 5 Andrew Pinski 2008-12-31 08:12:50 UTC
Confirmed, though I don't have a fully reduced testcase yet.  Basically it comes down to using unsigned int rather than size_t.  If you had used size_t as the index, the code would have worked correctly.
Comment 6 tim blechmann 2008-12-31 09:20:07 UTC
> sys_perf_counter_open always returns less than zero for me. 
> This is with:
> Linux gcc13 2.6.18-6-vserver-amd64 #1 SMP Sun Feb 10 17:55:04 UTC 2008 x86_64
> GNU/Linux
> 
> What system call is it trying to do and why?
> 

it is trying to open the performance counters (http://lwn.net/Articles/310176/). it requires a patched kernel, though ...


(In reply to comment #3)
> t.cc: In function �float __vector__ nova::detail::gen_one()�:
> t.cc:34160: warning: �x� is used uninitialized in this function
> 
> inline __m128 gen_one(void)
> {
>     __m128i x;
>     __m128i ones = _mm_cmpeq_epi32(x, x);
>     return (__m128)_mm_slli_epi32 (_mm_srli_epi32(ones, 25), 23);
> }
> 
> Is undefined code I think.

this code is valid. the uninitialized xmm register x is compared with itself in order to set the register ones to ffffffffffffffffffffffffffffffff.
Comment 7 Andrew Pinski 2010-03-01 23:32:28 UTC
Still getting:
  D.41749_31 = n_13 + 4294967295;
  D.41750_32 = (long unsigned int) D.41749_31;
  D.41751_49 = D.41750_32 + 1;

Reduced testcase:
int f(int *a, int n, int *b)
{
  n = n >> 2;
  do {
   *b = *a;
    a += 4;
   b += 4;
  } while (--n);
}

--- CUT ---
I want to say this was introduced by POINTER plus work :(.
Comment 8 Andrew Pinski 2010-03-01 23:34:55 UTC
For 4.2, we use three IVs; while from 4.3 and above, we use one IV.
Comment 9 Richard Biener 2012-01-16 12:54:21 UTC
We're back to exactly the code from 4.2 on trunk:

.L2:
        movaps  in2(%rax), %xmm0
        mulps   %xmm1, %xmm0
        addps   %xmm2, %xmm0
        mulps   in1(%rax), %xmm0
        movaps  %xmm0, out(%rax)
        addq    $16, %rax
        cmpq    %rdx, %rax
        jne     .L2

but I can't reproduce the originally reported assembly with 4.4.0 either
(the report lacks information on flags used besides -march=core2, so I used
both -O2 and -O3 with the same result).

I can confirm that for the testcase in comment #7 we, since 4.3.x and
up to 4.6.x generate sth like

f:
.LFB0:
        .cfi_startproc
        sarl    $2, %esi
        xorl    %eax, %eax
        subl    $1, %esi
        addq    $1, %rsi
        salq    $4, %rsi
        .p2align 4,,10
        .p2align 3
.L2:
        movl    (%rdi,%rax), %ecx
        movl    %ecx, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L2

instead of what we generated with 4.2:

f:
.LFB2:
        sarl    $2, %esi
        .p2align 4,,7
.L2:
        movl    (%rdi), %eax
        addq    $16, %rdi
        movl    %eax, (%rdx)
        addq    $16, %rdx
        subl    $1, %esi
        jne     .L2
        rep ; ret

But that, even while not using decrement-and-branch looks superior to me.
For trunk we now create

f:
.LFB0:
        .cfi_startproc
        sarl    $2, %esi
        .p2align 4,,10
        .p2align 3
.L2:
        movl    (%rdi), %eax
        addq    $16, %rdi
        movl    %eax, (%rdx)
        addq    $16, %rdx
        subl    $1, %esi
        jne     .L2

again.

So, closing as fixed for trunk (or WORKSFORME for the original report).