User account creation filtered due to spam.

Bug 79830 - GCC generates counterproductive code surrounding very simple loops (improvement request)
Summary: GCC generates counterproductive code surrounding very simple loops (improveme...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-03-03 13:08 UTC by Petr
Modified: 2017-05-10 11:51 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-03-03 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Petr 2017-03-03 13:08:33 UTC
It seems that GCC tries very hard to optimize loops, but in my case it's counterproductive. I have illustrated the problem in the following C++ code and disassembly.

Loops that are constructed this way need only one variable (`i`) as a loop counter and use sign flag to check whether the loop is done or not. Typically such loop requires simple check at the beginning (`sub` and `js`) and at the end. The purpose of such loop is to save registers and to only require minimal code surrounding the loop.

However, it seems that GCC tries to convert such loop into something else and requires a lot of operations to do that, resulting in bigger and slower code. When using `-Os` GCC produces code that I would expect, however, I don't want to optimize for size globally.

It's not a compiler bug, but I think that in this case this optimization doesn't make any sense and only adds to the executable/library size. I doubt this leads to any improvement and it would be nice if GCC can somehow discover to not do this for these kind of loops.

Also, here is a compiler explorer URL, for people wanting to compare:

  https://godbolt.org/g/oeDGmy



Consider the following C++ code
-------------------------------


#include <stdint.h>

#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

void transform(double* dst, const double* src, const double* matrix, size_t length) {
  __m256d m_00_11 = _mm256_castpd128_pd256(_mm_set_pd(matrix[3], matrix[0]));
  __m256d m_10_01 = _mm256_castpd128_pd256(_mm_set_pd(matrix[1], matrix[2]));
  __m256d m_20_21 = _mm256_broadcast_pd(reinterpret_cast<const __m128d*>(matrix + 4));
  
  m_00_11 = _mm256_permute2f128_pd(m_00_11, m_00_11, 0);
  m_10_01 = _mm256_permute2f128_pd(m_10_01, m_10_01, 0);
  
  intptr_t i = static_cast<intptr_t>(length);
  while ((i -= 8) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src +  0);
    __m256d s1 = _mm256_loadu_pd(src +  4);
    __m256d s2 = _mm256_loadu_pd(src +  8);
    __m256d s3 = _mm256_loadu_pd(src + 12);

    __m256d a0 = _mm256_add_pd(_mm256_mul_pd(s0, m_00_11), m_20_21);
    __m256d a1 = _mm256_add_pd(_mm256_mul_pd(s1, m_00_11), m_20_21);
    __m256d a2 = _mm256_add_pd(_mm256_mul_pd(s2, m_00_11), m_20_21);
    __m256d a3 = _mm256_add_pd(_mm256_mul_pd(s3, m_00_11), m_20_21);
    
    __m256d b0 = _mm256_mul_pd(_mm256_shuffle_pd(s0, s0, 0x1), m_10_01);
    __m256d b1 = _mm256_mul_pd(_mm256_shuffle_pd(s1, s1, 0x1), m_10_01);
    __m256d b2 = _mm256_mul_pd(_mm256_shuffle_pd(s2, s2, 0x1), m_10_01);
    __m256d b3 = _mm256_mul_pd(_mm256_shuffle_pd(s3, s3, 0x1), m_10_01);
    
    _mm256_storeu_pd(dst +  0, _mm256_add_pd(a0, b0));
    _mm256_storeu_pd(dst +  4, _mm256_add_pd(a1, b1));
    _mm256_storeu_pd(dst +  8, _mm256_add_pd(a2, b2));
    _mm256_storeu_pd(dst + 12, _mm256_add_pd(a3, b3));
    
    dst += 16;
    src += 16;
  }
  i += 8;

  while ((i -= 2) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src);

    __m256d a0 = _mm256_add_pd(_mm256_mul_pd(s0, m_00_11), m_20_21);
    __m256d b0 = _mm256_mul_pd(_mm256_shuffle_pd(s0, s0, 0x1), m_10_01);
 
    _mm256_storeu_pd(dst, _mm256_add_pd(a0, b0));
    
    dst += 4;
    src += 4;
  }
  
  if (i & 1) {
    __m128d s0 = _mm_loadu_pd(src +  0);

    __m128d a0 = _mm_add_pd(_mm_mul_pd(s0, _mm256_castpd256_pd128(m_00_11)), _mm256_castpd256_pd128(m_20_21));
    __m128d b0 = _mm_mul_pd(_mm_shuffle_pd(s0, s0, 0x1), _mm256_castpd256_pd128(m_10_01));
 
    _mm_storeu_pd(dst +  0, _mm_add_pd(a0, b0));
  }
}



Which is compiled to the following
----------------------------------

(-O2 -mavx -fno-exceptions -fno-tree-vectorize)

See comments describing what I din't like..

transform(double*, double const*, double const*, unsigned long):
        vmovsd  xmm4, QWORD PTR [rdx]
        mov     r9, rcx
        vmovsd  xmm5, QWORD PTR [rdx+16]
        sub     r9, 8
        vmovhpd xmm4, xmm4, QWORD PTR [rdx+24]
        vbroadcastf128  ymm6, XMMWORD PTR [rdx+32]
        mov     r8, rcx
        vmovhpd xmm5, xmm5, QWORD PTR [rdx+8]
        vperm2f128      ymm4, ymm4, ymm4, 0
        vperm2f128      ymm5, ymm5, ymm5, 0
        js      .L6

        ;; <----------------------- Weird
        mov     rax, r9
        sub     rcx, 16
        mov     r8, r9
        and     rax, -8
        mov     rdx, rsi
        sub     rcx, rax
        mov     rax, rdi
        ;; <----------------------- Weird
.L5:
        vmovupd xmm3, XMMWORD PTR [rdx]
        sub     r8, 8
        sub     rax, -128
        sub     rdx, -128
        vinsertf128     ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
        vmovupd xmm2, XMMWORD PTR [rdx-96]
        vmovupd xmm1, XMMWORD PTR [rdx-64]
        vinsertf128     ymm2, ymm2, XMMWORD PTR [rdx-80], 0x1
        vmovupd xmm0, XMMWORD PTR [rdx-32]
        vshufpd ymm10, ymm3, ymm3, 1
        vmulpd  ymm3, ymm4, ymm3
        vinsertf128     ymm1, ymm1, XMMWORD PTR [rdx-48], 0x1
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rdx-16], 0x1
        vaddpd  ymm3, ymm3, ymm6
        vmulpd  ymm10, ymm5, ymm10
        vshufpd ymm9, ymm2, ymm2, 1
        vshufpd ymm8, ymm1, ymm1, 1
        vshufpd ymm7, ymm0, ymm0, 1
        vaddpd  ymm3, ymm3, ymm10
        vmulpd  ymm2, ymm4, ymm2
        vmulpd  ymm1, ymm4, ymm1
        vmulpd  ymm0, ymm4, ymm0
        vmovups XMMWORD PTR [rax-128], xmm3
        vextractf128    XMMWORD PTR [rax-112], ymm3, 0x1
        vaddpd  ymm2, ymm2, ymm6
        vmulpd  ymm3, ymm5, ymm9
        vaddpd  ymm1, ymm1, ymm6
        vmulpd  ymm8, ymm5, ymm8
        vaddpd  ymm0, ymm0, ymm6
        vmulpd  ymm7, ymm5, ymm7
        vaddpd  ymm2, ymm2, ymm3
        vaddpd  ymm1, ymm1, ymm8
        vaddpd  ymm0, ymm0, ymm7
        vmovups XMMWORD PTR [rax-96], xmm2
        vextractf128    XMMWORD PTR [rax-80], ymm2, 0x1
        vmovups XMMWORD PTR [rax-64], xmm1
        vextractf128    XMMWORD PTR [rax-48], ymm1, 0x1
        vmovups XMMWORD PTR [rax-32], xmm0
        vextractf128    XMMWORD PTR [rax-16], ymm0, 0x1
        cmp     rcx, r8
        jne     .L5

        ;; <----------------------- Weird
        mov     rax, r9
        shr     rax, 3
        lea     rdx, [rax+1]
        neg     rax
        lea     r8, [r9+rax*8]
        sal     rdx, 7
        add     rdi, rdx
        add     rsi, rdx
.L6:
        mov     rax, r8
        sub     rax, 2
        js      .L4
        shr     rax
        lea     rcx, [rax+1]
        mov     rdx, rax
        xor     eax, eax
        sal     rcx, 5
        ;; <----------------------- Weird

.L7:
        vmovupd xmm0, XMMWORD PTR [rsi+rax]
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
        vshufpd ymm1, ymm0, ymm0, 1
        vmulpd  ymm0, ymm4, ymm0
        vmulpd  ymm1, ymm5, ymm1
        vaddpd  ymm0, ymm0, ymm6
        vaddpd  ymm0, ymm0, ymm1
        vmovups XMMWORD PTR [rdi+rax], xmm0
        vextractf128    XMMWORD PTR [rdi+16+rax], ymm0, 0x1
        add     rax, 32
        ;; <----------------------- Doesn't follow (i -= 2) >= 0
        cmp     rcx, rax
        jne     .L7
        ;; <----------------------- Weird
        mov     rax, rdx
        add     rdi, rcx
        add     rsi, rcx
        neg     rax
        lea     rax, [r8-4+rax*2]
        ;; <----------------------- Weird
.L4:
        test    al, 1
        je      .L14
        vmovupd xmm0, XMMWORD PTR [rsi]
        vmulpd  xmm4, xmm4, xmm0
        vshufpd xmm1, xmm0, xmm0, 1
        vaddpd  xmm6, xmm4, xmm6
        vmulpd  xmm5, xmm5, xmm1
        vaddpd  xmm5, xmm6, xmm5
        vmovups XMMWORD PTR [rdi], xmm5
.L14:
        vzeroupper
        ret
Comment 1 Richard Biener 2017-03-03 13:19:02 UTC
It is induction variable optimization (-fivopts) that re-writes the main induction variable.  We have

Original cost 17 (complexity 2)

Final cost 17 (complexity 2)

Selected IV set for loop 2 at t.C:44, 4 avg niters, 0 expressions, 1 IVs:
Candidate 5:
  Var befor: ivtmp.25_108
  Var after: ivtmp.25_107
  Incr POS: before exit test
  IV struct:
    Type:       sizetype
    Base:       0
    Step:       32
    Biv:        N
    Overflowness wrto loop niter:       No-overflow

Replacing exit test: if (i_32 >= 0)

but it doesn't seem to account the extra cost for the exit test replacement
when facing equal original/final cost.
Comment 2 Petr 2017-03-03 13:36:01 UTC
I'm not sure I follow with the exit test. I mean the code should be correct as each point has x|y coord, which is two doubles, so length 8 means 16 doubles (I converted from my production code into a simpler form that uses only native types).

However, I think that the problem is also that if this code was handwritten it would only use 3 registers (dst, src, and i), but GCC uses:

  rax|rcd|rdx|rsi|rdi|r8|r9

which is a lot and the same code in 32-bit mode contains one short spill of GP register. Basically if I needed more GP registers inside the function the problem would be much bigger (but no clue if GCC would use different approach in that case).
Comment 3 Petr 2017-03-03 13:38:36 UTC
Sorry for misunderstanding, I really read initially that you replaced the exit condition in the sample code :)
Comment 4 Petr 2017-03-04 23:17:15 UTC
I think the test-case can be simplified to the following code. It still suffers from the same issues as mentioned above.

#include <stdint.h>

#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

void transform(double* dst, const double* src, const double* matrix, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);
  while ((i -= 2) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src);
    _mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));
    
    dst += 4;
    src += 4;
  }
  
  if (i & 1) {
    __m128d s0 = _mm_loadu_pd(src);
    _mm_storeu_pd(dst, _mm_add_pd(s0, s0));
  }
}
Comment 5 amker 2017-05-10 10:27:46 UTC
(In reply to Richard Biener from comment #1)
> It is induction variable optimization (-fivopts) that re-writes the main
> induction variable.  We have
> 
> Original cost 17 (complexity 2)
> 
> Final cost 17 (complexity 2)
> 
> Selected IV set for loop 2 at t.C:44, 4 avg niters, 0 expressions, 1 IVs:
> Candidate 5:
>   Var befor: ivtmp.25_108
>   Var after: ivtmp.25_107
>   Incr POS: before exit test
>   IV struct:
>     Type:       sizetype
>     Base:       0
>     Step:       32
>     Biv:        N
>     Overflowness wrto loop niter:       No-overflow
> 
> Replacing exit test: if (i_32 >= 0)
> 
> but it doesn't seem to account the extra cost for the exit test replacement
> when facing equal original/final cost.

For the iv elimination issue, I think it's simply a bug in computing elimination cost:
  /* When the condition is a comparison of the candidate IV against
     zero, prefer this IV.

     TODO: The constant that we're subtracting from the cost should
     be target-dependent.  This information should be added to the
     target costs for each backend.  */
  if (!elim_cost.infinite_cost_p () /* Do not try to decrease infinite! */
      && integer_zerop (*bound_cst)
      && (operand_equal_p (*control_var, cand->var_after, 0)
	  || operand_equal_p (*control_var, cand->var_before, 0)))
    elim_cost -= 1;

Why it compares against current bound_cst for elim_cost?  After elimination, we don't compare against bound_cst any more (very likely)!
Comment 6 amker 2017-05-10 11:51:47 UTC
BTW, I don't see problem in iv_elimination for the second loop, the .L7 one.  It eliminates three IVs into one IV.  Well, the bloated loop header could be further simplified, but it's another issue requiring more vrp information, i.e., simplify
  ((unsigned)i + 18446744073709551614) / 2 + 1
into
  (unsigned)i