This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/68365] gfortran test case showing performance loss with vectorization


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, there are many loops here.  I looked at the following (assuming the
interesting loops are marked with safelen(1))

      subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)
      use lcd_mod
C
C     linear dependence testing
C     no dependence - vectorizable
C
      integer ntimes,ld,n,i,nl
      real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n)
      real t1,t2,chksum,ctime,dtime,cs1d
      call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ')
      call forttime(t1)
      do nl= 1,2*ntimes
#ifndef __MIC__
!$omp simd safelen(1)
#endif
          do i= 2,n,2
            a(i)= a(i-1)+b(i)
            enddo
          call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.)
        enddo
      call forttime(t2)

and current trunk doesn't consider this profitable unless -mavx is given
(it needs the larger vector size for profitability it seems).

Because of the step 2 it ends up using strided stores.  Instead of
doing interleaving on the loads and stores we could have just operated
on all elements (rather than only even ones) and then use a masked
store.  That would waste half of the vector bandwidth but save all the
shuffles.

.L8:
        vmovups (%rdx), %xmm0
        addl    $1, %r9d
        addq    $64, %rdx
        addq    $64, %r11
        vmovups -32(%rdx), %xmm2
        vinsertf128     $0x1, -48(%rdx), %ymm0, %ymm1
        vmovups -64(%r11), %xmm9
        vinsertf128     $0x1, -16(%rdx), %ymm2, %ymm3
        vmovups -32(%r11), %xmm11
        vinsertf128     $0x1, -48(%r11), %ymm9, %ymm10
        vinsertf128     $0x1, -16(%r11), %ymm11, %ymm12
        vshufps $136, %ymm3, %ymm1, %ymm4
        vshufps $136, %ymm12, %ymm10, %ymm13
        vperm2f128      $3, %ymm4, %ymm4, %ymm5
        vperm2f128      $3, %ymm13, %ymm13, %ymm14
        vshufps $68, %ymm5, %ymm4, %ymm6
        vshufps $238, %ymm5, %ymm4, %ymm7
        vshufps $68, %ymm14, %ymm13, %ymm15
        vshufps $238, %ymm14, %ymm13, %ymm0
        vinsertf128     $1, %xmm7, %ymm6, %ymm8
        vinsertf128     $1, %xmm0, %ymm15, %ymm1
        vaddps  %ymm1, %ymm8, %ymm2
        vextractf128    $0x1, %ymm2, %xmm4
        vmovss  %xmm2, -60(%rdx)
        vextractps      $1, %xmm2, -52(%rdx)
        vextractps      $2, %xmm2, -44(%rdx)
        vextractps      $3, %xmm2, -36(%rdx)
        vmovss  %xmm4, -28(%rdx)
        vextractps      $1, %xmm4, -20(%rdx)
        vextractps      $2, %xmm4, -12(%rdx)
        vextractps      $3, %xmm4, -4(%rdx)
        cmpl    %r9d, %ecx
        ja      .L8

what we fail to realize here is that cross-lane interleaving isn't working
with AVX256 and thus the interleave for the loads is very much more expensive
than we think.

That's a general vectorizer cost model issue:

      /* Uses an even and odd extract operations or shuffle operations
         for each needed permute.  */
      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
                                      stmt_info, 0, vect_body);

which 1) doesn't consider single-element interleaving differently,
2) simply uses vec_perm cost which heavily depends on the actual
(constant) permutation used

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]