Bug 68365

Summary:	gfortran test case showing performance loss with vectorization
Product:	gcc	Reporter:	Tim Prince <tprince>
Component:	target	Assignee:	Not yet assigned to anyone <unassigned>
Status:	NEW ---
Severity:	normal	CC:	rguenth, tprince
Priority:	P3
Version:	6.0
Target Milestone:	---
Host:	x86_64-pc-cygwin	Target:	x86_64-pc-cygwin
Build:	x86_64-pc-cygwin	Known to work:
Known to fail:		Last reconfirmed:	2015-11-15 00:00:00
Bug Depends on:
Bug Blocks:	53947
Attachments:	gzip tar file of Fortran and C source files fortran source

Description Tim Prince 2015-11-15 19:24:14 UTC

Created attachment 36716 [details]
gzip tar file of Fortran and C source files

Just recently, it has become necessary to add the omp simd safelen(1) directive in subroutine s111 in order to prevent a vectorization which reduces performance on all known IA targets other than Intel Xeon Phi.
The same situation occurs in gcc/g++, and (for several years) icc/icpc (but not ifort).
make -j 3 -f Makefile.cygwin lcd_ffast
I haven't tested the latest gfortran build on linux, but I do have a Makefile for that, in case it's useful.
In the Makefile, CLOCK_RATE is set to enable accurate translation from rdtsc ticks to seconds.
The timing quotations for VL=100 and VL=1000 will show the reduced performance of s111 when vectorized by removing safelen(1) .
For gcc and g++, functions s128() and s4113() also need vectorization disable for full performance, but gfortran doesn't exhibit that problem.  For this filing, you can ignore everything but subroutine s111.

Comment 1 Dominique d'Humieres 2015-11-15 19:37:23 UTC

make: *** No rule to make target 'lcdmod.o', needed by 'lcd_mod.mod'.  Stop.

or

Fatal Error: Can't open module file 'lcd_mod.mod' for reading at (1): No such file or directory

What should be done for non cygwin platform?

Comment 2 Tim Prince 2015-11-16 00:29:35 UTC

Created attachment 36722 [details]
fortran source

gfortran -c lcdmod.f90 should take care of the missing .mod

Comment 3 Richard Biener 2015-11-16 12:13:51 UTC

Hmm, there are many loops here.  I looked at the following (assuming the interesting loops are marked with safelen(1))

      subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)
      use lcd_mod
C
C     linear dependence testing
C     no dependence - vectorizable
C
      integer ntimes,ld,n,i,nl
      real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n)
      real t1,t2,chksum,ctime,dtime,cs1d
      call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ')
      call forttime(t1)
      do nl= 1,2*ntimes
#ifndef __MIC__
!$omp simd safelen(1)
#endif
          do i= 2,n,2
            a(i)= a(i-1)+b(i)
            enddo
          call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.)
        enddo
      call forttime(t2)

and current trunk doesn't consider this profitable unless -mavx is given
(it needs the larger vector size for profitability it seems).

Because of the step 2 it ends up using strided stores.  Instead of
doing interleaving on the loads and stores we could have just operated
on all elements (rather than only even ones) and then use a masked
store.  That would waste half of the vector bandwidth but save all the
shuffles.

.L8:
        vmovups (%rdx), %xmm0
        addl    $1, %r9d
        addq    $64, %rdx
        addq    $64, %r11
        vmovups -32(%rdx), %xmm2
        vinsertf128     $0x1, -48(%rdx), %ymm0, %ymm1
        vmovups -64(%r11), %xmm9
        vinsertf128     $0x1, -16(%rdx), %ymm2, %ymm3
        vmovups -32(%r11), %xmm11
        vinsertf128     $0x1, -48(%r11), %ymm9, %ymm10
        vinsertf128     $0x1, -16(%r11), %ymm11, %ymm12
        vshufps $136, %ymm3, %ymm1, %ymm4
        vshufps $136, %ymm12, %ymm10, %ymm13
        vperm2f128      $3, %ymm4, %ymm4, %ymm5
        vperm2f128      $3, %ymm13, %ymm13, %ymm14
        vshufps $68, %ymm5, %ymm4, %ymm6
        vshufps $238, %ymm5, %ymm4, %ymm7
        vshufps $68, %ymm14, %ymm13, %ymm15
        vshufps $238, %ymm14, %ymm13, %ymm0
        vinsertf128     $1, %xmm7, %ymm6, %ymm8
        vinsertf128     $1, %xmm0, %ymm15, %ymm1
        vaddps  %ymm1, %ymm8, %ymm2
        vextractf128    $0x1, %ymm2, %xmm4
        vmovss  %xmm2, -60(%rdx)
        vextractps      $1, %xmm2, -52(%rdx)
        vextractps      $2, %xmm2, -44(%rdx)
        vextractps      $3, %xmm2, -36(%rdx)
        vmovss  %xmm4, -28(%rdx)
        vextractps      $1, %xmm4, -20(%rdx)
        vextractps      $2, %xmm4, -12(%rdx)
        vextractps      $3, %xmm4, -4(%rdx)
        cmpl    %r9d, %ecx
        ja      .L8

what we fail to realize here is that cross-lane interleaving isn't working
with AVX256 and thus the interleave for the loads is very much more expensive than we think.

That's a general vectorizer cost model issue:

      /* Uses an even and odd extract operations or shuffle operations
         for each needed permute.  */
      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
                                      stmt_info, 0, vect_body);

which 1) doesn't consider single-element interleaving differently,
2) simply uses vec_perm cost which heavily depends on the actual
(constant) permutation used

Comment 4 n8tm 2015-11-16 13:11:07 UTC

On 11/16/2015 7:13 AM, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365
>
> Richard Biener <rguenth at gcc dot gnu.org> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|WAITING                     |NEW
>
> --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Hmm, there are many loops here.  I looked at the following (assuming the
> interesting loops are marked with safelen(1))
>
>       subroutine s111(ntimes,ld,n,ctime,dtime,a,b,c,d,e,aa,bb,cc)
>       use lcd_mod
> C
> C     linear dependence testing
> C     no dependence - vectorizable
> C
>       integer ntimes,ld,n,i,nl
>       real a(n),b(n),c(n),d(n),e(n),aa(ld,n),bb(ld,n),cc(ld,n)
>       real t1,t2,chksum,ctime,dtime,cs1d
>       call init(ld,n,a,b,c,d,e,aa,bb,cc,'s111 ')
>       call forttime(t1)
>       do nl= 1,2*ntimes
> #ifndef __MIC__
> !$omp simd safelen(1)
> #endif
>           do i= 2,n,2
>             a(i)= a(i-1)+b(i)
>             enddo
>           call dummy(ld,n,a,b,c,d,e,aa,bb,cc,1.)
>         enddo
>       call forttime(t2)
>
> and current trunk doesn't consider this profitable unless -mavx is given
> (it needs the larger vector size for profitability it seems).
>
> Because of the step 2 it ends up using strided stores.  Instead of
> doing interleaving on the loads and stores we could have just operated
> on all elements (rather than only even ones) and then use a masked
> store.  That would waste half of the vector bandwidth but save all the
> shuffles.
>
> .L8:
>         vmovups (%rdx), %xmm0
>         addl    $1, %r9d
>         addq    $64, %rdx
>         addq    $64, %r11
>         vmovups -32(%rdx), %xmm2
>         vinsertf128     $0x1, -48(%rdx), %ymm0, %ymm1
>         vmovups -64(%r11), %xmm9
>         vinsertf128     $0x1, -16(%rdx), %ymm2, %ymm3
>         vmovups -32(%r11), %xmm11
>         vinsertf128     $0x1, -48(%r11), %ymm9, %ymm10
>         vinsertf128     $0x1, -16(%r11), %ymm11, %ymm12
>         vshufps $136, %ymm3, %ymm1, %ymm4
>         vshufps $136, %ymm12, %ymm10, %ymm13
>         vperm2f128      $3, %ymm4, %ymm4, %ymm5
>         vperm2f128      $3, %ymm13, %ymm13, %ymm14
>         vshufps $68, %ymm5, %ymm4, %ymm6
>         vshufps $238, %ymm5, %ymm4, %ymm7
>         vshufps $68, %ymm14, %ymm13, %ymm15
>         vshufps $238, %ymm14, %ymm13, %ymm0
>         vinsertf128     $1, %xmm7, %ymm6, %ymm8
>         vinsertf128     $1, %xmm0, %ymm15, %ymm1
>         vaddps  %ymm1, %ymm8, %ymm2
>         vextractf128    $0x1, %ymm2, %xmm4
>         vmovss  %xmm2, -60(%rdx)
>         vextractps      $1, %xmm2, -52(%rdx)
>         vextractps      $2, %xmm2, -44(%rdx)
>         vextractps      $3, %xmm2, -36(%rdx)
>         vmovss  %xmm4, -28(%rdx)
>         vextractps      $1, %xmm4, -20(%rdx)
>         vextractps      $2, %xmm4, -12(%rdx)
>         vextractps      $3, %xmm4, -4(%rdx)
>         cmpl    %r9d, %ecx
>         ja      .L8
>
> what we fail to realize here is that cross-lane interleaving isn't working
> with AVX256 and thus the interleave for the loads is very much more expensive
> than we think.
>
> That's a general vectorizer cost model issue:
>
>       /* Uses an even and odd extract operations or shuffle operations
>          for each needed permute.  */
>       int nstmts = ncopies * ceil_log2 (group_size) * group_size;
>       inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
>                                       stmt_info, 0, vect_body);
>
> which 1) doesn't consider single-element interleaving differently,
> 2) simply uses vec_perm cost which heavily depends on the actual
> (constant) permutation used
>
Thanks for the interesting analysis.
icc/icpc take safelen(1) as preventing vectorization for this case, but
I found another stride 2 case where they still perform the unprofitable
AVX vectorization.  Maybe I'll submit an Intel PR (IPS).

Comment 5 n8tm 2015-11-29 21:13:33 UTC

Sorry, will attach that source file.

Sent via the ASUS PadFone X mini, an AT&T 4G LTE smartphone

-------- Original Message --------
From:"dominiq at lps dot ens.fr" <gcc-bugzilla@gcc.gnu.org>
Sent:Sun, 15 Nov 2015 14:37:23 -0500
To:tprince@computer.org
Subject:[Bug fortran/68365] gfortran test case showing performance loss with vectorization

>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68365
>
>Dominique d'Humieres <dominiq at lps dot ens.fr> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>             Status|UNCONFIRMED                 |WAITING
>   Last reconfirmed|                            |2015-11-15
>     Ever confirmed|0                           |1
>
>--- Comment #1 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
>make: *** No rule to make target 'lcdmod.o', needed by 'lcd_mod.mod'.  Stop.
>
>or
>
>Fatal Error: Can't open module file 'lcd_mod.mod' for reading at (1): No such
>file or directory
>
>What should be done for non cygwin platform?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You reported the bug.