52473 – CSHIFT slow - inline it?

Bug 52473 - CSHIFT slow - inline it?

Summary: CSHIFT slow - inline it?

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	fortran (show other bugs)
Version:	4.7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	52934 (view as bug list)
Depends on:
Blocks:	36854
	Show dependency tree / graph

Reported:	2012-03-03 20:00 UTC by Tobias Burnus
Modified:	2020-06-21 17:15 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2013-06-16 00:00:00

Attachments
Benchmark of straight DO loops vs. library version (739 bytes, text/plain) 2017-05-21 12:00 UTC, Thomas Koenig	Details
What an unrolled cshift could look like (570 bytes, text/plain) 2017-06-08 16:55 UTC, Thomas Koenig	Details
Patch for the library, not yet quite correct (1.58 KB, patch) 2017-06-12 22:45 UTC, Thomas Koenig	Details \| Diff
Test case which still fails (808 bytes, text/plain) 2017-06-12 22:47 UTC, Thomas Koenig	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tobias Burnus 2012-03-03 20:00:24 UTC

For the program http://www.cita.utoronto.ca/~pen/MHD/mhd.f90 GCC produces relatively slow code:

$ gfortran -Ofast mhd.f90; time ./a.out > /dev/null
real    0m0.534s

$ pathf95 -O3 mhd.f90; time ./a.out > /dev/null
real    0m0.169s

Which is 3.2 times faster. A closer analyis shows that it is sufficient to compile the functions advectbyzx, tvd1 and tvdb with pathf95 to gain the same speed. If one only compiles advectbyzx and tvd1 (and the rest with gfortran) the speed is 0m0.301s.

The performance difference seems to be due to the calls to _gfortran_cshift0_4.

Expected: CSHIFT should be inlined (at least for a scalar SHIFT) - as pathf95 and ifort do.

 * * *

13.7.43 CSHIFT (ARRAY, SHIFT [, DIM])
Description. Circular shift of an array.Description. Circular shift of an array.

Comment 1 Tobias Burnus 2012-04-10 15:22:30 UTC

Cf. also http://gcc.gnu.org/ml/fortran/2012-04/msg00041.html

Comment 2 Tobias Burnus 2012-04-11 12:51:34 UTC

*** Bug 52934 has been marked as a duplicate of this bug. ***

Comment 3 Dominique d'Humieres 2013-06-16 12:06:19 UTC

I don't have access to pathf95, but the following test (modified for some IDRIS benchmarks)

  subroutine test_cshift()
    !-----------------------------------------------------------------
    ! Test de la fonction intrinsèque "cshift"
    !-----------------------------------------------------------------
    implicit none
    integer, parameter             :: n=ordre
    integer                        :: i,j
    integer, dimension(n)          :: vect=(/ (-i, i=1,n) /)
    real(kind=prec),dimension(n,n) :: t,td,t2,t1
    real(kind=prec)                :: cste

    call impression_entete("test_cshift")
    print *, "Order of matrix:", n

    !---- Initialisations --------------------------
    cste=exp(1._prec)
    t(:,:)  = reshape( &
              source=(/ (i*cste,i=1,n*n) /) , shape = (/ n,n /) )

    td(:,:) = 0._prec
    call cpu_time( val_temps(1) )
    td(:,:) = cshift(array=t(:,:), shift=vect(:), dim=1)
    call cpu_time( val_temps(2) )
    t1(:,:)  = td(:,:)

    call temps_consomme("test_cshift> 1) CSHIFT")

    !---- Décalage selon les lignes via des boucles --------------------
    td(:,:) = 0._prec
    call cpu_time( val_temps(1) )
    do j=1,n
       do i=n+vect(j),1,-1
          td(i-vect(j),j) = t(i,j)
       end do
       do i=1,-vect(j)
          td(i,j)=t(n+vect(j)+i,j)
       end do
    end do
    call cpu_time( val_temps(2) )
    t2(:,:) = td(:,:)

    call temps_consomme("test__cshift> 2) DO loop")

    if (sum(abs(t2-t1)) /= 0._prec) then
       print *,'Mauvais résultats !!!!',sum(abs(t2-t1))
    else
        print *,'Results OK'
    end if    

  end subroutine test_cshift

gives

====================
 Call to test_cshift
====================

 Order of matrix:        1000
 test_cshift> 1) CSHIFT
                                       Used CPU time ==>         7.624 ms
 test__cshift> 2) DO loop
                                       Used CPU time ==>         4.482 ms
 Results OK

So CSHIFT is 50% slower than the DO loop.

Note that I get similar results for EOSHIFT and RESHAPE:

=====================
 Call to test_eoshift
=====================

 Order of matrix:        1000
 test_eoshift> 1) EOSHIFT
                                       Used CPU time ==>         7.443 ms
 test__eoshift> 2) DO loop
                                       Used CPU time ==>         3.775 ms
 Results OK

=====================
 Call to test_reshape
=====================

 Order of matrix:        1000
 test__reshape> 1) RESHAPE
                                       Used CPU time ==>        10.624 ms
 test__reshape> 2) DO loop
                                       Used CPU time ==>         4.442 ms
 Results OK

PR45689 should probably fixed at the same time.

Comment 4 Thomas Koenig 2016-12-26 12:25:37 UTC

Looking at the code, inlining cshift for a constant shift
could already be a good idea.

So, change

b = cshift(b,1) + cshift(b,-1)

to

a = b

b(1) = a(2) + a(n)
b(2:n-1) = a(1:n-2) + a(3:n)
b(n) = a(1) + a(n-1)

Comment 5 Thomas Koenig 2017-05-21 12:00:23 UTC

Created attachment 41394 [details]
Benchmark of straight DO loops vs. library version

Some numbers from

https://groups.google.com/forum/#!topic/comp.lang.fortran/AI0F1Vpkc3I

show that using straight DO loops could both help and hurt:

$ ./a.out
 Testing explicit DO loops
 Dim =            1  Elapsed CPU time =    2.82861114    
 Dim =            2  Elapsed CPU time =    2.93245506    
 Dim =            3  Elapsed CPU time =    2.94523525    
 Testing built-in cshift
 Dim =            1  Elapsed CPU time =    1.65619278    
 Dim =            2  Elapsed CPU time =    2.80988693    
 Dim =            3  Elapsed CPU time =    7.13671684 

I'll see what could be done using an explicit call to memcpy
for the innermost loops.

Comment 6 Thomas Koenig 2017-06-08 16:55:31 UTC

Created attachment 41508 [details]
What an unrolled cshift could look like

This is what an unrolled version of cshift could look like,
for a simple one-dimensional case.

If the shifts are constants, all the if statements are
optimized away at compile-time, so it is quite efficient.

Comment 7 Jerry DeLisle 2017-06-09 03:40:50 UTC

(In reply to Thomas Koenig from comment #6)
> Created attachment 41508 [details]
> What an unrolled cshift could look like
> 
> This is what an unrolled version of cshift could look like,
> for a simple one-dimensional case.
> 
> If the shifts are constants, all the if statements are
> optimized away at compile-time, so it is quite efficient.

How often or how important is this feature? Would it improve any known benchmark or real world application in a significant manner?

Comment 8 Thomas Koenig 2017-06-10 11:12:47 UTC

(In reply to Jerry DeLisle from comment #7)
> (In reply to Thomas Koenig from comment #6)
> > Created attachment 41508 [details]
> > What an unrolled cshift could look like
> > 
> > This is what an unrolled version of cshift could look like,
> > for a simple one-dimensional case.
> > 
> > If the shifts are constants, all the if statements are
> > optimized away at compile-time, so it is quite efficient.
> 
> How often or how important is this feature? Would it improve any known
> benchmark or real world application in a significant manner?

There's protein.f90 - see PR 52934.

Calculating a numerical derivative for a periodic function as

diff = (cshift(a,1) - cshift(a,-1))/(2*h)

seems natural, or the second derivative as

d2 = (cshift(a,1) - 2*a + cshift(a,-1))/(h**2)

Of course, to be really useful, this would also have to be
extended to eoshift...

Comment 9 Dominique d'Humieres 2017-06-10 12:07:54 UTC

> How often or how important is this feature? Would it improve any known
> benchmark or real world application in a significant manner?

Egg and chicken problem! If CSHIFT is slow, nobody will use it. If nobody use it, nobody will be motivated to make it fast.

Comment 10 Thomas Koenig 2017-06-11 08:10:58 UTC

Also see the discussion at

https://groups.google.com/forum/#!topic/comp.lang.fortran/AI0F1Vpkc3I

There is one thing that I do not understand.  For the following
test code, which compares straightforward DO loops for shifting
a three-dimensional array by 1, the library version is actually
faster than straightforward loops for dim=1.

module kinds
  integer, parameter :: sp = selected_real_kind(6) ! Single precision
  integer, parameter :: dp = selected_real_kind(15) ! Double precision
end module kinds

module replacements
  use kinds
contains
  subroutine cshift_sp_3_v1 (array, shift, dim, res)
    integer, parameter :: wp = sp
    real(kind=wp), dimension(:,:,:), intent(in), contiguous :: array
    integer, intent(in) :: shift, dim
    real(kind=wp), dimension(:,:,:), intent(out), contiguous :: res
    integer :: i,j,k
    integer :: sh, rsh
    integer :: n
    res = 0
    if (dim == 1) then
       n = size(array,1)
       sh = modulo(shift, n)
       rsh = n - sh
       do k=1, size(array,3)
          do j=1, size(array,2)
             do i=1, rsh
                res(i,j,k) = array(i+sh,j,k)
             end do
             do i=rsh+1,n
                res(i,j,k) = array(i-rsh,j,k)
             end do
          end do
       end do
    else if (dim == 2) then
       n = size(array,2)
       sh = modulo(shift,n)
       rsh = n - sh
       do k=1, size(array,3)
          do j=1, rsh
             do i=1, size(array,1)
                res(i,j,k) = array(i,j+sh, k)
             end do
          end do
          do j=rsh+1, n
             do i=1, size(array,1)
                res(i,j,k) = array(i,j-rsh, k)
             end do
          end do
       end do
    else if (dim == 3) then
       n = size(array,3)
       sh = modulo(shift, n)
       rsh = n - sh
       do k=1, rsh
          do j=1, size(array,2)
             do i=1, size(array,1)
                res(i,j,k) = array(i, j, k+sh)
             end do
          end do
       end do
       do k=rsh+1, n
          do j=1, size(array,2)
             do i=1, size(array,1)
                res(i,j, k) = array(i, j, k-rsh)
             end do
          end do          
       end do
    else
       stop "Wrong argument to dim"
    end if
  end subroutine cshift_sp_3_v1
end module replacements

program testme
  use kinds
  use replacements
  implicit none
  integer, parameter :: wp = sp  ! Working precision
  INTEGER, PARAMETER :: n = 200
  real(kind=wp) :: a(n,n,n), b(n,n,n)
  integer i, j, k
  real t1, t2

  print *,"Testing explicit DO loops"
  call random_number(a)
  do k = 1,3
     call cpu_time ( t1 )
     do j = 1, 100
        call cshift_sp_3_v1 (a, 1, k, b)
     end do
     call cpu_time ( t2 )
     write ( *, * ) 'Dim = ', k, ' Elapsed CPU time = ', t2-t1
  end do

  print *,"Testing built-in cshift"
  do k = 1,3
     call cpu_time ( t1 )
     do j = 1, 100
        b = cshift(a,1,k)
     end do
     call cpu_time ( t2 )
     write ( *, * ) 'Dim = ', k, ' Elapsed CPU time = ', t2-t1
  end do

end program testme

Comment 11 Thomas Koenig 2017-06-12 22:45:19 UTC

Created attachment 41541 [details]
Patch for the library, not yet quite correct

This is a patch which brings a dramatic speedup for
any cshift with dim>1, but it also causes a regression for
a test case which I will attach next.

Comment 12 Thomas Koenig 2017-06-12 22:47:53 UTC

Created attachment 41542 [details]
Test case which still fails

Comment 13 Thomas Koenig 2017-06-18 18:04:51 UTC

Author: tkoenig
Date: Sun Jun 18 18:04:19 2017
New Revision: 249350

URL: https://gcc.gnu.org/viewcvs?rev=249350&root=gcc&view=rev
Log:
2017-06-18  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR fortran/52473
	* m4/cshift0.m4:  For arrays that are contiguous up to
	shift, implement blocked algorighm for cshift.
	* generated/cshift0_c10.c:  Regenerated.
	* generated/cshift0_c16.c:  Regenerated.
	* generated/cshift0_c4.c:  Regenerated.
	* generated/cshift0_c8.c:  Regenerated.
	* generated/cshift0_i1.c:  Regenerated.
	* generated/cshift0_i16.c:  Regenerated.
	* generated/cshift0_i2.c:  Regenerated.
	* generated/cshift0_i4.c:  Regenerated.
	* generated/cshift0_i8.c:  Regenerated.
	* generated/cshift0_r10.c:  Regenerated.
	* generated/cshift0_r16.c:  Regenerated.
	* generated/cshift0_r4.c:  Regenerated.
	* generated/cshift0_r8.c:  Regenerated.

2017-06-18  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR fortran/52473
	* gfortran.dg/cshift_1.f90:  New test.


Added:
    trunk/gcc/testsuite/gfortran.dg/cshift_1.f90
Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/generated/cshift0_c10.c
    trunk/libgfortran/generated/cshift0_c16.c
    trunk/libgfortran/generated/cshift0_c4.c
    trunk/libgfortran/generated/cshift0_c8.c
    trunk/libgfortran/generated/cshift0_i1.c
    trunk/libgfortran/generated/cshift0_i16.c
    trunk/libgfortran/generated/cshift0_i2.c
    trunk/libgfortran/generated/cshift0_i4.c
    trunk/libgfortran/generated/cshift0_i8.c
    trunk/libgfortran/generated/cshift0_r10.c
    trunk/libgfortran/generated/cshift0_r16.c
    trunk/libgfortran/generated/cshift0_r4.c
    trunk/libgfortran/generated/cshift0_r8.c
    trunk/libgfortran/m4/cshift0.m4

Comment 14 Thomas Koenig 2017-06-24 07:08:29 UTC

Author: tkoenig
Date: Sat Jun 24 07:07:56 2017
New Revision: 249620

URL: https://gcc.gnu.org/viewcvs?rev=249620&root=gcc&view=rev
Log:
2017-06-24  Thomas Koenig  <tkoenig@gcc.gnu.org>

        PR fortran/52473
	* Makefile.am:  Add i_cshift1a_c.  Add rules to generate files
	from cshift1a.m4.
	* Makefile.in: Regenerated.
	* m4/cshift1a.m4: New file.
	* m4/cshift.m4 (cshift1): Split up inner loop by removing
	condition. Use memcpy where possible.  Call helper functions
	based on dtype.
	* libgfortran.h: Add prototypes for cshift1_16_c10,
	cshift1_16_c16, cshift1_16_c4, cshift1_16_c8, cshift1_16_i1,
	cshift1_16_i16, cshift1_16_i2, cshift1_16_i4, cshift1_16_i8,
	cshift1_16_r10, cshift1_16_r16, cshift1_16_r4, cshift1_16_r8,
	cshift1_4_c10, cshift1_4_c16, cshift1_4_c4, cshift1_4_c8,
	cshift1_4_i1, cshift1_4_i16, cshift1_4_i2, cshift1_4_i4,
	cshift1_4_i8, cshift1_4_r10, cshift1_4_r16, cshift1_4_r4,
	cshift1_4_r8, cshift1_8_c10, cshift1_8_c16, cshift1_8_c4,
	cshift1_8_c8, cshift1_8_i1, cshift1_8_i16, cshift1_8_i2,
	cshift1_8_i4, cshift1_8_i8, cshift1_8_r10, cshift1_8_r16,
	cshift1_8_r4 and cshift1_8_r8.
	* generated/cshift1_16_c10.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_c16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_c4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_c8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_i1.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_i16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_i2.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_i4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_i8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_r10.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_r16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_r4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_16_r8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_c10.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_c16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_c4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_c8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_i1.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_i16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_i2.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_i4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_i8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_r10.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_r16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_r4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_4_r8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_c10.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_c16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_c4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_c8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_i1.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_i16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_i2.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_i4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_i8.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_r10.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_r16.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_r4.c: New file, generated from cshift1a.m4.
	* generated/cshift1_8_r8.c: New file, generated from cshift1a.m4.

2017-06-24  Thomas Koenig  <tkoenig@gcc.gnu.org>

        PR fortran/52473
	* gfortran.dg/cshift_2.f90:  New test.


Added:
    trunk/gcc/testsuite/gfortran.dg/cshift_2.f90
    trunk/libgfortran/generated/cshift1_16_c10.c
    trunk/libgfortran/generated/cshift1_16_c16.c
    trunk/libgfortran/generated/cshift1_16_c4.c
    trunk/libgfortran/generated/cshift1_16_c8.c
    trunk/libgfortran/generated/cshift1_16_i1.c
    trunk/libgfortran/generated/cshift1_16_i16.c
    trunk/libgfortran/generated/cshift1_16_i2.c
    trunk/libgfortran/generated/cshift1_16_i4.c
    trunk/libgfortran/generated/cshift1_16_i8.c
    trunk/libgfortran/generated/cshift1_16_r10.c
    trunk/libgfortran/generated/cshift1_16_r16.c
    trunk/libgfortran/generated/cshift1_16_r4.c
    trunk/libgfortran/generated/cshift1_16_r8.c
    trunk/libgfortran/generated/cshift1_4_c10.c
    trunk/libgfortran/generated/cshift1_4_c16.c
    trunk/libgfortran/generated/cshift1_4_c4.c
    trunk/libgfortran/generated/cshift1_4_c8.c
    trunk/libgfortran/generated/cshift1_4_i1.c
    trunk/libgfortran/generated/cshift1_4_i16.c
    trunk/libgfortran/generated/cshift1_4_i2.c
    trunk/libgfortran/generated/cshift1_4_i4.c
    trunk/libgfortran/generated/cshift1_4_i8.c
    trunk/libgfortran/generated/cshift1_4_r10.c
    trunk/libgfortran/generated/cshift1_4_r16.c
    trunk/libgfortran/generated/cshift1_4_r4.c
    trunk/libgfortran/generated/cshift1_4_r8.c
    trunk/libgfortran/generated/cshift1_8_c10.c
    trunk/libgfortran/generated/cshift1_8_c16.c
    trunk/libgfortran/generated/cshift1_8_c4.c
    trunk/libgfortran/generated/cshift1_8_c8.c
    trunk/libgfortran/generated/cshift1_8_i1.c
    trunk/libgfortran/generated/cshift1_8_i16.c
    trunk/libgfortran/generated/cshift1_8_i2.c
    trunk/libgfortran/generated/cshift1_8_i4.c
    trunk/libgfortran/generated/cshift1_8_i8.c
    trunk/libgfortran/generated/cshift1_8_r10.c
    trunk/libgfortran/generated/cshift1_8_r16.c
    trunk/libgfortran/generated/cshift1_8_r4.c
    trunk/libgfortran/generated/cshift1_8_r8.c
    trunk/libgfortran/m4/cshift1a.m4
Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/Makefile.am
    trunk/libgfortran/Makefile.in
    trunk/libgfortran/generated/cshift1_16.c
    trunk/libgfortran/generated/cshift1_4.c
    trunk/libgfortran/generated/cshift1_8.c
    trunk/libgfortran/libgfortran.h
    trunk/libgfortran/m4/cshift1.m4

Comment 15 Thomas Koenig 2017-06-25 21:23:03 UTC

What was done so far was an improvement for cshifts of large
arrays.

The code in the original program has quite small arrays, so
inlining would actually make sense there.

I don't think I will start this any time soon; unassigning.

Comment 16 Dominique d'Humieres 2018-03-28 18:26:31 UTC

For the original test compiled with -Ofast, I get

 Testing explicit DO loops
 Dim =            1  Elapsed CPU time =   0.846953988    
 Dim =            2  Elapsed CPU time =   0.724469006    
 Dim =            3  Elapsed CPU time =   0.680019855    
 Testing built-in cshift
 Dim =            1  Elapsed CPU time =   0.506424904    
 Dim =            2  Elapsed CPU time =   0.391590118    
 Dim =            3  Elapsed CPU time =   0.352367163    

An for the IDRIS test:

=====================
 Call to test_eoshift
=====================

 Order of matrix:        1000
 test_eoshift> 1) EOSHIFT
                                       Used CPU time ==>         2.976 ms
 test__eoshift> 2) DO loop
                                       Used CPU time ==>         0.832 ms
 Results OK

====================
 Call to test_cshift
====================

 Order of matrix:        1000
 test_cshift> 1) CSHIFT
                                       Used CPU time ==>         1.377 ms
 test__cshift> 2) DO loop
                                       Used CPU time ==>         1.051 ms
 Results OK

=====================
 Call to test_reshape
=====================

 Order of matrix:        1000
 test__reshape> 1) RESHAPE
                                       Used CPU time ==>         4.909 ms
 test__reshape> 2) DO loop
                                       Used CPU time ==>         0.877 ms
 Results OK

May be this PR could be closed.

Comment 17 Thomas Koenig 2018-03-30 14:01:14 UTC

Large-scale cshift is OK now.

The original test case, or inlining something like

b = cshift(a,-1) - 2*a + cshift(a,1)

to efficiently calculate a second derivative on a circular array would be
quite nice.

Comment 18 Jürgen Reuter 2018-08-07 04:17:40 UTC

The example by posted on May 20, 2017 on c.l.f. improved by Stefano Zaghi below shows a factor of 10-20 improvement now in gfortran 9.0.0 including the work by Thomas Koenig.
$ ./a.out
 Elapsed CPU time =   0.33764499999999997     
 Elapsed CPU time =   0.29224100000000003     
 Elapsed CPU time =   0.26565400000000006

Here is the code:

program testme
 
  use, intrinsic :: iso_fortran_env
  
  implicit none
  
  integer(int32), parameter :: n = 200
  real(real32) :: a(n,n,n), b(n,n,n)
  integer(int32) :: j, k
  real(real64) :: t1, t2
  
  call random_number(a)
  
  do k = 1, 3
     call cpu_time ( t1 )
     do j = 1, 100
        b = cshift(a, shift=1, DIM=k)
     end do
     call cpu_time ( t2 )
     write ( *, * ) 'Elapsed CPU time = ', t2-t1
  end do
end program testme

Comment 19 Martin Liška 2018-11-19 12:00:02 UTC

Can the bug be marked as resolved?

Comment 20 Thomas Koenig 2018-11-19 21:15:23 UTC

The original test case is still slow, so I guess we should
keep this open.

Comment 21 Dominique d'Humieres 2018-11-19 22:23:26 UTC

> The original test case is still slow, so I guess we should
> keep this open.

Which test?