For the program http://www.cita.utoronto.ca/~pen/MHD/mhd.f90 GCC produces relatively slow code: $ gfortran -Ofast mhd.f90; time ./a.out > /dev/null real 0m0.534s $ pathf95 -O3 mhd.f90; time ./a.out > /dev/null real 0m0.169s Which is 3.2 times faster. A closer analyis shows that it is sufficient to compile the functions advectbyzx, tvd1 and tvdb with pathf95 to gain the same speed. If one only compiles advectbyzx and tvd1 (and the rest with gfortran) the speed is 0m0.301s. The performance difference seems to be due to the calls to _gfortran_cshift0_4. Expected: CSHIFT should be inlined (at least for a scalar SHIFT) - as pathf95 and ifort do. * * * 13.7.43 CSHIFT (ARRAY, SHIFT [, DIM]) Description. Circular shift of an array.Description. Circular shift of an array.
Cf. also http://gcc.gnu.org/ml/fortran/2012-04/msg00041.html
*** Bug 52934 has been marked as a duplicate of this bug. ***
I don't have access to pathf95, but the following test (modified for some IDRIS benchmarks) subroutine test_cshift() !----------------------------------------------------------------- ! Test de la fonction intrinsèque "cshift" !----------------------------------------------------------------- implicit none integer, parameter :: n=ordre integer :: i,j integer, dimension(n) :: vect=(/ (-i, i=1,n) /) real(kind=prec),dimension(n,n) :: t,td,t2,t1 real(kind=prec) :: cste call impression_entete("test_cshift") print *, "Order of matrix:", n !---- Initialisations -------------------------- cste=exp(1._prec) t(:,:) = reshape( & source=(/ (i*cste,i=1,n*n) /) , shape = (/ n,n /) ) td(:,:) = 0._prec call cpu_time( val_temps(1) ) td(:,:) = cshift(array=t(:,:), shift=vect(:), dim=1) call cpu_time( val_temps(2) ) t1(:,:) = td(:,:) call temps_consomme("test_cshift> 1) CSHIFT") !---- Décalage selon les lignes via des boucles -------------------- td(:,:) = 0._prec call cpu_time( val_temps(1) ) do j=1,n do i=n+vect(j),1,-1 td(i-vect(j),j) = t(i,j) end do do i=1,-vect(j) td(i,j)=t(n+vect(j)+i,j) end do end do call cpu_time( val_temps(2) ) t2(:,:) = td(:,:) call temps_consomme("test__cshift> 2) DO loop") if (sum(abs(t2-t1)) /= 0._prec) then print *,'Mauvais résultats !!!!',sum(abs(t2-t1)) else print *,'Results OK' end if end subroutine test_cshift gives ==================== Call to test_cshift ==================== Order of matrix: 1000 test_cshift> 1) CSHIFT Used CPU time ==> 7.624 ms test__cshift> 2) DO loop Used CPU time ==> 4.482 ms Results OK So CSHIFT is 50% slower than the DO loop. Note that I get similar results for EOSHIFT and RESHAPE: ===================== Call to test_eoshift ===================== Order of matrix: 1000 test_eoshift> 1) EOSHIFT Used CPU time ==> 7.443 ms test__eoshift> 2) DO loop Used CPU time ==> 3.775 ms Results OK ===================== Call to test_reshape ===================== Order of matrix: 1000 test__reshape> 1) RESHAPE Used CPU time ==> 10.624 ms test__reshape> 2) DO loop Used CPU time ==> 4.442 ms Results OK PR45689 should probably fixed at the same time.
Looking at the code, inlining cshift for a constant shift could already be a good idea. So, change b = cshift(b,1) + cshift(b,-1) to a = b b(1) = a(2) + a(n) b(2:n-1) = a(1:n-2) + a(3:n) b(n) = a(1) + a(n-1)
Created attachment 41394 [details] Benchmark of straight DO loops vs. library version Some numbers from https://groups.google.com/forum/#!topic/comp.lang.fortran/AI0F1Vpkc3I show that using straight DO loops could both help and hurt: $ ./a.out Testing explicit DO loops Dim = 1 Elapsed CPU time = 2.82861114 Dim = 2 Elapsed CPU time = 2.93245506 Dim = 3 Elapsed CPU time = 2.94523525 Testing built-in cshift Dim = 1 Elapsed CPU time = 1.65619278 Dim = 2 Elapsed CPU time = 2.80988693 Dim = 3 Elapsed CPU time = 7.13671684 I'll see what could be done using an explicit call to memcpy for the innermost loops.
Created attachment 41508 [details] What an unrolled cshift could look like This is what an unrolled version of cshift could look like, for a simple one-dimensional case. If the shifts are constants, all the if statements are optimized away at compile-time, so it is quite efficient.
(In reply to Thomas Koenig from comment #6) > Created attachment 41508 [details] > What an unrolled cshift could look like > > This is what an unrolled version of cshift could look like, > for a simple one-dimensional case. > > If the shifts are constants, all the if statements are > optimized away at compile-time, so it is quite efficient. How often or how important is this feature? Would it improve any known benchmark or real world application in a significant manner?
(In reply to Jerry DeLisle from comment #7) > (In reply to Thomas Koenig from comment #6) > > Created attachment 41508 [details] > > What an unrolled cshift could look like > > > > This is what an unrolled version of cshift could look like, > > for a simple one-dimensional case. > > > > If the shifts are constants, all the if statements are > > optimized away at compile-time, so it is quite efficient. > > How often or how important is this feature? Would it improve any known > benchmark or real world application in a significant manner? There's protein.f90 - see PR 52934. Calculating a numerical derivative for a periodic function as diff = (cshift(a,1) - cshift(a,-1))/(2*h) seems natural, or the second derivative as d2 = (cshift(a,1) - 2*a + cshift(a,-1))/(h**2) Of course, to be really useful, this would also have to be extended to eoshift...
> How often or how important is this feature? Would it improve any known > benchmark or real world application in a significant manner? Egg and chicken problem! If CSHIFT is slow, nobody will use it. If nobody use it, nobody will be motivated to make it fast.
Also see the discussion at https://groups.google.com/forum/#!topic/comp.lang.fortran/AI0F1Vpkc3I There is one thing that I do not understand. For the following test code, which compares straightforward DO loops for shifting a three-dimensional array by 1, the library version is actually faster than straightforward loops for dim=1. module kinds integer, parameter :: sp = selected_real_kind(6) ! Single precision integer, parameter :: dp = selected_real_kind(15) ! Double precision end module kinds module replacements use kinds contains subroutine cshift_sp_3_v1 (array, shift, dim, res) integer, parameter :: wp = sp real(kind=wp), dimension(:,:,:), intent(in), contiguous :: array integer, intent(in) :: shift, dim real(kind=wp), dimension(:,:,:), intent(out), contiguous :: res integer :: i,j,k integer :: sh, rsh integer :: n res = 0 if (dim == 1) then n = size(array,1) sh = modulo(shift, n) rsh = n - sh do k=1, size(array,3) do j=1, size(array,2) do i=1, rsh res(i,j,k) = array(i+sh,j,k) end do do i=rsh+1,n res(i,j,k) = array(i-rsh,j,k) end do end do end do else if (dim == 2) then n = size(array,2) sh = modulo(shift,n) rsh = n - sh do k=1, size(array,3) do j=1, rsh do i=1, size(array,1) res(i,j,k) = array(i,j+sh, k) end do end do do j=rsh+1, n do i=1, size(array,1) res(i,j,k) = array(i,j-rsh, k) end do end do end do else if (dim == 3) then n = size(array,3) sh = modulo(shift, n) rsh = n - sh do k=1, rsh do j=1, size(array,2) do i=1, size(array,1) res(i,j,k) = array(i, j, k+sh) end do end do end do do k=rsh+1, n do j=1, size(array,2) do i=1, size(array,1) res(i,j, k) = array(i, j, k-rsh) end do end do end do else stop "Wrong argument to dim" end if end subroutine cshift_sp_3_v1 end module replacements program testme use kinds use replacements implicit none integer, parameter :: wp = sp ! Working precision INTEGER, PARAMETER :: n = 200 real(kind=wp) :: a(n,n,n), b(n,n,n) integer i, j, k real t1, t2 print *,"Testing explicit DO loops" call random_number(a) do k = 1,3 call cpu_time ( t1 ) do j = 1, 100 call cshift_sp_3_v1 (a, 1, k, b) end do call cpu_time ( t2 ) write ( *, * ) 'Dim = ', k, ' Elapsed CPU time = ', t2-t1 end do print *,"Testing built-in cshift" do k = 1,3 call cpu_time ( t1 ) do j = 1, 100 b = cshift(a,1,k) end do call cpu_time ( t2 ) write ( *, * ) 'Dim = ', k, ' Elapsed CPU time = ', t2-t1 end do end program testme
Created attachment 41541 [details] Patch for the library, not yet quite correct This is a patch which brings a dramatic speedup for any cshift with dim>1, but it also causes a regression for a test case which I will attach next.
Created attachment 41542 [details] Test case which still fails
Author: tkoenig Date: Sun Jun 18 18:04:19 2017 New Revision: 249350 URL: https://gcc.gnu.org/viewcvs?rev=249350&root=gcc&view=rev Log: 2017-06-18 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/52473 * m4/cshift0.m4: For arrays that are contiguous up to shift, implement blocked algorighm for cshift. * generated/cshift0_c10.c: Regenerated. * generated/cshift0_c16.c: Regenerated. * generated/cshift0_c4.c: Regenerated. * generated/cshift0_c8.c: Regenerated. * generated/cshift0_i1.c: Regenerated. * generated/cshift0_i16.c: Regenerated. * generated/cshift0_i2.c: Regenerated. * generated/cshift0_i4.c: Regenerated. * generated/cshift0_i8.c: Regenerated. * generated/cshift0_r10.c: Regenerated. * generated/cshift0_r16.c: Regenerated. * generated/cshift0_r4.c: Regenerated. * generated/cshift0_r8.c: Regenerated. 2017-06-18 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/52473 * gfortran.dg/cshift_1.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/cshift_1.f90 Modified: trunk/gcc/testsuite/ChangeLog trunk/libgfortran/ChangeLog trunk/libgfortran/generated/cshift0_c10.c trunk/libgfortran/generated/cshift0_c16.c trunk/libgfortran/generated/cshift0_c4.c trunk/libgfortran/generated/cshift0_c8.c trunk/libgfortran/generated/cshift0_i1.c trunk/libgfortran/generated/cshift0_i16.c trunk/libgfortran/generated/cshift0_i2.c trunk/libgfortran/generated/cshift0_i4.c trunk/libgfortran/generated/cshift0_i8.c trunk/libgfortran/generated/cshift0_r10.c trunk/libgfortran/generated/cshift0_r16.c trunk/libgfortran/generated/cshift0_r4.c trunk/libgfortran/generated/cshift0_r8.c trunk/libgfortran/m4/cshift0.m4
Author: tkoenig Date: Sat Jun 24 07:07:56 2017 New Revision: 249620 URL: https://gcc.gnu.org/viewcvs?rev=249620&root=gcc&view=rev Log: 2017-06-24 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/52473 * Makefile.am: Add i_cshift1a_c. Add rules to generate files from cshift1a.m4. * Makefile.in: Regenerated. * m4/cshift1a.m4: New file. * m4/cshift.m4 (cshift1): Split up inner loop by removing condition. Use memcpy where possible. Call helper functions based on dtype. * libgfortran.h: Add prototypes for cshift1_16_c10, cshift1_16_c16, cshift1_16_c4, cshift1_16_c8, cshift1_16_i1, cshift1_16_i16, cshift1_16_i2, cshift1_16_i4, cshift1_16_i8, cshift1_16_r10, cshift1_16_r16, cshift1_16_r4, cshift1_16_r8, cshift1_4_c10, cshift1_4_c16, cshift1_4_c4, cshift1_4_c8, cshift1_4_i1, cshift1_4_i16, cshift1_4_i2, cshift1_4_i4, cshift1_4_i8, cshift1_4_r10, cshift1_4_r16, cshift1_4_r4, cshift1_4_r8, cshift1_8_c10, cshift1_8_c16, cshift1_8_c4, cshift1_8_c8, cshift1_8_i1, cshift1_8_i16, cshift1_8_i2, cshift1_8_i4, cshift1_8_i8, cshift1_8_r10, cshift1_8_r16, cshift1_8_r4 and cshift1_8_r8. * generated/cshift1_16_c10.c: New file, generated from cshift1a.m4. * generated/cshift1_16_c16.c: New file, generated from cshift1a.m4. * generated/cshift1_16_c4.c: New file, generated from cshift1a.m4. * generated/cshift1_16_c8.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i1.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i16.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i2.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i4.c: New file, generated from cshift1a.m4. * generated/cshift1_16_i8.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r10.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r16.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r4.c: New file, generated from cshift1a.m4. * generated/cshift1_16_r8.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c10.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c16.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c4.c: New file, generated from cshift1a.m4. * generated/cshift1_4_c8.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i1.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i16.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i2.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i4.c: New file, generated from cshift1a.m4. * generated/cshift1_4_i8.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r10.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r16.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r4.c: New file, generated from cshift1a.m4. * generated/cshift1_4_r8.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c10.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c16.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c4.c: New file, generated from cshift1a.m4. * generated/cshift1_8_c8.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i1.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i16.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i2.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i4.c: New file, generated from cshift1a.m4. * generated/cshift1_8_i8.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r10.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r16.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r4.c: New file, generated from cshift1a.m4. * generated/cshift1_8_r8.c: New file, generated from cshift1a.m4. 2017-06-24 Thomas Koenig <tkoenig@gcc.gnu.org> PR fortran/52473 * gfortran.dg/cshift_2.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/cshift_2.f90 trunk/libgfortran/generated/cshift1_16_c10.c trunk/libgfortran/generated/cshift1_16_c16.c trunk/libgfortran/generated/cshift1_16_c4.c trunk/libgfortran/generated/cshift1_16_c8.c trunk/libgfortran/generated/cshift1_16_i1.c trunk/libgfortran/generated/cshift1_16_i16.c trunk/libgfortran/generated/cshift1_16_i2.c trunk/libgfortran/generated/cshift1_16_i4.c trunk/libgfortran/generated/cshift1_16_i8.c trunk/libgfortran/generated/cshift1_16_r10.c trunk/libgfortran/generated/cshift1_16_r16.c trunk/libgfortran/generated/cshift1_16_r4.c trunk/libgfortran/generated/cshift1_16_r8.c trunk/libgfortran/generated/cshift1_4_c10.c trunk/libgfortran/generated/cshift1_4_c16.c trunk/libgfortran/generated/cshift1_4_c4.c trunk/libgfortran/generated/cshift1_4_c8.c trunk/libgfortran/generated/cshift1_4_i1.c trunk/libgfortran/generated/cshift1_4_i16.c trunk/libgfortran/generated/cshift1_4_i2.c trunk/libgfortran/generated/cshift1_4_i4.c trunk/libgfortran/generated/cshift1_4_i8.c trunk/libgfortran/generated/cshift1_4_r10.c trunk/libgfortran/generated/cshift1_4_r16.c trunk/libgfortran/generated/cshift1_4_r4.c trunk/libgfortran/generated/cshift1_4_r8.c trunk/libgfortran/generated/cshift1_8_c10.c trunk/libgfortran/generated/cshift1_8_c16.c trunk/libgfortran/generated/cshift1_8_c4.c trunk/libgfortran/generated/cshift1_8_c8.c trunk/libgfortran/generated/cshift1_8_i1.c trunk/libgfortran/generated/cshift1_8_i16.c trunk/libgfortran/generated/cshift1_8_i2.c trunk/libgfortran/generated/cshift1_8_i4.c trunk/libgfortran/generated/cshift1_8_i8.c trunk/libgfortran/generated/cshift1_8_r10.c trunk/libgfortran/generated/cshift1_8_r16.c trunk/libgfortran/generated/cshift1_8_r4.c trunk/libgfortran/generated/cshift1_8_r8.c trunk/libgfortran/m4/cshift1a.m4 Modified: trunk/gcc/testsuite/ChangeLog trunk/libgfortran/ChangeLog trunk/libgfortran/Makefile.am trunk/libgfortran/Makefile.in trunk/libgfortran/generated/cshift1_16.c trunk/libgfortran/generated/cshift1_4.c trunk/libgfortran/generated/cshift1_8.c trunk/libgfortran/libgfortran.h trunk/libgfortran/m4/cshift1.m4
What was done so far was an improvement for cshifts of large arrays. The code in the original program has quite small arrays, so inlining would actually make sense there. I don't think I will start this any time soon; unassigning.
For the original test compiled with -Ofast, I get Testing explicit DO loops Dim = 1 Elapsed CPU time = 0.846953988 Dim = 2 Elapsed CPU time = 0.724469006 Dim = 3 Elapsed CPU time = 0.680019855 Testing built-in cshift Dim = 1 Elapsed CPU time = 0.506424904 Dim = 2 Elapsed CPU time = 0.391590118 Dim = 3 Elapsed CPU time = 0.352367163 An for the IDRIS test: ===================== Call to test_eoshift ===================== Order of matrix: 1000 test_eoshift> 1) EOSHIFT Used CPU time ==> 2.976 ms test__eoshift> 2) DO loop Used CPU time ==> 0.832 ms Results OK ==================== Call to test_cshift ==================== Order of matrix: 1000 test_cshift> 1) CSHIFT Used CPU time ==> 1.377 ms test__cshift> 2) DO loop Used CPU time ==> 1.051 ms Results OK ===================== Call to test_reshape ===================== Order of matrix: 1000 test__reshape> 1) RESHAPE Used CPU time ==> 4.909 ms test__reshape> 2) DO loop Used CPU time ==> 0.877 ms Results OK May be this PR could be closed.
Large-scale cshift is OK now. The original test case, or inlining something like b = cshift(a,-1) - 2*a + cshift(a,1) to efficiently calculate a second derivative on a circular array would be quite nice.
The example by posted on May 20, 2017 on c.l.f. improved by Stefano Zaghi below shows a factor of 10-20 improvement now in gfortran 9.0.0 including the work by Thomas Koenig. $ ./a.out Elapsed CPU time = 0.33764499999999997 Elapsed CPU time = 0.29224100000000003 Elapsed CPU time = 0.26565400000000006 Here is the code: program testme use, intrinsic :: iso_fortran_env implicit none integer(int32), parameter :: n = 200 real(real32) :: a(n,n,n), b(n,n,n) integer(int32) :: j, k real(real64) :: t1, t2 call random_number(a) do k = 1, 3 call cpu_time ( t1 ) do j = 1, 100 b = cshift(a, shift=1, DIM=k) end do call cpu_time ( t2 ) write ( *, * ) 'Elapsed CPU time = ', t2-t1 end do end program testme
Can the bug be marked as resolved?
The original test case is still slow, so I guess we should keep this open.
> The original test case is still slow, so I guess we should > keep this open. Which test?