[PATCH][RFC] Add versioning for constant strides for vectorization

Sun Jan 25 12:12:00 GMT 2009

Richard,

> This patch adds the capability to the vectorizer to perform versioning
> for the case of a constant (suitable) stride.

I have applied the patch on i686-apple-darwin9 (Core2 2.1Ghz, 4Mb cache, 
2Gb RAM). It regtested without regression. However the following test:

program mymatmul
  implicit none
  integer, parameter :: n = 2000
  real, dimension(n,n) :: rr, ri
  complex, dimension(n,n) :: a,b,c
  real :: t1, t2
  integer :: i, j, k

  call random_number (rr)
  call random_number (ri)
  a = cmplx (rr, ri)
  call random_number (rr)
  call random_number (ri)
  b = cmplx (rr, ri)

  call cpu_time (t1)

  c = cmplx (0., 0.)
  do j = 1, n
     do k = 1, n
	do i = 1, n
	   c(i,j) = c(i,j) + a(i,k) * b(k,j)
	end do
     end do
  end do

  call cpu_time (t2)
  write (*,'(F8.4)') t2-t1

end program mymatmul

did not vectorize:

[ibook-dhum] bug/timing% gfc -m64 -O3 -ffast-math -funroll-loops 
-fomit-frame-pointer -ftree-vectorizer-verbose=2 mymatmul_db.f90

mymatmul_db.f90:24: note: not vectorized: can't calculate alignment 
for data ref.
mymatmul_db.f90:14: note: not vectorized: complicated access pattern.
mymatmul_db.f90:14: note: not vectorized: can't calculate alignment 
for data ref.
mymatmul_db.f90:11: note: not vectorized: complicated access pattern.
mymatmul_db.f90:11: note: not vectorized: can't calculate alignment 
for data ref.
mymatmul_db.f90:1: note: vectorized 0 loops in function.

Is it expected?

> I didn't yet performance test this extensively, but it might need
> cost-model adjustments and/or need to wait until we have profile feedback
> to properly seed vectorizer analysis here.  A micro-benchmark based on
> the above loop shows around 15% improvement on AMD K10.

I can only report some timing with the polyhedron test suite:

================================================================================
Test Name       : pbharness
Compile Command : gfc %n.f90 -m64 -O3 -ffast-math -funroll-loops 
                             -ftree-loop-linear -fomit-frame-pointer 
                             -finline-limit=600 --param min-vect-loop-bound=2 
                             -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct 
                  linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

Date & Time     : 21 Jan 2009 14:06:52          24 Jan 2009  9:16:15 (patched)

  Bench.  Comp.  Exec.  Ave Run  #   Estim    Comp.    Exec. Ave Run   #  Estim
    Name (secs) (bytes)  (secs) Run  Err %   (secs)  (bytes)  (secs) Run  Err %
-------- ------ ------- ------- --- ------   ------ -------- ------- --- ------
      ac   2.33   42560   12.27   2 0.0081     2.51    42560   12.43   5 0.3163
  aermod  86.99 1270544   29.94   3 0.1371    92.59+ 1331976+  30.08   3 0.1636
     air   5.60   77336    8.40   2 0.0060     5.49    77336    8.35   2 0.0060
capacita   3.46   72760   55.41   2 0.0794     5.41+  105528+  51.79-  2 0.1690
 channel   2.11   38648    2.26   2 0.0442     2.13    38648    2.28   5 0.0683
   doduc  11.65  200024   43.07   2 0.0441    11.67   200024   42.97   2 0.0093
 fatigue   5.13   89024   10.78   5 0.3519     4.95    89024   11.87+  5 0.3516
 gas_dyn   6.45  708584   10.32   5 0.3332     6.51   708584   10.28   5 0.7988
  induct  10.03  181168   34.37   2 0.1222    10.37   181168   34.30   2 0.0087
   linpk   1.64   42536   27.63   2 0.0290     1.54    42536   27.67   2 0.0397
    mdbx   3.37   73000   14.74   2 0.0000     3.29    73000   14.80   2 0.0169
      nf  24.10  161416   31.91   2 0.0627    18.61-  140936-  32.06   2 0.0764
 protein  10.55  126424   47.05   2 0.0000    10.34   126424   46.24   3 0.1754
  rnflow  11.09  179616   36.14   2 0.0982    13.15+  191904+  36.61   2 0.1065
test_fpu  10.16  166512   12.39   2 0.0403    10.05   162416-  12.43   2 0.1006
    tfft   1.14   26432    2.82   2 0.0177     1.15    26432    2.84   3 0.1960

Geom. Mean Exec. Time =   17.01s                               17.07s

================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved

The timing shows a ~10% improvement for capacita.f90 compensated by a ~10% 
degradation for fatigue.f90. All the other times are within the noise.

Thanks for the patch.

Dominique

PS Most of the time in capacita and tfft is spent in FFT subroutines that 
are not vectorized. Anything that can be done to change that?