36099 – [4.4 Regression] early loop unrolling pass prevents vectorization, SLP doesn't do its job

Bug 36099 - [4.4 Regression] early loop unrolling pass prevents vectorization, SLP doesn't do its job

Summary: [4.4 Regression] early loop unrolling pass prevents vectorization, SLP doesn'...

Status:	RESOLVED DUPLICATE of bug 22226

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.4.0

Importance:	P2 normal
Target Milestone:	4.4.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:	35252
Blocks:
	Show dependency tree / graph

Reported:	2008-05-01 14:16 UTC by Dominique d'Humieres
Modified:	2008-08-11 07:48 UTC (History)
CC List:	8 users (show)

See Also:
Host:	i686-apple-darwin9
Target:	i686-apple-darwin9
Build:	i686-apple-darwin9
Known to work:
Known to fail:
Last reconfirmed:	2008-08-05 08:11:26

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominique d'Humieres 2008-05-01 14:16:17 UTC

The following test case does not vectorize:

integer, parameter :: n = 1000000
integer  :: i, j, k
real(8)  :: pi, sum1, sum2, theta, phi, sini, cosi, dotp
real(8)  :: a(3), b(9,3), c(3)
pi = acos(-1.0d0)
theta = pi/9.0d0
phi = pi/4.5d0
do k = 1, 9
   b(k,1) = 0.5d0*cos(k*phi)*sin(k*theta)
   b(k,2) = 0.5d0*sin(k*phi)*sin(k*theta)
   b(k,3) = 0.5d0*cos(k*theta)
end do
theta = pi/real(n,kind=8)
sum2 = 0.0
do i = 1, n
    sini = sin(i*theta)
    cosi = cos(i*theta)
    phi = pi/4.5d0
    sum1 = 0.0d0
    do j = 1, 9
        c(1) = 0.5d0*cos(j*phi)*sini
        c(2) = 0.5d0*sin(j*phi)*sini
        c(3) = 0.5d0*cosi
        do k =1, 9
           a(1) = b(k,1) - c(1)
           a(2) = b(k,2) - c(2)
           a(3) = b(k,3) - c(3)
!           a = b(k,:) - c
           dotp = a(1)*a(1) + a(2)*a(2) + a(3)*a(3)
!           dotp = dot_product(a,a)
           sum1 = sum1 +dotp
        end do
    end do
    sum2 = sum2 + sum1/81.0d0
end do
print *, 3.0d0*sum2/(4.0d0*pi*real(n,kind=8))
end

[ibook-dhum] bug/timing% gfc -O3 -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorizer-verbose=2 test_vect.f90
test_vect.f90:20: note: not vectorized: unsupported data-type complex(kind=8)
test_vect.f90:8: note: not vectorized: unsupported data-type complex(kind=8)
test_vect.f90:1: note: vectorized 0 loops in function.

while it vectorize for

        do k =1, 9
!           a(1) = b(k,1) - c(1)
!           a(2) = b(k,2) - c(2)
!           a(3) = b(k,3) - c(3)
           a = b(k,:) - c
           dotp = a(1)*a(1) + a(2)*a(2) + a(3)*a(3)
!           dotp = dot_product(a,a)
           sum1 = sum1 +dotp
        end do

[ibook-dhum] bug/timing% gfc -O3 -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorizer-verbose=2 test_vect.f90
test_vect.f90:24: note: LOOP VECTORIZED.
test_vect.f90:8: note: not vectorized: unsupported data-type complex(kind=8)
test_vect.f90:1: note: vectorized 1 loops in function.

(see http://gcc.gnu.org/ml/gcc-patches/2008-05/msg00033.html and pr34265).

Note that both variants do not vectorize on powerpc-apple-darwin9.

Comment 1 Richard Biener 2008-05-02 12:36:33 UTC

With a = b(k,:) - c manually unrolled the loop over k is unrolled with the
early loop unrolling pass which exposes the unvectorizable calls to sin/cos,
respective the complex temporaries introduced by the sincos pass
to the vectorizer which then punts.

The early unroller at -O3 is just limited by the maximum final loop size and
the trip count (400 and 8) and the unroller estimates

Loop 4 iterates 8 times.
  Loop size: 40
  Estimated size after unrolling: 216

SLP also doesn't handle vectorization of register operations but needs
memory source and destination operands(?).  Likewise SLP shouldn't be
confused by unvectorizable data types?

On x86_64 you can reproduce the missed vectorization with -O3 -ffast-math.

<bb 7>:
  # ivtmp.40_261 = PHI <9(6), ivtmp.40_240(8)>
  # sum1_5 = PHI <0.0(6), sum1_90(8)>
  # j_2 = PHI <1(6), j_94(8)>
  D.1032_55 = (real(kind=8)) j_2;
  D.1033_56 = D.1032_55 * 6.9813170079773179121929160828585736453533172607421875e-1;
  sincostmp.16_28 = __builtin_cexpi (D.1033_56);
  D.1034_57 = REALPART_EXPR <sincostmp.16_28>;
  D.1035_58 = sini_48 * D.1034_57;
  D.1036_59 = D.1035_58 * 5.0e-1;
  D.1037_62 = IMAGPART_EXPR <sincostmp.16_28>;
  D.1038_63 = sini_48 * D.1037_62;
  D.1039_64 = D.1038_63 * 5.0e-1;
  D.1044_128 = pretmp.30_150 - D.1036_59;
  D.1047_132 = pretmp.30_154 - D.1039_64;
  D.1052_137 = __builtin_pow (D.1044_128, 2.0e+0);
  D.1054_138 = __builtin_pow (D.1047_132, 2.0e+0);
  D.1044_149 = pretmp.30_168 - D.1036_59;
  D.1047_153 = pretmp.30_172 - D.1039_64;
  D.1052_158 = __builtin_pow (D.1044_149, 2.0e+0);
  D.1054_159 = __builtin_pow (D.1047_153, 2.0e+0);
  D.1044_170 = pretmp.30_188 - D.1036_59;
  D.1047_174 = pretmp.30_192 - D.1039_64;
  D.1052_179 = __builtin_pow (D.1044_170, 2.0e+0);
  D.1054_180 = __builtin_pow (D.1047_174, 2.0e+0);
  D.1044_191 = pretmp.30_206 - D.1036_59;
  D.1047_195 = pretmp.30_210 - D.1039_64;
  D.1052_200 = __builtin_pow (D.1044_191, 2.0e+0);
  D.1054_201 = __builtin_pow (D.1047_195, 2.0e+0);
  D.1044_212 = pretmp.30_218 - D.1036_59;
  D.1047_216 = pretmp.30_230 - D.1039_64;
  D.1052_221 = __builtin_pow (D.1044_212, 2.0e+0);
  D.1054_222 = __builtin_pow (D.1047_216, 2.0e+0);
  D.1044_233 = pretmp.30_238 - D.1036_59;
  D.1047_237 = pretmp.30_248 - D.1039_64;
  D.1052_242 = __builtin_pow (D.1044_233, 2.0e+0);
  D.1054_243 = __builtin_pow (D.1047_237, 2.0e+0);
  D.1044_254 = pretmp.30_256 - D.1036_59;
  D.1047_258 = pretmp.30_260 - D.1039_64;
  D.1052_263 = __builtin_pow (D.1044_254, 2.0e+0);
  D.1054_264 = __builtin_pow (D.1047_258, 2.0e+0);
  D.1044_275 = pretmp.30_276 - D.1036_59;
  D.1047_279 = pretmp.30_280 - D.1039_64;
  D.1052_284 = __builtin_pow (D.1044_275, 2.0e+0);
  D.1054_285 = __builtin_pow (D.1047_279, 2.0e+0);
  D.1044_71 = pretmp.30_68 - D.1036_59;
  D.1047_76 = pretmp.30_73 - D.1039_64;
  D.1052_83 = __builtin_pow (D.1044_71, 2.0e+0);
  D.1054_85 = __builtin_pow (D.1047_76, 2.0e+0);
  D.1055_86 = D.1054_85 + D.1052_83;
  dotp_89 = D.1055_86 + pretmp.33_294;
  dotp_288 = dotp_89 + D.1052_137;
  D.1055_286 = dotp_288 + D.1054_138;
  sum1_289 = pretmp.33_249 + D.1055_286;
  dotp_267 = sum1_289 + D.1052_158;
  D.1055_265 = dotp_267 + D.1054_159;
  sum1_268 = pretmp.33_228 + D.1055_265;
  dotp_246 = sum1_268 + D.1052_179;
  D.1055_244 = dotp_246 + D.1054_180;
  sum1_247 = pretmp.33_207 + D.1055_244;
  dotp_225 = sum1_247 + D.1052_200;
  D.1055_223 = dotp_225 + D.1054_201;
  sum1_226 = pretmp.33_186 + D.1055_223;
  dotp_204 = sum1_226 + D.1052_221;
  D.1055_202 = dotp_204 + D.1054_222;
  sum1_205 = pretmp.33_165 + D.1055_202;
  dotp_183 = sum1_205 + D.1052_242;
  D.1055_181 = dotp_183 + D.1054_243;
  sum1_184 = pretmp.33_144 + D.1055_181;
  dotp_162 = sum1_184 + D.1052_263;
  D.1055_160 = dotp_162 + D.1054_264;
  sum1_163 = pretmp.33_123 + D.1055_160;
  dotp_141 = sum1_163 + D.1052_284;
  D.1055_139 = dotp_141 + D.1054_285;
  sum1_142 = D.1055_139 + pretmp.33_292;
  sum1_90 = sum1_142 + sum1_5;
  j_94 = j_2 + 1;
  ivtmp.40_240 = ivtmp.40_261 - 1;
  if (ivtmp.40_240 == 0)
    goto <bb 9>;
  else
    goto <bb 8>;

Comment 2 Richard Biener 2008-05-02 12:46:44 UTC

The vectorizer doesn't know to vectorize __builtin_cexpi or {REAL,IMAG}PART_EXPR
either.

IMHO rather than somehow tweaking the early unroller the vectorizer should
know how to deal with complex types.

Comment 3 Ira Rosen 2008-05-04 05:54:09 UTC

(In reply to comment #1)
> 
> SLP also doesn't handle vectorization of register operations but needs
> memory source and destination operands(?).  

Right.

> Likewise SLP shouldn't be
> confused by unvectorizable data types?

SLP does get confused by unvectorizable data types.

Ira

Comment 4 Ira Rosen 2008-05-04 06:02:37 UTC

(In reply to comment #2)
> The vectorizer doesn't know to vectorize __builtin_cexpi or
> {REAL,IMAG}PART_EXPR
> either.
> 
> IMHO rather than somehow tweaking the early unroller the vectorizer should
> know how to deal with complex types.
> 

There already exists a missed-optimization PR for vectorization of {REAL,IMAG}PART_EXPR - pr35252.

Ira

Comment 5 Dominique d'Humieres 2008-05-13 15:27:48 UTC

I just noticed today that the vectorization of the variant induct.v2.f90 depends on the -m64 flag:

[ibook-dhum] source/dir_indu% gfc -m64 -O3 -ffast-math -funroll-loops -ftree-vectorizer-verbose=2 indu.v2.f90
...
indu.v2.f90:2322: note: not vectorized: unsupported use in stmt.
indu.v2.f90:2245: note: not vectorized: unsupported unaligned store.
indu.v2.f90:2244: note: vectorizing stmts using SLP.
indu.v2.f90:2244: note: LOOP VECTORIZED.
indu.v2.f90:2146: note: not vectorized: unsupported use in stmt.
indu.v2.f90:2069: note: not vectorized: unsupported unaligned store.
indu.v2.f90:2068: note: vectorizing stmts using SLP.
indu.v2.f90:2068: note: LOOP VECTORIZED.
indu.v2.f90:1976: note: not vectorized: complicated access pattern.
indu.v2.f90:1875: note: vectorized 2 loops in function.

indu.v2.f90:1816: note: not vectorized: unsupported use in stmt.
indu.v2.f90:1771: note: not vectorized: unsupported unaligned store.
indu.v2.f90:1770: note: vectorizing stmts using SLP.
indu.v2.f90:1770: note: LOOP VECTORIZED.
indu.v2.f90:1682: note: not vectorized: unsupported use in stmt.
indu.v2.f90:1633: note: not vectorized: unsupported unaligned store.
indu.v2.f90:1632: note: vectorizing stmts using SLP.
indu.v2.f90:1632: note: LOOP VECTORIZED.
indu.v2.f90:1543: note: not vectorized: complicated access pattern.
indu.v2.f90:1441: note: vectorized 2 loops in function.
...
[ibook-dhum] source/dir_indu% gfc -O3 -ffast-math -funroll-loops -ftree-vectorizer-verbose=2 indu.v2.f90
...
indu.v2.f90:2334: note: LOOP VECTORIZED.
indu.v2.f90:2245: note: not vectorized: unsupported unaligned store.
indu.v2.f90:2244: note: vectorizing stmts using SLP.
indu.v2.f90:2244: note: LOOP VECTORIZED.
indu.v2.f90:2158: note: LOOP VECTORIZED.
indu.v2.f90:2069: note: not vectorized: unsupported unaligned store.
indu.v2.f90:2068: note: vectorizing stmts using SLP.
indu.v2.f90:2068: note: LOOP VECTORIZED.
indu.v2.f90:1976: note: not vectorized: complicated access pattern.
indu.v2.f90:1875: note: vectorized 4 loops in function.

indu.v2.f90:1825: note: LOOP VECTORIZED.
indu.v2.f90:1771: note: not vectorized: unsupported unaligned store.
indu.v2.f90:1770: note: vectorizing stmts using SLP.
indu.v2.f90:1770: note: LOOP VECTORIZED.
indu.v2.f90:1691: note: LOOP VECTORIZED.
indu.v2.f90:1633: note: not vectorized: unsupported unaligned store.
indu.v2.f90:1632: note: vectorizing stmts using SLP.
indu.v2.f90:1632: note: LOOP VECTORIZED.
indu.v2.f90:1543: note: not vectorized: complicated access pattern.
indu.v2.f90:1441: note: vectorized 4 loops in function.
...

Where the nested loop vectorized without -m64 at 1691 is:

...
          do j = 1, 9
              c_vector(3) = 0.5_longreal * h_coil * z1gauss(j)
!
!       rotate coil vector into the global coordinate system and translate it
!
              rot_c_vector(1) = rot_i_vector(1) + rotate_coil(1,3) * c_vector(3)
              rot_c_vector(2) = rot_i_vector(2) + rotate_coil(2,3) * c_vector(3)
              rot_c_vector(3) = rot_i_vector(3) + rotate_coil(3,3) * c_vector(3)
!
              do k = 1, 9                ! <==== line 1691
!
!       rotate quad vector into the global coordinate system
!
                  rot_q_vector(1) = rot_q1_vector(k,1) - rot_c_vector(1)
                  rot_q_vector(2) = rot_q1_vector(k,2) - rot_c_vector(2)
                  rot_q_vector(3) = rot_q1_vector(k,3) - rot_c_vector(3)

!
!       compute and add in quadrature term
!
                  numerator = dotp * w1gauss(j) * w2gauss(k)
                  dotp2=rot_q_vector(1)*rot_q_vector(1)+rot_q_vector(2)*rot_q_vector(2)+    &
                        rot_q_vector(3)*rot_q_vector(3)
                  denominator = sqrt(dotp2)
                  l12_lower = l12_lower + numerator/denominator
              end do
          end do
...

Comment 6 victork 2008-07-29 22:56:38 UTC

I had a closer look into testcase in original bug report and it appears that vectorizer fails on attempt to vectorize __builtin_cexpi ().

Regarding different behavior of vectorizer with -m32 and -m64 described in comment #5. It appears that with -m32, the internal loop is not unrolled, and thus is vectorized later. While with -m64 the internal loop gets completely unrolled, so there is no chance to vectorize it later.

Comment 7 victork 2008-08-05 08:11:26 UTC

Regarding missing SLP vectorization of reductions I have opened new bug report - PR37027.
Regarding not vectorized call to __builtin_cexpi() we already have a bug report PR22226 about missing support of generic vector calls to IBM MASSV and Intel VML vector libraries.
Thus, I propose to close this PR as duplicate of PR22226.

Comment 8 victork 2008-08-11 07:48:28 UTC


*** This bug has been marked as a duplicate of 22226 ***